Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's nice to see such optimism yet I have to (hopefully respectfully) criticize to your approach.

I used to work in the same office as Bob Jervis who wrote Turbo C back in the day, and he said he never uses parser generators. Clearly something is wrong if the tools that have all the weight of the theory and formalisms are losing to coding parsers by hand.

When a number of really smart people do something, there might just be reason for it. They might be wrong too but if you're going to call them wrong, perhaps you should first take the time to figure why they do what they do first.

One thing that should be a red flag is that lots of people still write parsers manually instead of using tools.

Wait! People write programs manually too instead of automatic program constructors. It might be OK.

The thing about "manually" constructing a parser is that it not a terribly difficult task once one has a strong grasp of the relation between languages specification and recursion. Manually constructing a recursive descent parser also doesn't involve throwing out the formal specification of the language. Rather, you transform your language specification until you shape it into a set of productions which can be captured by a series of recursive function calls (essentially taking the language to Greibach Normal Form but there are a number of short cuts you can take once you're comfortable with this process). It is not a ad-hoc affair but it does allow you make add ad-hoc optimizations for the particular actions that are needed or not-needed in the results of your parsing.

The road block to any reusability in parsers is that a language's grammar exists at a different logical level than a single function, object or piece of data.

Specifically, two programs that parse the same grammar would likely need to do different things at different times in the parsing process. The simplest thing to create the code for each of these programs separately and then let the compiler optimize the code, tossing out the things I don't need.

But any "general purpose" parser library more or less has taken the code and fully compile it - turn it into an AST or the equivalent and then let the programmer deal with the results (the way Python or Ruby do reflection). Such an approach work for some things but it inherently isn't going to be very fast because it doesn't which parts of the code the final user is not concerned with.

Just consider. The simple way of removing a single "i" tag from strings like "<i>help me</i> he said" is never going to be "load XML parser, issue commands, close XML parser". A single loop removing the offending character is going to beat this "algorithm" a thousand times over. Manual parser generation more or less allows you to boil down your parser to this.

Qt creator finds syntax errors as I type and I remember Visual Studio doing the same. I don't think either uses the same parser as their respective compilers because they have to be more forgiving - otherwise they would fail to arrive at the piece of code you're actually editing but I could be wrong. I know most syntax highlighting is custom written but again, isn't that hard to custom write.



> They might be wrong too but if you're going to call them wrong, perhaps you should first take the time to figure why they do what they do first.

Right back at'cha. You have criticized my approach without understanding it.

> Specifically, two programs that parse the same grammar would likely need to do different things at different times in the parsing process. [...] But any "general purpose" parser library more or less has taken the code and fully compile it - turn it into an AST or the equivalent and then let the programmer deal with the results

Actually, that's not what Gazelle does. If that were the best you could do, I'd agree with you.

At its lowest layer, Gazelle offers event-based parsing. It's like SAX, where the approach you have described is like DOM.

The way you would remove a single "i" tag from a string is by registering a callback that gets called when "i" is encountered, and another callback that gets called when the corresponding closing tag is encountered.

And once I've optimized Gazelle to generate really fast machine code, I doubt your "single loop" is going to beat me. For example, did you know that with SSE you can do 16 byte-wise compares in a single instruction? Are you going to put SSE instructions in your "single loop"?

Finally, your "single loop" is probably not going to correctly handle the case where the "i" tag is inside a CDATA section, and is therefore not actually an "i" tag at all. This is the biggest problem with ad hoc parsers -- nobody seems to realize how incomplete/buggy they are.

> Wait! People write programs manually too instead of automatic program constructors. It might be OK.

I think a better analogy would be: you would know C compilers weren't good enough if people were always writing in assembly instead. Writing a parser manually instead of using an abstraction like a CFG is like writing in assembly language.


> Just consider. The simple way of removing a single "i" tag from strings like "<i>help me</i> he said" is never going to be "load XML parser, issue commands, close XML parser". A single loop removing the offending character is going to beat this "algorithm" a thousand times over. Manual parser generation more or less allows you to boil down your parser to this.

However, the parser-way might enable you code to handle tricky XML stuff. E.g. namespaces: "<html:i>help me</html:i>".


Either can be useful depending on the situation.

A just-in-time compiler would be able to do the quick loop also if this loop ran somewhat frequently.

The point is each of these approaches makes sense in a given context and so you can't say that any of them are relative "hacks" or "stone-age" or whatever.

I could imagine a future language/system that would allow a few XML-parser commands on a line to be translated into a single loop. But we aren't there yet - the solution isn't in just algorithms but the whole development. Haberman's "program" of improving parsing is noble. The problem is he's articulated as simply bolting algorithms in and you need more than that (ie, you need a language that can lots of compile-time futzing about).


Visual Studio is very often confused at edit-time. The parser should not be more forgiving, it should just remember the parts that once were correct but are now invalidated by further edits. That way, updates that keep local correctness would continue to work even if some parts prevented the correct parsing of the whole file.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: