It's nice to see such optimism yet I have to (hopefully respectfully) criticize ...

haberman · on March 16, 2011

> They might be wrong too but if you're going to call them wrong, perhaps you should first take the time to figure why they do what they do first.

Right back at'cha. You have criticized my approach without understanding it.

> Specifically, two programs that parse the same grammar would likely need to do different things at different times in the parsing process. [...] But any "general purpose" parser library more or less has taken the code and fully compile it - turn it into an AST or the equivalent and then let the programmer deal with the results

Actually, that's not what Gazelle does. If that were the best you could do, I'd agree with you.

At its lowest layer, Gazelle offers event-based parsing. It's like SAX, where the approach you have described is like DOM.

The way you would remove a single "i" tag from a string is by registering a callback that gets called when "i" is encountered, and another callback that gets called when the corresponding closing tag is encountered.

And once I've optimized Gazelle to generate really fast machine code, I doubt your "single loop" is going to beat me. For example, did you know that with SSE you can do 16 byte-wise compares in a single instruction? Are you going to put SSE instructions in your "single loop"?

Finally, your "single loop" is probably not going to correctly handle the case where the "i" tag is inside a CDATA section, and is therefore not actually an "i" tag at all. This is the biggest problem with ad hoc parsers -- nobody seems to realize how incomplete/buggy they are.

> Wait! People write programs manually too instead of automatic program constructors. It might be OK.

I think a better analogy would be: you would know C compilers weren't good enough if people were always writing in assembly instead. Writing a parser manually instead of using an abstraction like a CFG is like writing in assembly language.

beza1e1 · on March 16, 2011

> Just consider. The simple way of removing a single "i" tag from strings like "<i>help me</i> he said" is never going to be "load XML parser, issue commands, close XML parser". A single loop removing the offending character is going to beat this "algorithm" a thousand times over. Manual parser generation more or less allows you to boil down your parser to this.

However, the parser-way might enable you code to handle tricky XML stuff. E.g. namespaces: "<html:i>help me</html:i>".

joe_the_user · on March 16, 2011

Either can be useful depending on the situation.

A just-in-time compiler would be able to do the quick loop also if this loop ran somewhat frequently.

The point is each of these approaches makes sense in a given context and so you can't say that any of them are relative "hacks" or "stone-age" or whatever.

I could imagine a future language/system that would allow a few XML-parser commands on a line to be translated into a single loop. But we aren't there yet - the solution isn't in just algorithms but the whole development. Haberman's "program" of improving parsing is noble. The problem is he's articulated as simply bolting algorithms in and you need more than that (ie, you need a language that can lots of compile-time futzing about).

alexk7 · on March 16, 2011

Visual Studio is very often confused at edit-time. The parser should not be more forgiving, it should just remember the parts that once were correct but are now invalidated by further edits. That way, updates that keep local correctness would continue to work even if some parts prevented the correct parsing of the whole file.