Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It surprises that they are still teaching parsing techniques that are based on limitation from 40 years ago, when memory was limited and you had to parse a file one character at the time. Why not start with a back-tracking recursive descent parser on a file stored in memory? Can be made efficient with some simple caching. In an introduction course there is no need to aim for maximum performance if parsing a 10k lines program takes less than a second.


Parsing is strange in that many people tend to believe it is a solved problem and yet every project handles it slightly differently (and almost none do it truly well).

I have been studying compiler design for several years and I have found that writing a simple parser by hand is the best way to go most of the time. There is a process to it: You start with a "Hello, world!" program and you parse it character by character with no separate lexer. You ensure that at each step in your parser, you make an unambiguous decision at each character and never backtrack. The decision may be that you need to enter a disambiguation function that also only moves forward. If the grammar gets in the way of conserving this property, change the grammar not the parser design.

If you follow that high level algorithm, you will end up with a parser with performance linear in the number of characters which is asymptotically as well as you can hope to do. It is both easy and simple to implement (provided you have solid programming fundamentals) and no caching is needed for efficiency.

Deliberate backtracking in a compiler is an evil that should be avoided at all costs. It is potentially injecting exponentially poor performance into a developer's primary feedback loop which is a theft of their time for which they have little to no recourse.


I agree, that if you want to write a production grade parser, this is probably the best way to go. I also agree that parsing is not a solved problem for all cases. But that is the case with many more problems. However, for many cases it is a solved problem and that often it is not the first thing you should focus on to optimize.

If you teach a course about compiler construction, I think it might be better to teach your students how to write a grammar for some language and use some interactive parser that can parse some input according to the grammar (and visualize the AST). See for example: [1] and [2] (Even if you feed it the C grammar, it succeeds parsing thousands of lines (preprocessed) C code at every keystroke. This interpreting parser is written in JavaScript and uses a simple caching strategy for performance improvement.)

For the scripting language [3] in some of the Bizzdesigns modeling tools, a similar interactive parser was used (implemented in C++). This scripting language is also internally used for implementing the various meta-models. These scripts are parsed once, cached, and interpreted often.

I think it is also true for many domain-specific languages (DSL).

[1] https://info.itemis.com/demo/agl/editor

[2] https://fransfaase.github.io/MCH2022ParserWorkshop/IParseStu...

[3] https://help.bizzdesign.com/articles/#!horizzon-help/the-scr...


I love parsing and have made a lot of parsers, but never a typical programming language parser. It's interesting that most of the literature (from academic papers to blog posts) focuses programming language parsers, but the vast majority of parsers out there deal with other things. I had to really figure things out myself, and that's been the same story for every parser I've written.

A lesson I have to relearn every time: while you can always skip lexing (which is really just another parser), it almost always screws you over to do so.


Well, I can immediately think of two reasons:

Backtracking parsers lead people into creating bad grammars. In principle people are perfectly capable of creating simple context-free grammars and write any parser they want to read it. But on practice your tools guide your decisions for a huge extent, and the less experience people have, the more true that becomes; so it's a really dangerous tool, in particular for students.

Also, fully backtracking parsers have the most unpredictable and hard to fix error conditions for all possibilities. There exist middle grounds where the parser execution is still predictable but you do get most of the benefit from backtracking, but that's a lot of complex engineering decisions to reach and keep your project close that optimal.

Immediate edit: On a CS context there is one reason that is probably more important than any other. People use parsers as an application of automata and regular languages theory. Those two concepts are way more important than the practical implications of parsing.


What do you mean with bad grammars? Do you mean grammars that are hard to parse (require a lot of backtracking) or do you mean that it leads people to creating bad languages?

My experience is that if a back-tracking parser list all the possible terminals it is expecting at the first location (with some additional information about the rules they occurred in) it fails to get passed, that this usually gives enough information to understand what is wrong about the input or the grammar.


I mean grammars that are hard for people to follow.


> In an introduction course there is no need to aim for maximum performance if parsing a 10k lines program takes less than a second.

I'll do you one better. The main compiler in my course uses only six characters to parse every single project: `(read)`.


Are you referring to lookahead, as in allowing more ambiguous grammars?


Sorry, I mend to say: Even if a grammar is not ambiguous, it can require unbound look-ahead to be parsed correctly [1].

The grammar of C is ambiguous. The statement "a * b;" can be both parsed as a variable declaration of the variable 'b' of type pointer to 'a' and as an expression consisting of a multiplication of 'a' and 'b'. It all depends on whether 'a' is a type or not. In most cases it would be a type definition, because why multiply two variables and ignore the result. One trick to deal with this is to give precedence for the type declaration grammar rule over the expression grammar rule. However, this is not something that can be done with many parser generating tools.

Yet the first C compiler where single pass compilers with a single look-ahead lexical token probably implemented as a recursive descent parser. It worked, because it kept a (reverse) list of all variable declarations, which allowed it to determine when 'a' was parsed if it was the start of some declaration or the start of a statement based on whether it was defined before as a type or not.

[1] https://stackoverflow.com/questions/12971191/grammars-that-r...


No, even if a grammar is ambigious it can require unbound look-ahead to be parsed, although this is very rare the case for meaningfull grammars such as the ones you would write for a programming language.

What I wanted to say that you do not need complex algorithms to implement parser if you do not have a grammar that can be parsed with look-ahead lexical element.


I mend to say: No, even if a grammar is not ambiguous ...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: