Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar presets. #26

Open
modulovalue opened this issue Mar 29, 2023 · 42 comments
Open

Grammar presets. #26

modulovalue opened this issue Mar 29, 2023 · 42 comments

Comments

@modulovalue
Copy link

modulovalue commented Mar 29, 2023

Hello @mdaines,

(I hope you don't mind another issue)

I think it would be awesome if there was a way to load sample grammars (and perhaps make it easy to submit them via PRs?). I'm thinking of something like the "Load Sample" feature in edotor, but with a short description for why the grammar is considered to be interesting.


"Mysterious conflict"

Here's one grammar that I think is interesting source:

on grammophone

It demonstrates a grammar that is LALR(1), but not LR(1). The conflict is known as a "mysterious conflict".

@modulovalue
Copy link
Author

modulovalue commented Apr 5, 2023

Grammar examples

I've just found this: http://smlweb.cpsc.ucalgary.ca/example-grammars/ by chasing links starting at the readme of this project. This contains a bunch of grammar examples for various grammar classes.

@mdaines
Copy link
Owner

mdaines commented Apr 5, 2023

Yes, those would be good examples. They're already imported for test fixtures: https://github.com/mdaines/grammophone/blob/76612cb618f22fa7aaf9ec53f2ddefb6194f85b2/test/fixtures/example_grammars.js

@modulovalue
Copy link
Author

modulovalue commented Apr 5, 2023

Technique: Determinization via Unit Production Elimination

Here are two more that I think are interesting because they have practical relevance to e.g. parameter/argument lists that allow optional trailing commas:

This grammar attempts to capture interspersed lists where the separator can optionally occur at the end of the list.

The following grammar can be disambiguated by SLR(1): on grammophone

S -> L ','.
S -> L.
L -> L ',' a.
L  -> a.

The following grammar wraps the list from above in a separate production rule which makes it not LR(1): on grammophone

S -> P ','.
S -> P.
P -> L.
L -> L ',' a.
L  -> a.

A parser generator that naively transforms an "interspersion" construct to one with a level of indirection (like in the second grammar) would generate a grammar that is not LR(1).

@modulovalue
Copy link
Author

modulovalue commented May 24, 2023

Grammar examples

Here's a gist I found that contains more: https://gist.github.com/lin/dc83bb38eb458ded3ff01aec4a327d54

@modulovalue
Copy link
Author

modulovalue commented Jun 19, 2023

Testcase: unnecessary LR(1) splits

This grammar demonstrates a source of inefficiency in the automaton of the standard LR(1) construction.

Looking at the LR(1) automaton, I think that state 13 & 18 and state 17 & 12 could be merged and the resulting automaton would still resolve the ambiguity that LR(1) was meant to resolve. only 6 & 9 and 14 & 19 are required to be distinct new states.

This grammar is given as an example in some slides that introduce langcc. However, I'm not sure if langcc eliminates this source of inefficiency.

It appears that the act of eliminating this inefficiency is known as "minimizing" an LR automaton.

Edit: the author of the booze-tools project wrote a little something about this grammar: https://boozetools.readthedocs.io/en/latest/minimal.html

Note: Minimizing LR(1) State Machines is NP-Hard

@modulovalue

This comment was marked as duplicate.

@modulovalue
Copy link
Author

modulovalue commented Jul 22, 2023

Testcase: NQLALR guard

Here are two other interesting grammars:

When implementing an LALR parser using the Deremer and Pennello relation-based algorithm, it is easy to implement something that is known as NQLALR (Not-Quite-LALR), and apparently, back when this was a hot research topic, many people made this mistake.

According to some papers, NQLALR doesn't fit into the SLR -> LALR hierarchy, meaning, there are grammars that are SLR and not NQLALR. This observation can be used to debug an LALR implementation.

The following papers present example grammars that can be used to do that:


Simple Computation of LALR(1) Lookahead Sets, Manuel Bermudez

An LALR(1) grammar that is non-SLR(1) and non-NQLALR(1).

https://mdaines.github.io/grammophone/?s=UyAtPiBhIGcgZC4KUyAtPiBhIEEgYy4KUyAtPiBiIEEgZC4KUyAtPiBiIGcgYy4KQSAtPiBCLgpCIC0+IGcu#/


On the (non-) relationship between SLR(1) and NQLALR(1) grammars, Manuel E. Bermudez, Karl M. Schimpf

An SLR(1) grammar that is not NQLARL(1).

https://mdaines.github.io/grammophone/?s=SSAtPiBTLgpTIC0+IGEgQiBkLgpTIC0+IEMgQiBnLgpTIC0+IGMgZC4KQiAtPiBEIEUuCkQgLT4gLgpFIC0+IC4KQyAtPiBjLg==

@modulovalue
Copy link
Author

modulovalue commented Jul 29, 2023

Testcase: LALR correctness

The following paper discusses an incorrect oversimplification that one can make while implementing their algorithm.

Efficient Computation of LALR(1) Look-Ahead Sets, Frank DeRemer & Thomas Pennello

The paper doesn't discuss this, but I think, it is possible to look at the whole algorithm as a single dataflow problem. Their wording implies that two separate SCC runs are necessary, I conjecture that that is not the case. In any case, that grammar might also be valuable in verifying that an implementation is not incorrect in some obvious way.

According to the paper, the following grammar should be LALR(1) and an incorrect implementation would report it as not being LALR(1).

https://mdaines.github.io/grammophone/?s=UyAtPiBJLgpJIC0+IGEgQiBkLgpJIC0+IEMgQiBnLgpJIC0+IGMgZC4KQiAtPiBEIEUuCkMgLT4gYy4KRCAtPiAuCkUgLT4gLg==

@modulovalue
Copy link
Author

modulovalue commented Jul 29, 2023

@modulovalue
Copy link
Author

modulovalue commented Jul 31, 2023

Metric: Includes Theorem

DeRemer & Pennello conjectured that if the "includes" relation of their LALR(1) construction algorithm contains a nontrivial SCC (... this misses the read condition, see the paper), then the grammar can't be LR(k) for any k. This was proven by A Short Proof of a Conjecture of DeRemer and Pennello
THOMAS J. SAGER
. These are example grammars that Sager gives where that is the case:

G1:
https://mdaines.github.io/grammophone/?s=UyAtPiBCLgpBIC0+IGMgQy4KQSAtPiBhLgpCIC0+IGQgQS4KQyAtPiBiIEIuCkMgLT4gYiBkIEEgdC4=

G2:
https://mdaines.github.io/grammophone/?s=UyAtPiBCLgpBIC0+IGMgQy4KQSAtPiBhLgpCIC0+IGQgQSBELgpDIC0+IGIgZCBBIHQuCkMgLT4gYiBCLgpEIC0+IC4=

G3:
https://mdaines.github.io/grammophone/?s=UyAtPiBCLgpBIC0+IGMgQy4KQSAtPiBhLgpCIC0+IGQgQS4KQyAtPiBiIGQgQSBFIHQuCkMgLT4gYiBCLgpFIC0+IC4=

G4:
https://mdaines.github.io/grammophone/?s=UyAtPiBCLgpBIC0+IGMgQy4KQSAtPiBhLgpCIC0+IGQgQSBELgpDIC0+IGIgZCBBIEUgdC4KQyAtPiBiIEIuCkQgLT4gLgpFIC0+IC4=

In general, this property is undecidable. This is one of the few heuristics that are known for detecting whether a CFG is not LR(k) for any k.

@modulovalue
Copy link
Author

modulovalue commented Jul 31, 2023

Transformation: Recursion Scheme Fusion

The PhD Thesis by Philippe Charles on LALR(k) parsing and automatic error recovery discusses the following three grammars extensively (Figure 3.1, Page 31). They all model a simple BNF grammar. He uses them to highlight the importance of LALR(k) where k is > 1. ...

Example Grammar: LALR(2)

... (a) is LALR(2), (b) is a version of (a) that was converted to LALR(1) and (c) is a version of (a) that requires an explicit separator to be LALR(1) to maintain the shape of (a).

(a) (b) (c)


Note: this comment is very similar to this comment #26 (comment), the transformation from (a) to (b) appears to be the same.

@modulovalue
Copy link
Author

modulovalue commented Jul 31, 2023

Metric: Read Theorem

In Efficient Computation of LALR(1) Look-Ahead Sets, Frank DeRemer & Thomas Pennello, the authors prove that if there's a nontrivial SCC in the reads relation of their LALR(1) algorithm, then the grammar is not LR(k) for any k.

They give 2 grammars as examples. One is ambiguous, the other is not. This is the unambiguous grammar that is not LR(k) for any k. (The other ambiguous grammar is that one without the f at the end of the A production on grammophone.)

@modulovalue
Copy link
Author

modulovalue commented Sep 12, 2023

Transformation: (Right-)CPS

The following grammar is presented in the following paper that was published alongside langcc (cf. page 5 bullet 4):

Field -> S E.
Field -> E E.
S -> s.
S -> .

It is not LR(1), but can be made SLR(1) by the CPS transformation in that paper, which would transform it to the following grammar:

Field -> S.
Field -> E E.
S -> s E.
S -> E.

It can also be made SLR(1) by simply inlining S on grammophone.

@modulovalue
Copy link
Author

modulovalue commented Oct 24, 2023

Metric: Context splitting only resolves R/R conflicts

In his dissertation on LR parsing (page 94, first bullet, third paragraph), Xin Chen claims that it is widely known that only reduce/reduce conflicts can be resolved by state splitting and not shift/reduce conflicts.

I think this is an interesting fact that doesn't seem obvious to me and I wanted to capture it here.

It can be observed in the first grammar in this issue and in the grammar mentioned in this comment. The LALR(1) parsing tables contain only reduce/reduce conflicts so, I guess, in an efficient implementation, it would make sense to start with state splitting to resolve those reduce/reduce conflicts and not to consider shift/reduce conflicts at all.

Considering that claim, it is clear that the (a) grammar in #26 (comment) can't be determinized with state splitting alone. As the paper claims, either it needs to be transformed to (b), or more lookahead is needed.

@modulovalue
Copy link
Author

modulovalue commented Oct 24, 2023

Technique: Moving LR(k) to runtime

This LR(2) grammar contains a reduce/reduce conflict in its LR(1) automaton and is used as an example for doing LR(k) at runtime in the dissertation by Xin Chen (Page 164)

Testcase: LR(1) to LR(2) inefficient splitting

It is also an example of a source of inefficiency (similar to #26 (comment)) that the canonical LR(1) construction has. LR(1) doesn't help with resolving any inadequate states (LR(0) has 3, LALR(1) resolves two, LR(2) resolves one), but the canonical LR(1) construction still splits one state. An efficient implementation of LR(k) should consider that and not split states that don't help with resolving inadequate states. (TODO or maybe it needs to split this state in LR(1) so that LR(2) can be effective?)

@modulovalue
Copy link
Author

modulovalue commented Oct 24, 2023

Transformation: ?

This mentions a yacc-grammar-snippet grammar for a language that is LR(2), but the grammar is not. This is also being discussed in Xin Chen's dissertation (here on Page 167). The first link in this comment contains some prose that discusses different ways for dealing with it by, for example, rewriting it to an LR(1) grammar such as b or c


After a little bit of cleaning up, we get the following grammar:

on grammophone

# rule+
yacc -> rule | yacc rule.

# rulebody ";"?
rule -> rulebody | rulebody ";".

# (RULENAME ':' alt."|"+)?
rulebody -> RULENAME ':' alt | rulebody "|" alt.

# (TOKEN | RULENAME)*
alt -> alt TOKEN | alt RULENAME | .

TODO: http://www.cs.man.ac.uk/~pjj/complang/howto.html
TODO: http://www.cs.man.ac.uk/~pjj/complang/howto2.html
TODO: http://www.cs.man.ac.uk/~pjj/complang/grammar.html
TODO: http://www.cs.man.ac.uk/~pjj/complang/g2lr.html
TODO: http://www.cs.man.ac.uk/~pjj/cs2121/ho/node7.html
TODO: http://www.cs.man.ac.uk/~pjj/complang/g2ll.html
TODO: How can we go from that to (b) or (c), see derivation from above.

Consider (b):

  • we can right factor yacc and move the contents of the base case of rulebody out from the left (inverse of Left-CPS):

on grammophone

yacc -> RULENAME ':' rulebody T.

T -> ";" | yacc | ";" yacc | .

rulebody -> 
	| rulebody TOKEN
	| rulebody RULENAME	
	| rulebody "|".
  • we can turn the left recursion into right recursion and apply Right-CPS

on grammophone

yacc -> RULENAME ':' rulebody.

T -> ";" | yacc | ";" yacc | .

rulebody -> T
	| TOKEN rulebody
	| RULENAME rulebody	
	| "|" rulebody.

@modulovalue
Copy link
Author

modulovalue commented Oct 24, 2023

Class: LR(closed)/LR(infinite)

Chris Clark suggested a grammar class that he calls LR(closed) here. An example grammar can be found on grammophone.

Quote:

Here is a grammar that isn't LR(k) for any k, for a language that
could have an LR(1) grammar for it--in fact the language is regular.
The language is (a|b)a+(a|b).

S : a A a ;
S : b A b ;
S : a B b ;
S : b B a ;
A : A a ;
B : B a ;
A : a ;
B : a ;

Note, this grammar is particularly interesting to me as it is part of
what we tried to solve in Yacc++, when we did LR(infinite) (or what I
sometimes call LR(closed) parsing. If you look carefully, you will
see that you can parse it deterministically, However, you have to push
an indefinite amount of rules on the "stack". Still, when you get
done, and you see the EOF marker, you can then deterministically
resolve the rules you had pushed. I believe that every unambiguous
CFG has such a deterministic parse, and I believe the computing the
"closure" is the method to find it (if it exists), which is why I
sometimes refer to the method as LR(closed). Unfortunately, due to
the halting probloem, there are grammars that attempting to compute
the closure on will never terminate.

It is also mentioned in Xin Chen's dissertation here on page 170.

This is interesting in a different way. Here, an LR split removes a local ambiguity on a 'b', but since it doesn't remove the local ambiguity on 'a', that ambiguity is duplicated with the split. the amount of local ambiguities can therefore increase when going from LALR(1) to LR(1) if one is not careful about how he counts them.


Determinization: Convert left recursion to right recursion -> Procedure cloning -> Right-CPS

We can clone A and B to get the following grammar: on grammophone Not LR(1)

We can convert the left recursion to right recursion: on grammophone Not LR(1)

We can then apply CPS to move the right context to the left into the base case of the recursion: on grammophone SLR(1)

@modulovalue
Copy link
Author

modulovalue commented Oct 24, 2023

Minimal Grammar Example: LR(2)

Here's possibly the simplest possible LR(2) grammar (source).

@modulovalue
Copy link
Author

modulovalue commented Oct 24, 2023

Minimal Grammar Example: LR(1)

Here's possibly the simplest possible LR(1) grammar that is not LALR(1) which doesn't introduce inefficient state splits in the canonical LR(1) construction. (source)

Smaller

@mdaines
Copy link
Owner

mdaines commented Oct 24, 2023

@modulovalue I decided to enable discussions as experiment, and created an example grammars section: https://github.com/mdaines/grammophone/discussions/categories/example-grammars

I appreciate you posting these grammars, but I was wondering if discussions might be a better format than a single issue. Let me know what you think; I'm not sure if I like GitHub discussions or not.

I still plan to add a few example grammars to Grammophone when I get a chance.

@modulovalue

This comment was marked as outdated.

@mdaines
Copy link
Owner

mdaines commented Oct 24, 2023

It's not bothersome. Feel free to keep adding them here if you like.

Also, if you end up publishing the examples you've collected somewhere else I'm happy to link to that.

@modulovalue
Copy link
Author

modulovalue commented Oct 24, 2023

Grammar Example LR(k) for any k

Here's a simple formula for constructing a sample LR(k) grammar for k > 0 (source):

// Here c^k means the concatenation of k letters 'c'.
// This is a LR(k+1) grammar, depending on the value of k.
S -> a A D a.
S -> b A D b.
S -> a B D b.
S -> b B D a.
A -> a.
B -> a.
D -> c^k.

LR(1), LR(2), LR(3), LR(4), LR(5), ...

@modulovalue
Copy link
Author

modulovalue commented Oct 25, 2023

Metric: Ill-foundedness

Here is an example of an "ill-founded" grammar as defined by the booze-tools project.

It seems to be similar to the "unrealizable nonterminals" metric that grammophone exposes.

It would be interesting to know whether the "ill-founded" concept adds something new or whether unrealizable nonterminals and ill-foundedness are one and the same thing.

on grammophone

Source

Here, "well-founded" means "can possibly produce a finite sequence of terminals."
The opposite is called "ill-founded".

Here are two examples of ill-founded grammars:
	S -> x S   -- This is ill-founded because there's always one more S.
	A -> B y;  B -> A x     -- Similar, but with mutual recursion.

A terminal symbol is well-founded. So is a nullable symbol, since zero is finite.
A rule with only well-founded symbols in the right-hand side is well-founded.
A non-terminal symbol with at least one well-founded rule is well-founded.
Induction applies. That induction happens to be called ``bipartite_closure``.

A grammar with only well-founded symbols is well-founded.

@modulovalue
Copy link
Author

modulovalue commented Oct 25, 2023

Testcase: GLR hidden left recursion termination issues

A naive implementation of GLR would have trouble with certain grammars containing epsilon productions. This is one: on grammophone

This is known as hidden left recursion.

Source

## Tomita's GSS Algorithm:

When you boil it all down, Tomita's chief contribution is the idea to use
a directed acyclic graph rather than a tree of states: each node could have
more than one predecessor, and so after any given machine cycle, you have
at most one active graph-node per LR table state. (In practice, you'll have
many fewer such active nodes).

Neglecting epsilon rules, this works pretty well: You still have search
problem to carry out reductions, but the performance bounds are polynomial
in the size of the input string.

The problem with epsilon-rules is not immediately obvious, but it's easy
to trigger with a hidden-left-recursive grammar such as this:

S -> E S a     (rule 1)
S -> b         (rule 2)
E -> :epsilon  (rule 3)

It should be clear upon inspection that this context-free grammar describes
in fact a regular language (`ba*`) but the obvious variations on Tomita all
fail to recognize the string `baa`. Why? Well, with lookahead `b` we can
certainly shift the first token of rule 2 (where we'll soon recognize `S`)
but we can also possibly reduce rule 3 before shifting the `b`, but we wind
up in the same state after shifting `b` both ways. Now if you carefully
play this scenario forward: The first branch dies; the second allows `ba`
but then can't go back in time to recognize however many `E` symbols must
have preceeded the first `b` so as to consume all the right number of `a`
tokens which follow.

There are a number of hacks which solve the problem for subsets of the
epsilon-grammars, but no simple solution for all of them.

Tomita recognized these difficulties but did not offer a complete solution.

@modulovalue
Copy link
Author

modulovalue commented Oct 28, 2023

Tutorial: LR(1)

The LALRPOP project provides an excellent explanation of how to implement LR(1) through the lane tracing method:

First example: on grammophone & the explanation
Second example: on grammophone & the explanation

This blogpost appears to be an extended version of that readme.

@modulovalue
Copy link
Author

modulovalue commented Oct 31, 2023

Testcase: LR(1) disambiguation related incompleteness

The IELR paper claims:

In this paper, we demonstrate that a well known algorithm described by David Pager and implemented in Menhir, the most robust minimal LR(1) implementation we have discovered, does not always achieve the full power of canonical LR(1) when the given grammar is non-LR(1) coupled with a specification for resolving conflicts.

This is the grammar that they use on grammophone.

Notice how grammophone identifies that there are 4 example sentences.

[...] If he has declared a as left- associative, the parser chooses to reduce in the cases of both aa·a and aa·aa. In this way, the grammar author has specified a new language consisting of only 3 sentences: aaa, bab, and baab.

So, apparently, a strategy that uses associativity declarations for removing ambiguities could lead to a different language being recognized.

@modulovalue
Copy link
Author

modulovalue commented Nov 1, 2023

Transformation: Deduplication

A very simple LR(2) grammar on grammophone with some prose (source) that shows how to rewrite that grammar to LR(0) by removing duplicates.

This grammar shows that it might make sense to have some form of duplicate detection that identifies identical production rules.

@modulovalue
Copy link
Author

modulovalue commented Nov 1, 2023

Grammar Examples

This repo contains A TON of random LR(k) grammars and a bunch of LR(1) grammars in bison syntax.

@modulovalue
Copy link
Author

modulovalue commented Nov 8, 2023

Class: LR-regular (LRR)

This is an example grammar that is meant to be LRR (LR-regular): on grammophone.

It is a proper superclass of LR(k). That grammar comes from the following paper that introduces LRR:

LR-regular grammars—an extension of LR(k) grammars, Karel Čulik II and Rina Cohen


There's a somewhat common source of ambiguities in LR-parsers that many people are pointing out where infinite lookahead would be needed to remove all inandequate states, so those grammars can't be LR(k) for any k. I'm wondering whether LRR could help with that. I'm wondering about the relationship between Chris Clark's LR(closed) grammars and LRR.

It would be nice to find some minimal LRR grammars so that their automata can be inspected more easily.

I'm also wondering if LRR could be generalized to something like "LR-CF" where the lookahead sets themselves are not just regular, but descriptions of a context-free language. Or maybe at least visibly pushdown languages (https://github.com/ianh/owl), which support a limited form of recursion, which regular languages don't, and have desirable properties that context free languages don't have, but regular languages have.


Leo optimizes Earley-style parsers to take linear time for LRR grammars.

One parser that claims to be able to parser LRR grammars is: Marpa, A practical general parser: the recognizer, Jeffrey Kegler.

@modulovalue
Copy link
Author

modulovalue commented Nov 18, 2023

Class: ECFGs

This paper by Kristensen and Madsen proposes to extend the LR-theory from CFGs to ECFGs. It argues that by doing that, many conflicts can be avoided.

(See page 8 for reference):

star-operator

The following grammar:

S -> E.
E -> a b* b d.
E -> a b* b c.

can be desugared into a left-recursive version (on grammophone) or a right-recursive version (on grammophone), both of which are not LR(1).

The LR(1) automaton of the left-recursive version contains an r/r conflict where both productions are epsilon productions in one inadequate state.

Screenshot 2023-11-18 at 13 31 33

The LR(1) automaton of the right-recursive version contains 2 of same r/r conflicts, but with an additional shift conflict for each in two inadequate states.

Screenshot 2023-11-18 at 13 31 43

If we canonicalize b*, then the left-recursive version becomes SLR(1) (on grammophone), the same trick doesn't work with the right-recursive version (on grammophone). It still has a s/r conflict. More lookahead might be able to resolve this conflict.

Left-recursive definitions in bottom up parsers are more efficient because they don't need O(n) stack space. However, something that is not clear to me is the question of which kinds of inadequate states can benefit from a right-recursive definition. right-recursive definitions don't appear to be strictly worse (excluding performance).

From a performance perspective, left-recursive definitions are superior. But left-recursive definitions can introduce inadequate states that right-recursive definitions can't introduce (#26 (comment)).


On page 9, Kristensen and Madsen show an alternative desugaring that is SLR(1). (There's a typo there, the last production in both grammars should be B -> b, otherwise they recognize a different language. Or that section is just wrong):

left-recursive: on grammophone
right-recursive: on grammophone

But even there, the left-recursive version is LR(0), and the right-recursive version isn't. So it might make sense to prefer the left-recursive definition there (ignoring any performance benefits) because the increased expressivity might propagate to fewer inadequate states in more complex grammars?

If we apply the CPS idea from langcc, we can make the right recursive version LR(0): on grammophone

or-operator

Consider page 9 and page 33:

A naive desugaring produces a non-LR(1) grammar: on grammophone. If we canonicalize, we get an LR(0) grammar on grammophone, but by doing that we would lose meaning (e.g., the semantic actions would have to be the same so an automatic transformation is not practical here.)

If we apply the automatic transformation, we can keep meaning and have an LR(0) grammar: on grammophone


Kristensen and Madsen don't seem to provide a method for applying their transformation automatically to ECFGs, but 50 years later, langcc did that, although the author doesn't mention that in the paper.


I think the key question here is whether we could benefit from extending the definitions of nullsets/firstsets/(slr) followsets/(lalr) lookaheadsets to ECFGs and introduce new actions in LR automata to support + (and interspersed +), * (and interspersed *), ? and | explicitly as first class operations without having to desugar them to a mutilated CFG. (regular expression to NFA to DFA conversions already need to implicitly consider such extended definitions of nullsets/firstsets/followsets, so this shouldn't be anything new.)

Applying any such transformations makes it much harder to keep a sane parse tree or do incremental parsing and, in general, it just complicates things by a lot.

I haven't seen a single project that implements what the paper says, i.e., extend the LR-theory to support ECFGs as a first class specification. Maybe by doing that we can have all the performance benefits of left-recursive definitions, the hypothetical additional expressiveness of right-recursive definitions, support incremental parsing and have sane parse trees without having to do anything really complicated.

(Note: it is not straightforward to restore ASTs from regular expressions, so I guess it won't be straightforward to restore parse trees from first class ECFGs. https://github.com/ianh/owl shows that the parse tree problem for regular expressions can be solved.)

@modulovalue
Copy link
Author

modulovalue commented Nov 18, 2023

Technique: Left vs Right Recursion

Amazing, I've just found an example that provides a grammar where a right-recursive definition can be superior:

What follows is a minimal example of what that post says:

If we want to parse a list of comma separated values with an optional trailing comma, we can:

  • accept a left recursive list that has a trailing comma and one that doesn't: on grammophone => Not LR(1)
  • accept a right recursive list that has a trailing comma and one that doesn't on grammophone => SLR(1)

Furthermore, we can:

  • accept a single right recursive list that has an optional trailing comma on grammophone => SLR(1)

Note: A right recursive definition is not enough if the optional trailing separator is not in the base case of the right recursive definition: (on grammophone => not LR(1)). Joe's CPS transformation achieves precisely that, that is, it moves the optional trailing separator into the base case.

However, we can have an LR(1) left-recursive definition of that language: on grammophone, so right recursion is not strictly necessary.

To conclude, a quote from the linked blogpost:

The idea that “left recursion is good, right recursion is bad” is a myth, at least in terms of conflicts. There are ways of working with left recursion and ways of working with right recursion; they are just not the same.


And to be complete, here's the example where a left recursive definition is superior:


Observation: followset differences between left and right recursion

Consider right-recursive on grammophone & left-recursive on grammophone

Notice how the rec nonterminal in the left-recursive version contains a larger followset. label can follow rec in the left-recursive version, but not in the right-recursive version. Since LALR lookaheads are partitions of followsets, a similar observation could be made for LALR lookaheads.


Note:

Right recursive (x.',')*','?: on grammophone

@modulovalue
Copy link
Author

modulovalue commented Nov 20, 2023

Transformation: (Left-)CPS

The CPS technique (#26 (comment)) introduces a transformation that is able to move the right remainder of a rule into a nonterminal to its left. However, it would also make sense to do the same with left remainders, and move them into nonterminals to the right.

This is a comma separated list that accepts an optional trailing comma on grammophone.
Applying such a "Left CPS" transformation would result in the following grammar: on grammophone. Here, the intermediate unnecessary O was able to be removed and the grammar became more compact. Therefore, such a Left-CPS transformation is able to make grammars more amenable to an optimization that removes unit productions. Also, applying "Left CPS" and inlining the intermediate O can lead to fewer conflicts. (TODO find a small repro grammar for this observation)

Left-CPS can only be applied to lists if they are left recursive and Right-CPS only to right recursive lists. (Why? Because lists and of the left if they are left recursive and on the right if they are right recursive).

@modulovalue
Copy link
Author

modulovalue commented Nov 20, 2023

Transformation: Pete Jinks

See: http://www.cs.man.ac.uk/~pjj/complang/heuristics.html

That page contains a bunch of equalities that can be used to rewrite grammars. I'm wondering if it would be possible to find more by, e.g., formalizing CFGs (ECFGs?) and their algebraic rules as an algebra and putting that through, e.g., the knuth bendix completion procedure (and maybe we can even find a confluent term rewriting system)?

Some transformations are not obvious to me, especially the self referential ones. Can an axiomatic system be defined to make their correctness more obvious by deriving them from basic algebraic manipulations?

PJ 4.12

(A* # B)* = (B* # A)* = (A+|B+)* = (A|B)* => made initial grammar SLR(1)

@modulovalue
Copy link
Author

Example: CPS-eligibility criterion for *

Here's another example of a grammar (a* a) that can benefit from the CPS transformation.

Before: on grammophone Not LR(1)

After: on grammophone SLR(1)

Maybe this observation could be translated into a CPS-eligibility criterion: let the grammar be X* Y, if any of first(Y) is in first(X), then the CPS transformation should be applied?

@modulovalue
Copy link
Author

Transformation: Procedure Cloning

GFGs (#27) introduce the idea of applying a traditional program optimization technique known as procedure cloning to CFGs.

Consider the following grammar (for this regular language: (m|n)*mf(m|n)*n): on grammophone. It is not LR(1). However, by combining procedure cloning and CPS we can make it SLR(1): on grammophone. We clone the flattened (m|n)* L list into L1 and L2 and replace each use of L with a unique list. This allows us to apply CPS without losing any information related to semantic actions.

We can make the grammar SLR(1) by using a left-recursive list: on grammophone.
However, if we add a level of indirection with the left-recursive list, the grammar loses it's SLR(1) property and it becomes not even LR(1): on grammophone. The right-recursive version is not susceptible to this level-of-indirection-induced inadequacy: on grammophone. The topic of unit rule elimination might be relevant here: #26 (comment)

@modulovalue
Copy link
Author

Example: Need for LL(k) where k > 1.

Terence Parr and Russell Quong discuss the need for LL(k) & LR(k) where k > 1 in
LL and LR Translator Need k.

Example 1: We illustrate the advantage of an LL(2) grammar over an LL(1) grammar for recognizing a fragment of the C language. Consider distinguishing between C labels ``ID :'' and C assignment statements "ID = ...", Where ID represents an identifier token. In the following grammar fragment, rule {\tt stat} requires two lookahead symbols and is easily expressed with an LL(2) grammar. This common language feature is hard to express with an LL(1) grammar because the ID token is matched in many different grammar locations. Left-factoring the common ID prefix in stat} and expr would results in a much less natural grammar.

on grammophone

Although manual left-factoring can sometimes solve the problem, it is not ideal for two reasons. First, even if left-factoring yields a reasonable grammar, the LL(k) grammar for k>1 is simply more convenient, which tends to reduce development time. Secondly, while left-factoring might be theoretically possible, it can be practically implausible, as in this example, where expr occurs throughout the grammer.

The next example, example 2, discusses the issue from the following comment in this issue #26 (comment).

The other examples don't appear to add anything new to this issue. A lot of that article discusses how actions can reduce the expressivity of LR parsers to that of an LL parser, but that's a big can of worms that I'm not going to try to investigate in this issue.

@modulovalue
Copy link
Author

Transformation: Extended CPS

The CPS paper appears to only consider basic EBNF operations such as * and + in its flattening scheme and CPS transformation.

We can go further. One might need to have multiple base cases (such as when we want to parse lists with interspersed commas and optional trailing commas, see: #26 (comment)). The paper only considers those with one base case. We can extend CPS to flattening procedures that introduce multiple base cases. I'm going to call this Extended CPS (ECPS)

Here's an example of a basic ECPS application with multiple base cases:

That simple grammar is a snippet from the Dart grammar that describes a reduced version of function expressions and record patterns.

Notice how the version without CPS is not LR(1), but the version with CPS is LR(0).

The CPS paper describes a very basic form of detecting which nonterminals are eligible for CPS. It is only able to be applied to EBNF symbols that were flattened before. I think we can do better. A better CPS transformation should not have to depend on the flattening scheme and be able to find CPS eligible symbols by examining the structure of the CFG by, for example, looking for cycles in the CFG (maybe it's enough to consider SCCs (via Tarjan) that consist of a single fundamental cycle (via Johnson with a single entry-point to determine CPS eligibility (if there are multiple entry-points, procedure cloning could help). If the recursion tails on the right, it should be eligible for left-CPS and if it tails on the left, it should be eligible for right-CPS). (Also, see: jzimmerman/langcc#48 (comment) for the same idea but for optionals)

@modulovalue
Copy link
Author

This link contains a tutorial from Cornell on how to turn a Java grammar LALR(1). There might be something interesting there.

TODO implement the examples in grammophone

@modulovalue
Copy link
Author

This link contains a tutorial from one of the only LL(k) parser generators (SLK) on LL(k). I'm not focusing on LL parsing, and most of the stuff here is covering LR-parsing, but that page is very detailed so it deserves to be mentioned.

TODO implement the examples in grammophone

@modulovalue
Copy link
Author

Example: "star lookahead + 1"

Consider: on grammophone

S -> A | B.
A -> B1 a.
B -> B2 b.
B1 -> B1 b | b.
B2 -> B2 b | b.

(From: A Methodology for Removing LALR(k) Conflicts, right after Figure 1)

This is not LALR(k) because it needs a non-finite amount of lookahead to resolve whether to reduce B1 or B2. However, the language itself is regular.

I don't know of any strategies for dealing with this. It would be interesting to find a way for detecting conflicts of such kind and dealing with them.

  • does LR-regular handle it?
  • what automatic transformation could make this LALR(x) while preserving semantic actions?

@modulovalue
Copy link
Author

SLR(1) / LALR(1)

Here's a simple grammar that is LALR(1) but not SLR(1).

On grammophone

S -> a A c | a B d | B c.
A -> z.
B -> z.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants