Edit 12th Feb 2025

Information So, it turns out the actual grammar is more complex than what I initially thought. There seems to be a kind-of induced recursion due to mutual recursion between rule $\text{prefix}$, $\text{template-prefix}$ and $\text{prefix}$, $\text{closure-prefix}$. If you’re reading this post for the first time, just disregard this message.

\[ \text{A long-standing issue regarding algorithms that manipulate} \\ \text{context-free grammars (CFGs) in a “top-down” left-to-right fashion} \\ \text{is that left recursion can lead to nontermination.} ^{\text{[1]}} \]

Fixing Left Recursion

A grammar is left recursive if it comes in the following form

\[ \langle S \rangle ::= \langle S \rangle \ a \mid b \]

When parsing or generating a string using this grammar (using a backtracking algorithm), where $ \langle S \rangle $ is the start symbol, you are likely to enter a state of infinite recursion. Any ideas on how to fix this?

\[ \langle S \rangle ::= b \mid \langle S \rangle \ a \]

Rewriting it in above form won’t change a thing, except the fact that it’ll try to read or generate the terminal $b$ first. How do we fix this then? The root of our problem is left recursion, so if we can somehow make it right recursive then it’ll fix our issue. Before reading any further, give it a try yourself.

--- title: example string "aaa" --- stateDiagram-v2 [*] --> S S --> b S --> Sa Sa --> S b --> [*]

For given example string $\text{aaa}$ the parser will never terminate. Not even to inform us that the string does not belong to language generated by corresponding grammar.

Language Analysis

Let’s start by generating some sample strings in the given language, and try to discover a pattern. The following are expansion paths for $ \langle S \rangle $ :

\[ \begin{align} L(S) &= b \\ L(S) &= \langle S \rangle a \rightarrow \langle S \rangle \ a \ a \rightarrow \langle S \rangle \ a \ a \ … \ a \rightarrow b \ a \ a \ … \ a \end{align} \]

So we have two possible expansions $b$ or $b \ a^*$. The symbol $*$ means as many repetitions of anything that comes before it. It can be nothing, like an empty string, or it can be something. So, this lets us know that a left recursive grammar always starts with $b$, and then you generate or match as many $a$ as possible. The interesting thing about $a^*$ is that it can be generated with right recursion as well, with a rule like :

\[ \langle N \rangle = a \langle N \rangle \mid \epsilon \]

stateDiagram-v2 [*] --> N N --> aN : prepend a aN --> N : expand N N --> [*]

Now this can also be generated with a left recursion as well, but we’re doing all this just to avoid it, so we only care about right recursion. Now combining all the information we’ve gather up until now, we can

\[ \begin{align} \langle S \rangle &::= b \langle S’ \rangle \\ \langle S’ \rangle &::= a \langle S’ \rangle \mid a \end{align} \]

This grammar as you can see is completely right recursive.

--- title: example string "aaa" --- stateDiagram-v2 [*] --> S S --> bS' S' --> aS' S' --> a bS' --> S' aS' --> S' a --> [*]

This new machine will quickly terminate when it notices that the example string $ \text{aaa} $ does not start with $b$ like in our language.

Real Life Example

So, I’ve been (re)writing a C++ demangler for RizinOrg’s rz-libdemangle which was previously licensed to GPL, and I’m re-writing it to relicense it to LGPL v3, which is more permissive for commercial usage. You are allowed to link to precompiled binaries licensed with LGPLv3.

While re-writing grammar for Itanium ABI, I encountered a left recursive grammar.

\[ \begin{align} \langle \text{prefix} \rangle &::= \langle \text{unqualified-name} \rangle \\ &\quad \mid \langle \text{prefix} \rangle \langle \text{unqualified-name} \rangle \\ &\quad \mid \langle \text{template-prefix} \rangle \langle \text{template-args} \rangle \\ &\quad \mid \langle \text{closure-prefix} \rangle \\ &\quad \mid \langle \text{template-param} \rangle \\ &\quad \mid \langle \text{decltype} \rangle \\ &\quad \mid \langle \text{substitution} \rangle \end{align} \]

Notice how production number $6$ is left recursive in nature. Wanna know how I fixed it in code? You’ll have to follow this post a bit more. Real life examples are bit more complex than theory. Life happend here as well, and made some things not-so-straightforward.

Mutual Left Recursion

A mutual left recursion happens when two rules mutually call each other by expanding first non terminal as other rule. It generally looks like this :

\[ \begin{align} \langle S \rangle &::= \langle A \rangle \mid \langle B \rangle \\ \langle A \rangle &::= \langle B \rangle k \mid X \\ \langle B \rangle &::= \langle A \rangle m \mid Y \end{align} \]

Notice how production $13$ and $14$ call each other, making it a mutual recursion. Using a similar analysis as done for simple left recursion, we can remove left recursion here as well, by making each rule right recursion only.

stateDiagram-v2 [*] --> S S --> A S --> B A --> Bk Bk --> B B --> Am Am --> A A --> X B --> Y X --> [*] Y --> [*]

You can visually see the mutual recursion in path $ \textbf{A} \rightarrow \text{Bk} \rightarrow \textbf{B} \rightarrow \text{Am} \rightarrow \textbf{A} $, wherein $A$ ends up calling $B$ which eventually calls $A$.

Language Analysis

If you try to follow the pattern, then you’ll get the following possible languages

\[ \begin{align} L(A) &= X (mk)^* \\ L(A) &= Y k(mk)^* \\ L(B) &= Y (km)^* \\ L(B) &= X m(km)^* \end{align} \]

Noticing this pattern again, I can clearly see the right recursion this time again, so, I’ll re-write the mutually left recursive grammar as follows

\[ \begin{align} \langle S \rangle & ::= & \langle S1 \rangle \mid \langle S2 \rangle \\ \langle S1 \rangle & ::= & \langle A \rangle \mid \langle A \rangle \langle R1 \rangle \\ \langle S2 \rangle & ::= & \langle B \rangle \mid \langle B \rangle \langle R2 \rangle \\ \langle A \rangle & ::= & X \mid Y k \\ \langle B \rangle & ::= & Y \mid X m \\ \langle R1 \rangle & ::= & m k \langle R1 \rangle \mid m k \\ \langle R2 \rangle & ::= & k m \langle R2 \rangle \mid k m \end{align} \]

Again, notice that since the only recursive productions $20$ and $21$ are right recursive only, the whole grammar is right recursive, and at the same time, the grammar is no more mutually left recursive.

stateDiagram-v2 [*] --> S S --> S1 S --> S2 S1 --> A S1 --> AR1 AR1 --> R1 : expand R1 R1 --> mkR1 mkR1 --> R1 S2 --> B S2 --> BR2 BR2 --> R2 : expand R2 R2 --> kmR2 kmR2 --> R2 A --> X A --> Yk B --> Y B --> Xm X --> [*] Y --> [*] Yk --> [*] Xm --> [*] R1 --> mk R2 --> km mk --> [*] km --> [*]

Real Life Example

So, while (re)writing the demangler, I came across some rules that are mutually recursive. Take a look at the following grammar, and comment what you see :

\[ \begin{align} \langle \text{template-prefix} \rangle &::= \langle \text{template unqualified-name} \rangle \\ &\quad \mid \langle \text{prefix} \rangle \langle \text{template unqualified-name} \rangle \\ &\quad \mid \langle \text{template-param} \rangle \\ &\quad \mid \langle \text{substitution} \rangle \end{align} \]

\[ \begin{align} \langle \text{closure-prefix} \rangle &::= [ \langle \text{prefix} \rangle ] \langle \text{variable or member unqualified-name} \rangle M \\ &\quad \mid \langle \text{variable template template-prefix} \rangle \langle \text{template-args} \rangle M \end{align} \]

Did you notice it? $\langle \text{prefix} \rangle$ is mutually left recursive with $\langle \text{template-prefix} \rangle$ and $\langle \text{closure-prefix} \rangle$ at the same time, and this is where things get a bit complicated. The solution is not hard, it’s really simple though. I’ll have to show my original solution though.

Solution

For making things easy while writing the demangler, and for easy implementation of grammar rules and productions, I’ve devised some macros that make it look like I’m using a DSL to write the demangler. So, to help you understand, I’ll just show you the before and after code, and leave you to diff these out by yourself to understand. You can see the complete source code in the PR, or in source code of rz-libdemangle after my PR is merged.

Before Solving Anything

Before I’ve made any fixes to the grammar, here’s how the productions look in code :

DECL_RULE(prefix);
DECL_RULE(template_prefix);
DECL_RULE(closure_prefix);

DEFN_RULE (prefix, {
    MATCH (RULE (unqualified_name));
    MATCH (RULE (prefix) && RULE (unqualified_name));
    MATCH (RULE (template_prefix) && RULE (template_args));
    MATCH (RULE (closure_prefix));
    MATCH (RULE (template_param));
    MATCH (RULE (decltype));
    MATCH (RULE (substitution));
});

DEFN_RULE(template_prefix, {
    MATCH (RULE (template_unqualified_name));
    MATCH (RULE (prefix) && RULE (template_unqualified_name));
    MATCH (RULE (template_param));
    MATCH (RULE (substitution));
});

DEFN_RULE(closure_prefix, {
    MATCH (RULE_OPTIONAL (prefix) && RULE (variable_or_member_unqualified_name) && READ ('M')); 
    MATCH (RULE (variable_template_template_name) && RULE (template_args) && READ ('M')); 
});

So? Do you notice many mutual recursions at the same time. How do we fix this? To be honest, at the time of writing this, I haven’t checked the code, but I’m pretty sure that my theory is right.

After The Changes

DEFN_RULE (prefix_X, {
    MATCH (RULE (unqualified_name));
    MATCH (RULE (template_param));
    MATCH (RULE (decltype));
    MATCH (RULE (substitution));
});

DEFN_RULE (prefix_Yk_template_prefix, {
    MATCH (RULE (template_unqualified_name) && RULE (template_args));
    MATCH (RULE (template_param) && RULE (template_args));
    MATCH (RULE (substitution) && RULE (template_args));
});

DEFN_RULE (prefix_A_template_prefix, {
    MATCH (RULE (prefix_X));
    MATCH (RULE (prefix_Yk_template_prefix));
});

DEFN_RULE (prefix_mk_template_prefix, {
    MATCH (RULE (template_unqualified_name) && RULE (template_args));
});

DEFN_RULE (prefix_S1_template_prefix, {
    MATCH (RULE (prefix_A_template_prefix) && RULE_MANY (prefix_mk_template_prefix));
});

DEFN_RULE (prefix_Yk_closure_prefix, {
    MATCH (RULE (variable_template_template_prefix) && RULE (template_args) && READ ('M'));
});

DEFN_RULE (prefix_A_closure_prefix, {
    MATCH (RULE (prefix_X));
    MATCH (RULE (prefix_Yk_closure_prefix));
});

DEFN_RULE (prefix_mk_closure_prefix, {
    MATCH (RULE (variable_or_member_unqualified_name) && READ ('M'));
});

DEFN_RULE (prefix_S1_closure_prefix, {
    MATCH (RULE (prefix_A_closure_prefix) && RULE_MANY (prefix_mk_closure_prefix));
});

DEFN_RULE (prefix, {
    // fix left-recursion
    MATCH (RULE (prefix_X) && RULE_MANY (unqualified_name));

    // fix mutual-recursions
    MATCH (RULE (prefix_S1_template_prefix));
    MATCH (RULE (prefix_S1_closure_prefix));
});

The changes made for $ \langle \text{closure-prefix} \rangle $ :

DEFN_RULE (closure_prefix_Y, {
    MATCH (RULE (variable_template_template_prefix) && RULE (template_args) && READ ('M'));
});

DEFN_RULE (closure_prefix_Xm_prefix, {
    MATCH (RULE (unqualified_name) && RULE (variable_or_member_unqualified_name) && READ ('M'));
    MATCH (RULE (template_param) && RULE (variable_or_member_unqualified_name) && READ ('M'));
    MATCH (RULE (decltype) && RULE (variable_or_member_unqualified_name) && READ ('M'));
    MATCH (RULE (substitution) && RULE (variable_or_member_unqualified_name) && READ ('M'));
});

DEFN_RULE (closure_prefix_km_prefix, {
    MATCH (RULE (variable_template_template_prefix) && RULE (template_args) && READ ('M'));
});

DEFN_RULE (closure_prefix_B, {
    MATCH (RULE (closure_prefix_Y));
    MATCH (RULE (closure_prefix_Xm_prefix));
});

DEFN_RULE (closure_prefix_S2_prefix, {
    MATCH (RULE (closure_prefix_B) && RULE_MANY (closure_prefix_km_prefix));
});

DEFN_RULE (closure_prefix, { MATCH (RULE (closure_prefix_S2_prefix)); });

The changes made for $ \langle \text{template-prefix} \rangle $ :

DEFN_RULE (template_prefix_Y, {
    MATCH (RULE (template_unqualified_name));
    MATCH (RULE (template_param));
    MATCH (RULE (substitution));
});

DEFN_RULE (template_prefix_Xm_prefix, {
    MATCH (RULE (unqualified_name) && RULE (template_unqualified_name));
    MATCH (RULE (template_param) && RULE (template_unqualified_name));
    MATCH (RULE (decltype) && RULE (template_unqualified_name));
    MATCH (RULE (substitution) && RULE (template_unqualified_name));
});

DEFN_RULE (template_prefix_km_prefix, {
    MATCH (RULE (template_args) && RULE (template_unqualified_name));
});

DEFN_RULE (template_prefix_B, {
    MATCH (RULE (template_prefix_Y));
    MATCH (RULE (template_prefix_Xm_prefix));
});

DEFN_RULE (template_prefix_S2_prefix, {
    MATCH (RULE (template_prefix_B) && RULE_MANY (template_prefix_km_prefix));
});

DEFN_RULE (template_prefix, { MATCH (RULE (template_prefix_S2_prefix)); });

This is my current fix. This fixed any recursion issues for now. Previously I was getting stack overflow, because the stack frames just kept getting growing up and up. For those of you who are not comfortable with the code, here’s the grammar form :

\[ \begin{align} \langle \text{prefix-X} \rangle & ::= & \langle \text{unqualified-name} \rangle \\ & \quad \mid & \langle \text{template-param} \rangle \\ & \quad \mid & \langle \text{decltype} \rangle \\ & \quad \mid & \langle \text{substitution} \rangle \end{align} \]

\[ \begin{align} \langle \text{prefix-Yk-template-prefix} \rangle & ::= & \langle \text{template-unqualified-name} \rangle \ \langle \text{template-args} \rangle \\ & \quad \mid & \langle \text{template-param} \rangle \ \langle \text{template-args} \rangle \\ & \quad \mid & \langle \text{substitution} \rangle \ \langle \text{template-args} \rangle \\ \langle \text{prefix-A-template-prefix} \rangle & ::= & \langle \text{prefix-X} \rangle \\ & \quad \mid & \langle \text{prefix-Yk-template-prefix} \rangle \\ \langle \text{prefix-mk-template-prefix} \rangle & ::= & \langle \text{template-unqualified-name} \rangle \ \langle \text{template-args} \rangle \\ \langle \text{prefix-S1-template-prefix} \rangle & ::= & \langle \text{prefix-A-template-prefix} \rangle \ \langle \text{prefix-mk-template-prefix} \rangle ^* \end{align} \]

\[ \begin{align} \langle \text{prefix-Yk-closure-prefix} \rangle & ::= & \langle \text{variable-template-template-prefix} \rangle \ \langle \text{template-args} \rangle \ M \\ \langle \text{prefix-A-closure-prefix} \rangle & ::= & \langle \text{prefix-X} \rangle \\ & \quad \mid & \langle \text{prefix-Yk-closure-prefix} \rangle \\ \langle \text{prefix-mk-closure-prefix} \rangle & ::= & \langle \text{variable-or-member-unqualified-name} \rangle \ M \\ \langle \text{prefix-S1-closure-prefix} \rangle & ::= & \langle \text{prefix-A-closure-prefix} \rangle \ \langle \text{prefix-mk-closure-prefix} \rangle ^* \end{align} \]

\[ \begin{align} \langle \text{prefix} \rangle & ::= & \langle \text{prefix-X} \rangle \ {\langle \text{unqualified-name} \rangle ^*} \\ & \mid & \langle \text{prefix-S1-template-prefix} \rangle \\ & \mid & \langle \text{prefix-S1-closure-prefix} \rangle \end{align} \]

The changes made for $ \langle \text{closure-prefix} \rangle $ :

\[ \begin{align} \langle \text{closure-prefix-Y} \rangle & ::= & \langle \text{variable-template-template-prefix} \rangle \ \langle \text{template-args} \rangle \ M \end{align} \]

\[ \begin{align} \langle \text{closure-prefix-Xm-prefix} \rangle & ::= & \langle \text{unqualified-name} \rangle \ \langle \text{variable-or-member-unqualified-name} \rangle \ M \\ & \quad \mid & \langle \text{template-param} \rangle \ \langle \text{variable-or-member-unqualified-name} \rangle \ M \\ & \quad \mid & \langle \text{decltype} \rangle \ \langle \text{variable-or-member-unqualified-name} \rangle \ M \\ & \quad \mid & \langle \text{substitution} \rangle \ \langle \text{variable-or-member-unqualified-name} \rangle \ M \end{align} \]

\[ \begin{align} \langle \text{closure-prefix-B} \rangle & ::= & \langle \text{closure-prefix-Y} \rangle \\ & \quad \mid & \langle \text{closure-prefix-Xm-prefix} \rangle \end{align} \]

\[ \begin{align} \langle \text{closure-prefix-km-prefix} \rangle ::= \langle \text{variable or member unqualified-name} \rangle M \end{align} \]

\[ \begin{align} \langle \text{closure-prefix-S2-prefix} \rangle & ::= & \langle \text{closure-prefix-B} \rangle \ \langle \text{closure-prefix-km-prefix} \rangle ^* \end{align} \]

\[ \begin{align} \langle \text{closure-prefix} \rangle & ::= & \langle \text{closure-prefix-S2-prefix} \rangle \end{align} \]

The changes made for $ \langle \text{template-prefix} \rangle $ :

\[ \begin{align} \langle \text{template-prefix-Y} \rangle & ::= & \langle \text{template-unqualified-name} \rangle \\ & \quad \mid & \langle \text{template-param} \rangle \\ & \quad \mid & \langle \text{substitution} \rangle \end{align} \]

\[ \begin{align} \langle \text{template-prefix-Xm-prefix} \rangle & ::= & \langle \text{unqualified-name} \rangle \ \langle \text{template-unqualified-name} \rangle \\ & \quad \mid & \langle \text{template-param} \rangle \ \langle \text{template-unqualified-name} \rangle \\ & \quad \mid & \langle \text{decltype} \rangle \ \langle \text{template-unqualified-name} \rangle \\ & \quad \mid & \langle \text{substitution} \rangle \ \langle \text{template-unqualified-name} \rangle \end{align} \]

\[ \begin{align} \langle \text{template-prefix-B} \rangle & ::= & \langle \text{template-prefix-Y} \rangle \\ & \quad \mid & \langle \text{template-prefix-Xm-prefix} \rangle \end{align} \]

\[ \begin{align} \langle \text{template-prefix-km-prefix} \rangle ::= \langle \text{template-args} \rangle \langle \text{template-unqualified-name} \rangle \end{align} \]

\[ \begin{align} \langle \text{template-prefix-S2-prefix} \rangle & ::= & \langle \text{template-prefix-B} \rangle \ \langle \text{template-prefix-km-prefix} \rangle ^* \end{align} \]

\[ \begin{align} \langle \text{template-prefix} \rangle & ::= & \langle \text{template-prefix-S2-prefix} \rangle \end{align} \]

EDIT : The Induced Problem & Solution

So, the previous solution didn’t work. The actual solution is one more level harder than what I perceived it to be. To be honest, I initially had a suspicion about this, but then I dismissed the thought and ignored it, without verifying. This was a bad choice and took hours of my debugging. If only there were a tool to directly debug grammars. It’s also my foolishness to implement such a simple parser and not use one of those advanced tools or an algorithm to parse this complex grammar.

The Problem

The problem arises because $\langle \text{template-prefix} \rangle$ and $\langle \text{closure-prefix} \rangle$ are allowed to implicitly call each other, and while rewriting the grammar I did not consider this. I was under the impression that fixing the mutual left recursion for the pairs $\langle \text{prefix} \rangle$, $\langle \text{template-prefix} \rangle$ and $\langle \text{prefix} \rangle$, $\langle \text{closure-prefix} \rangle$ independently would solve the issue, but this is clearly not the case. I’ve already talked about how mutual left recursion can be a problem, and the solution for that is completely correct! Now let’s discuss this “induced recursion” problem.

This is a special case of mutual recursion actually, which can go deeper than just one single expansion. Let’s consider the following grammar for an easier analysis and study :

\[ \begin{align} \langle \text{S} \rangle ::=& \langle \text{A} \rangle \mid \langle \text{B} \rangle \mid \langle \text{C} \rangle \\ \langle \text{A} \rangle ::=& \langle \text{B} \rangle b \\ ::=& \langle \text{C} \rangle c \\ ::=& X \\ \langle \text{B} \rangle ::=& \langle \text{A} \rangle l \mid Y \\ \langle \text{C} \rangle ::=& \langle \text{A} \rangle m \mid Z \\ \end{align} \]

Now this mostly resembles our correct scenario at the moment. Here $\langle \text{prefix} \rangle$ is $\langle A \rangle$, $\langle \text{template-prefix} \rangle$ is $\langle B \rangle$, and $\langle \text{closure-prefix} \rangle$ is $\langle C \rangle$. Now, let’s follow one expansion path of this grammar :

stateDiagram-v2 [*] --> S S --> A S --> B S --> C A --> Bb A --> Cc A --> X B --> Al B --> Y C --> Am C --> Z Bb --> B Cc --> C Al --> A Am --> A X --> [*] Y --> [*] Z --> [*]

Can you see it? $\langle B \rangle$ and $\langle C \rangle$ being able to call each other? No?

stateDiagram-v2 [*] --> S S --> A A --> Bb A --> Cc B --> Al C --> Am Bb --> B Cc --> C Al --> A Am --> A

How about now? And the grammar rewrite I made last time, didn’t address this! So, the grammar rewrite will be a bit more complex than what we achieved last time. Also, notice this is only one level deep. $\langle A \rangle$ expands $\langle B \rangle$ which goes back to $\langle A \rangle$. This can be deep on multiple levels!

Again I’ll follow the simple path and try to generate all possible languages out of this, and then write a grammar that can generate the same language but without any of the problems here.

flowchart TD S(S) --> A(A) A --> X((X)) A --> Bb(Bb) A --> Cc(Cc) Bb --> Yb((Yb)) Bb --> Alb(Alb) Cc --> Zc((Zc)) Cc --> Amc(Amc) Alb --> Xlb((Xlb)) Alb --> Bblb(Bblb) Alb --> Cclb(Cclb) Amc --> Xmc((Xmc)) Amc --> Ccmc(Ccmc) Amc --> Bbmc(Bbmc) Bblb --> Yblb((Yblb)) Bblb --> Alblb(Alblb) Cclb --> Zclb((Zclb)) Cclb --> Amclb(Amclb) Ccmc --> Zcmc((Zcmc)) Ccmc --> Amcmc(Amcmc) Bbmc --> Ybmc((Ybmc)) Bbmc --> Albmc(Albmc) Alblb --> Xlblb((Xlblb)) Alblb --> Bblblb(Bblblb) Alblb --> Cclblb(Cclblb) Amcmc --> Xmcmc((Xmcmc)) Amcmc --> Bbmcmc(Bbmcmc) Amcmc --> Ccmcmc(Ccmcmc) Amclb --> Xmclb((Xmclb)) Amclb --> Bbmclb(Bbmclb) Amclb --> Ccmclb(Ccmclb) Albmc --> Xlbmb((Xlbmc)) Albmc --> Bblbmc(Bblbmc) Albmc --> Cclbmc(Cclbmc) Bblblb --> Yblblb((Yblblb)) Bblblb --> Alblblb(Alblblb) Bbmclb --> Ybmclb((Ybmclb)) Bbmclb --> Albmclb(Albmclb) Bbmcmc --> Ybmcmc((Ybmcmc)) Bbmcmc --> Albmcmc(Albmcmc) Bblbmc --> Yblbmb((Yblbmc)) Bblbmc --> Alblbmc(Alblbmc) Ccmcmc --> Zcmcmc((Zcmcmc)) Ccmcmc --> Amcmcmc(Amcmcmc) Ccmclb --> Zcmclb((Zcmclb)) Ccmclb --> Amcmclb(Amcmclb) Cclblb --> Zclblb((Zclblb)) Cclblb --> Amclblb(Amclblb) Cclbmc --> Zclbmc((Zclbmc)) Cclbmc --> Amclbmc(Amclbmc) Alblblb --> Xlblblb((Xlblblb)) Albmclb --> Xlbmclb((Xlbmclb)) Albmcmc --> Xlbmcmc((Xlbmcmc)) Alblbmc --> Xlblbmc((Xlblbmc)) Amcmcmc --> Xmcmcmc((Xmcmcmc)) Amcmclb --> Xmcmclb((Xmcmclb)) Amclblb --> Xmclblb((Xmclblb)) Amclbmc --> Xmclbmc((Xmclbmc)) Alblblb --> Bblblblb(Bblblblb) Albmclb --> Bblbmclb(Bblbmclb) Albmcmc --> Bblbmcmc(Bblbmcmc) Alblbmc --> Bblblbmc(Bblblbmc) Amcmcmc --> Bbmcmcmc(Bbmcmcmc) Amcmclb --> Bbmcmclb(Bbmcmclb) Amclblb --> Bbmclblb(Bbmclblb) Amclbmc --> Bbmclbmc(Bbmclbmc) Alblblb --> Cclblblb(Cclblblb) Albmclb --> Cclbmclb(Cclbmclb) Albmcmc --> Cclbmcmc(Cclbmcmc) Alblbmc --> Cclblbmc(Cclblbmc) Amcmcmc --> Ccmcmcmc(Ccmcmcmc) Amcmclb --> Ccmcmclb(Ccmcmclb) Amclblb --> Ccmclblb(Ccmclblb) Amclbmc --> Ccmclbmc(Ccmclbmc)

Wow, that blew up very fast! Here are the generated strings if it’s getting harder to read

L(A) generated

X
Yb
Zc
Xlb
Xmc
Yblb
Ybmc
Zclb
Zcmc
Xlblb
Xmcmc
Xmclb
Xlbmb
Yblblb
Ybmclb
Ybmcmc
Yblbmb
Zcmcmc
Zcmclb
Zclblb
Zclbmc
Xlblblb
Xlbmclb
Xlbmcmc
Xlblbmc
Xmcmcmc
Xmcmclb
Xmclblb
Xmclbmc

To make it easier to interpret, I’ll just write down the ones starting with X. 1st iteration :

2nd iteration :

Xlb
Xmc

3rd iteration :

Xlblb
Xmcmc
Xmclb
Xlbmb

4th iteration :

Xlblblb
Xlbmclb
Xlbmcmc
Xlblbmc
Xmcmcmc
Xmcmclb
Xmclblb
Xmclbmc

Now in each of these rules, once replace it with Yb and once with Zc. Notice anything? You’re getting the string just after rule C was expanded. Also to generate the next iteration whenever A was expanded, replace X once with Xlb and once with Xmc, or to go from a B to A, replace Y with Xl and Z with Xm. If we follow the pattern inductively like this. It looks like the language of A is trying to generate X followed by all permutation and combination of lb and mc. This can be generated by the grammar

\[ \begin{align} \langle A \rangle ::=& X \langle T \rangle \\ ::=& \text{Yb} \langle T \rangle \\ ::=& \text{Zc} \langle T \rangle \\ \langle T \rangle ::=& \text{lb} \langle T \rangle \mid \epsilon \\ ::=& \text{mc} \langle T \rangle \mid \epsilon \\ \end{align} \]

And this is grammar generated just for $A$. Let’s do the same for $B$ and $C$

flowchart TD S(S) --> B(B) B --> Y((Y)) B --> Al(Al) Al --> Xl((Xl)) Al --> Bbl(Bbl) Al --> Ccl(Ccl) Bbl --> Ybl((Ybl)) Bbl --> Albl(Albl) Ccl --> Zcl((Zcl)) Ccl --> Amcl(Amcl) Albl --> Xlbl((Xlbl)) Albl --> Bblbl(Bblbl) Albl --> Cclbl(Cclbl) Amcl --> Xmcl((Xmcl)) Amcl --> Bbmcl(Bbmcl) Amcl --> Ccmcl(Ccmcl) Bblbl --> Yblbl((Yblbl)) Bblbl --> Alblbl(Alblbl) Bbmcl --> Ybmcl((Ybmcl)) Bbmcl --> Albmcl(Albmcl) Ccmcl --> Zcmcl((Zcmcl)) Ccmcl --> Amcmcl(Amcmcl) Cclbl --> Zclbl((Zclbl)) Cclbl --> Amclbl(Amclbl) Alblbl --> Xlblbl((Xlblbl)) Alblbl --> Bblblbl(Bblblbl) Alblbl --> Cclblbl(Cclblbl) Albmcl --> Xlbmcl((Xlbmcl)) Albmcl --> Bblbmcl(Bblbmcl) Albmcl --> Cclbmcl(Cclbmcl) Amclbl --> Xmclbl((Xmclbl)) Amclbl --> Bbmclbl(Bbmclbl) Amclbl --> Ccmclbl(Ccmclbl) Amcmcl --> Xmcmcl((Xmcmcl)) Amcmcl --> Ccmcmcl(Ccmcmcl) Amcmcl --> Bbmcmcl(Bbmcmcl)

Now, $L(B)$ is

1st iteration :

2nd iteration :

Xl (replace $Y$ with $\text{Xl}$)

3rd iteration :

Ybl (replace $X$ with $\text{Yb}$)
Zcl (… or replace $X$ with $\text{Zc}$)

4th iteration :

Xlbl (… and again replace $Y$ with $\text{Xl}$ )
Xmcl (and replace $Z$ with $\text{Xm}$)

5th iteration :

Yblbl
Ybmcl
Zcmcl
Zclbl

6th iteration :

Xlblbl
Xlbmcl
Xmclbl
Xmcmcl

Notice the pattern for replacement of $X$ with $Yb$ or $Zc$ and $Y$ with $Xl$ and $Z$ with $Xm$, we observed when guessing the language of $A$ applies here as well. I’m pretty confident that this will work when generating language for $C$ as well, because nature loves symmetry. First, let’s generate grammar for language of $B$. Few things to not first :

We have onyl once case of $Y$ (not having any $\text{lb}$ or $\text{mc}$)
Strings start with either $Yb$ or $Zc$ or $X$
Strings always end with $l$.
At any stage, removing $Yb$ or $Zc$ or $X$ from beginning and $l$ from end, leaves out a string that is generated by rule for $\langle T \rangle$ in equation number $89$.

\[ \begin{align} \langle B \rangle ::=& Y \\ ::=& X \langle T \rangle l \\ ::=& Yb \langle T \rangle l \\ ::=& Zc \langle T \rangle l \\ \langle T \rangle ::=& \text{lb} \langle T \rangle \mid \epsilon \\ ::=& \text{mc} \langle T \rangle \mid \epsilon \\ \end{align} \]

… and now I have a gut feeling that following the same pattern, we’ll get the grammar for language of $C$ to be

\[ \begin{align} \langle C \rangle ::=& Z \\ ::=& X \langle T \rangle m \\ ::=& Yb \langle T \rangle m \\ ::=& Zc \langle T \rangle m \\ \langle T \rangle ::=& \text{lb} \langle T \rangle \mid \epsilon \\ ::=& \text{mc} \langle T \rangle \mid \epsilon \\ \end{align} \]

Final Comments

Grammars are very hard to get right in the first try. You are basically developing your own language. It takes experience, which basically demands you beforehand about what you want and what you don’t want. The other way is doing it iteratively, which is another name for trial-and-error (and as I said, hard to get right in first try). The final grammar we see here can be further simplified using some normalization algorithms to generate a normalized form of this grammar. These normalized forms generally do transformations to original grammar to remove redundancy, and help the actual parsing algorithm make correct decisions faster.

Did you like the post? Drop in a comment! If you find any error in this post, I’m only a human, and I’ll accept my mistakes, and make any changes if required.

Edit

I’m also sure that this grammar form will heavily change when the depth of this “induced recursion” increases, because then we won’t only have $X$, $Y$, $Z$ and $b$, $c$, $l$, $m$, but instead have a lot more terminals, and depending on how the rules appear, and how the recursion is performed (we took only one expansion per rule), the complexity will arise as well.

So? Will we have to go through this process again? For every new grammar we encounter, can’t we just use this as a formula? I don’t know to be honest. Maybe from next time I’ll just use a pre existing good parser generator, or just use a better algorithm. I also don’t know whether it’s always possible to rewrite the grammar like this or not. When it comes to this “inductive recursion”, things can get a lot scary.

One possible hack would be to first go through the grammar at hand, and look for any cases like this. This can be easily detectable by finding cycles in a graph. A graph where each node represents a rule or a terminal. I remember creating a tool for this on my stream, but sadly I don’t have the code now, just the live streamed video on YT.

So if you notice something like this, next time in your grammar, make sure to fix it then and there if possible, otherwise maybe try a right to left parser, or some other advanced algorithm that does not use backtracking.

Siddharth Mishra

Fixing Recursions In Grammar

Edit 12th Feb 2025

Fixing Left Recursion

Language Analysis

Real Life Example

Mutual Left Recursion

Language Analysis

Real Life Example

Solution

Before Solving Anything

After The Changes

EDIT : The Induced Problem & Solution

The Problem

Final Comments

Edit

Further Reading