Next: Computing First Sets of Up: Grammars Previous: So What Kind of

Building Parse Trees

The desirable complexity results of the previous section hold for recognition of context-free languages. But often it is the case that one wants, given a string to construct the parse tree(s) for it. An easy way to do this is to add an argument to each nonterminal to contain the parse tree, and then add the necessary code to each rule to construct the appropriate tree. For example, the following rule from our expression example:

    expr(Val) --> expr(Eval), [+], term(Tval), {Val is Eval+Tval}.

could become:

expr(Val,+(E1,T1)) --> expr(Eval,E1), [+], term(Tval,T1), {Val is Eval+Tval}.

We've added a second argument and in it constructed the parse tree for the entire expression phrase given the parse trees for the component expression and term phrases. All the other rules would be extended accordingly.

This is very easy to do and in almost all cases is the best way to construct the parse tree. However, from a complexity standpoint, it has certain drawbacks. It may cause us to lose the polynomial complexity we had. For example, consider the following grammar:

    :- auto_table.
    s --> b,[c]

    b --> b,b.
    b --> [a].

Here nonterminal s generates a list of ``a''s followed by a single ``c''. If we compile this grammar with XSB, using the auto_table directive, we get a program that will recognize strings correctly in O(n³) time. But if we add parameters to construct the parse tree, thusly:

    :- auto_table.
    s(s1(B,c)) --> b(B), [c].

    b(b1(B1,B2)) --> b(B1), b(B2).
    b(b2(a)) --> [a].

it may take exponential time. Now the string aⁿc has exponentially many parses (all bracketings of the length n string of ``a''s), so it will clearly take exponential time to produce them all. This is not so much of a problem; if there are exponentially many, there's no way to produce them all without taking exponential time. However, if we try to recognize the string aⁿ, this will take exponential time to report failure. This is because the system will construct all the (exponentially many) initial parses from the nonterminal ``b'' and then each time fail when it doesn't find a terminating ``c'' in the input string.

One might think that we could easily just maintain two versions of the grammar: one with no parameters to do recognition, and one with a parse-tree parameter. Then we'd first recognize and if there were no parses, we'd simply report failure, and if there were parses, we'd reprocess the input string, this time using the parsing version of the grammar. But this doesn't always work either. For example, say we added a few rules to our grammar above to obtain:

    :- auto_table.
    s --> b,[c].
    s --> g,[d].

    b --> b,b.
    b --> [a].

    g --> g, [a].
    g --> [a].

Here, the input string of [a,a,a,a,a,a,d] has only one parse, but naively parsing it with the parse-annotated grammar will construct exponentially many initial segments of parses, which come from the first rule and the rules for the nonterminal ``b''. So if we are serious about this problem, we must be a little more sophisticated.

We still use a recognizer, but we must mix calls to it throughout the parser. So we'll assume we have the recognizer given just previously and we want an efficient parser. We must mix calls to the recognizer in with the calls to the parsing procedures. One problem is that the DCG syntax does not support what we need, so we will have to write the routines directly above:

    :- auto_table.
    s(s(P,c),S0,S) :- 
           b(S0,S1), 'C'(S1,c,S),
           b(P,S0,S1).

    s(s(P,d),S0,S) :- 
           g(S0,S1), 'C'(S1,d,S),
           g(P,S0,S1).

    b(b(P1,P2),S0,S) :- 
           b(S0,S1), b(S1,S),
           b(P1,S0,S1), b(P2,S1,S).
    b(a,S0,S) :- 
           'C'(S0,a,S).

    g(g(P),S0,S) :- 
           g(S0,S1), 'C'(S1,a,S),
           g(P,S0,S1).
    g(a,S0,S) :- 
           'C'(S0,a,S).

Here for each rule, we first use the recognizer to see whether the rule will succeed. After we know it will succeed and have computed the exact substring that each nonterminal spans, then for each nonterminal we invoke the parsing routine to construct its parse.

Now as coded here, there will be multiple tabling: the auto_table directive will cause b/2, g/2, b/3, and g/3 all to be tabled. However, we don't really have to table the parsing nonterminals b/3 and g/3. Since they are always called with both the starting and ending string position known, they will not loop. This program has the desired property, that either it will fail in cubic time or it will succeed in cubic time and produce each parse in linear time.

Next: Computing First Sets of Up: Grammars Previous: So What Kind of

David S. Warren
1999-07-31