CoCKTaiL, part 2

Djun Kim
2008
10
01

The CoCKTaiL grammar

In my last article I gave some examples and discussed the movtivation for CoCKTaiL, a Drupal CCK type language. I promised that I would describe the grammer I have been working on and say a few words about the tools I have been using.

I should begin by saying that this is very much a work in progress, so what you see here should not be regarded as a specification, or any kind of promise of the final shape of things to come. Rather, it is an invitation to join in the party and help to bring this tool into being. I welcome any comments, suggestions, and help with development, testing, and documentation. In particular, if you have thoughts on how the grammar itself could be changed to make it simpler, or otherwise better, I'd love your input.

Without further ado, then, here is the grammar (at the current moment) of CoCKTaiL. The following is the output of the relatively new lemon parser-generator, about which we shall say more just a bit later. Note: the numbers at the start of each line are not part of the output; they have been added to allow reference to particular lines in the following discussion.


 1 // Reprint of input file "CoCKTaiL.y".
 2 // Symbols:
 3 //   0 $                 13 RPAREN            26 groupname       
 4 //   1 TYPE              14 KEYSTRING         27 fieldtype       
 5 //   2 LBRACE            15 TRUE              28 fieldname       
 6 //   3 RBRACE            16 FALSE             29 attr            
 7 //   4 ENCAPSED_STRING   17 DOTDOT            30 attrname        
 8 //   5 GROUP             18 error             31 attrvalue       
 9 //   6 FIELD             19 typedef           32 rhsvalue        
10 //   7 IDENTIFIER        20 typename          33 boolean         
11 //   8 SEMICOLON         21 field_or_group    34 range           
12 //   9 STRING            22 attributes        35 array           
13 //  10 INTEGER           23 groups_or_fields  36 arrayentries    
14 //  11 ARR               24 group_def         37 arrayentry      
15 //  12 LPAREN            25 field_def         38 arraykey        
16 typedef ::= TYPE typename LBRACE field_or_group RBRACE.
17 typename ::= ENCAPSED_STRING.
18 field_or_group ::= attributes groups_or_fields.
19 groups_or_fields ::=.
20 groups_or_fields ::= groups_or_fields group_def.
21 groups_or_fields ::= groups_or_fields field_def.
22 group_def ::= GROUP groupname LBRACE field_or_group RBRACE.
23 groupname ::= ENCAPSED_STRING.
24 field_def ::= FIELD fieldtype fieldname LBRACE attributes RBRACE.
25 fieldtype ::= IDENTIFIER.
26 fieldname ::= ENCAPSED_STRING.
27 attributes ::=.
28 attributes ::= attributes attr.
29 attr ::= attrname attrvalue SEMICOLON.
30 attrname ::= IDENTIFIER.
31 attrvalue ::= rhsvalue.
32 rhsvalue ::= STRING.
33 rhsvalue ::= IDENTIFIER.
34 rhsvalue ::= ENCAPSED_STRING.
35 rhsvalue ::= INTEGER.
36 rhsvalue ::= boolean.
37 rhsvalue ::= range.
38 rhsvalue ::= array.
39 array ::= ARR LPAREN arrayentries RPAREN.
40 arrayentries ::=.
41 arrayentries ::= arrayentry arrayentries.
42 arrayentry ::= arraykey rhsvalue SEMICOLON.
43 arraykey ::= KEYSTRING.
44 boolean ::= TRUE.
45 boolean ::= FALSE.
46 range ::= INTEGER DOTDOT INTEGER.

The first part of this defines the symbols of the grammar. By convention, the symbols in UPPERCASE are terminal symbols, while the symbols in lowercase are non-terminals. Terminal symbols are recognized and passed to the parser by a lexical analyser (a.k.a. lexer, scanner, or tokenizer).

The second part defines the grammar per se, as a sequence of rules or productions. Each rule is terminated by a period. A rule is interpreted as meaning that a sequence of input symbols matching the right hand side (RHS) may be replaced by (i.e., reduced to) the left hand side (LHS) of the rule. If a sequence of such reductions applied to an input results in the start symbol (typedef) then the input is 'accepted' as being a valid expression in the grammar.

For example, lines 27 and 28 define attributes to be either empty (line 27) or attributes follow by attribute. This is a recursive definition, but since we can escape from the recursion by taking the first choice for attributes, we see that these two lines define attributes as a list of attribute symbols.

Tools

I've already mentioned the Lemon parser generator, the latest in a venerable line of software tools of which YACC (yet another compiler-comiler) is probably the most famous progenitor, and GNU/Bison and AntLR the most widely used current exemplars.

Lemon was written (in C, to generate C-language output) by Dr. Richard Hipp, whom you may know as the author of SQLite, the embedded SQL Engine used by Trac, to name just one of many projects. Lemon was used to write the SQL parser for SQLite.

Well, this would be merely interesting if the story ended here, because after all, we want to write in PHP to generate PHP, don't we? Fortunately, Greg Beaver ported lemon to PHP and released it as a PEAR package: PHP_ParserGenerator. To make our joy complete, he also wrote a lexer generator to complete the suite of tools needed. Again, this is packaged as PHP_LexerGenerator in PEAR.

Using phplemon and plex to generate a CoCKTaiL parser

Step 1: Download and install PHP_ParserGenerator and PHP_LexerGenerator.

Step 2: Set up the environment. I added a couple of lines to my PATH specification to add the directories where phplemon and plex live to my shell's executable search path. Then phplemon parser.y runs the command line PHP lemon parser generator on the parser input file parser.y, generating output files parser.out and parser.php. Similiarly plex lexer.plex will build the lexer, generating lexer.php as output.

Step 3: Write the parser. This involves writing the grammar specification for the parser. I developed the parser iteratively, starting with a trivial grammar and adding features one at a time. The parser input file (CoCKTaiL.y in my case) consists of several sections, with the grammer proper being one of them. Besides specifying the productions as I've discussed above, one can add an action to each production. This is a chunk of code that will be executed by the parser when a production 'matches' (i.e., is reduced). The code snippets have access to the parser's context for the input stack, which can be used to generate appropriate code. Returning to our example above, if we have reduced an attributes symbol, we can gather the list of attributes and save them for future use.

I started out with empty actions, which kept the emphasis on the grammar, and kept the parser cleaner. This approach also lets one easily build multiple applications from the same grammar. A parser with empty actions is a validator; adding actions to generate the appropriate PHP code lets us build a CCK type generator.

Step 4: Build the lexer. Actually, building the parser and building the lexer kind of go hand-in-hand if you want to test with real input. The point to remember is that the parser generator defines the symbols, so it should be used to generate the symbolic names that the lexer uses. The 'C' version of Lemon can generate a C header file containing the terminal symbols; I had to extract these by hand from the generated PHP file, to build a class file defining the symbols which could then be used by the lexer.

In reality, the grammar is simple enough that we could build a lexer by hand. However, plex is very simple to use, and the resulting lexer is probably faster and less buggy than a hand-rolled one. Since the spec is rather in flux at the moment, flexibility is a great boon here.

Specifying a lexer amounts to writing regular expressions defining terminal symbols. Things get slightly more complicated, in that the lexer is actually state-driven: it can lex symbols differently depending on what state it is in; transitions between states are defined by specifying states to move to when particular sequences of symbols are read. Our present lexer is fairly simple. This may change, if it can help to simplify the input grammer. (In general, things are simpler for humans if we can rely on context to suppress otherwise ugly details. This means, however, that the lexer has to be aware of context: it does this by realizing different states for different contexts.)

Coming Up

In my next article, I will show completed code for a validator and a CCK type generator, that will parse a CoCKTaiL input file to produce PHP which can be imported by CCK. I welcome any comments, suggestions, and help with development and testing.

You could generate this as

You could generate this as a XML UML export and then use simple methods like PHP's XPath support to generate these

Exactly!

Having a suitable higher level language as a target for XSL transformation from XMI is one of the goals of this project.

Glad I reserved judgement

I feared that wheels where re-invented here - but in fact its quite the opposite. Looks like you are standing on the shoulders of giants.

andre

How about integrating it

How about integrating it with the DRUSH project?

聚合内容