CoCKTaiL, part 2
The CoCKTaiL grammar
In my last article I gave some examples and discussed the movtivation for CoCKTaiL, a Drupal CCK type language. I promised that I would describe the grammer I have been working on and say a few words about the tools I have been using.
I should begin by saying that this is very much a work in progress, so what you see here should not be regarded as a specification, or any kind of promise of the final shape of things to come. Rather, it is an invitation to join in the party and help to bring this tool into being. I welcome any comments, suggestions, and help with development, testing, and documentation. In particular, if you have thoughts on how the grammar itself could be changed to make it simpler, or otherwise better, I'd love your input.
Without further ado, then, here is the grammar (at the current moment) of CoCKTaiL. The following is the output of the relatively new lemon parser-generator, about which we shall say more just a bit later. Note: the numbers at the start of each line are not part of the output; they have been added to allow reference to particular lines in the following discussion.
1 // Reprint of input file "CoCKTaiL.y".
2 // Symbols:
3 // 0 $ 13 RPAREN 26 groupname
4 // 1 TYPE 14 KEYSTRING 27 fieldtype
5 // 2 LBRACE 15 TRUE 28 fieldname
6 // 3 RBRACE 16 FALSE 29 attr
7 // 4 ENCAPSED_STRING 17 DOTDOT 30 attrname
8 // 5 GROUP 18 error 31 attrvalue
9 // 6 FIELD 19 typedef 32 rhsvalue
10 // 7 IDENTIFIER 20 typename 33 boolean
11 // 8 SEMICOLON 21 field_or_group 34 range
12 // 9 STRING 22 attributes 35 array
13 // 10 INTEGER 23 groups_or_fields 36 arrayentries
14 // 11 ARR 24 group_def 37 arrayentry
15 // 12 LPAREN 25 field_def 38 arraykey
16 typedef ::= TYPE typename LBRACE field_or_group RBRACE.
17 typename ::= ENCAPSED_STRING.
18 field_or_group ::= attributes groups_or_fields.
19 groups_or_fields ::=.
20 groups_or_fields ::= groups_or_fields group_def.
21 groups_or_fields ::= groups_or_fields field_def.
22 group_def ::= GROUP groupname LBRACE field_or_group RBRACE.
23 groupname ::= ENCAPSED_STRING.
24 field_def ::= FIELD fieldtype fieldname LBRACE attributes RBRACE.
25 fieldtype ::= IDENTIFIER.
26 fieldname ::= ENCAPSED_STRING.
27 attributes ::=.
28 attributes ::= attributes attr.
29 attr ::= attrname attrvalue SEMICOLON.
30 attrname ::= IDENTIFIER.
31 attrvalue ::= rhsvalue.
32 rhsvalue ::= STRING.
33 rhsvalue ::= IDENTIFIER.
34 rhsvalue ::= ENCAPSED_STRING.
35 rhsvalue ::= INTEGER.
36 rhsvalue ::= boolean.
37 rhsvalue ::= range.
38 rhsvalue ::= array.
39 array ::= ARR LPAREN arrayentries RPAREN.
40 arrayentries ::=.
41 arrayentries ::= arrayentry arrayentries.
42 arrayentry ::= arraykey rhsvalue SEMICOLON.
43 arraykey ::= KEYSTRING.
44 boolean ::= TRUE.
45 boolean ::= FALSE.
46 range ::= INTEGER DOTDOT INTEGER.
The first part of this defines the symbols of the grammar. By convention, the symbols in UPPERCASE are terminal symbols, while the symbols in lowercase are non-terminals. Terminal symbols are recognized and passed to the parser by a lexical analyser (a.k.a. lexer, scanner, or tokenizer).
The second part defines the grammar per se, as a sequence of rules or productions. Each rule is terminated by a period. A rule is interpreted as meaning that a sequence of input symbols matching the right hand side (RHS) may be replaced by (i.e., reduced to) the left hand side (LHS) of the rule. If a sequence of such reductions applied to an input results in the start symbol (typedef) then the input is 'accepted' as being a valid expression in the grammar.
For example, lines 27 and 28 define attributes to be either empty (line 27) or attributes follow by attribute. This is a recursive definition, but since we can escape from the recursion by taking the first choice for attributes, we see that these two lines define attributes as a list of attribute symbols.
Tools
I've already mentioned the Lemon parser generator, the latest in a venerable line of software tools of which YACC (yet another compiler-comiler) is probably the most famous progenitor, and GNU/Bison and AntLR the most widely used current exemplars.
Lemon was written (in C, to generate C-language output) by Dr. Richard Hipp, whom you may know as the author of SQLite, the embedded SQL Engine used by Trac, to name just one of many projects. Lemon was used to write the SQL parser for SQLite.
Well, this would be merely interesting if the story ended here, because after all, we want to write in PHP to generate PHP, don't we? Fortunately, Greg Beaver ported lemon to PHP and released it as a PEAR package: PHP_ParserGenerator. To make our joy complete, he also wrote a lexer generator to complete the suite of tools needed. Again, this is packaged as PHP_LexerGenerator in PEAR.
Using phplemon and plex to generate a CoCKTaiL parser
Step 1: Download and install PHP_ParserGenerator and PHP_LexerGenerator.
Step 2: Set up the environment. I added a couple of lines to my PATH
specification to add the directories where phplemon and plex live to
my shell's executable search path. Then phplemon
parser.y runs the command line PHP lemon parser generator on
the parser input file parser.y, generating output files
parser.out and parser.php. Similiarly
plex lexer.plex will build the lexer, generating
lexer.php as output.
Step 3: Write the parser. This involves writing the grammar specification for the parser. I developed the parser iteratively, starting with a trivial grammar and adding features one at a time. The parser input file (CoCKTaiL.y in my case) consists of several sections, with the grammer proper being one of them. Besides specifying the productions as I've discussed above, one can add an action to each production. This is a chunk of code that will be executed by the parser when a production 'matches' (i.e., is reduced). The code snippets have access to the parser's context for the input stack, which can be used to generate appropriate code. Returning to our example above, if we have reduced an attributes symbol, we can gather the list of attributes and save them for future use.
I started out with empty actions, which kept the emphasis on the grammar, and kept the parser cleaner. This approach also lets one easily build multiple applications from the same grammar. A parser with empty actions is a validator; adding actions to generate the appropriate PHP code lets us build a CCK type generator.
Step 4: Build the lexer. Actually, building the parser and building the lexer kind of go hand-in-hand if you want to test with real input. The point to remember is that the parser generator defines the symbols, so it should be used to generate the symbolic names that the lexer uses. The 'C' version of Lemon can generate a C header file containing the terminal symbols; I had to extract these by hand from the generated PHP file, to build a class file defining the symbols which could then be used by the lexer.
In reality, the grammar is simple enough that we could build a lexer by hand. However, plex is very simple to use, and the resulting lexer is probably faster and less buggy than a hand-rolled one. Since the spec is rather in flux at the moment, flexibility is a great boon here.
Specifying a lexer amounts to writing regular expressions defining terminal symbols. Things get slightly more complicated, in that the lexer is actually state-driven: it can lex symbols differently depending on what state it is in; transitions between states are defined by specifying states to move to when particular sequences of symbols are read. Our present lexer is fairly simple. This may change, if it can help to simplify the input grammer. (In general, things are simpler for humans if we can rely on context to suppress otherwise ugly details. This means, however, that the lexer has to be aware of context: it does this by realizing different states for different contexts.)
Coming Up
In my next article, I will show completed code for a validator and a CCK type generator, that will parse a CoCKTaiL input file to produce PHP which can be imported by CCK. I welcome any comments, suggestions, and help with development and testing.














You could generate this as
You could generate this as a XML UML export and then use simple methods like PHP's XPath support to generate these
Exactly!
Having a suitable higher level language as a target for XSL transformation from XMI is one of the goals of this project.
Glad I reserved judgement
I feared that wheels where re-invented here - but in fact its quite the opposite. Looks like you are standing on the shoulders of giants.
andre
How about integrating it
How about integrating it with the DRUSH project?