Parsing XMLThis is a tutorial about parsing XML using ANTLR 3. If you do not know what ANTLR 3 is, have a look at this introduction. XML has been chosen as an example as it should be known to wide range of developers. Additionally, it isn‘t all that easy to parse, but then also not too difficult. Caution: This is an example given for educational reasons. In the magnitude of all cases, you‘ll be off far better by using an XML parser such as Xerces, expat etc. or an XML binding tool such as XML beans. If you are not sure whether you should write a parser of your own, have a look at section 5. Parsing XML using traditional parsing techniques is tricky, most of all because XML is a sick language. Depending on the context different characters are either keywords or simple text. Given an XML fragment like: <tag attribute="value">
The ‘=‘ has a special meaning as well as the ‘>‘ while this is of course not the case in plain text. Additionally, newlines and other white spaces inside tags have to be ignored, but are part of the text outside of them. For further investigations we restrict ourselves to a simple subset of XML. It will have the same complexity as XML seen from the perspective of parsing, but will make the examples much simpler. It is restricted to tags and parsable text (PCDATA) - no XML declaration, not doctype, no DTD or XSD, no processing instructions, no CDATA sections, no namespaces. We require that the reader has at least basic knowledge of XML. For a short introduction, you should have a look at this XML in 10 points article Note: Full sources are available as an attachment to this page. OverviewThis tutorial is devided into multiple parts. Each part explains about the tasks of one of ANTLR‘3 parser types. There is an optional excursion that tries to shed some light onto how ANTLR 3‘s parsing strategy works. If you are not interested in theory you can safely omit it. Parser Type Overview
Lexing XML with ANTLR 3First of all you have to decide which parts of the input XML file our parser needs. Traditionally, you have a so called lexer that reads through all characters and builds larger chunks called tokens. A token can be a keyword like struct or void in the C programming language or, it can be a string enclosed in quotes. A lexer thus is some sort of preprocessor for the next step, the token parser. You will learn more about the token parser later, here we will look more closely at the lexer. In ANTLR you describe a lexer using a lexer grammar. To declare a grammar called XMLLexer, you create a text file and add the following lines at the top: lexer grammar XMLLexer; What follows are a number of descriptions for tokens, like TAG_START_OPEN : ‘<‘ ; This description, called rule or production, says "generate a token called TAG_START_OPEN every time you see a ‘<‘ in the input character stream" to indicate a tag starts here. The following rule is for the ‘>‘ that indicates the end of a tag: TAG_CLOSE : ‘>‘ ; The token parser which we will inspect in detail later has to make sense of these tokens and find out where a full tag is. You now have seen something about tags, but what about text? This is a little more complicated as text is anything until a tag begins which is indicated by a ‘<‘: PCDATA : (~‘<‘)+ ; This rule contains two new concepts. First, everything inside (...)+ must be there at least once, but many also may be there many times more. Then there is this ~ before the~ ‘<‘~. The ~ means "anything but the character that follows". The complete description thus means the PCDATA token is composed of one or more characters that are not ‘<‘. To illustrate that, look at this stream of characters This is text<tag> Rule PCDATA would accept the text This is text and generate a PCDATA token from it. That‘s fine and just what we wanted. The next token will be TAG_START_OPEN upon seeing the ‘<‘. Hooray, still right!. But upon seeing tag it tries to generate an additional PCDATA token which certainly isn‘t what we want!. OK, no problem, here comes the rule for the tag name. We call it GENERIC_ID as you can use it for attribute names as well: GENERIC_ID : ( LETTER | ‘_‘ | ‘:‘) (NAMECHAR)* ; Before we investigate if this is appropriate to our problem, we have to explain some more concepts. (...)* is the same as (...)+ except that it also allows for no characters at all. The other new thing is the embedded NAMECHAR. It is a reference to another rule. This special rule describes how a character of an id might look like. You will see that rule later. Finally you have the | which is an or. Thus ( LETTER | ‘‘ | ‘:‘) means the first character can either be a LETTER - another sub rule - a ‘‘ or a ‘:‘. Let us complete this grammar with the missing sub rules to let it run through ANTLR: lexer grammar XMLLexer; TAG_START_OPEN : ‘<‘ ; TAG_CLOSE : ‘>‘ ; PCDATA : (~‘<‘)+ ; GENERIC_ID : ( LETTER | ‘_‘ | ‘:‘) (NAMECHAR)* ; fragment NAMECHAR : LETTER | DIGIT | ‘.‘ | ‘-‘ | ‘_‘ | ‘:‘ ; fragment DIGIT : ‘0‘..‘9‘ ; fragment LETTER : ‘a‘..‘z‘ | ‘A‘..‘Z‘ ; WS : (‘ ‘|‘\r‘|‘\t‘|‘\u000C‘|‘\n‘) {channel=99;} ; You can see that sub rules are prefixed by the keyword fragment. This is important as no sub rule can be used to identify a token. Otherwise your lexer would identify lots of LETTER tokens which you certainly do not want. The rule WS is special as it recognizes all white spaces, but does not pass it to the next stage, the token parser, because we don‘t care about white spaces. Sending the token to channel 99 is ANTLR‘s way of saying that. ‘0‘..‘9‘ is a short form of saying (0|1|2|3|4|5|6|7|8|9) and ‘a‘..‘z‘ respectively for the characters from a to z. This is how you invoke the ANTLR tool. Be careful to have all jars (ANTLR3, ANTLR2.7.6 and stringtemplate) from the ANTLR lib dir in your Java classpath: java org.antlr.Tool xmlLexer-version1.g As you can see, the ANTLR grammar is named xmlLexer-version1.g and followed the convention that all grammars have a g as extension. This is what ANTLR is likely to tell you: ANTLR Parser Generator Early Access Version 3.0b3 (July 21, 2006) 1989-2006 xmlLexer-version1.g:25:1: The following token definitions are unreachable: GENERIC_ID,WS Hmmm. The first line simply tells us about the ANTLR version, but the second line looks like there is something wrong. It says that the rules for white spaces and the tag name can not be reached and will thus never apply. You might have guessed already that the reason is the rule for PCDATA. The characters it can match are a superset of those possibly matched by WS and GENERIC_ID. Reconsidering our XML problem we can rephrase it in terms of context. In one context text can be PCDATA and in another it might be the name for a tag. GENERIC_ID : {tagMode}?=> ( LETTER | ‘_‘ | ‘:‘) (NAMECHAR)* ; WS : {tagMode}?=> (‘ ‘|‘\r‘|‘\t‘|‘\u000C‘|‘\n‘) {channel=99;} ; PCDATA : {!tagMode}?=> (~‘<‘)+ ; Gates are written inside {...}?=> and contain a boolean expression in the target language (Java by default). In your case this is a simple Boolean variable. Of course we need to define it and also toggle it upon identification of meaningful rules. Introduction of members goes like this: @members { boolean tagMode = false; } Cool. But just when to switch to the tag mode and when to switch back? Easy, every time a tag starts we switch to the tag mode: TAG_START_OPEN : ‘<‘ { tagMode = true; } ;
and every time a tag ends we switch back: TAG_CLOSE : ‘>‘ { tagMode = false; } ;
Everything inside {...} will be executed once a rule has been used to identify a token. It must thus be a valid statement of the target language. This is our complete grammar version #2: <pre c
Parsing XML with ANTLR3As soon as you have your lexer tuned to deliver the right tokens you need to think about the structure expressed by them. Or, you might say, reveal the inherent structure of the token stream coming from the lexer. This is the task of the token parser. Most people simply call it parser, but some people also call the combination of the lexer/parser/tree parser the parser. That‘s why I like to use the term token parser. Anyway, to illustrate the job of the token parser consider the programming language Pascal. In Pascal the keywords begin and end mark the beginning and the end of a block. begin and end would be tokens delivered by the lexer and the token parser now has to reveal that anything in between is a block. Or, to stick to our XML example, a name between a TAG_START_OPEN and a TAG_CLOSE together is are a start tag. In ANTLR we describe it like that: startTag : TAG_START_OPEN GENERIC_ID TAG_CLOSE ; Whereas the GENERIC_ID is the name of that specific start tag. Note that token names are usually all upper case and token parser rule names at least start with a lower case character. Other than that, syntax of the lexer and token parser description is unified. This means the general rule structure and most of the expressions - like ()+ and ()* - are the same both in the lexer as well as in the token parser. Which is good as you only have to learn one language! But wait, our rule isn‘t quite complete, we forgot about attributes: startTag : TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ; attribute : GENERIC_ID ATTR_EQ ATTR_VALUE ; You have put the attribute definition into a separate rule to reuse it as a sub rule in the definition for the empty element tag: emptyElement : TAG_START_OPEN GENERIC_ID (attribute)* TAG_EMPTY_CLOSE ; Finally, the definition for the end tag, which is easy: endTag : TAG_END_OPEN GENERIC_ID TAG_CLOSE ; Using this grammar ANTLR can now identify all kinds of tags. This is the complete first version of the grammar: parser grammar XMLParser; startTag : TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ; attribute : GENERIC_ID ATTR_EQ ATTR_VALUE ; endTag : TAG_END_OPEN GENERIC_ID TAG_CLOSE ; emptyElement : TAG_START_OPEN GENERIC_ID (attribute)* TAG_EMPTY_CLOSE ; Save it to a file - I use xmlParser-version1.g - and run it through ANTLR just like the lexer grammar: java org.antlr.Tool xmlLParser-version1.g ANTLR will know what to do with it because of the specific header. So, you have a parser now. But what‘s the use? What exactly is the effect of these token parser rules? In the lexer rules produce tokens once the characters coming in match it. Obviously, this is not the case here. Remember, what we wanted from the parser was to find out about the structure of our XML document. What you have is the structure of the tags inside that document which is only part of our overall structure. You also need to know which element is the root element, where the text and generally where all those tags are relatively to each other. It is important to know which tag is a child of another, what tags are siblings and so on. Marvel at this beautifully simple rule for the overall structure of an XML document: element : startTag (element | PCDATA )* endTag | emptyElement ; It matches all structures that either are an empty element or a complete element headed by a start tag and ended by an end tag. Such a complete element may contain text as PCDATA and other elements as sub elements. Both text and sub element can be mixed. That‘s it. That‘s the whole structure of an XML document! OK, we add this rule and also another to tell ANTLR where to start parsing. We also need to tune our parser to the types of the token coming from the lexer: parser grammar XMLParser; options { tokenVocab=XMLLexer; } document : element ; element : startTag (element | PCDATA )* endTag | emptyElement ; startTag : TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ; attribute : GENERIC_ID ATTR_EQ ATTR_VALUE ; endTag : TAG_END_OPEN GENERIC_ID TAG_CLOSE ; emptyElement : TAG_START_OPEN GENERIC_ID (attribute)* TAG_EMPTY_CLOSE ; And this is the glue code to feed the token parser with the tokens coming from the lexer: import org.antlr.runtime.*; import org.antlr.runtime.tree.*; public class MainParser { public static void main(String[] args) { try { CharStream input = new ANTLRFileStream(args[0]); XMLLexer lex = new XMLLexer(input); CommonTokenStream tokens = new CommonTokenStream(lex); XMLParser parser = new XMLParser(tokens); parser.document(); } catch(Throwable t) { System.out.println("exception: "+t); t.printStackTrace(); } } } Save this class, compile it and start the whole parser on the XML we were using for the lexer. Like this: java MainParser simpleXML.xml And see: nothing! It may not seem so, but this is good as it indicates that we have successfully parsed the XML input. The structure actually has been revealed to be a tree, but we do not see much of it, yet. In the next section we will construct a tree data structure that exactly matches the revealed one.
Excursion: LL(*) parsingThis short excursion into ANTLR‘s parsing strategy is optional. If you are not interested you can safely continue with the next section and still understand the rest of the tutorial. However, if you are interested, you can learn a little bit about parsing theory and the special strategy ANTLR 3 uses as opposed to ANTLR 2.x. Look at these two rules for the start tag and the empty element. Both rules start with the same tokens, only the last token is different: startTag : TAG_START_OPEN GENERIC_ID (attribute)* TAG_CLOSE ; emptyElement : TAG_START_OPEN GENERIC_ID (attribute)* TAG_EMPTY_CLOSE ; When you tried to pass this grammar through earlier versions of ANTLR you would run into serious problems. The reason is that all parsers need to decide which rule they have to use to match an incoming token stream. Same parsing strategies solve this by delaying this decision until they have seen enough tokens. For this purpose tokens taken from the stream are often collected on a stack until there is enough information available to make the decision. In our example this would be until either the TAG_CLOSE or TAG_EMPTY_CLOSE is detected. Those strategies are called bottom up. Tools that use this strategy include famous yacc. Their drawback is a fair amount of complexity when it comes to debugging. ANTLR 2.x can not do this as it uses a top down parsing strategy. Top down parsing strategies have to decide which rule to use before taking any tokens from the token stream. What ANTLR 2.x can do instead is to look ahead in the token stream to make its decision. However, no matter how far you look ahead there always is the horizon beyond which you can not see. Because of this top down parsers are thus usually more restrictive concerning the grammars they can handle. This is also true for ANTLR 2.x as the amount of look ahead is arbitrary, but fixed. Because the number of attributes in a tag is not fixed, ANTLR 2.x can not handle this grammar. Now, this may sound a little bit complicated. But an example will help: This is a start tag <tagname attribute1="value1">
while this is an empty element <tagname attribute1="value1"/>
How many tokens would we need to look ahead to tell if this a start tag or an empty element? Or, in other words, until we either see the > or the />? Let‘s count. Remember that whitespace has already been stripped by the lexer and this is how the attribute rule looks like: attribute : GENERIC_ID ATTR_EQ ATTR_VALUE ; Tokens:
OK, so we have six tokens as a required look ahead to decide which alternative to take. Setting the fixed look ahead (ANTLR and many others call it k) to 6 thus seems to be sufficient. However, it really is not, as this only works for this specific example. In general, there can be any number attributes and for any given k I can give a tag that has too many attributes to be correctly predicted. The prediction mechanism of ANTLR 3 is more powerful, because it can look ahead a random number of tokens that is not fixed at the beginning of a parse. Maybe you can image that ANTLR 3 has a scout sent out when decisions become tricky. He would go ahead until he has the significant information while ANTLR 2.x only had a telescope with a certain range. This way ANTLR 3 can handle this grammar and is more tolerant to the way a grammar has to look like. Consequently, it allows for more natural looking grammars that are more intuitive to the programmer. What struck me most about ANTLR 3‘s new parsing strategy was that it appears to mimic what humans do. As a human we can easily decide the kind of the tag by looking at its end. We scan forward until we either see > or />. Once we made our decision based upon this information we return back to the start of the tag and collect all information inside of it. And that is what ANTLR 3 does. Both the way humans behave as well as ANTLR‘s parsing strategy come at a certain price, though. Firstly - and only theoretically - the parsing time may no longer be linear to the input size. Secondly, the forward scan can only handle a certain amount of structural complexity. It fails to properly detect too involved structures. This again is true both for humans as for ANTLR 3. However, when we encounter such complexities the language to be parsed may not be all too sensible.
Adding a tree parser stageIn the section about the token parser we revealed the structure of an XML document to be a tree. You did not see pretty much of it, though. The reason is that you did not actually build up any data structure from what you revealed. Such a data structure usually is described as an AST (abstract syntax tree), a tree structure that mimics the structure of the parsed input. In this section you will see how to build up such a structure and also how to traverse it using tree parsers. Tree parsers do not consume flat streams of tokens, both rather structured trees of tokens generated by the token parser. Usually, all redundant data is stripped from those trees and their internal structure is most natural. One example of redundant data is an end tag. Its purpose in a flat stream is to indicate where an element ends. Such structure can, however, be expressed in a two dimensional tree. No need for an end tag. An example of a token that can be completely skipped is the equal sign (=) between attribute name and value. It is there for better human readability, but actually does not carry any meaning. Finally, there is good reason to consider an empty element equivalent to a start tag directly followed by an end tag. This can be unified. Even though this is not a common approach, let us start with the tree grammar that looks most natural and simple. Later you can find out how tree construction must look like to produce a tree matching this grammar. To achieve this, let us remember that an XML document is a representation of a tree structure. Leaf nodes are either empty elements or text data. Non-leaf nodes are always elements that enclose other elements or text. Additionally, there are attributes and finally names that are associated to elements. It turns out that you can easily write all this in a single rule: element : ^( ELEMENT GENERIC_ID ( ^(ATTRIBUTE GENERIC_ID ATTR_VALUE) )* (element | PCDATA )* ) ; Cute, isn‘t it? This rule expresses the whole stammering above. The ^ symbol before an opening bracket indicates a hierarchical tree structure with the first item after the bracket being the root of the tree and the rest being the children. In this case we have a token called ELEMENT that tells us what follows actually is an element. The first child is GENERIC_ID which is the name of the element, then we have a list of attributes and finally the element‘s sub elements potentially mixed with text. All attributes are again structured as a tree. To complete the tree grammar we just need to add the header that tells the tree parser to use the same token vocabulary as the lexer and the parser and we are done: tree grammar XMLTreeParser; options { tokenVocab=XMLLexer; } For real processing of the tree you certainly would want to access the text of the tokens. In ANTLR 3 you do this by $name.text where name is the label of a tree element like name=GENERIC_ID. You need to tell ANTLR about the type of your tree and also need to introduce a start rule. This is the complete tree grammar that simply dumps the whole AST back to XML: tree grammar XMLTreeParser; options { tokenVocab=XMLLexer; ASTLabelType = Tree; } document : element ; element : ^( ELEMENT name=GENERIC_ID { System.out.print("<"+$name.text); } ( ^(ATTRIBUTE attrName=GENERIC_ID value=ATTR_VALUE) { System.out.print(" "+$attrName.text+"="+$value.text); } )* { System.out.println(">"); } (element | text=PCDATA { System.out.print($text.text); } )* { System.out.println("</"+$name.text+">"); } ) ; You now have to augment the parser you have already seen above with tree construction statements. In ANTLR 3 tree creation and matching syntax is very similar. You can actually declare how the constructed tree is supposed to look like by duplicating the grammar fragment that matches the tree. E.g. for the tree construction of the attribute part we could write: attribute : GENERIC_ID ATTR_EQ ATTR_VALUE -> ^(ATTRIBUTE GENERIC_ID ATTR_VALUE) ; where the tree construction part that comes after the -> is exactly the same as the fragment seen in the tree grammar above. Isn‘t that cool? It, however, turns out that tree construction declaration by example isn‘t sufficient in all scenarios. For example we want our end tag to be omitted completely. That‘s what the ! operator is for. You can either apply it to a full rule by placing it behind a rule name like in endTag! : TAG_END_OPEN GENERIC_ID TAG_CLOSE; or to a single item (endTag again) in a rule like in element : ( startTag^^ (element | PCDATA )* endTag! | emptyElement ) ; You will notice that this is the element rule you already are familiar with. Next to the ! we can see the double ^^ here. This operator declares startTag to be the root of the generated tree. As you can see, this does not only work for a single token, but also for complete sub trees! Additionally, we make use of the automatic tree construction feature already known from ANTLR 2. Left hand items without declaration will simply result in a flat tree node. Finally, we have the two rules for start tags and empty elements startTag : TAG_START_OPEN GENERIC_ID attribute* TAG_CLOSE -> ^(ELEMENT GENERIC_ID attribute*) ; emptyElement : TAG_START_OPEN GENERIC_ID attribute* TAG_EMPTY_CLOSE -> ^(ELEMENT GENERIC_ID attribute*) ; which generate a unified tree for both complete as well as for empty elements. As you might have noticed, we are using new token names which we have to declare. This results in this grammar head: parser grammar XMLParser; options { tokenVocab=XMLLexer; output=AST; } tokens { ELEMENT; ATTRIBUTE; } That‘s our complete parser grammar that actually creates an AST: parser grammar XMLParser; options { tokenVocab=XMLLexer; output=AST; } tokens { ELEMENT; ATTRIBUTE; } document : element ; element : ( startTag^^ (element | PCDATA )* endTag! | emptyElement ) ; startTag : TAG_START_OPEN GENERIC_ID attribute* TAG_CLOSE -> ^(ELEMENT GENERIC_ID attribute*) ; attribute : GENERIC_ID ATTR_EQ ATTR_VALUE -> ^(ATTRIBUTE GENERIC_ID ATTR_VALUE) ; endTag! : TAG_END_OPEN GENERIC_ID TAG_CLOSE; emptyElement : TAG_START_OPEN GENERIC_ID attribute* TAG_EMPTY_CLOSE -> ^(ELEMENT GENERIC_ID attribute*) ; Never has tree construction been easier! Finally, the glue code: import org.antlr.runtime.*; import org.antlr.runtime.tree.*; public class Main { public static void main(String[] args) { try { CharStream input = new ANTLRFileStream(args[0]); XMLLexer lex = new XMLLexer(input); CommonTokenStream tokens = new CommonTokenStream(lex); XMLParser parser = new XMLParser(tokens); XMLParser.document_return root = parser.document(); System.out.println("tree="+((Tree)root.tree).toStringTree()); CommonTreeNodeStream nodes = new CommonTreeNodeStream((Tree)root.tree); XMLTreeParser walker = new XMLTreeParser(nodes); walker.document(); } catch(Throwable t) { System.out.println("exception: "+t); t.printStackTrace(); } } } When you feed in an example that is a bit more complex, like: <t>Starting text inside t<t attrt1="valuet1" attrt2="valuet2"><t1>Text inside t1<empty/></t1><empty2/><empty3/></t></t> This is what your parser spits out: tree= (ELEMENT t Starting text inside t (ELEMENT t (ATTRIBUTE attrt1 "valuet1") (ATTRIBUTE attrt2 "valuet2") (ELEMENT t1 Text inside t1 (ELEMENT empty)) (ELEMENT empty2) (ELEMENT empty3))) <t> Starting text inside t<t attrt1="valuet1" attrt2="valuet2"> <t1> Text inside t1<empty> </empty> </t1> <empty2> </empty2> <empty3> </empty3> </t> </t> The first part of the output is the result of the toStringTree() method call from our glue code. The formatted XML is printed by the actions inside our tree parser. And, wow! So, what you have built is an XML unifier and pretty printer. As you have now mastered this complete trail through all ANTLR parser types you may ask yourself where to go from here. A good idea is to have a look at the examples
![]() ![]() ![]() When to use a custom made XML parserAs discussed in a mailing list thread
|
|