Ikke's blog » parsing http://eikke.com 'cause this is what I do Sun, 13 Feb 2011 14:58:55 +0000 en-US hourly 1 http://wordpress.org/?v=3.4.1 Pyparsing introduction: BNF to code http://eikke.com/pyparsing-introduction-bnf-to-code/ http://eikke.com/pyparsing-introduction-bnf-to-code/#comments Sun, 20 Jan 2008 16:32:51 +0000 Nicolas http://eikke.com/pyparsing-introduction-bnf-to-code/ After reading my previous post, you should have a pretty good understanding of what a BNF definition is all about. Let’s put this theory into practice, and write some basic parsers in Python, using Pyparsing!

Pyparsing allows a pretty one-to-one mapping of BNF to Python code: you can define sets and combinations, then parse any text fragment against it. This is something very important to notice: one basic BNF definition can (and should) be reused: if you once wrote a BNF definition for an integer value, you can easily reuse this definition in, eg, a basic integer math expression.

The most basic element using Pyparsing is a Word. In it’s most basic form this is a set of characters which will match any arbitrary length string, as long as the characters in this string are part of the Word character set.

A little introduction example: let’s write a parser which accepts words consisting of small-cap characters, or sentences which consist of words separated by spaces. First we define a formal definition using BNF:

  • character ::= a | b | c | d | … | y | z
  • word ::= character | character word
  • sentence ::= word | sentence ” ” word

Let’s port this formal definition to Python code. First we need to do some imports, as in most Python programs. I’d encourage the reader to write this code in an interactive interpreter (give iPython a try, it rocks!) and experiment a little with it (tab-completion and ‘?’ rock!):

from pyparsing import Word
from string import lowercase

Pyparsing includes several useful pre-defined lists of characters, including

  • alphas: a-zA-Z
  • nums: 0-9
  • alphanums: alphas + nums

These are normal Python strings. In this sample we only want lowercase characters though, so we import this from the string module.

Now we can define one word: a word is a concatenation of lowercase characters.

word = Word(lowercase)

Let’s play around with this:

print word.parseString('hello')
# returns ['hello']
print word.parseString('Hello')
# raises ParseException: Expected W:(abcd...) (0), (1,1)
print word.parseString('hello world')
# returns ['hello']
]]>
http://eikke.com/pyparsing-introduction-bnf-to-code/feed/ 8
Text parsing, formal grammars and BNF introduction http://eikke.com/text-parsing-formal-grammars-and-bnf-introduction/ http://eikke.com/text-parsing-formal-grammars-and-bnf-introduction/#comments Sun, 20 Jan 2008 03:39:43 +0000 Nicolas http://eikke.com/text-parsing-formal-grammars-and-bnf-introduction/ Parsing input is something most developers run into one day. Parsing binary input can be pretty straight-forward, as most of the time you know the format of the input, ie you know what to expect: if you receive a message of 10 bytes, the first byte could be a message ID, the second one the payload length, third one message type ID, and others message content. Pretty easy to handle.

Parsing human-readable text can be harder though, as human beings tend to be less strict when providing input (eg whitespacing), you can’t ask humans to prepend strings with their length, etc.

There are several ways to handle text input. One well-known method is using regular expressions with matches, but writing regular expressions which are able to process not-so-strict input can be pretty though, writing expressions to parse large bodies of text is hard, using sub-expressions can become pretty complicated,… Overall regular expressions usually involve quite a lot of black magic for the average outsider.

xkcd comic: Regular expressions

Luckily, there are easier methods to parse text input too, of which I’d like to introduct one: a Python module called Pyparsing, which can do BNF-style text parsing.

First of all, let me explain “BNF”. The Backus-Naur Form, aka BNF, is a metasyntax you can use to express the grammar of a formal language. This might make no sense at all, so let’s split it up:

  1. Syntax: a syntax defines the structure of a sentence. Don’t just think about a normal sentence here, an expression in a programming language can be a sentence as well. In every language there’s an alphabet (example: the alphabet of integer numbers consists of 0, 1, 2,…, 9). A syntax defines how characters from this alphabet should be placed together to form a valid sentence (expression).
  2. Metasyntax: a metasyntax is a syntax to define a syntax. It has its own alphabet and well-formedness rules.
  3. Grammar: set of rules against which sentences can be checked to figure out whether they’re valid or not. Do note being valid does not imply the sentence got a real significance.
  4. Formal language: a language with a very strict and unforgiving description (grammar). The English language is not a formal language: although “I in Belgium live” is not a correct sentence (it doesn’t correspond to the grammar of the English language), everyone knows what it means. In a formal language, any string which consists of characters from the language’s alphabet, but does not match the grammar, got no significance at all.

That’s about it. If this isn’t very clear to you, never mind, the upcoming examples should explain a lot.

Let’s write our first BNF definition of a very simple language: integers. As noted before, the alphabet of integers consists of the numbers 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. The structure of integers is very simple, it’s a concatenation of numbers, but shouldn’t start with a 0.

When creating a BNF definition, we first define the sets of characters from the alphabet we’re going to use:

  • nzdigit ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
  • digit ::= 0 | nzdigit

nzdigit is a set of all non-zero numbers, digit is any digit. the :, = and | characters are part of the BNF’s metalanguage alphabet. The : and = characters are used to assign a set to a name, the | character denotes “or”. Notice we can re-use a previously defined set.

Now let’s see how to define the rules for integers. An integer consists of a concatenation of numbers, but can’t start with a 0. So first of all we need to figure out how to present a concatenation of numbers. Remember we can use previously defined sets in a definition? Guess what, we can also re-use a set recursively, so this is a concatenation of digits:

  • digits ::= digit | digit digits

So, a “digits” is one single digit, or a single digit with a digits appended to it, which can be a single digit or a digits appended to it, etc. “1″, “12″, “120″ and “012″ are valid digits.

So, finally we can define an integer:

  • integer ::= digit | nzdigit digits

An integer is a non-zero digit, or a non-zero digits followed by an arbitrary amount of digits.

This example should be clear, if it’s not, read over it again, it’s pretty important to ‘get’ this.

Exercise 1: define a BNF definition for a greeting: “Hello, John!”, where ‘John’ can be any name starting with an uppercase character, followed by some lowercase ones.

Exercise 2: write a BNF definition for a standard printf-style function, which accepts string (“%s”) and integer (“%d”) values. The formatting string can consist of any alphanumeric character, spaces and formatters. Variable names can consist of any alphanumeric character, but can’t start with a number. You can use some pre-defined definitions:

  • All the definitions we used above (nzdigit, digit, digits and integer)
  • lcchar and ucchar: lower-case and upper-case characters
]]>
http://eikke.com/text-parsing-formal-grammars-and-bnf-introduction/feed/ 14