Ikke's blog

'cause this is what I do

Text parsing, formal grammars and BNF introduction

Parsing input is something most developers run into one day. Parsing binary input can be pretty straight-forward, as most of the time you know the format of the input, ie you know what to expect: if you receive a message of 10 bytes, the first byte could be a message ID, the second one the payload length, third one message type ID, and others message content. Pretty easy to handle.

Parsing human-readable text can be harder though, as human beings tend to be less strict when providing input (eg whitespacing), you can’t ask humans to prepend strings with their length, etc.

There are several ways to handle text input. One well-known method is using regular expressions with matches, but writing regular expressions which are able to process not-so-strict input can be pretty though, writing expressions to parse large bodies of text is hard, using sub-expressions can become pretty complicated,… Overall regular expressions usually involve quite a lot of black magic for the average outsider.

xkcd comic: Regular expressions

Luckily, there are easier methods to parse text input too, of which I’d like to introduct one: a Python module called Pyparsing, which can do BNF-style text parsing.

First of all, let me explain “BNF”. The Backus-Naur Form, aka BNF, is a metasyntax you can use to express the grammar of a formal language. This might make no sense at all, so let’s split it up:

Syntax: a syntax defines the structure of a sentence. Don’t just think about a normal sentence here, an expression in a programming language can be a sentence as well. In every language there’s an alphabet (example: the alphabet of integer numbers consists of 0, 1, 2,…, 9). A syntax defines how characters from this alphabet should be placed together to form a valid sentence (expression).
Metasyntax: a metasyntax is a syntax to define a syntax. It has its own alphabet and well-formedness rules.
Grammar: set of rules against which sentences can be checked to figure out whether they’re valid or not. Do note being valid does not imply the sentence got a real significance.
Formal language: a language with a very strict and unforgiving description (grammar). The English language is not a formal language: although “I in Belgium live” is not a correct sentence (it doesn’t correspond to the grammar of the English language), everyone knows what it means. In a formal language, any string which consists of characters from the language’s alphabet, but does not match the grammar, got no significance at all.

That’s about it. If this isn’t very clear to you, never mind, the upcoming examples should explain a lot.

Let’s write our first BNF definition of a very simple language: integers. As noted before, the alphabet of integers consists of the numbers 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. The structure of integers is very simple, it’s a concatenation of numbers, but shouldn’t start with a 0.

When creating a BNF definition, we first define the sets of characters from the alphabet we’re going to use:

nzdigit ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
digit ::= 0 | nzdigit

nzdigit is a set of all non-zero numbers, digit is any digit. the :, = and | characters are part of the BNF’s metalanguage alphabet. The : and = characters are used to assign a set to a name, the | character denotes “or”. Notice we can re-use a previously defined set.

Now let’s see how to define the rules for integers. An integer consists of a concatenation of numbers, but can’t start with a 0. So first of all we need to figure out how to present a concatenation of numbers. Remember we can use previously defined sets in a definition? Guess what, we can also re-use a set recursively, so this is a concatenation of digits:

digits ::= digit | digit digits

So, a “digits” is one single digit, or a single digit with a digits appended to it, which can be a single digit or a digits appended to it, etc. “1″, “12″, “120″ and “012″ are valid digits.

So, finally we can define an integer:

integer ::= digit | nzdigit digits

An integer is a non-zero digit, or a non-zero digits followed by an arbitrary amount of digits.

This example should be clear, if it’s not, read over it again, it’s pretty important to ‘get’ this.

Exercise 1: define a BNF definition for a greeting: “Hello, John!”, where ‘John’ can be any name starting with an uppercase character, followed by some lowercase ones.

Exercise 2: write a BNF definition for a standard printf-style function, which accepts string (“%s”) and integer (“%d”) values. The formatting string can consist of any alphanumeric character, spaces and formatters. Variable names can consist of any alphanumeric character, but can’t start with a number. You can use some pre-defined definitions:

All the definitions we used above (nzdigit, digit, digits and integer)
lcchar and ucchar: lower-case and upper-case characters

Pages: 1 2

Posted in Development, Various.

Tagged with bnf, grammar, mathematics, parsing.

By Nicolas – January 20, 2008

14 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Anonymous says

Shouldn’t that be: integer ::= digit | nzdigit digits ? Otherwise, your language could not parse “0″.

January 20, 2008, 05:27 Reply
fok says

there are several parsing modules for python beyond pyparsing. My personal preference is ply (http://www.dabeaz.com/ply/)

January 20, 2008, 06:40 Reply
RubenV says

Interesting, but I think the issue of ambiguous grammars is missing. These bring their own class of problems.

Also, it might be interesting to mention the fact that regular expressions simply cannot represent the same class of languages as grammars. This is the very reason one would like to use grammars, which are inherently harder to parse (you need a PDA whereas a regular expression can be processed using a finite state automaton). On the other hand, the inexpressivity of regular expressions is the reason why they are used that often, they can be implemented very efficiently. You see, there is no free lunch

I can highly recommend Michael Sipser’s book on this topic: Introduction to the Theory of Computation. It’s a bit tough, but highly comprehensive (and very well written). More here: http://www-math.mit.edu/~sipser/book.html

January 20, 2008, 12:50 Reply
Nicolas says

Anonymous: correct! Fixed, thanks.

Fok: I kinda dislike the non-Pythoness of Yakk, and as a result the kinda hard to understand syntax of PLY. Thanks for the pointer though

Ruben: can’t any BNF definition be translated into a set of regular expressions and controller code?

January 20, 2008, 13:41 Reply
Mattias Bengtsson says

I’ve been using BNFC (BNF Converter) for some compiler construction courses at school. I think it is a really neat project. It can generate Java, Haskell and C++-parsers. I’m not sure if it can output Python parsers though.

January 20, 2008, 15:32 Reply
- Divya says
  
  Hi,
  I’m going to use BNF for the first time. Just going through online informations.
  Could you please tell me good docs if any.
  Thanks!
  
  September 15, 2010, 14:11 Reply
no one in particular says

For the record, note that you’ve described the whole numbers, not the integers, because integers include negative numbers.

(I’m not trying to be annoying, but I figured it was worth pointing out. Thanks for the interesting read!)

January 20, 2008, 18:46 Reply
RubenV says

Nicolas: Try expressing a language which recognizes balanced brackets, which can be arbitrarily nested using a regular expression.

Fairly trivial in BNF, impossible to implement using traditional regular expressions.

January 20, 2008, 18:46 Reply
Nicolas says
no one in particular: indeed… I translated it somewhat badly. So to be correct:
- whole_number ::= digit | nzdigit digits
- sign ::= “-” | “+” | “”
- integer ::= sign whole_number
Ruben: out of my head, if input is ‘foo(bar(baz)bat)’, can’t you do something like
```
parts = []
regex = '^([.^(]*)[$([.^)]*)$]*(.*)$'

def parse(s):
    parts = regex.parse(s)
    parts.append(parts[0])
    for i in range(1, len(parts)):
        parse(parts[i])

parse(s)
```
January 20, 2008, 19:30 Reply
RubenV says

Nicolas: You’re cheating there by combining recursion and regular expressions . Pure regular expressions can’t do that. Recursion is actually what sets regular languages and context free grammars apart.

January 20, 2008, 20:52 Reply
ethana2 says

Lojban would have been perfect if it had a proper phonetic base.
After I fork it and make the Best Language on Earth, we can avoid this nonsense

January 21, 2008, 09:45 Reply
sar says

Hi,

I’am a student , and as a project I have to write a grammar (LBNF one) -and it’s ok for it-; then, I have to use the BNFC, and that’s th e pb i.e. Idon’t know how to use it.

Please help me. and thanks a lot.

Best regards.

May 23, 2008, 18:52 Reply
preeti says

hi i need a source coding,to check the given expression in numbers is valid or not by using c programming

January 30, 2010, 18:16 Reply
Abhay Joshi says

I think tht the line
“An integer is a non-zero digit, or a non-zero digits followed by an arbitrary amount of digits.”
should be modified as
“An integer is a digit, or a non-zero digit followed by an arbitrary amount of digits.”
because your BNF looks like
nzdigit ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
digit ::= 0 | nzdigit
digits ::= digit | digit digits
integer ::= digit | nzdigit digits
and hence 0 is a valid integer.

March 14, 2011, 19:03 Reply

« django-validation now includes inheritance support Pyparsing introduction: BNF to code »

@donsbot Any reason Data.ByteString.zipWith' isn't exported/public API? 12:30:27 AM December 17, 2011 from web in reply to donsbot Reply Retweet Favorite
RT @raichoo: "#Ocaml has an OOP extension… that nobody uses […] not even its inventor" -Yaron Minsky, Janestreet. 09:28:04 PM December 16, 2011 from web Reply Retweet Favorite
Whenever you launch a tech startup, don't look for office space: head to Starbucks. Coffee, power & free wifi, all you need is a laptop! 06:30:52 PM December 16, 2011 from web Reply Retweet Favorite
@viktorklang Bit me a couple of time in the Haskell bindings as well. Simple approach to make a library "threadsafe"? 03:19:38 PM December 16, 2011 from web in reply to viktorklang Reply Retweet Favorite
W00t ^_^ RT @johtib Screenshot teaser for something I've been working on lately: http://t.co/OyGog474 01:19:31 PM December 16, 2011 from web Reply Retweet Favorite

@eikke

Proudly powered by WordPress and Carrington.

Text parsing, formal grammars and BNF introduction

14 Responses

About Ikke's blog

Meta

Friends

Me

Planets

License

Me @ Twitter

Text parsing, formal grammars and BNF introduction

14 Responses

Subscribe

About Ikke's blog

Meta

Friends

Me

Planets

License

Me @ Twitter

Tags