Skip to content


Text parsing, formal grammars and BNF introduction

Exercise 1: try to define a BNF definition for a greeting: “Hello, John!”, where ‘John’ can be any name starting with an uppercase character, followed by some lowercase ones.

Answer:

  • lccharacter ::= a | b | c | d | e | f | … | z
  • uccharacter ::= A | B | C | D | F | … | Z
  • lcseq ::= lccharacter | lccharacter lcseq
  • name ::= uccharacter lcseq
  • greeting ::= “Hello, ” name “!”

Obviously, you could have chosen other names. Notice the “” in the greeting definition. This stresses the fact these constant words are not part of the metalanguage.

Exercise 2: write a BNF definition for a standard printf-style function, which accepts string (“%s”) and integer (“%d”) values. The formatting string can consist of any alphanumeric character, spaces and formatters. Variable names can consist of any alphanumeric character, but can’t start with a number.

Answer:

  • charseq ::= lcchar | ucchar | lcchar charseq | ucchar charseq
  • varname ::= charseq | charseq digits | charseq digits varname
  • formatter ::= “%d” | “%s”
  • word ::= lcchar | ucchar | digit | lcchar word | ucchar word | digit word
  • part ::= formatter | word | ” “
  • spaced ::= part | ” ” part | part ” ” | ” ” part ” “
  • arg ::= spaced | spaced spaced
  • varlist ::= varname | varname “, ” varname | varname “, ” varlist
  • optarglist ::= “” | “, ” varlist
  • printf ::= “printf(” arg optarglist “);”

This one might look somewhat big, and maybe the rule names weren’t that well-chosen, sorry about that, but you should be able to recognize some parts.
Disclaimer: I just wrote down these by hand, there might be some mistakes in them.

Posted in Development, Various.

Tagged with , , , .


14 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Anonymous says

    Shouldn’t that be: integer ::= digit | nzdigit digits ? Otherwise, your language could not parse “0″.

  2. fok says

    there are several parsing modules for python beyond pyparsing. My personal preference is ply (http://www.dabeaz.com/ply/)

  3. RubenV says

    Interesting, but I think the issue of ambiguous grammars is missing. These bring their own class of problems.

    Also, it might be interesting to mention the fact that regular expressions simply cannot represent the same class of languages as grammars. This is the very reason one would like to use grammars, which are inherently harder to parse (you need a PDA whereas a regular expression can be processed using a finite state automaton). On the other hand, the inexpressivity of regular expressions is the reason why they are used that often, they can be implemented very efficiently. You see, there is no free lunch :-)

    I can highly recommend Michael Sipser’s book on this topic: Introduction to the Theory of Computation. It’s a bit tough, but highly comprehensive (and very well written). More here: http://www-math.mit.edu/~sipser/book.html

  4. Nicolas says

    Anonymous: correct! Fixed, thanks.

    Fok: I kinda dislike the non-Pythoness of Yakk, and as a result the kinda hard to understand syntax of PLY. Thanks for the pointer though :-)

    Ruben: can’t any BNF definition be translated into a set of regular expressions and controller code?

  5. Mattias Bengtsson says

    I’ve been using BNFC (BNF Converter) for some compiler construction courses at school. I think it is a really neat project. It can generate Java, Haskell and C++-parsers. I’m not sure if it can output Python parsers though.

    • Divya says

      Hi,
      I’m going to use BNF for the first time. Just going through online informations.
      Could you please tell me good docs if any.
      Thanks!

  6. no one in particular says

    For the record, note that you’ve described the whole numbers, not the integers, because integers include negative numbers.

    (I’m not trying to be annoying, but I figured it was worth pointing out. Thanks for the interesting read!)

  7. RubenV says

    Nicolas: Try expressing a language which recognizes balanced brackets, which can be arbitrarily nested using a regular expression.

    Fairly trivial in BNF, impossible to implement using traditional regular expressions.

  8. Nicolas says

    no one in particular: indeed… I translated it somewhat badly. So to be correct:

    • whole_number ::= digit | nzdigit digits
    • sign ::= “-” | “+” | “”
    • integer ::= sign whole_number

    Ruben: out of my head, if input is ‘foo(bar(baz)bat)’, can’t you do something like

    parts = []
    regex = '^([.^(]*)[\(([.^)]*)\)]*(.*)$'
    
    def parse(s):
        parts = regex.parse(s)
        parts.append(parts[0])
        for i in range(1, len(parts)):
            parse(parts[i])
    
    parse(s)
  9. RubenV says

    Nicolas: You’re cheating there by combining recursion and regular expressions :-) . Pure regular expressions can’t do that. Recursion is actually what sets regular languages and context free grammars apart.

  10. ethana2 says

    Lojban would have been perfect if it had a proper phonetic base.
    After I fork it and make the Best Language on Earth, we can avoid this nonsense ;)

  11. sar says

    Hi,

    I’am a student , and as a project I have to write a grammar (LBNF one) -and it’s ok for it-; then, I have to use the BNFC, and that’s th e pb i.e. Idon’t know how to use it.

    Please help me. and thanks a lot.

    Best regards.

  12. preeti says

    hi i need a source coding,to check the given expression in numbers is valid or not by using c programming

  13. Abhay Joshi says

    I think tht the line
    “An integer is a non-zero digit, or a non-zero digits followed by an arbitrary amount of digits.”
    should be modified as
    “An integer is a digit, or a non-zero digit followed by an arbitrary amount of digits.”
    because your BNF looks like
    nzdigit ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
    digit ::= 0 | nzdigit
    digits ::= digit | digit digits
    integer ::= digit | nzdigit digits
    and hence 0 is a valid integer.



Some HTML is OK

or, reply to this post via trackback.