Now we want to define a sentence. Did you notice in the last example Pyparsing doesn’t error out on the input, but stops processing it after the first word? The parsing module is aware of whitespace, which is (by default) the end of a structure. So we need to define a sentence as a list of words, separated by whitespace (implicit):
from pyparsing import OneOrMore sentence = OneOrMore(word) print sentence.parseString('hello world') # returns ['hello', 'world'] print sentence.parseString('hello') # returns ['hello'] print sentence.parseString('hello world') # notice >1 spaces # returns ['hello', 'world'] print sentence.parseString('Hello world') # raises a ParseException
Let’s enhance our sentence parser somewhat: we want to parse sentences which are defined using these rules:
- A word is a concatenation of lowercase characters (a-z)
- A sentence starts with a word which starts with one uppercase character (A-Z)
- A sentence consists of one or more words, including the first one
- A sentence ends with a dot, a question mark or an exclamation mark
Let’s rewrite this more formal using BNF:
- lccharacter ::= a | b | c | … | z
- uccharacter ::= A | B | C | … | Z
- word ::= lccharacter | lccharacter word
- startword ::= uccharacter | uccharacter word
- end ::= . | ? | !
- body ::= word | body ” ” word | “”
- sentence ::= startword body end
Notice ‘body’ can be an empty string, so ‘I!’ is a valid sentence.
Let’s rewrite this to Python code again. Did you notice ‘body’ is actually almost the same thing as the ‘sentence’ structure we defined before, only including an empty string? No need to re-explain this
Six pages?? Ouch. And no next button. Any chance you could put it all on one page next time? I feel like I’m reading some ad-infested hardware blog.
very nice tutorial! thanks!
What’s the difference to other parser systems like simpleparse ?
Regards,
I don’t see the difference in (except for the whitespaces)
print sentence.parseString(‘hello world’) # notice >1 spaces
# returns ['hello', 'world']
print sentence.parseString(‘Hello world’)
# raises a ParseException
Why does the second one raise an exception ?
Francis: I guess you’re referring to the snippet on page 2? It says:
from pyparsing import OneOrMore
sentence = OneOrMore(word)
The definition of ‘word’ is given on the previous page:
word = Word(lowercase)
where ‘lowercase’ is imported from the ‘string’ module and equals
abcdefghijklmnopqrstuvwxyz
The definition of the BNF type ‘word’ is Word(lowercase), ie. a concatenation of any character in the string (or list, so you want) ‘lowercase’, which is a-z.
A sentence is defined as OneOrMore words.
The string ‘Hello world’ can not be parsed since it does not match OneOrMore(word): the first item in it (‘Hello’) contains characters not matching the definition of word: the ‘H’ (since we defined a word to be a concatenation of lowercase characters, it shouldn’t contain any uppercase characters).
As you can see, on page 3 a better definition of sentence is constructed using a ‘startword’ definition which should be a concatenation of one uppercase character, followed by zero or more lowercase characters.The example shows ‘A valid sentence.’ can be parsed and validated. The string ‘Hello world!’ would be valid in this BNF construct too. ‘Hello world’ would not match since we’re missing a punctuation sign.
Using the definitions from page 3
almost_valid_sentence = startword + body
or (even more limited)
hello_caps = startword + word
would validate and parse ‘Hello world’.
Good introduction – thank you!
Although I share the feelings of “sb” about pagination.
hi poh,, what if the expr is like this A=B+c?
Good introduction to pyparsing. Thanks Nicolas!