Question

For some search-based code (in Python), I need to write a query syntax parser that would parse a simple google like query syntax. For example:

all of these words "with this phrase" OR that OR this site:within.site filetype:ps from:lastweek

As search becomes more an more popular, I expected to be able to easily find a python library for doing this and thus avoid having to re-invent the wheel. Sadly, searches on google doesn't yield much.

What would you recommend as a python parsing library for this simple task?

Was it helpful?

Solution

While ply is a more classical approach (a Pythonic variant of lexx + yacc) and thus may be easier to get started with if you're already familiar with such traditional tools, pyparsing is highly pythonic and would be my top recommendation, especially for such simple tasks (which are really more like lexing than "full-blown" parsing... at least until you want to allow possibly-nested parentheses, but pyparsing won't really be troubled by those either;-).

OTHER TIPS

SORRY - Lepl is no longer being developed.

There's also LEPL - http://www.acooke.org/lepl

Here's a quick solution I wrote during breakfast:

pl6 src: python3                                                      
Python 3.1 (r31:73572, Oct 24 2009, 05:39:09)                         
[GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2                
Type "help", "copyright", "credits" or "license" for more information.
>>> from lepl import *                                                
>>>                                                                   
>>> class Alternatives(Node):                                         
...     pass                                                          
...
>>> class Query(Node):
...     pass
...
>>> class Text(Node):
...     pass
...
>>> def compile():
...     qualifier      = Word() & Drop(':')           > 'qualifier'
...     word           = ~Lookahead('OR') & Word()
...     phrase         = String()
...     text           = phrase | word
...     word_or_phrase = (Optional(qualifier) & text) > Text
...     space          = Drop(Space()[1:])
...     query          = word_or_phrase[1:, space]    > Query
...     separator      = Drop(space & 'OR' & space)
...     alternatives   = query[:, separator]          > Alternatives
...     return alternatives.string_parser()
...
>>> parser = compile()
>>>
>>> alternatives = parser('all of these words "with this phrase" '
...                       'OR that OR this site:within.site '
...                       'filetype:ps from:lastweek')[0]
>>>
>>> print(str(alternatives))
Alternatives
 +- Query
 |   +- Text
 |   |   `- 'all'
 |   +- Text
 |   |   `- 'of'
 |   +- Text
 |   |   `- 'these'
 |   +- Text
 |   |   `- 'words'
 |   `- Text
 |       `- 'with this phrase'
 +- Query
 |   `- Text
 |       `- 'that'
 `- Query
     +- Text
     |   `- 'this'
     +- Text
     |   +- qualifier 'site'
     |   `- 'within.site'
     +- Text
     |   +- qualifier 'filetype'
     |   `- 'ps'
     `- Text
         +- qualifier 'from'
         `- 'lastweek'
>>>

I would argue that LEPL isn't a "toy" - although it's recursive descent, it includes memoisation and trampolining, which help avoid some of the limitations of that approach.

However, it is pure Python, so it's not super-fast, and it's in active development (a new release, 4.0, with quite a few fixes and improvements, is coming relatively soon).

A few good options:

  • Whoosh: the only problem is that they have few parsing examples since the parser might not be its main feature/focus, but it's definitely a good option

  • modgrammar: I didn't try it, but it seems pretty flexible and simple

  • ply

  • pyparsing: highly recommended. there are some good parsing examples online

If you're done with the project, what did you end up choosing?

PLY is great. It is based on the Lex/Yacc idiom and thus may already be familiar. It allows you to create arbitrarily complex lexers and parsers for any task, including the one you need.

Using a powerful tool like PLY instead of a simple toy is a good idea, because your needs can become more complex with time and you'd like to stay with the same tool.

PyParsing would be the right choice, although is quite tedious, thats why I have developed a query parser inspired on lucene and gmail syntax. It's only dependency is PyParsing, and we have used it on several projects. It is fully customizable and extendable, plus it abstracts you from the pyparsing issues. You can check it out here:

http://www.github.com/sebastiandev/plyse

Its pretty well documented so you'll find docs on how to do the querying, configs, etc.

Whoosh has a comprehensive search query parser module whoosh.qparser and class QueryParser that should be reasonably easy to adapt to your use case.

See http://pythonhosted.org/Whoosh/parsing.html and https://bitbucket.org/mchaput/whoosh/src/55f9c484047a8306101c8eaa59e9a110f960a1c2/src/whoosh/qparser

I know this is an old question but for future reference I just uploaded my package searchstringparser to PyPi. Which implements a decent query parsing machinery based on ply. It outputs a string suitable for the PostgreSQL function tsquery. You can look at the lexer and parser classes to see if they fit your need or modify accordingly.

Feedback welcome!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top