Вопрос

I'm trying to parse a particular syntax for positions in biological sequences. The positions can have forms like:

12           -- a simple position in the sequence
12+34        -- a complex position as a base (12) and offset(+34)
12_56        -- a range, from 12 to 56
12+34_56-78  -- a range as a start to end, where either or both may be simple or complex

I'd like to have these parsed as dicts, roughly like this:

12          -> { 'start': { 'base': 12, 'offset': 0 },  'end': None }
12+34       -> { 'start': { 'base': 12, 'offset': 34 }, 'end': None }
12_56       -> { 'start': { 'base': 12, 'offset': 0 },
                   'end': { 'base': 56, 'offset': 0 } }
12+34_56-78 -> { 'start': { 'base': 12, 'offset': 0 }, 
                   'end': { 'base': 56, 'offset': -78 } }

I've made several stabs using pyparsing. Here's one:

from pyparsing import *
integer = Word(nums)
signed_integer = Word('+-', nums)
underscore = Suppress('_')
position = integer.setResultsName('base') + Or(signed_integer,Empty).setResultsName('offset')
interval = position.setResultsName('start') + Or(underscore + position,Empty).setResultsName('end')

The results are close to what I want:

In [20]: hgvspyparsing.interval.parseString('12-34_56+78').asDict()
Out[20]: 
{'base': '56',
'end': (['56', '+78'], {'base': [('56', 0)], 'offset': [((['+78'], {}), 1)]}),
'offset': (['+78'], {}),
'start': (['12', '-34'], {'base': [('12', 0)], 'offset': [((['-34'], {}), 1)]})}

Two questions:

  1. asDict() only worked on the root parseResult. Is there a way to cajole pyparsing into returning a nested dict (and only that)?

  2. How do I get the optionality of the end of a range and the offset of a position? The Or() in the position rule doesn't cut it. (I tried similarly for the end of the range.) Ideally, I'd treat all positions as special cases of the most complex form (i.e., { start: {base, end}, end: { base, end } }), where the simpler cases use 0 or None.)

Thanks!

Это было полезно?

Решение

Some general pyparsing tips:

Or(expr, empty) is better written as Optional(expr). Also, your Or expression was trying to create an Or with the class Empty, you probably meant to write Empty() or empty for the second argument.

expr.setResultsName("name") can now be written as expr("name")

If you want to apply structure to your results, use Group.

Use dump() instead of asDict() to better view the structure of your parsed results.

Here is how I would build up your expression:

from pyparsing import Word, nums, oneOf, Combine, Group, Optional

integer = Word(nums)

sign = oneOf("+ -")
signedInteger = Combine(sign + integer)

integerExpr = Group(integer("base") + Optional(signedInteger, default="0")("offset"))

integerRange = integerExpr("start") + Optional('_' + integerExpr("end"))


tests = """\
12
12+34
12_56
12+34_56-78""".splitlines()

for t in tests:
    result = integerRange.parseString(t)
    print t
    print result.dump()
    print result.asDict()
    print result.start.base, result.start.offset
    if result.end:
        print result.end.base, result.end.offset
    print

Prints:

12
[['12', '0']]
- start: ['12', '0']
  - base: 12
  - offset: 0
{'start': (['12', '0'], {'base': [('12', 0)], 'offset': [('0', 1)]})}
12 0

12+34
[['12', '+34']]
- start: ['12', '+34']
  - base: 12
  - offset: +34
{'start': (['12', '+34'], {'base': [('12', 0)], 'offset': [('+34', 1)]})}
12 +34

12_56
[['12', '0'], '_', ['56', '0']]
- end: ['56', '0']
  - base: 56
  - offset: 0
- start: ['12', '0']
  - base: 12
  - offset: 0
{'start': (['12', '0'], {'base': [('12', 0)], 'offset': [('0', 1)]}), 'end': (['56', '0'], {'base': [('56', 0)], 'offset': [('0', 1)]})}
12 0
56 0

12+34_56-78
[['12', '+34'], '_', ['56', '-78']]
- end: ['56', '-78']
  - base: 56
  - offset: -78
- start: ['12', '+34']
  - base: 12
  - offset: +34
{'start': (['12', '+34'], {'base': [('12', 0)], 'offset': [('+34', 1)]}), 'end': (['56', '-78'], {'base': [('56', 0)], 'offset': [('-78', 1)]})}
12 +34
56 -78

Другие советы

Is the actual syntax more complicated than your examples? Because the parsing can be done fairly easily in pure Python:

bases = ["12", "12+34", "12_56", "12+34", "12+34_56-78"]

def parse_base(base_string):

    def parse_single(s):
        if '-' in s:
            offset_start = s.find("-")
            base, offset = int(s[:offset_start]), int(s[offset_start:])
        elif '+' in s:
            offset_start = s.find("+")
            base, offset = int(s[:offset_start]), int(s[offset_start:])
        else:
            base = int(s)
            offset = 0
        return {'base': base, 'offset': offset}

    range_split = base_string.split('_')
    if len(range_split) == 1:
        start = range_split[0]
        return {'start': parse_single(start), 'end': None}
    elif len(range_split) == 2:
        start, end = range_split
        return {'start': parse_single(start),
                'end': parse_single(end)}

Output:

for b in bases:
     print(parse_base(b))

{'start': {'base': 12, 'offset': 0}, 'end': None}
{'start': {'base': 12, 'offset': 34}, 'end': None}
{'start': {'base': 12, 'offset': 0}, 'end': {'base': 56, 'offset': 0}}
{'start': {'base': 12, 'offset': 34}, 'end': None}
{'start': {'base': 12, 'offset': 34}, 'end': {'base': 56, 'offset': -78}}
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top