Domanda

I'm trying to parse data from about ~30,000 file names. Here are a few examples:

A0038clone11-BA28.ab1
A038clone11-BA31.ab1
A0038clone11-BA32.ab1
A0001-R00-tatI-BA29.ab1
A0001-R00-V3-BA31.ab1
A0001-R00-LTR-BA43.ab1
A0001-R00-BA81-tat1.ab1
A0002_R07-Primer7.ab1
A0016_A0053_R01-Primer5.ab1
A0016:A0053-R02-Primer7.ab1
A0017_A0054_R04-Primer5.ab1
A0017_A0054_R07-Primer5.ab1
A0037_R06_042712-Primer7_R.ab1
A0037_R07-Primer5_R.ab1
A0041-R01-12Feb-BA87-tat2.ab1
A0094-R00-BA88-fall to early.ab1
A0094-R02-BA88-need to repeat.ab1
A0107-R01-WZ5-BA86-tat1.ab1
A0111_R04_P5_GC-Primer5.ab1
A0179-R02LTR-BA83_R-bad seq.ab1

I'm trying to extract the following:

  • Patient number (A0-something)
  • visit number (R-something) which is often missing and I'm assuming a default of R00
  • Primer (either BA-somethign or Primer-something)
  • Clone number (clone-something) which is often missing and I'm assuming clone01

I'm new to using pyparsing so I'd love some help.

My first guess was to do something like:

pat = pyp.Combine(pyp.Word('A') + pyp.Word(pyp.nums))
visit = pyp.Combine(pyp.Word('R') + pyp.Word(pyp.nums))
clone = pyp.Combine('clone' + pyp.Word(pyp.nums))

primer = pyp.Combine(pyp.oneOf('Primer BA', caseless=True) + pyp.Word(pyp.nums))
extension = pyp.Combine(pyp.Optional(pyp.CaselessLiteral('_R'))+pyp.CaselessLiteral('.ab1'))

parser = pat + pyp.Optional(visit, default='R00') + pyp.Optional(clone, default='clone01') + primer + extension
parser.setDefaultWhitespaceChars(' -/:_-')

But that fails when the order is off or there are extra words in there (like tatI, V3, etc).

Using the suggestions from Pyparsing - where order of tokens in unpredictable I've tried to use OneOrMore operator like:

parser = pyp.OneOrMore(pyp.MatchFirst([pat, 
                                       visit, 
                                       clone, 
                                       primer,
                                       extension]))
parser.setDefaultWhitespaceChars(' -/:_-')

But that misses the primer in some instances: like A0001-R00-LTR-BA43.ab1 but not A0001-R00-BA81-tat1.ab1 for reasons that I don't understand.

Any suggestions would be greatly appreciated!

È stato utile?

Soluzione

You were almost there. You needed to match the extra tokens (that I assume you don't care about). Just make sure that the extra match comes at the end, so to doesn't gobble something you were interested in. Using names as the list defined in your post:

from pyparsing import *

def marker(key):
    return Combine(CaselessLiteral(key) + Word(nums))

pat    = marker("a")
visit  = marker("r")
clone  = marker("clone")
primer = marker("ba") | marker("primer")
sep    = oneOf("- _").suppress()
other  = Word(alphanums + ":")

file_ext = Literal(".").suppress() + Word(alphanums)
EOL    = LineEnd().suppress()

tokens = [pat("pat"),
          visit("visit"),
          clone("clone"),
          primer("primer"),
          sep,other]
grammar = OneOrMore(MatchFirst(tokens)) + file_ext + EOL

By giving the intermediate results a name, e.g. clone("clone") we can create a dictionary of them for easy access:

for result in grammar.scanString(names):
    print result[0].asDict()

resulting in

{'clone': 'clone11', 'primer': 'ba28', 'pat': 'a0038'}
{'clone': 'clone11', 'primer': 'ba31', 'pat': 'a038'}
{'clone': 'clone11', 'primer': 'ba32', 'pat': 'a0038'}
{'pat': 'a0001', 'primer': 'ba29', 'visit': 'r00'}
{'pat': 'a0001', 'primer': 'ba31', 'visit': 'r00'}
{'pat': 'a0001', 'primer': 'ba43', 'visit': 'r00'}
{'pat': 'a0001', 'primer': 'ba81', 'visit': 'r00'}
{'pat': 'a0002', 'primer': 'primer7', 'visit': 'r07'}
{'pat': 'a0053', 'primer': 'primer5', 'visit': 'r01'}
{'pat': 'a0016', 'primer': 'primer7', 'visit': 'r02'}
{'pat': 'a0054', 'primer': 'primer5', 'visit': 'r04'}
{'pat': 'a0054', 'primer': 'primer5', 'visit': 'r07'}
{'pat': 'a0037', 'primer': 'primer7', 'visit': 'r06'}
{'pat': 'a0037', 'primer': 'primer5', 'visit': 'r07'}
{'pat': 'a0041', 'primer': 'ba87', 'visit': 'r01'}
{'pat': 'a0094', 'primer': 'ba88', 'visit': 'r00'}
{'pat': 'a0094', 'primer': 'ba88', 'visit': 'r02'}
{'pat': 'a0107', 'primer': 'ba86', 'visit': 'r01'}
{'pat': 'a0111', 'primer': 'primer5', 'visit': 'r04'}
{'pat': 'a0179', 'primer': 'ba83', 'visit': 'r02'}
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top