stringa Pyparsing CSV con citazioni casuali

https://stackoverflow.com/questions/2797644

04-10-2019
|

Domanda

Ho una stringa simile alla seguente:

<118>date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from="prvs=4745cd07e1=example@example.org",mailer="mta",client_name="example.org,[194.177.17.24]",resolved=OK,to="example@example.org",direction="in",message_length=6832079,virus="",disposition="Accept",classifier="Not,Spam",subject="=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="

Ho provato ad utilizzare il modulo in formato CSV e non ha forma, perché non ho trovato il modo di ignorare ciò che è citato. Pyparsing sembrava una risposta migliore, ma non ho trovato un modo per dichiarare tutte le grammatiche.

Al momento, sto usando il mio vecchio script in Perl per analizzare, ma voglio che questo scritto in Python. se avete bisogno del mio Perl frammento sarò lieto di fornirle.

Ogni aiuto è apprezzato.

Soluzione

Non sono sicuro di quello che stai veramente cercando, ma

import re
data = "date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from=\"prvs=4745cd07e1=example@example.org\",mailer=\"mta\",client_name=\"example.org,[194.177.17.24]\",resolved=OK,to=\"example@example.org\",direction=\"in\",message_length=6832079,virus=\"\",disposition=\"Accept\",classifier=\"Not,Spam\",subject=\"=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?=\""
pattern = r"""(\w+)=((?:"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|[^\\,"'])+)"""
print(re.findall(pattern, data))

ti dà

[('date', '2010-05-09'), ('time', '16:41:27'), ('device_id', 'FE-2KA3F09000049'),
 ('log_id', '0400147717'), ('log_part', '00'), ('type', 'statistics'),
 ('subtype', 'n/a'), ('pri', 'information'), ('session_id', 'o49CedRc021772'),
 ('from', '"prvs=4745cd07e1=example@example.org"'), ('mailer', '"mta"'),
 ('client_name', '"example.org,[194.177.17.24]"'), ('resolved', 'OK'),
 ('to', '"example@example.org"'), ('direction', '"in"'),
 ('message_length', '6832079'), ('virus', '""'), ('disposition', '"Accept"'),
 ('classifier', '"Not,Spam"'), 
 ('subject', '"=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="')
]

Si potrebbe desiderare di ripulire le stringhe tra virgolette dopo (usando mystring.strip("'\"")).

Modifica : Questa espressione regolare ora anche correttamente maniglie sfuggito citazioni all'interno di stringhe tra virgolette (a="She said \"Hi!\"").

Spiegazione del regex:

(\w+)=((?:"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|[^\\,"'])+)

(\w+): Partita l'identificativo e la cattura in backreference no. 1

=: Partita un =

(: acquisire i seguenti in backreference no. 2:

(?:: una delle seguenti:

"(?:\\.|[^\\"])*": Una doppia citazione, seguita da zero o più delle seguenti: un carattere di escape o un / carattere non preventivo non backslash, seguito da un altro doppio apice

|: o

'(?:\\.|[^\\'])*':. Vedi sopra, solo per singoli apici

|: o

[^\\,"']:. Un personaggio che non è né una barra inversa, una virgola, né una citazione

)+:. Ripetere almeno una volta, come numero di volte possibile

): fine del gruppo di cattura di no. 2.

Altri suggerimenti

E 'potrebbe essere migliore di sfruttare un parser esistente piuttosto che l'uso ad hoc regexs.

parse_http_list(s)
    Parse lists as described by RFC 2068 Section 2.

    In particular, parse comma-separated lists where the elements of
    the list may include quoted-strings.  A quoted-string could
    contain a comma.  A non-quoted string could have quotes in the
    middle.  Neither commas nor quotes count if they are escaped.
    Only double-quotes count, not single-quotes.

parse_keqv_list(l)
    Parse list of key=value strings where keys are not duplicated.

Esempio:

>>> pprint.pprint(urllib2.parse_keqv_list(urllib2.parse_http_list(s)))
{'<118>date': '2010-05-09',
 'classifier': 'Not,Spam',
 'client_name': 'example.org,[194.177.17.24]',
 'device_id': 'FE-2KA3F09000049',
 'direction': 'in',
 'disposition': 'Accept',
 'from': 'prvs=4745cd07e1=example@example.org',
 'log_id': '0400147717',
 'log_part': '00',
 'mailer': 'mta',
 'message_length': '6832079',
 'pri': 'information',
 'resolved': 'OK',
 'session_id': 'o49CedRc021772',
 'subject':'=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?=',
 'subtype': 'n/a',
 'time': '16:41:27',
 'to': 'example@example.org',
 'type': 'statistics',
 'virus': ''}

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow