Question

I am looking for various (better) ways of parsing structured text data in PHP and getting that data into a PHP object graph. I have seen a lot of different parsers in PHP for a variety of text-based file formats but pretty much all of them seem to be some brittle chain of regular expressions. There must be a better way!

In this specific case I am looking to parse MT940 files (bank account transactions). But I have run into the same problem with other file formats as well. Invariably I end up with a big chain of regexes that becomes complex to maintain, especially when different formats need to be supported. MT940 has this problem as well. MT940 isn't a strictly defined format and pretty much all banks use a slightly different dialect.

So, how do you design parsers that are more robust and extendable to deal with different dialects?

Here's an example MT940 statement, taken from this question:

{1:F01AHHBCH110XXX0000000000}{2:I940X           N2}{3:{108:XBS/091502}}{4:
:20:XBS/091202/0001
:25:5887/507004-50
:28C:140/1
:60F:C0914CHF7789,
:61:0912021202D36,80NTRFNONREF//0887-1202-29-941
04392579-0 LUTHY + xxx, ZUR
:86:6034?60LUTHY + xxxx, ZUR vom 01.12.09 um 16:28 Karten-Nr. 2232
2579-0
:62F:C091202CHF52,2
:64:C091302CHF52,2
-}
Was it helpful?

Solution

You could use this free parser (GPL 2.0):

http://www.kingsquare.nl/php-mt940

Here's another:

http://www.butcher.art.pl/en/2010/09/tutoriale/parser-php-mt940-format-wyciagow-bankowych/

Hopefully this will allow you to forgo reinventing the wheel on this.

So, how do you design parsers that are more robust and extendable to deal with different dialects?

Unfortunately there's no easy answer to this. You'd have to buckle down and familiarize yourself with all the variants you wish to support. From the king square page:

The parser tries to determine which originating bank it is from via the first few lines of the file and then loads up the engine per bank.

This will take a lot of experience and study. Fortunately, their code could help you along immensely.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top