best approach for my pattern match

https://stackoverflow.com/questions/22351063

13-06-2023
|

Question

So, I've built a regex which follows this:

4!a2!a2!c[3!c]

which is translated to

4 alpha character followed by
2 alpha characters followed by
2 characters followed by
3 optional character

this is a standard format for SWIFT BIC code HSBCGB2LXXX

my regex to pull this out of string is:

(?<=:32[^:]:)(([a-zA-Z]{4}[a-zA-Z]{2})[0-9][a-zA-Z]{1}[X]{3})

Now this is targeting a specific tag (32) and works, however, I'm not sure if it's the cleanest, plus if there are any characters before H then it fails.

the string being matched against is:

:32B:HsBfGB4LXXXHELLO

the following returns HSBCGB4LXXX, but this:

:32B:2HsBfGB4LXXXHELLO

returns nothing.

EDIT

For clarity. I have a string which contains multiple lines all starting with :2xnumber:optional letter (eg, :58A:) i want to specify a line to start matching in and return a BIC from anywhere in the line.

EDIT Some more example data to help:

:20:ABCDERF  Z
:23B:CRED
:32A:140310AUD2120,
:33B:AUD2120,
:50K:/111222333                        
Mr Bank of Dad              
Dads house
England            
:52D:/DBEL02010987654321
address 1 
address 2
:53B:/HSBCGB2LXXX
:57A://AU124040
AREFERENCE
:59:/44556677
A line which HSBCGB2LXXX contains a BIC
:70:Another line of data
:71A:Even more

Ok, so I need to pass in as a variable the tag 53 or 59 and return the BIC HSBCGB2LXXX only!

Solution

Your regex can be simplified, and corrected to allow a character before the H, to:

:32[^:]:.?([a-zA-Z]{6}\d[a-zA-Z]XXX)

The changes made were:

Lost the look behind - just make it part of the match
Inserting .? meaning "optional character"
([a-zA-Z]{4}[a-zA-Z]{2}) ==> [a-zA-Z]{6} (4+2=6)
[0-9] ==> \d (\d means "any digit")
[X]{3} ==> XXX (just easier to read and less characters)

Group 1 of the match contains your target

OTHER TIPS

I'm not quite sure if I understand your question completely, as your regular expression does not completely match what you have described above it. For example, you mentioned 3 optional characters, but in the regexp you use 3 mandatory X-es.

However, the actual regular expression can be further cleaned:

instead of [a-zA-Z]{4}[a-zA-Z]{2}, you can simply use [a-zA-Z]{6}, and the grouping parentheses around this might be unnecessary;
the {1} can be left out without any change in the result;
the X does not need surrounding brackets.

All in all (?<=:32[^:]:)([a-zA-Z]{6}[0-9][a-zA-Z]X{3}) is shorter and matches in the very same cases.

If you give a better description of the domain, probably further improvements are also possible.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow