How can you employ multi-digit wildcards in python to parse a string using varying patterns?

https://stackoverflow.com/questions/22240724

10-06-2023
|

Question

I currently have several thousand files that I have to parse, and each file contains a single line of data. Two examples:

(CSoc:0.00825830327156463345,(PChapmani:0.00000254996576400768,PPatrius:0.00917039554517301569):0.16367666117488463562,CHaigi:0.00401845774067355072):0.0;

((CSoc:0.00298782818816040099,CHaigi:0.00148583412998809050):0.27857371651571366522,PPatrius:0.00188545323182991121,PChapmani:0.00799482946501124843):0.0;

My goal is to go into each file and insert a string immediately after the random floating point numbers that come after "CSoc:" and "PChapmani:". As you can see from these two examples, things are rearranged from file to file, and that order has to be maintained. I tried using .split() to do this, but the problem I ran into is that from file to file things change too much. For example, sometimes the number after "CSoc:" or after "PChapmani:" is immediately followed by a comma, and sometimes it is immediately followed by parenthesis. I also tried using regex, but I am failing miserably...

Here is my pitiful regex attempt thus far:

for line in infile:
    print line
    r = re.compile('CSoc:(\d+).(\d+),')
    print r.split(line)

I'm not even trying to insert the string at this point, just trying to figure out how to distinguish different parts of the string based on these patterns.

Just to try to be as clear as possible, here is what I'm hoping to get eventually:

(CSoc:0.00825830327156463345 STRING,(PChapmani:0.00000254996576400768 STRING,PPatrius:0.00917039554517301569):0.16367666117488463562,CHaigi:0.00401845774067355072):0.0;
((CSoc:0.00298782818816040099 STRING,CHaigi:0.00148583412998809050):0.27857371651571366522,PPatrius:0.00188545323182991121,PChapmani:0.00799482946501124843 STRING):0.0;

Thanks very much for your time.

SOLUTION:

user2289175 provided an answer (below) that seems to work fine for me, even though I have some difficulty understanding the coding. This is how I implemented it:

string = "TESTSTRING"
for file in filelist:
openfile = open(file, "r")
for line in openfile:
    print "1: " + line
    line = re.sub(r"(CSoc:[0-9\.]+)",r"\1 " + string, line)
    line = re.sub(r"(PChapmani:[0-9\.]+)",r"\1 " + string, line)
    print "2: " + line

This provides me with the original line (1) and the new line (2) for comparison. To be honest I was only expecting this to work when the numbers were followed immediately by a ')' but it works for any situation I've thrown at it thus far... Here is some example output:

1: ((PPatrius:0.00204974573878130778,PChapmani:0.00505729864425210219):0.18772783359999054009,CSoc:0.00901378811915975846,CHaigi:0.00000166275543481961):0.0;
2: ((PPatrius:0.00204974573878130778,PChapmani:0.00505729864425210219 TESTSTRING):0.18772783359999054009,CSoc:0.00901378811915975846 TESTSTRING,CHaigi:0.00000166275543481961):0.0;

1: (CSoc:0.00536514757027959765,(PChapmani:0.00160443687004130928,PPatrius:0.00393832871636974006):0.08600185225519103860,CHaigi:0.00555651009595325897):0.0;
2: (CSoc:0.00536514757027959765 TESTSTRING,(PChapmani:0.00160443687004130928 TESTSTRING,PPatrius:0.00393832871636974006):0.08600185225519103860,CHaigi:0.00555651009595325897):0.0;

1: ((PPatrius:0.00448104193048302988,PChapmani:0.00000271124757644997):0.26894791764191683381,CSoc:0.00341363919340215930,CHaigi:0.00000271124757644997):0.0;
2: ((PPatrius:0.00448104193048302988,PChapmani:0.00000271124757644997 TESTSTRING):0.26894791764191683381,CSoc:0.00341363919340215930 TESTSTRING,CHaigi:0.00000271124757644997):0.0;

Thanks again! Hope others find this helpful.

Solution

Use the python re module for your problem http://docs.python.org/2/library/re.html

something like this should work:

for line in infline:
     #re.sub("pattern","replacement", "string")
     line =re.sub(r"([a-zA-Z]+:[0-9\.]+)",r"\1 STRING", line)
     print line

Briefly, using ( ) will tell python regex to match whatever regular expression is inside the parentheses, and indicates the start and end of a group. Each matched group will be stored in a special operator called \1 which you can use in your replacement. For more information about the regular expression syntax, check out the link above.

Cheers!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow