Python Regex Replace Matching Text

Question 1

This should do what you want

import re

s = '''bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf%28ML%29endBeliefnet.Web.UI.S||NULL||NULL'''

# bad_regex = re.compile(r'(?<=swf)[^|]+') # will stop at a single pipe character |
regex = re.compile(r'(?<=-swf).*?(?=\|\|)') # matches everything between -swf and || 
regex.sub('', s)

Output =

>>> print(s)
bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swf||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf||NULL||NULL

Edit 1: The regex I gave in the original answer fails if the text for removal has a '|' character in it. I've replaced it with a regex that doesn't have this problem.

Question 2

Probably to make it really quick you could try to use Cython. Also: maybe you could first try to see if this performs better ->

def test_speed():
    row_text = 'bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL'
    string_list = row_text.split('||') # which gives a list
    # Then only partition in the string_list[2] area -> 
    string_list[2] = ''.join(string_list[2].partition('-swf')[0:2])
    # then join it together again: 
    row_text = '||'.join(string_list)

%timeit test_speed()
100000 loops, best of 3: 1.36 µs per loop

just some ideas! seems to be quite fast?

Edit: looking at Kevin's regex example:

import re
regex = re.compile(r'(?<=swf)[^|]+')
def test_regex_speed(regex):
    row_text = 'bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL'
    regex.sub('', row_text)

%timeit test_regex_speed(regex)
100000 loops, best of 3: 2.16 µs per loop

So that's a bit slower, but you could do the entire file at once with the regex.

Edit 2: sorry, i see i didn't see the "entire file is already in memory". For optimal memory usage I would suggest to go row by row through large files though.