سؤال

In a huge text file which I handle as a big string for efficiency reasons (I don't read the file line by line) I want to delete any character that is after -swf and before ||

I have a huge text which looks like this:

bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf%28ML%29endBeliefnet.Web.UI.S||NULL||NULL

I want the final result to look like this:

bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swf||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf||NULL||NULL

I can do this line by line using the partition function in python but it takes a lot of time since it requires to handle the file line by line and the file has more than 10M rows. Is there any way to do this by not examining the file line by line?

هل كانت مفيدة؟

المحلول

This should do what you want

import re

s = '''bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf%28ML%29endBeliefnet.Web.UI.S||NULL||NULL'''

# bad_regex = re.compile(r'(?<=swf)[^|]+') # will stop at a single pipe character |
regex = re.compile(r'(?<=-swf).*?(?=\|\|)') # matches everything between -swf and || 
regex.sub('', s)

Output =

>>> print(s)
bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swf||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf||NULL||NULL

Edit 1: The regex I gave in the original answer fails if the text for removal has a '|' character in it. I've replaced it with a regex that doesn't have this problem.

نصائح أخرى

Probably to make it really quick you could try to use Cython. Also: maybe you could first try to see if this performs better ->

def test_speed():
    row_text = 'bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL'
    string_list = row_text.split('||') # which gives a list
    # Then only partition in the string_list[2] area -> 
    string_list[2] = ''.join(string_list[2].partition('-swf')[0:2])
    # then join it together again: 
    row_text = '||'.join(string_list)

%timeit test_speed()
100000 loops, best of 3: 1.36 µs per loop

just some ideas! seems to be quite fast?

Edit: looking at Kevin's regex example:

import re
regex = re.compile(r'(?<=swf)[^|]+')
def test_regex_speed(regex):
    row_text = 'bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL'
    regex.sub('', row_text)

%timeit test_regex_speed(regex)
100000 loops, best of 3: 2.16 µs per loop

So that's a bit slower, but you could do the entire file at once with the regex.

Edit 2: sorry, i see i didn't see the "entire file is already in memory". For optimal memory usage I would suggest to go row by row through large files though.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top