Python: Checking a list with regex, filling in blanks

https://stackoverflow.com/questions/21470039

05-10-2022
|

题

I've tried to find ways to do this and searched online here, but cannot find examples to help me figure this out.

I'm reading in rows from a large csv and changing each row to a list. The problem is that the data source isn't very clean. It has empty strings or bad data sometimes, and I need to fill in default values when that happens. For example:

list_ex1 = ['apple','9','','2012-03-05','455.6']
list_ex2 = ['pear','0','45','wrong_entry','565.11']

Here, list_ex1 has a blank third entry and list_ex2 has erroneous data where a date should be. To be clear, I can create a regex that limits what each of the five entries should be:

reg_ex_check = ['[A-Za-z]+','[0-9]','[0-9]','[0-9]{4}-[0-1][0-9]-[0-3][0-9]','[0-9.]+']

That is:

1st entry: A string, no numbers
2nd entry: Exactly one digit between 0 and 9
3rd entry: Exactly one digit as well.
4th entry: Date in standard format (allowing any four digit ints for year)
5th entry: Float

If an entry is blank OR does not match the regular expression, then it should be filled in/replaced with the following defaults:

default_fill = ['empty','0','0','2000-01-01','0']

I'm not sure how the best way to go about this is. I think I could write a complicated loop, but it doesn't feel very 'pythonic' to me to do such things.

Any better ideas?

解决方案

Use zip and a conditional expression in a list comprehension:

[x if re.match(r,x) else d for x,r,d in zip(list_ex2,reg_ex_check,default_fill)]
Out[14]: ['pear', '0', '45', '2000-01-01', '565.11']

You don't really need to explicitly check for blank strings since your various regexen (plural of regex) will all fail on blank strings.

Other note: you probably still want to add an anchor for the end of your string to each regex. Using re.match ensures that it tries to match from the start, but still provides no guarantee that there is not illegal stuff after your match. Consider:

['pear and a pear tree', '0blah', '4 4 4', '2000-01-0000', '192.168.0.bananas']

The above entire list is "acceptable" if you don't add a $ anchor to the end of each regex :-)

其他提示

What about something like this?

map(lambda(x,y,z): re.search(y,x) and x or z, zip(list_ex1, reg_ex_check, default_fill))

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow