Question

The required task is to extract lines that contains x number of occurence of y. The input text file contains 1,000,000 lines of 50-200 characters per line.

In this case, let's say lines that contains 20 whitespaces. Is there a more pythonic way of doing this? Currently, I'm doing it this way:

import codecs

def readlinesmorethan20spaces(intxtfile):
    with codecs.open(intxtfile, 'r','utf8') as fin:
        for i in fin:
            if i.count(" ") > 20:
                yield i.strip()

for i in readlinesmorethan20spaces("in.txt"):
    print i

How can this be done without python but with unix commands? Is it even possible?

Was it helpful?

Solution

Unix way using grep and sed:

grep -E '(\s[^\s]*){20,}' in.txt | sed 's/^\s*//;s/\s*$//'

The first command filters lines with 20+ whitespaces (even non consecutive), the second command then strips leading and trailing whitespaces.

This isn’t an ideal approach, it’s probably slower then others (awk maybe), but it’s quite simple. By the way, I’d be interested in performance comparison of different methods mentioned on this page…

Yeah, almost everything can be solved with regular expressions! ;)

OTHER TIPS

List comps are just generally more pythonic. In your context it would look something like this:

import codecs, re

def readlinesmorethan20spaces(intxtfile):
    with codecs.open(intxtfile, 'r','utf8') as fin:
        return (i.strip() for i in fin if i.count(' ') > 20)

for i in readlinesmorethan20spaces("in.txt"):
    print i

In that context, returning a generator is similarly lazy to your previous yield statement.

You could also do it as a single line if you want, though I think the above version is more readable:

read_lines = lambda fn: (i.strip() for i in codecs.open(fn, 'r', 'utf8') if i.count(' ') > 20)

The unix approach is less straightforward, but this should totally be possible. The start is probably to use awk to count the characters in each line. Here's an example:

awk -v FS=""'{cnt=0;for (i=1;i<=NF;i++) if ($i==" ") cnt++; print cnt"\t"NR}' stores.dat

I'd usually not bother with a generator

import codecs
with codecs.open(intxtfile, 'r','utf8') as fin:
    for i in fin:
        if i.count(' ') <= 20:
            continue
        i = i.strip()
        ...

One advantage of using a function/generator would be finer grained components to unittest. As mentioned in the comments - moving things around a little makes the generator much easier to test as fin doesn't need to be a open file - it could equally well be a list etc.

import codecs

def readlinesmorethan20spaces(fin):
    for i in fin:
        if i.count(" ") > 20:
            yield i.strip()

with codecs.open(intxtfile, 'r','utf8') as fin:
    for i in readlinesmorethan20spaces(fin):
        print i

Other way with high-performance container collections.Counter:

import codecs
import collections

def readlinesmorethan20spaces(intxtfile):
    with codecs.open(intxtfile, 'r','utf8') as fin:
        for line in fin:
            counter = collections.Counter(line)
            if counter[" "] > 20:
                yield line.strip()

for i in readlinesmorethan20spaces("in.txt"):
    print i
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top