How to read/extract lines with more than 20 spaces ? - unix/python

Question 1

Unix way using grep and sed:

grep -E '(\s[^\s]*){20,}' in.txt | sed 's/^\s*//;s/\s*$//'

The first command filters lines with 20+ whitespaces (even non consecutive), the second command then strips leading and trailing whitespaces.

This isn’t an ideal approach, it’s probably slower then others (awk maybe), but it’s quite simple. By the way, I’d be interested in performance comparison of different methods mentioned on this page…

Yeah, almost everything can be solved with regular expressions! ;)

Question 2

List comps are just generally more pythonic. In your context it would look something like this:

import codecs, re

def readlinesmorethan20spaces(intxtfile):
    with codecs.open(intxtfile, 'r','utf8') as fin:
        return (i.strip() for i in fin if i.count(' ') > 20)

for i in readlinesmorethan20spaces("in.txt"):
    print i

In that context, returning a generator is similarly lazy to your previous yield statement.

You could also do it as a single line if you want, though I think the above version is more readable:

read_lines = lambda fn: (i.strip() for i in codecs.open(fn, 'r', 'utf8') if i.count(' ') > 20)

The unix approach is less straightforward, but this should totally be possible. The start is probably to use awk to count the characters in each line. Here's an example:

awk -v FS=""'{cnt=0;for (i=1;i<=NF;i++) if ($i==" ") cnt++; print cnt"\t"NR}' stores.dat

Question 3

I'd usually not bother with a generator

import codecs
with codecs.open(intxtfile, 'r','utf8') as fin:
    for i in fin:
        if i.count(' ') <= 20:
            continue
        i = i.strip()
        ...

One advantage of using a function/generator would be finer grained components to unittest. As mentioned in the comments - moving things around a little makes the generator much easier to test as fin doesn't need to be a open file - it could equally well be a list etc.

import codecs

def readlinesmorethan20spaces(fin):
    for i in fin:
        if i.count(" ") > 20:
            yield i.strip()

with codecs.open(intxtfile, 'r','utf8') as fin:
    for i in readlinesmorethan20spaces(fin):
        print i

Question 4

Other way with high-performance container collections.Counter:

import codecs
import collections

def readlinesmorethan20spaces(intxtfile):
    with codecs.open(intxtfile, 'r','utf8') as fin:
        for line in fin:
            counter = collections.Counter(line)
            if counter[" "] > 20:
                yield line.strip()

for i in readlinesmorethan20spaces("in.txt"):
    print i