numpy.genfromtxt skips/ignores last line in long tsv file

Question

Well, the file seems to have some lines with just tabs. I'm surprised np.genfromtxt did not raise a ValueError. One way to prevent the problem would be to remove those empty tab lines. Another might be to use the invalid_raise=False parameter in the call to np.genfromtxt:

oompa = np.genfromtxt('C_20k_73_2.tsv',delimiter='\t',
            usecols=(0,1,2),unpack=True,skip_header=13,
            dtype=str, invalid_raise=False)

That will skip lines that are inconsistent with the number of columns np.genfromtxt expects to parse.

If the file is not too long, an easy way to look at the last few lines of the file is

print(open(filename, 'rb').read().splitlines()[:-3])

Since this prints a list, you get the repr of the items in the list without having to call repr directly. The repr makes it easy to see where the tabs and end-of-line characters are.

By examining the repr of the last lines successfully parsed by np.genfromtxt compared to the first lines skipped, you should be able to spot the break in pattern which is causing the problem.

If the file is very long, you can print the last few lines using

import collections
lines = collections.deque(maxlen=2)
with open('data', 'rb') as f:
    lines.extend(f)
print(list(lines))

The problem with open(filename, 'rb').read().splitlines() is that it reads the entire file into memory and then splits the huge string into a huge list. This can cause a MemoryError when the file is too large. The deque has a maximum number of elements, thus preventing the problem so long as the lines themselves are not too long.