Question

Considering a text file of 1.5 million lines and around 50-100 words per line.

To find the lines that contains the word, using os.popen('grep -w word infile') seems to be faster than

for line in infile: 
  if word in line:
    print line

How else could one search a word in a textfile in python? What is the fastest way to search through that large unindex textfile?

Était-ce utile?

La solution

There are several fast search algorithms (see wikipedia). They require you to compile the word into some structure. Grep is using Aho-Corasick algorithm.

I haven't seen the source code for python's in but either

  1. word is compiled for each line which takes time (I doubt in compiles anything, obviously it can compile it, cache the results, etc), or
  2. the search is inefficient. Consider searching for "word" in "worword" you first check "worw" and fail, then you check "o", then "r" and fail, etc. But there is no reason to recheck "o" or "r" if you are smart. So for example, Knuth–Morris–Pratt algorithm creates a table based on the searched word that tells it how many characters can be skipped when fail occurs.

Autres conseils

I may recommend to install and use the_silver_searcher.

In my test it searched ~ 1GB text file with ~ 29 million lines and found hundreds of searched word entries in only 00h 00m 00.73s, i.e. LESS than a second!

Here is Python 3 code which uses it to search for word and count number of times it was found:

import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-wc", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE).stdout.read()
print("Found entries:", output.rstrip().decode('ascii'))

This version searches for word and prints line numbers + actual text were the word was found:

import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-w", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE)

for line in output.stdout.readlines():
    print(line.rstrip().decode('ascii'))
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top