Question

I have a human dictionary file that looks like this in eng.dic (image that there is close to a billion words in that list). And I have to run different word queries quite often.

apple
pear
foo
bar
foo bar
dictionary
sentence

I have a string let's say "foo-bar", is there a better (more efficient way) of searching through that file to see whether it exist, if it return exist, if it doesnt exist, append the dictionary file

dic_file = open('en_dic', 'ra', 'utf8')
query = "foo-bar"
wordlist = list(dic_file.readlines().replace(" ","-"))
en_dic = map(str.strip, wordlist)

if query in en_dic:
    return 1
else:
    print>>dic_file, query

Is there any in-built search functions in python? or any libraries that i can import to run such searches without much overheads?

Was it helpful?

Solution

As I already mentioned, going through the whole file when its size is significant, is not a good idea. Instead you should use established solutions and:

  1. index the words in the document,
  2. store the results of indexing in appropriate form (I suggest database),
  3. check if the word exists in the file (by checking the database),
  4. if it does not exist, add it to file and database,

Storing data in database is really a lot more efficient than trying to reinvent the wheel. If you will use SQLite, the database will be also a file, so the setup procedure is minimal.

So again, I am proposing storing words in SQLite database and querying when you want to check if the word exists in the file, then updating it if you are adding it.

To read more on the solution see answers to this question:

The most efficient way to index words in a document

OTHER TIPS

Most efficient way depends on most frequent operation that you will perform with this dictionary.

If you need to read file each time, you can use while loop reading file line-by-line until result is your word on end of the file. This is necessary if you have several concurrent workers that can update file at the same time.

If you don't need to read file each time (eg, you have only one process that work with dictionary), you can definitely write more efficient implementation: 1) read all lines into set (instead of list), 2) for each "new" word perform both actions - update set with add operation and write word to file.

If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:

with open('largeFile', 'r') as inF:
 for line in inF:
    if 'myString' in line:
        # do_something
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top