What is a scalable and practical way to search existence of a group of strings in a huge file
https://softwareengineering.stackexchange.com/questions/232450
-
02-10-2020 - |
Question
Context: I built an app which generates around 1000 domain names based on user input. I need to check if they are available or not by matching against a huge zone file of parsed domain names which is around 2 GB.
I have an amazon micro instance and cannot store the text file in there due to space constraints. I am expecting around 100k - 200k and more in search queries per month.
Naive approach (Potentially): 1. Store the text file in dropbox. Then get the contents of the file and search for the strings and spit out the available domain names on the EC2 instance.
I only need to check if domains exist or not. Should I have to store it in a database?
Some Info: There are currently 100 million dot com names registered according to Verisign. And my parsed domain names are one on each line. Like:
- APPLE
- STACKOVERFLOW etc
What is the best and practical way to deal with the problem? Ideally the checking should take only a few seconds. But I am fine with anything that works at this point.
La solution
a) an indexing engine, like Lucene
b) Bloom filter http://en.wikipedia.org/wiki/Bloom_filter
c) You could use a simple key hashing scheme, and split your domains up by a simple "hash modulo N" to divvy the work across multiple (simple) datastores.
BTW 2GB is tiny. You could easily do the whole thing in memory.
Also, since you've already normalized your strings into a standard format, I'd focus on storing and doing lookups on the hashes of strings rather than the strings themselves.
...
Full disclosure: I'd probably just store them in a database.
Autres conseils
Should I have to store it in a database?
That would be the simplest solution. Look for a light-weight (non-transactional) database library.
Another idea would be to convert the zone file into a compressed word list, treating each domain name as a "word". This will give you better compression, and if you choose the right scheme, you should be able to get fast and accurate lookup without keeping the whole "file" in memory.
A variation on the above would be to treat name components as words, and treat this as free-text search problem. Build a reverse index, and treat the search for a domain name as a search for a set of words; e.g. "www.example.com" is a search for "www" AND "example" AND "com". This is significantly more complicated, but it might give better compression overall, and it would allow you to do other queries.
There is a lot of literature on the subject of text search ...
Putting the (original) file into an external file service and fetching / searching it is a poor idea. It takes a significant amount of time to stream a 2Gb file from one machine to another over the internet ... and you will probably end up paying for the I/O bandwidth you are using. You would be better off paying up front for storage on the Amazon infrastructure.