Question

Context: I built an app which generates around 1000 domain names based on user input. I need to check if they are available or not by matching against a huge zone file of parsed domain names which is around 2 GB.

I have an amazon micro instance and cannot store the text file in there due to space constraints. I am expecting around 100k - 200k and more in search queries per month.

Naive approach (Potentially): 1. Store the text file in dropbox. Then get the contents of the file and search for the strings and spit out the available domain names on the EC2 instance.

I only need to check if domains exist or not. Should I have to store it in a database?

Some Info: There are currently 100 million dot com names registered according to Verisign. And my parsed domain names are one on each line. Like:

  • GOOGLE
  • APPLE
  • FACEBOOK
  • STACKOVERFLOW etc

What is the best and practical way to deal with the problem? Ideally the checking should take only a few seconds. But I am fine with anything that works at this point.

Était-ce utile?

La solution

a) an indexing engine, like Lucene

b) Bloom filter http://en.wikipedia.org/wiki/Bloom_filter

c) You could use a simple key hashing scheme, and split your domains up by a simple "hash modulo N" to divvy the work across multiple (simple) datastores.

  • BTW 2GB is tiny. You could easily do the whole thing in memory.

  • Also, since you've already normalized your strings into a standard format, I'd focus on storing and doing lookups on the hashes of strings rather than the strings themselves.

...

Full disclosure: I'd probably just store them in a database.

Autres conseils

Should I have to store it in a database?

That would be the simplest solution. Look for a light-weight (non-transactional) database library.

Another idea would be to convert the zone file into a compressed word list, treating each domain name as a "word". This will give you better compression, and if you choose the right scheme, you should be able to get fast and accurate lookup without keeping the whole "file" in memory.

A variation on the above would be to treat name components as words, and treat this as free-text search problem. Build a reverse index, and treat the search for a domain name as a search for a set of words; e.g. "www.example.com" is a search for "www" AND "example" AND "com". This is significantly more complicated, but it might give better compression overall, and it would allow you to do other queries.

There is a lot of literature on the subject of text search ...


Putting the (original) file into an external file service and fetching / searching it is a poor idea. It takes a significant amount of time to stream a 2Gb file from one machine to another over the internet ... and you will probably end up paying for the I/O bandwidth you are using. You would be better off paying up front for storage on the Amazon infrastructure.

Licencié sous: CC-BY-SA avec attribution
scroll top