Question

i'm working on an academic project(a search engine), the main functions of this search engine are:

    1/-crawling
    2/-storing
    3/-indexing
    4/-page ranking

all the sites that my search engine will crawl are available locally which means it's an intranet search engine.

after storing the files found by the crawler, these files need to be served quickly for caching purpose.

so i wonder what is the fastest way to store and retrieve these file ?

the first idea that came up is to use FTP or SSH, but these protocols are connection based protocols, the time to connect, search for the file and get it is lengthy.

i've already read about google's anatomy, i saw that they use a data repository, i'd like to do the same but i don't know how.

NOTES: i'm using Linux/debian, and the search engine back-end is coded using C/C++. HELP !

Was it helpful?

Solution

Storing individual files is quite easy - wget -r http://www.example.com will store a local copy of example.com's entire (crawlable) content.

Of course, beware of generated pages, where the content is different depending on when (or from where) you access the page.

Another thing to consider is that maybe you don't really want to store all the pages yourself, but just forward to the site that actually contains the pages - that way, you only need to store a reference to what page contains what words, not the entire page. Since a lot of pages will have much repeated content, you only really need to store the unique words in your database and a list of pages that contain that word (if you also filter out words that occur on nearly every page, such as "if", "and", "it", "to", "do", etc, you can reduce the amount of data that you need to store. Do a count of the number of each word on each page, and then see compare different pages, to find the pages that are meaningless to search.

OTHER TIPS

Well, if the program is to be constantly running during operation, you could just store the pages in RAM - grab a gigabyte of RAM and you'd be able to store a great many pages. This would be much faster than caching them to the hard disk.

I gather from the question that the user is on a different machine from the search engine, and therefore cache. Perhaps I am overlooking something obvious here, but couldn't you just sent them the HTML over the connection already established between the user and the search engine? Text is very light data-wise, after all, so it shouldn't be too much of a strain on the connection.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top