Question

I'm developing a tiny search engine using TF-IDF and cosine similarity. When pages are added, I build an inverted index to keep words frequency in the different pages. I remove stopwords and more common words, and plural/verb/etc. are stemmed.

My inverted index looks like:

map< string, map<int, float> > index

[
    word_a => [ id_doc=>frequency, id_doc2=>frequency2, ... ],
    word_b => [ id_doc->frequency, id_doc2=>frequency2, ... ],
    ...
]

With this data structure, I can get the idf weight with word_a.size(). Given a query, the program loops over the keywords and scores the documents.

I don't know well data structures and my questions are:

  1. How to store a 500 Mo inverted index in order to load it at search time? Currently, I use boost to serialize the index:

    ofstream ofs_index("index.sr", ios::binary);
    boost::archive::bynary_oarchive oa(ofs_index);
    oa << index;
    

    And then I load it at search time:

    ifstream ifs_index("index.sr", ios::binary);
    boost::archive::bynary_iarchive ia(ifs_index);
    ia >> index;
    

    But it is very slow, it takes sometines 10 seconds to load.

  2. I don't know if map are efficient enough for inverted index.

  3. In order to cluster documents, I get all keywords from each document and I loop over these keywords to score similar documents, but I would like to avoid reading again each document and use only this inverted index. But I think this data structure would be costly.

Thank you in advance for any help!

Was it helpful?

Solution

The answer will pretty much depend on whether you need to support data comparable to or larger than your machine's RAM and whether in your typical use case you are likely to access all of the indexed data or rather only a small fraction of it.

If you are certain that your data will fit into your machine's memory, you can try to optimize the map-based structure you are using now. Storing your data in a map should give pretty fast access, but there will always be some initial overhead when you load the data from disk into memory. Also, if you only use a small fraction of the index, this approach may be quite wasteful as you always read and write the whole index, and keep all of it in memory.

Below I list some suggestions you could try out, but before you commit too much time to any of them, make sure that you actually measure what improves the load and run times and what does not. Without profiling the actual working code on actual data you use, these are just guesses which may be completely wrong.

  • map is implemented as a tree (usually black-red tree). In many cases, a hash_map may give you better performance as well as better memory usage (fewer allocations and less fragmentation for example).
  • Try reducing the size of the data - less data means it will be faster to read it from disk, potentially less memory allocation, and sometimes better in-memory performance due to better locality. You may for example consider that you use float to store the frequency, but perhaps you could store only the number of occurrences as an unsigned short in the map values and in a separate map store the number of all words for each document (also as a short). Using the two numbers, you can re-calculate the frequency, but use less disk space when you save the data to disk, which could result in faster load times.
  • Your map has quite a few entries, and sometimes using custom memory allocators helps improve performance in such a case.

If your data could potentially grow beyond the size of your machine's RAM, I would suggest you use memory-mapped files for storing the data. Such an approach may require re-modelling your data structures and either using custom STL allocators or using completely custom data structures instead of std::map but it may improve your performance an order of magnitude if done well. In particular, this approach frees your from having to load the whole structure into memory at once, so your startup times will improve dramatically at the cost of slight delays related to disk accesses distributed over time as you touch different parts of the structure for the first time. The subject is quite broad, and requires much deeper changes to your code than just tuning the map, but if you plan handling huge data, you should certainly have a look at mmap and friends.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top