Question

I am writing a web script for a small online documents management company that wants to allow users to quickly search the content of their files online. While many accounts are very small (under 100 2MB files), there's a handful that have 1,000,000 files or more. Support for PDF and DOC/DOCX is needed. Binary files won't be indexed.

We're looking for a simple solution that provides basic search results. Nothing too fancy. Each user has a home folder (and a search would search his subfolders only) so keep in mind that the search system should be optimal for that.. To illustrate, if a guy with a 100 MB account searches his home folder, it would make sense to not search the other 4 TB of files.

What do you suggest?

Here's some options I was looking at:

1) I was thinking of using Windows Search for this- either a command line tool or using an API.. But each server can have literally 1 billion files and the top 3 results should be delivered instantly. Will Windows Search do? Or will this yield frustration?

2) Custom: Making a simple open-source MySQL database program to hold index information. There are about 100,000 words in the English language... Then there's custom words and acronyms.. So for a fast lookup, it makes sense to index based on word and user account. I will pre-process so that "jogging" becomes "jog" and "fiddling" becomes "fiddle", to lower the DB size. Given 150 customer accounts per server, would it make sense to have one big DB, or maybe eliminate the UserID field and give each user a DB?

Tables:
Table WorldTable
EnglishWord (pk) | WordID (fk)

Table FileTable
FileID (pk) | FilePath

Table WordIndex
WordID (pk) | FileID (fk) | UserID | SettingsPatternID

Table Settings
SettingsPatternID | Top (bool) | IsWordForm (bool)

IsWordForm = Indicates it's not an exact match, but a form of the word. Ex: Word in file was "jogging", or "dancing" originally in the document, but is filed under the short form "jog" or "dance". (If the query was also a wordform, then it helps with relevancy.) The likelihood of a IsWordForm is high. Top = Word is at top 50 words of document (indicates title)

I'd like a small storage overhead of 5-15%. CPU is very precious... But, for each file, that's lots of overhead since each file will generate thousands of records in the WordIndex.. Ie:

WordID, FileID, UserID, SettingsPatternID
WordID, FileID, UserID, SettingsPatternID
WordID, FileID, UserID, SettingsPatternID

... This is the longest table, and WordID is needlessly repeated.

3) Hashing, with MySQL Since we know it will be a search of words, a pure relational database might not be the best model...

it may be more efficient to "hash" each word to a list of matching files. Ex: For each word, make a 2-column table. You don't need to "look up" the word in a table, since we know what it is. This list could be a 2-column table for each word:

Table *The Word*
FileID | UserID | SettingsPatternID
(There would be 100,000 of these. One for each unique word.)

Table Settings
SettingsPatternID | Top (bool) | IsWordForm (bool)

4) I've also looked at SolR but I think it's overkill. Is that a bad assumption? While it supports PDF and DOC, it's also a fair bit of work to integrate... I almost feel it will be the same amount of work to do it myself, but of course as a coder I know that assumption's wrong too often...

Thoughts please!!!

Was it helpful?

Solution

4) I've also looked at SolR but I think it's overkill. Is that a bad assumption? While it supports PDF and DOC, it's also a fair bit of work to integrate... I almost feel it will be the same amount of work to do it myself, but of course as a coder I know that assumption's wrong too often...

Definitely go with SolR: it is more costly to integrate, but it will be easier to setup, and much easier to maintain.

Moreover, it already has many of the features you'd have to otherwise implement (and debug, and maintain...) by yourself.

I'd suggest, however, to review SolR's features, design a basic interface around those features, and have it approved in writing. "Text searching" too often becomes an unspoken "I want the system to be able to read my mind". Also, explain that efficient text searching is not a "simple script"; there's literally thousands of Ph.D. papers involving semantics, stemming, relevance, proximity and so on. Many of those papers have found their way into SolR/Lucene.

SolR is "overkill" if you assume that users might be satisfied by grep, both performance-wise, scalability-wise and result-wise. Trust me, they won't.

You may try suggesting a Google Machine. It will also help establish a baseline relative to costs: i.e., "if you want Google performances, this is the price of Google. Any other ad hoc implementation without Google's economies of scale would cost far more to achieve the same levels of performance".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top