Pregunta

I'm working with a mailing list archive and am tasked with setting up basic search, boolean search, and ultimately some sort of more intelligent tag-based searching.

I see both commercial products and some open-source projects (like Lucene.NET)

Has anyone else done any similar kind of work?

I'm working in Win2k3 server now, so the immediate thought was to use ASP Classic or ASP.NET. However, if there were another platform that was orders of magnitude better for the purpose, then I'd consider that as well. I'm not going to throw out something becuse of that ;)

¿Fue útil?

Solución

Since you are setting up mail search you will need two things : a search engine and a database. There are many search engines that offer what you need.

  • Sphinx
  • Solr(Lucene and Solr are merged now)
  • PostgreSQL(inbuilt search)

They provide advanced search tools like keywords, field-restricted search, boolean queries, phrase search and more. Here is another SO post looking into various text search engines: Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

Sphinx and Solr are pretty fast in search. Sphinx does full database search and also does partial indexing. Solr uses indexed based search, and is scalable with almost linear performance.

Second most important choice is the database where you store your mails. The mails will be in some format (schema), like fields in a table. It would be plain crazy not to use any format. It is not file search, right? Some search engines require particular DB's to work. Sphinx uses SQL databases only, Solr can be integrated with noSQL databases.

If you are not worried about scaling issues (you have thousands of users, having GB's of data, needing real time performance) then, you are fine with SQL databases. Otherwise you will have to use noSQL database with Solr.

SQL databases(like PostgreSQL) are simplest to work with, do what you need and require minimal setup/effort. Connectors will allow you to send query(mail search) from browser to your database.

Also you said you use Win2k3, you will have to switch to linux distribution to take advantage of these search engines. Win2k3 is slow, does not offer performance comparable to linux distros.

Otros consejos

First, you should think about what you need.

  • What do you want to search in your e-mail archive? Just full text search in the the e-mail’s plein data? You will not get matches in mails that are base64 encoded then, for example. Do you need ‘fielded’ search? E.g.: search only in ‘subject’, ‘from’, ‘to’, ‘body’, ‘attachments’?
  • How do you want to provide access to search in the mails? Via a web page? On a command line? In some windows program?

If you didn’t yet, you should examine what your data looks like. Maybe ‘mbox’ format (one file with mail plain text concatenated) ‘maildir’ (a directory with many files, each contain one mail), or something else?

Setting up a search engine means to think about how data needs to be prepared:

  • E-Mails can contain different data inside. You will have to deal with base64 encoded data, character encodings as UTF-8 and attachments.
  • Usegroup mails may even be split across multiple e-mail messages.
  • If you want to search different ‘fields’ (‘Subject’, ‘date’, ‘body’) they need to be extracted.
  • Data needs to be prepared by linguistic means. You will need to find out which language the mails are in (if there are several) and process the data, eg. to make a search on mouse match on notions of mice and, perhaps, rats; or cursor and pointing device, depending on the topic of your mailing list.

Also think about:

  • Will there be updates to the data in future?
  • Are there deletes (including messages being relabeled later)?

Then compare the products (commercial or open source) that you favour how much of this they do provide already and what you will have to write yourself. Be aware that providing a search experience is more than downloading a search engine and dropping in a ton of data.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top