Question

I've got nutch and lucene setup to crawl and index some sites and I'd like to use a .net website instead of the JSP site that comes with nutch.

Can anyone recommend some solutions?

I've seen solutions where there was an app running on the index server which the .Net site used remoting to connect to.

Speed is a consideration obviously so can this still perform well?

Edit: could NHibernate.Search work for this?

Edit: We ended up going with Solr index servers being used by our ASP.net site with the solrnet library.

Was it helpful?

Solution

Instead of using Lucene, you could use Solr to index with nutch (see here), then you can connect very easily to Solr using one of the two libraries available: SolrSharp and SolrNet.

OTHER TIPS

In case it wasn't totally clear from the other answers, Lucene.NET and Lucene (Java) use the same index format, so you should be able continue to use your existing (Java-based) mechanisms for indexing, and then use Lucene.NET inside your .NET web application to query the index.

From the Lucene.NET incubator site:

In addition to the APIs and classes port to C#, the algorithm of Java Lucene is ported to C# Lucene. This means an index created with Java Lucene is back-and-forth compatible with the C# Lucene; both at reading, writing and updating. In fact a Lucene index can be concurrently searched and updated using Java Lucene and C# Lucene processes

I'm also working on this.

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

It seems you can submit your query to nutch and get the rss results back.

edit:

Got this working today in a windows form as a proof of concept. Two textboxes(searchurl and query), one for the server url and one for the query. One datagrid view.

private void Form1_Load(object sender, EventArgs e)
        {
            searchurl.Text = "http://localhost:8080/opensearch?query=";


    }

    private void search_Click(object sender, EventArgs e)
    {
        string uri;

        uri = searchurl.Text.ToString() + query.Text.ToString();
        Console.WriteLine(uri);

        XmlDocument myXMLDocument = new XmlDocument();

        myXMLDocument.Load(uri);

        DataSet ds = new DataSet();

        ds.ReadXml(new XmlNodeReader(myXMLDocument));

        SearchResultsGridView1.DataSource = ds;
        SearchResultsGridView1.DataMember = "item";

    }

Got here by searching for a comparison between SolrNet and SolrSharp, just thought I'd leave here my impressions.

It seems like SolarSharp is a dead project (wasn't updated for a long time) so the only option is SolarNet.

I hope this will help someone, I would have left a comment to the accepted answer but I don't have enough reputation yet :)

Instead of using Solr, I wrote a java based indexer that runs in a cron job, and a java based web service for querying. I actually didn't index pages so much as different types of data that the .net site uses to build the pages. So there's actually 4 different indexes each with a different document structure that can all be queried in about the same way (say: users, posts, messages, photos).

By defining an XSD for the web service responses I was able to both generate classes in .net and java to store a representation of the documents. The web service basically runs the query on the right index and fills out the response xml from the hits. The .net client parses that back into objects. There's also a json interface for any client side JavaScript.

Why not switch from java lucene to the dot net version. Sure it's an investment but it's mostly a class substitution exercise. The last thing you need is more layers that add no value other than just being glue. Less glue and more stuff is what you should aim for...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top