Question

I have set up 'Nutch search engine' to crawl websites. Now,I need to write an php API to talk to the Nutch search engine. I need to do 2 things:

  1. using a PHP script I need to specify to Nutch as to which URLs to crawl (for this I have some pointers from http://www.cs.sjsu.edu/faculty/pollett/masters/Semesters/Fall07/sheetal/?Deliverable2.html

  2. using a PHP script I need to retrieve the crawl result from the Nutch crawl DB. I cant seem to find any help on this (or I might be too dumb to see the answer if it's already there :()

If anyone has used a PHP API to read Nutch crawl results, please share some pointers with me.

Desperately waiting for some help.

Was it helpful?

Solution

I'm looking for a real good way to do this too. But as of now, Im using a JSP API to display search results. This should start you off.

You could alternatively use php to recieve your results as JSON objects.

To kick you off in this direction, there's a interesting page to get you started on JSON using jquery. Google for other tutorials on JSON. There are plenty of'em.

OTHER TIPS

for your question #1 you need to inject these URLs to the crawler. Relatively simple : + create a file with the URLs you want added + issue inject command with these URL's (may need to wait for the end of the previous crawl/fetch/index cycle) + start a new crawl

note : you need to make sure the urls are not filtered out as well

you need to use solr for search or another search platform, nutch is just a crawler, the idea is simple :

  • ==> nutch for crawling
  • ==> solr to create an index
  • ==> build an interface to search inside the index( step 2). i used SolariumBundle‎ for this step

regarding #2, Nutch is written in JSP & Java, I don't know any PHP implementation (if you find I'm interested). So basically you weed to create an AJAX or SOAP kind of communication scheme between your PHP script and the Nutch Server. Have you tried the nutch mailing list for help ?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top