How to make nutch crawl files and subfolders - it only crawls the index of the folder

https://stackoverflow.com/questions/21185364

29-09-2022
|

Frage

EDIT: I found my answer and wrote it up below, but gave the bounty to tahagh, since he provided some good suggestions.

I am setting up nutch to crawl a local folder (a samba mount). I have followed this tutorial.

My folder looks like this:

nutch@ubuntu:~$ ls /mnt/ntserver/
expansion.docx  test-folder  test-shared.txt

with some files and folders below test-folder also.

When I run nutch, it doesn't index the files or the subfolder. It only puts a single document into solr, which is the index of the folder. This is what I get in solr after running nutch on an empty solr index:

"response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "content": [
          "Index of /mnt/ntserver Index of /mnt/ntserver ../ - - - expansion.docx Mon, 30 Dec 2013 14:00:42 GMT 70524 test-folder/ Fri, 17 Jan 2014 09:38:50 GMT - test-shared.txt Thu, 16 Jan 2014 11:33:42 GMT 16"
        ],
      .....

How can I get nutch to index the files and the subfolders?

edit: if I set regex-urlfilter to allow everything (after filtering for gifs, http etc) like this +., then nutch seems to go up the folder hierarchy, but not down, and still only crawling the indexes, not the files. This is what I get in solr:

"response": {
    "numFound": 26,
    "start": 0,
    "docs": [
      {
        "title": [
          "Index of /"
        ]
      },
      {
        "title": [
          "Index of /bin"
        ]
      },
      ...
      {
        "title": [
          "Index of /mnt"
        ]
      },
      {
        "title": [
          "Index of /mnt/ntserver"
        ]
      },
      ...
    ]

Additional info:

This is the crawl command I use:

apache-nutch-1.7/bin/nutch crawl -dir fileCrawl -urls apache-nutch-1.7/urls/ -solr http://localhost:8983/solr -depth 3 -topN 10000

This is the content of my seed urls file:

nutch@ubuntu:~$ cat apache-nutch-1.7/urls/urls_to_be_crawled.txt 
file:////mnt/ntserver

this is my regex-urlfilter.xml:

nutch@ubuntu:~$ cat apache-nutch-1.7/conf/regex-urlfilter.txt
# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY|cs|CS|dll|DLL|refresh|REFRESH)$

# accept any files
+.*mnt/ntserver.*

I have included protocol-file and set no limit on file size in nutch-site.xml:

nutch@ubuntu:~$ cat apache-nutch-1.7/conf/nutch-site.xml
...
<property>
    <name>plugin.includes</name>
    <value>protocol-file|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more<!--|remove-empty-document|title-adder--></value>
    <description></description>
</property>

<property>
    <name>file.content.limit</name>
    <value>-1</value>
    <description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>

...

and I have commented out the duplicate slash removal in regex-normalize.xml:

nutch@ubuntu:~$ cat apache-nutch-1.7/conf/regex-normalize.xml
...
<!-- removes duplicate slashes - commented out, so we won't get invalid filenames 
<regex>
    <pattern>(?&lt;!:)/{2,}</pattern>
    <substitution>/</substitution>
</regex>
-->
...

Lösung 2

I found out that in order to crawl a local file system, you have to add slashes at the end of the seed url, otherwise nutch would not identify the last part of the path as a directory.

So I changed my seed url from

file:////mnt/ntserver

file:////mnt/ntserver/

and then things worked.

More details:

If for instance I had the file test.txt under my /mnt/ntserver and had file:////mnt/ntserver as my seed url, then nutch would correctly parse the index of /mnt/ntserver, and find out that there was a file called test.txt, but then it would try to fetch the file /mnt/test.txt. After adding the trailing slash to the seed url, making it file:////mnt/ntserver/, nutch now tried to fetch the file /mnt/ntserver/test.txt, solving my problem.

Incidentally, in order to stop nutch from going up the folder tree towards the root, I set file.crawl.parent to false in nutch-default.xml, but it could also be done via regex-urlfilter.xml.

Andere Tipps

Investigating File and FileResponse sources, I found the followings:

There is a configuration parameter named "file.crawl.parent" which controls whether nutch should also crawl the parent of a directory or not. By default it is true.
In this implementation, when nutch encounters a directory, it generates the list of files in it as a set of hyperlinks in the content otherwise it reads the file content. Nutch uses File.isDirectory() to determine the given path is a directory or not. So check that your path is really interpreted as a directory.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow