Nutch Domain Regular Expression

https://stackoverflow.com/questions/20640649

regex
nutch

18-09-2022
|

Question

I am following the tutorial here, trying to build a robot against a website.

I am in a page that contains all the product categories. Say it is www.example.com/allproducts.

After diving into each category. You can see the product list in a table format and you can click the next page to loop through all the pages inside that category. Actually you can only see the 1,2,3,4,5, last page.

The first page in the category has a URL looks like www.example.com/level1/level2/_/N-1, then the second page will looks like www.example.com/level1/level2/_/N-1/?No=100 .. so on an so forth..

I personally don't have that much JAVA programming experience and I am wondering

can I crawl the all the products list page using Nutch and store the HTML for now..

and maybe later figure out a way to parse the html/index correctly.

(1) Can I just modify conf/regex-urlfilter.txt and replace

# accept anything else
+.

with something correct? (I just don't understand how could

+^http://([a-z0-9]*\.)*nutch.apache.org/

only restrict the URLs inside the Nutch domain..., I will interpret that regular expression to be between the double slash and nutch, there could be any characters that are alpha numeric or asterisk, backslash or dot..)

How can I build the regular expression so it only scrape http://www.example.com/.../.../_/N-../...

(2) I can see the HTML is stored in the content folder inside segment... However, when I open that file in VI, it just totally looks like nonsense to me... and I am wondering if that is the so-called JAVA serialization which I need to deserialize in JAVA to read it.

Forgive me if those questions are too basic and thanks a lot for reading.

Solution

(1) Can I just modify conf/regex-urlfilter.txt and replace

Sure. You should replace +. with these lines:

#accept all products page
+www\.example\.com/allproducts

#accept categories pages
+www\.example\.com/level1/level2/_/N-

One important note about regex in this file: the regular expressions are partially match. So if you write a rule like "+ab" it means: accept all urls that contain "ab" so it matches with these urls

By default, nutch filter urls with ? (since mostly they are dynamic pages). To prevent this, comment this line in you regex-urlfilter.txt file:

-[?*!@=]

(2) I can see the HTML ...

Nutch saves the files in binary format. See https://stackoverflow.com/a/10150402/1881318

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow