Apache Nutch: Get outlink URL's text context

https://stackoverflow.com/questions/22283624

11-06-2023
|

Question

Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink:

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. You can download Nutch here. For more information about Apache Nutch, please see the Nutch wiki.

In this example, I would like to get the sentence containing the link, and a sentence before and after that sentence. Any way to do this efficiently? Any methods I can invoke to get something like the position of the link within a fetched content? Or even a part of the nutch code I can modify to do this? Thanks!

Solution

What you want to do is Web Scraping. Python and Hadoop offers tools for that. To achieve it, you can use selectors.

Here you find some examples how to do that using Python Scrapy:

On Hadoop the best way to go is to implement a crawling using selectors:

The cascading can be used to address the URL you specify:

Hadoop and Cascading

After having the data, you can also use R to optimize analysis:

If you haven't done anything with Hadoop yet, here is a good starting point. You may also want to have a look in HUE Beeswax as an interactive tool that is very useful for data analysis.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow