some information about pattern matching in a Java web crwaler using crawler4j library

https://stackoverflow.com/questions/15007462

10-03-2022
|

Domanda

I want implement a very simple web crawler using Java and I have find this library: crawler4j: http://code.google.com/p/crawler4j/

I need a crawler that do the following thing:

Start from an URL (specificated by me) and recognizes if in the current page there is a specific word such as a own name or a company name (also this word are specified by me)

If find this word, the current page URL have to be saved in a database.

So, there is no semantic analysis but only syntactic analysis (the crawler has to try to match the web page content with some token specified by me)

I would know if this token research (find if a word is contained in the current page) is a feature implemented by the abstract class WebCrawler of crawler4j or if I have to implement it by myself

Soluzione

As noted by user1887511 it is dead simple to implement. Adapted from here.

  static String wordToFind = "...";
  public void visit(Page page) {          
            if (page.getParseData() instanceof HtmlParseData) {
                    HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                    String text = htmlParseData.getText();
                    if(text.indexOf(wordToFind)!=-1)
                            saveToDB(page.getWebURL().getURL()):
            }
  }

Altri suggerimenti

You have to implement it yourself, a starting point in the code would be the visit() subclass/method, this is called when a page is visited... and parsed to you, then you can do whatever you want with the page-text ...for instance using regex patterns.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow