Domanda

I am implementing a web crawler and I am using Crawler4j library. I am not getting all the links on a web site . I tried to extract all the links on one page using Crawler4j and missed some links.

Crawler4j version : crawler4j-3.3

Url I used :http://testsite2012.site90.com/frontPage.html

No. of links on this page : almost 60 and 4-5 among them are repeating

No. of links crawler4j gave: 23

this is list of URLs on page and this is list of URLs given by Crawler4j.

I looked in 'HtmlContentHandler.java' file using by crawler4j to extract the links . In this only links associated with 'src' and 'href' links are being extracted .

I find the difference between these files. Crawler4j is missing the links which are not associated with 'src' or 'href' attribute and which are under the 'script' tag. this is the list of links which crawler4j didn't crawl .

How can i extract all the links on this page ? Do I need to do string manipulation(like findding 'http' ) on HTML parsed page or should I change code of 'HtmlContentHandler.java' file ?

Which is best way ?

Even if I do string manipulation and extract all links on this page but Crawler4j is crawling the website using the links crawled by itself and won't in such case it miss some pages ?

È stato utile?

Soluzione

Try using Regular Expressions to locate links.

You can look here for an example.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top