Question

I am using crawler4j to crawl a website. When I visit a page, I would like to get the link text of all the links, not only the full URLs. Is this possible?

Thanks in advance.

Was it helpful?

Solution

In the class where you derive from WebCrawler, get the contents of the page and then apply a regular expression.

Map<String, String> urlLinkText = new HashMap<String, String>();
String content = new String(page.getContentData(), page.getContentCharset());
Pattern pattern = Pattern.compile("<a[^>]*href=\"([^\"]*)\"[^>]*>([^<]*)</a[^>]*>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
    urlLinkText.put(matcher.group(1), matcher.group(2));
}

Then stick urlLinkText somewhere that you can get to it once your crawl is complete. For example you could make it a private member of your crawler class and add a getter.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top