Get link text of links when crawling a website using crawler4j

https://stackoverflow.com//questions/9610946

09-12-2019
|

Question

I am using crawler4j to crawl a website. When I visit a page, I would like to get the link text of all the links, not only the full URLs. Is this possible?

Thanks in advance.

Solution

In the class where you derive from WebCrawler, get the contents of the page and then apply a regular expression.

Map<String, String> urlLinkText = new HashMap<String, String>();
String content = new String(page.getContentData(), page.getContentCharset());
Pattern pattern = Pattern.compile("<a[^>]*href=\"([^\"]*)\"[^>]*>([^<]*)</a[^>]*>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
    urlLinkText.put(matcher.group(1), matcher.group(2));
}

Then stick urlLinkText somewhere that you can get to it once your crawl is complete. For example you could make it a private member of your crawler class and add a getter.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow