Frage

I'd like do this programmatically:

Given a page URL, I need to get all links on the page. What's important is that at least 3 pieces of link info must be obtained: anchor text, href attribute value, absolute position of the link on the page.

Java CSSBox library is an option, but it's not fully implemented yet(the href attribute value cannot be obtained at the same time and some extra mapping must be done with additional library such as Jsoup). What's more, the CSSBox library renders a page really slow.

It seems that Javascript has all functions available but we have to inject the javascript code into the page and write a driver to take advantage of existing browsers. Scripting languages such as Python and Ruby have support for this as well. It is hard for me to find out the most handy tool.

War es hilfreich?

Lösung

Does PHP's DOM manipulation library help you? http://www.php.net/manual/en/book.dom.php

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top