Replacement for Jaxer for parsing/crawling websites

https://stackoverflow.com/questions/9375920

28-10-2019
|

Frage

I have an old tool an (ex-)colleague wrote a few years back with Jaxer, that I'd like to replace/rewrite.

Jaxer is an (abandoned) server-side framework based on a headless Mozilla/Gecko-Browser allowing you to use JavaScript and the DOM server-side.

Since Jaxer is abandoned and because I have big problems installing and running Aptana Studio 1.5 with Jaxer on a new computer, I'm looking for a library/framework/something on which I can base a new version.

This tool is only run locally inside Aptana Studio (the IDE for Jaxer) and was never intended to be an actual web app. It crawls our customers websites by loading them page by page into the server-side Mozilla. In order to do that it uses jQuery and predefined CSS selectors to find the links in the menus and parse other information out of the pages. The final result is basically a glorified sitemap.

I'd like to keep this modus operandi if possible and continue using jQuery/JavaScript/the DOM to load and parse/access the pages, but it can be wrapped in a framework based on another language such as Java. I considered writing something based on Gecko myself, but that seems a bit over the top, so I'm open for an other suggestions.

Lösung

As far as HTML crawling/parsing goes: http://ccil.org/~cowan/XML/tagsoup/

http://jsoup.org/

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow