Pergunta

I want to save whole content of this specific website using lynx

http://build.chromium.org/f/chromium/perf/dashboard/ui/changelog.html?url=%2Ftrunk%2Fsrc&range=41818%3A40345&mode=html

I used these commands

webpage="http://build.chromium.org/f/chromium/perf/dashboard/ui/changelog.html?url=%2Ftrunk%2Fsrc&range=41818%3A40345&mode=html"

lynx -crawl -dump  $webpage > output

My output was only like this:

SVN path: ____________________ SVN revision range: ____________________

When it was expected to have all information about bugs and comments.

In the URL, it included "/trunk/src" and "41818:40345" values which should be put in to SVN path and SVN revision range and then submit it to get content but it didn't.

Question: Do you have any idea to "tell" lynx to wait a bit while the website is rendering its content until complete?

Thanks in advanced.

Foi útil?

Solução

The problem here is that the webpage is being built by a javascript function. Such pages can be tricky to download with tools like lynx (or curl, which IMHO is better at the basic download problem). In order to download the contents you see on that page, you'd need to first load the javascript files needed by the page, and then execute the javascript "as though you were a browser". That javascript will proceed to request some data, which turns out to be XML, and then builds HTML from that data.

Note that the "website" doesn't render its data. Your browser renders the data. Or, to be more accurate, your browser is expected to render it but lynx won't because it doesn't do javascript.

So you have a couple of options. You could try to find a scriptable javascript-aware browser (iirc links does javascript, but I don't know offhand how to script it to do what you want.)

Or you can cheat. By using Chrom{e,ium}'s "developer" tools, you can see what URL is being requested by the javascript. It turns out, in this case, to be

http://build.chromium.org/cgi-bin/svn-log?url=http://src.chromium.org/svn//trunk/src&range=41818:40345

so you could get it with curl as follows

curl -G \
     -d url=http://src.chromium.org/svn//trunk/src \
     -d range=41818:40345 \
     http://build.chromium.org/cgi-bin/svn-log \
     > 41818-40345.xml

That XML data is in a pretty straightforward (i.e. apparently easy to reverse-engineer) format. And then you could use a simple scriptable xml tool like xmlstarlet (or any XSLT tool) to take the xml apart and reformat as you wish. With luck, you might even find some documentation (or a DTD) somewhere for the xml.

At least, that's how I would proceed.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top