質問

Is there some way download the following pdf from the command line?

http://www.ofsted.gov.uk/provider/files/1295389/urn/EY298883.pdf   

A simple wget http://www.ofsted.gov.uk/provider/files/1295389/urn/EY298883.pdf returns a web page. However if you go to it in firefox you get a pdf.

Related to How to get a JS redirected pdf linked from a web page where I tried to find a python solution.

役に立ちましたか?

解決

If you don't need a universal answer that simulates a web browser and runs the JS (you need to do this to get a universal solution), but are fine with just finding the download link from the html you get by yourself, then you can:

  1. wget the page (wget will follow HTTP redirect so that this will give you the target html with the JS that does the download)
  2. you then need to parse the HTML and find the link you're looking for
  3. you need to wget that link

I wrote some simple scripts to do 2,3 for you at https://github.com/pjump/wgetbyCss In order to use them, you need

  • ruby
  • the mechanize gem (gem install mechanize)

Then you can do:

 ./wget_by_link_text 'http://www.ofsted.gov.uk/filedownloading/?id=1295389&type=1&refer=1' "Please download the requested file here"

i.e.:

   ./wget_by_link_text url link_text [save_as]

To get that link by its text. Alternatively, you can use the wget_by_css script and get the link by its .auto_click class, or some other css selector.

他のヒント

in short: you can't using wget/curl

You could use curl -L constrains curl to follow redirection

 curl -L http://www.ofsted.gov.uk/provider/files/1295389/urn/EY298883.pdf

But it doesn't work as you can see curl-FAQ:

4.14 Redirects work in browser but not with curl!

curl supports HTTP redirects fine (see item 3.8). Browsers generally support at least two other ways to perform redirects that curl does not:

Meta tags. You can write a HTML tag that will cause the browser to redirect to another given URL after a certain time.

Javascript. You can write a Javascript program embedded in a HTML page that redirects the browser to another given URL.

There is no way to make curl follow these redirects. You must either manually figure out what the page is set to do, or you write a script that parses the results and fetches the new URL.

So I think bad news, you will have to do it by yourself within a script, see your other question as reference: How to get a JS redirected pdf linked from a web page


Consider to use seleniumhq the queen's website seems to be a hard nut for crawlers.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top