Scrape multi-frame website

https://stackoverflow.com/questions/116810

02-07-2019
|

Question

I'm auditing our existing web application, which makes heavy use of HTML frames. I would like to download all of the HTML in each frame, is there a method of doing this with wget or a little bit of scripting?

Solution

as an addition to Steve's answer:

Span to any host—‘-H’

The ‘-H’ option turns on host spanning, thus allowing Wget's recursive run to visit any host referenced by a link. Unless sufficient recursion-limiting criteria are applied depth, these foreign hosts will typically link to yet more hosts, and so on until Wget ends up sucking up much more data than you have intended.

Limit spanning to certain domains—‘-D’

The ‘-D’ option allows you to specify the domains that will be followed, thus limiting the recursion only to the hosts that belong to these domains. Obviously, this makes sense only in conjunction with ‘-H’.

A typical example would be downloading the contents of ‘www.server.com’, but allowing downloads from ‘images.server.com’, etc.:

      wget -rH -Dserver.com http://www.server.com/

You can specify more than one address by separating them with a comma,

e.g. ‘-Ddomain1.com,domain2.com’.

taken from: wget manual

OTHER TIPS

wget --recursive --domains=www.mysite.com http://www.mysite.com

Which indicates a recursive crawl should also traverse into frames and iframes. Be careful to limit the scope of recursion only to your web site since you probably don't want to crawl the whole web.

wget has a -r option to make it recursive, try wget -r -l1 (in case the font makes it hard to read: that last part is a lower case L followed by a number one) The -l1 part tells it to recurse to a maximum depth of 1. Try playing with this number to scrape more.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow