How to screen scrape across origins in an IFRAME?

https://stackoverflow.com/questions/23499006

16-07-2023
|

题

I have a business web app that needs to pull in information from various other web sites. For most sites, the user just instructs the server to pull the data (either using .NET's HttpRequest, or Selenium).

But for some unfriendly, Javascript-heavy sites, our users have to visit the site manually, navigate to the right spot, and copy and paste into our application.

Other than bookmarklets, is there any way for our page to show an IFRAME with the source web site loaded, allow the user to navigate within the frame, and then capture the IFRAME's body?

Since the site in the IFRAME isn't in the same domain (not even close), I can't seem to work around browser cross-site scripting limitations. I've tried using HTML5's "sandbox" feature, but it appears to only allow communication (via "allow-same-origin") the other way, from the IFRAME to the host site, which isn't useful to me. Also, it doesn't work if the site in question attempts to load its frames to the top context.

What I'm ideally looking for is a solution that would allow the browser to be configured to trust my web site implicitly (it's an intranet app) and allow it to access any frame's contents. That would at least get me in the ballpark. Bonus points if I can get the iframe to redefine the "top" context as its own frame, so the hosted site functions properly within the frame.

解决方案

The best approach I've found through many many screen scraping projects (scraping JS heavy pages) is to create a user-script or Greasemonkey script, setup a few virtual machines in their own IP space (for protection) and feed them a list of sites to visit from a remote program:

Check the queue at a set interval
Request page with Greasemonkey, etc.
Capture contents and send to remote program for processing

You can't use an iframe method and you are going to bang your head up against a wall trying to go that route, the method I've described has worked for numerous large-scale scraping projects.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow