Frage

I've searched high and low, but all I can find is questions (and answers) about scraping content that is dynamically generated by Javascript.

I'm putting together a simple tool to audit client websites by finding text in the HTML source and comparing it to a dictionary.

For example, "ga.js" = Google Analytics.

However, I'm noticing that comparable tools are picking up scripts that mine is not... because they don't actually appear in the HTML source. I can only see them through Chrome's Developer Tools:

Here's a capture from Chrome, since I can't post the image...

Those scripts, such as the "reflektion_b.js", are nowhere to be found in the HTML source.

My script, as it stands now, is using urllib2 (urlopen) to fetch and then BeautifulSoup to parse. Can anyone help me our re:getting the list of script sources? Or maybe even being able to read them as well (not 100% necessary, but could come in handy)?

Any help would be much appreciated.

War es hilfreich?

Lösung

You need to use a headless browser with python API approach. Ghost will probably do what you want.

http://jeanphix.me/Ghost.py/

Andere Tipps

content that is dynamically generated by Javascript. implies that the Javascript in question is interpreted, which involves a Javascript interpreter.

You probably need an instance of web view with a mechanism to intercept request to figure out which javascript is being loaded in the page.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top