Question

I'm now making a web crawler.

getting a link from HTML is easy part but acquiring a link from the result of javascript is not easy for me.

Can I get the result of javascript so as to know where a link is referred to?

for example.

How can I retrieve the link to google.com from javascript code in Python?

<!DOCTYPE html>
<html lang="en">
    <head></head>
    <body>
        <a href="#" id="goog">to google</a>
    </body>
    <script>
        document.getElementById('goog').onclick = function() {
            window.location = "http://google.com";
        };

    </script>
</html>
Was it helpful?

Solution

You would need to install node.js and run a separate piece of code that executes the Javascript code in context to emit the html. This is possible using jsdom but the key to it is extracting the Javascript code from the HTML page, and setting up the context correctly.

OTHER TIPS

Python doesn't offer a way to execute the Javascript, which would be a large task, and may not even be what you want, because you won't know how to execute all of the appropriate Javascript.

For the code you showed, you could simply regex the entire thing to get URL-like strings from it, but that could be very ad-hoc and error-prone.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top