Question

The company I work for is going to be going through a site redesign in a few months, and one of the things we need is a table containing every URL of every page on the site. Then, optimally, there would be columns containing the values of a set of predefined JavaScript variables (in this case, Omniture variables, so we can ensure each page is properly tagged with its place in the site hierarchy).

Here's an example of what might be in the HTML for a given page:

<script type="text/javascript">     
metrics_level2  = "biz";
metrics_level3  = "products";
metrics_level4  = "my_awesome_product";
metrics_pagename    = "biz|products|my_awesome_product";    
</script>

I've crawled the site with RapidMiner and the data is ready to go, but my issue is the best way to isolate these variables and put "metrics_level2", "metrics_level3", etc. in their own columns. Is XPath the best way to do it? Regular expressions? My attempts with XPath seem to bring in the entire contents between the tags, which requires a lot of cleaning up after the fact.

Was it helpful?

Solution

If you use PhantomJS http://phantomjs.org/ you can simply access these variables as you would from within the webpage, with JavaScript. A very simple example is as follows:

//where url is the page that contains these variables.
page.open(url, function (status) {
    //Page is loaded!
    var dataFromPage = page.evaluate(function(){
       return {
                 metrics_level2:metrics_level2,
                 metrics_level3:metrics_level3,
                 metrics_level4:metrics_level4
              };
    });
    //dataFromPage now contains those variables

    phantom.exit();
});

If you already have your webpages scraped and saved off to html files or something, you could just set the content of the page object using the content method as appose to opening the page as seen above. See http://phantomjs.org/api/webpage/property/content.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top