Question

I'm new to this webscraping process, but I have a template I can work off of. I'm accessing a civic database (corporations division for the state of MA) and ideally would be able to retrieve the "Date of Organization in Massachusetts" found on the website.

How could I fix the code I currently have (which is returning blank) so it can grab the date?

$.ajax({
    url: "http://corp.sec.state.ma.us/CorpWeb/CorpSearch/CorpSummary.aspx?   FEIN=800829800&SEARCH_TYPE=1",
    type: 'GET',
    cache: false,
    success: function(data) {
        var root;
        root = $("<div></div>")
        root.html(data.responseText)

        var content = root.find("#MainContent_lblOrganisationDate");
        var date = content.text();
        console.log(date);
      }
    });

UPDATE

Thanks everyone for pointing out that javascript may be a poor choice in scraping. I've changed to cheerio and request. However, I'm still printing out a blank value on my terminal, so now I suspect there's something wrong with my DOM structure. Any suggestions - thanks so much for the help so far!

var url = 'http://corp.sec.state.ma.us/CorpWeb/CorpSearch/CorpSummary.aspx?FEIN=800829800&SEARCH_TYPE=1';
request(url, function(err, resp, body) {
    if (err)
        throw err;
    $ = cheerio.load(body);
    var orgdate = $('#MainContent_tblOrg .p1 td #MainContent_lblOrganisationDate').text();
    console.log(orgdate);
});
Was it helpful?

Solution

As @codeomnitrix said you can't do cross domain ajax requests.

If you wish to use JavaScript to write your web scraper, consider using Node.js Its JS, but meant for building backend services. It works well for network driven apps, supports higher concurrency.

You can also get many packages at npm like async, underscore and cheerio which can help you build a pretty decent webscraper.

OTHER TIPS

First of all you can't do cross domain ajax requests, ajax requests can reach to your domain only not to any random domain. So what you can do is write a small proxy script in php/java whatever backend technology you are using and then make the request to the server.

you can google for simple proxy script in php and then you can write it as :

url: "proxy.php?q=http://corp.sec.state.ma.us/CorpWeb/CorpSearch/CorpSummary.aspx?"

hope it will help

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top