Question

I'm trying to make a simple feature where a user can specify a term and the program fetches a definition for it and returns it. The best definition system I know of is Google's "define" keyword in search queries where if you start the query with "define " or "define:" etc it returns very accurate and sufficient definitions. However, I have no idea how to access this information programatically.

Google's new Custom Search Engine API doesn't show definitions and the old one gives slightly better results but is deprecated and still doesn't show the same definitions I see when I Google the term in the browser.

Failing Google, I turned to Wikipedia, which has a huge API but I still couldn't find a way to extract summaries like Google definitions.

So my question is, does anybody know how I can get this information out of Google via the API or any other means?

This is an older question but is asking the same thing. Except the answers given are no longer applicable as Google Dictionary no longer exists.

Update: So I'm now going down the route of trying to scrape the definitions straight out of the page itself. Now the problem is, when I visit the page in the browser (Firefox), the definitions show up, but when I'm scraping them using cheerio, they don't show up anywhere on the page. I must mention I'm scraping the page through nitrous.io so it's rendering the page from a different region and operating system to the one I'm viewing it in the browser with so maybe it's region related. Will look into it further.

Update 2.0: I think maybe the definitions are loaded asynchronously and so I have no idea how to scrape them because I've never really done scraping before and I'm just a newbie :(

Update 3.0: Ok, so now I think it's not to do with the asynchronous loading but the renderer of the page. When I load this in Firefox, the page looks like this:

enter image description here

However, when I load it in IE (8) it looks like this:

enter image description here

Anybody got some insight on this?

Was it helpful?

Solution

Finally got to the answer. Had to set user agent when screen scraping. My resulting code for getting definitions via scraping:

var request = require('request')
  , cheerio = require('cheerio');

var searchTerm = 'test';

request({url:'https://www.google.co.uk/search?q=define+'+searchTerm,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0"}}, function(err, resp, body){
  $ = cheerio.load(body);
    var defineBlocks = $(".lr_dct_sf_sen");
    var numOfBlocks = (defineBlocks.length < 3) ? defineBlocks.length : 3;
    for (var i=0; i<numOfBlocks; i++){
        var block = defineBlocks[i].children[1].children[0]; //font-size:small level
        process(block);

        function process (block) {
            for (var i=0; i<block.children.length; i++){
                var line = block.children[i];
                if ("style" in line.attribs){ // main text
                    exampleStr = "";
                    for (var k=0; k<line.children.length; k++){
                        exampleStr += line.children[k].children[0].data;
                    }
                    console.log(exampleStr);
                } else if ("class" in line.attribs){ // example
                    console.log("\""+line.children[1].children[0].data+"\"");
                } else { // nothing i want

                }
            }
        }
    }
});
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top