Get title of a page with cheerio

https://stackoverflow.com/questions/23326561

10-07-2023
|

题

I'm trying to get the title tag of a url with cheerio. But, I'm getting empty string values. This is my code:

app.get('/scrape', function(req, res){

    url = 'http://nrabinowitz.github.io/pjscrape/';

    request(url, function(error, response, html){
        if(!error){
                        var $ = cheerio.load(html);

            var title, release, rating;
            var json = { title : "", release : "", rating : ""};

            $('title').filter(function(){
                //var data = $(this);
                var data = $(this);
                        title = data.children().first().text();            
                        release = data.children().last().children().text();

                json.title = title;
                json.release = release;
            })

            $('.star-box-giga-star').filter(function(){
                var data = $(this);
                rating = data.text();

                json.rating = rating;
            })
        }


        fs.writeFile('output.json', JSON.stringify(json, null, 4), function(err){

            console.log('File successfully written! - Check your project directory for the output.json file');

        })

        // Finally, we'll just send out a message to the browser reminding you that this app does not have a UI.
        res.send('Check your console!')
    })
});

解决方案

request(url, function (error, response, body) 
{
  if (!error && response.statusCode == 200) 
  {
    var $ = cheerio.load(body);
    var title = $("title").text();
  }
})

Using Javascript we extract the text contained within the "title" tags.

其他提示

If Robert Ryan's solution still doesn't work, I'd be suspicious of the formatting of the original page, which may be malformed somehow.

In my case I was accepting gzip and other compression but never decoding, so Cheerio was trying to parse compressed binary bits. When console logging the original body, I was able to spot the binary text instead of plain text HTML.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow