Question

im using var tmp_title = $('title').text(); with cheerio.js to get a title from a page.

question, is there anything that could normalizse a string or remove html entities like \n\t or \n etc?

Example

\n\t defense.gov news article: thousands lay wreaths at arlington cemetery gravesites\n

Into

Thousand lay wreaths at arlington cemetery gravesites

or is there a way to get the title from a page? how can google now that the title is at <h3> tag or does google crawler get the title from <title> tag and remove and normalize title to get a readable title string?

Was it helpful?

Solution

I would make some analysis between:

  • head > title
  • og:metas of the page: $('meta[name="og:title"]).attr('content')
  • hN (descending the hierarchy to get the first hN which is the only one on the page)

Then the "analysis" could be as basic as

  • trimming
  • taking the smallest common string sequence between all 3 choices

Or, you don't mind relying on some saas web service, you could have a look at http://www.diffbot.com/ .

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top