get normalized or find title from page with nodejs

https://stackoverflow.com/questions/20698097

20-09-2022
|

Question

im using var tmp_title = $('title').text(); with cheerio.js to get a title from a page.

question, is there anything that could normalizse a string or remove html entities like \n\t or \n etc?

Example

\n\t defense.gov news article: thousands lay wreaths at arlington cemetery gravesites\n

Into

Thousand lay wreaths at arlington cemetery gravesites

or is there a way to get the title from a page? how can google now that the title is at <h3> tag or does google crawler get the title from <title> tag and remove and normalize title to get a readable title string?

Solution

I would make some analysis between:

head > title
og:metas of the page: $('meta[name="og:title"]).attr('content')
hN (descending the hierarchy to get the first hN which is the only one on the page)

Then the "analysis" could be as basic as

trimming
taking the smallest common string sequence between all 3 choices

Or, you don't mind relying on some saas web service, you could have a look at http://www.diffbot.com/ .

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow