Web scraping/parsing in Node.js to detect the language of a HTML page?

https://stackoverflow.com/questions/23375733

12-07-2023
|

Question

I am using the Readability Parser API and the node-readability module to do web scraping/parsing for a server built on Node.js. I can get much information (title, links, date, content, length...) about the articles published on sites of publishers and blogs (my target), but cannot get their written language. Any idea of how I could do this?

There is the Google Translate API, but it is not free, and I don't need any translation. There is the Alchemy Language Detection API, or there is the node-language-detect module, but it seems to detect language from a given text, whereas in my case some information about the language may be available in the HTML code of the page (see http://www.w3.org/TR/i18n-html-tech-lang/).

Solution

While inferring the language of a web page can be difficult (Bonjour!), HTML is there to help. Look for the lang attribute:

<html lang="en-us">

It should be noted that any element can have said attribute. In the case of my opening sentence:

<p lang="en-us">While inferring the language of a web page can be difficult <span lang="fr">(Bonjour!)</span></p>

More info here: https://stackoverflow.com/a/7076990/1216976

Alternatively, you could check the Content-Language of the return headers, but that's not as specific, defining the entire page.

OTHER TIPS

You could make a request for the linked content and then get language from the HTTP response headers.

Some servers will response with a Content-Language header. HTTP Headers.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow