Finding the subject of a webpage

https://stackoverflow.com/questions/9314183

26-10-2019
|

Question

I'm interested in finding the subject or topic of random webpages and linking this to an entity in an RDF database such as dbpedia. I wondered if there are any tools/libraries to do this or if anyone has tried to do anything like this before?

Solution

Finding the subject of a webpage is probably closest to Automatic Summarization (see the homonymous Wikipedia page). One of the subtasks used for that is Keyphrase Extraction (KE). KE will return substrings (phrases) from the input text that are important/prominent/relevant to that text item. If you assume that named entities are usually key to the subject of your input text, then Named Entity Recognition (NER) would be another possible subtask for what you want. NER will return the substrings that are names of entities, alongside the type of the entity.

From your description, it seems that you are looking for more than just KE or NER, as you mention linking to a knowledge base (KB) such as DBpedia. A tool called DBpedia Spotlight does exactly that. You can configure it to find every DBpedia resource in the input text, or only keyphrases, only named entities, etc. All of that while linking to DBpedia in the end. Check it out: http://spotlight.dbpedia.org

There are other tools like AlchemiAPI, Zemanta, WikiMachine, Evri, HeadUp, Enrycher, etc. But as far as I know, DBpedia Spotlight is the only one that is free, open source (Apache V2) and allows you to configure the behavior of the phrase recognition and disambiguation alike. (Disclaimer: I am co-creator of DBpedia Spotlight)

OTHER TIPS

What you're essentially after is a named entity recognition tool. There are a number of free and commercial services available, such as Alchemy API, OpenCalais, Lupedia or by Zemanta. Some of my colleagues have blogged about their experiences with these services.

For the interlinking part you would typically use frameworks such as Silk or LIMES; very soon there will be an interlinking service in the cloud available via the EC FP7 project LATC. Disclaimer: I'm the LATC project co-ordinator and Silk/LIMES are products of LATC consortium members.

OpenLink Virtuoso does this already with its Sponger (an RDFizer) meta-cartridges for OpenCalais, Alchemy, Pingar and DBPedia Spotlight. Ie, you feed it a page, it asks the above sites for entities, it gives you triples based on identified entities.

(Disclaimer: I should know.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow