How do I find whether a document on the web is semantically related to some other document?

https://stackoverflow.com/questions/6122060

03-01-2021
|

Frage

My question here is that given a document d1 on the web and a document d2 how do I tell that d1 and d2 are semantically related. Are there some API's that can do some amount of natural language processing that might give me a hint as to d1 is a probably connected to d2. I need it badly and uregently.Please Help!!

Lösung

You can use special microformats. See more at http://microformats.org/

Simple example:

<a href="http://creativecommons.org/licenses/by/2.0/" rel="license">cc by 2.0</a>

Rel-License is one of several microformats. By adding rel="license" to a hyperlink, a page indicates that the destination of that hyperlink is a license for the current page.

Andere Tipps

For semantically relating documents you can use special vocabularies like SKOS and relate them in an ontology. Or you can use - as silex mentioned - microformats directly in your documents.

For natural language processing, there exist different tools like GATE which can extract information. But this is not a trivial task.

Perhaps you can refine what you want to do? Do you want to define which documents are related? Or do you want a software to find out which documents may be related?

You need to look into "named entity extraction" i.e. natural language processing to extract likely entities that are common to both documents. These are generally people, places, events, times, organisations.

Take a look at OpenCalais http://www.opencalais.com/ for some real-world applications of this type of technology.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow