rdf representation of entity references in text

https://stackoverflow.com/questions/3272473

17-09-2020
|

Question

Consider a sentence like:

John Smith travelled to Washington.

A name tagger would identify, on a good day, 'John Smith' as a person, and 'Washington' as a place. However, without other evidence, it can't tell which of all the possible 'John Smith's in the world, or even which of the various 'Washington's, it's got.

Eventually, some resolution process might decide, based on other evidence. Until that point, however, what is a good practice for representing these references in RDF? Assign them made-up unique identifiers in some namespace? Make blank tuples (e.g. 'Some person named John Smith was referenced in Document d'.)? Some other alternative? A book I have gives an example involving anonymous weather stations, but I am not quite following how their example fits in with everything else about RDF being described.

Solution

Assign them unique identifiers in your own namespace. If you later discover that this "Washington" is the same as http://dbpedia.org/resource/Washington,_D.C., or whatever, you can add an owl:sameAs to assert that.

OTHER TIPS

first of all, there are existing good services you can use for entity recognition such as OpenCalais, Zemanta and Alchemy.

To be more specific though, yes simply 'mint' your own URIs (identifiers) for each thing, then talking about them - to offer up a representation for this information in turtle

@prefix : <http://yourdomain.com/data/> .
@prefix myont: <http://yourdomain.com/ontology/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix dbpedia-owl: <http://dbpedia.org/ontology/Place>.
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

:John_Smith#d rdf:type foaf:Person ;
  foaf:name "John Smith"@en .

:Washington#d rdf:type dbpedia-owl:Place ;
  rdfs:label "Washington"@en .

:John_Smith#d myont:travelled_to :Washington#d .

<http://yourdomain.com/some-doc#this> rdf:type foaf:Document ;
  dcterms:references :John_Smith#d, :Washington#d .

and if you later match them up, then you can use owl:sameAs as glenn mcdonald mentions.

May be relevant for you to read how Apache Stanbol does it: http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html

You can either mint your own URI as discussed above, or use a blank-node. There are pros and cons for both approaches:

URI's have an external identity, so you can explicitly refer to your concept in future queries which can make some queries much simpler; but, you they have an external identity, so the algorithm you use to construct the URI's becomes a critical part of your infrastructure and you have to guarantee they are both stable and unique. This may be trivial at first, but when you start dealing with multiple documents being reprocessed at differing times, often in parallel, and on distributed systems, it pretty quickly ceases to be straight forward.

Blank-nodes were included specifically to solve this problem, their uniqueness is guaranteed by their scoping; but, if you are going to need to refer to a blank-node in a query explicitly you are going to need to use either a non-standard extension, or find some way to characterize the node.

In both cases, but especially should you use a blank-node, you should include provenance statements to characterize it anyway.

@nathan's example is a good one to get the idea.

So an example using blank-nodes might be:

@prefix my: <http://yourdomain.com/2010/07/20/conceptmap#> .
@prefix proc: <http://yourdomain.com/2010/07/20/processing#> .
@prefix prg: <http://yourdomain.com/processors#> .

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.example.org/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix doc: <http://yourdomain.com/doc-path/> .

_:1 rdf:type proc:ProcessRun ; 
    proc:parser prg:tagger ;
    proc:version "1.0.2" ;
    proc:time "2010-07-03 20:35:45"^^<xsd:Timestamp> ;
    proc:host prg:hostname-of-processing-node ;
    proc:file doc:some-doc#line=1,;md5=md5_sum_goes_here,mime-charset_goes_here ;

_:2 rdf:type foaf:Person ;
    foaf:name "John Smith"@en ;
    proc:identifiedBy _:1 ;
    proc:atLocation doc:some-doc#char=0,9 .


_:3 rdf:type owl:Thing ;
    foaf:name "Washington"@en ;
    proc:identifiedBy _:1 ;
    proc:atLocation doc:some-doc#char=24,33 .

<http://yourdomain.com/some-doc#this> rdf:type foaf:Document ;
                                      dcterms:references _:2, _:3 .

Note the use of rfc5147 text/plain fragment identifiers to uniquely identify the file being processed, this provides you with flexibility as to how you wish to identify individual runs. The alternative is to capture all this in the URI for the document root, or to abandon provenance altogether.

@prefix : <http://yourdomain.com/ProcessRun/parser=tagger/version=1.0.2/time=2010-07-03+20:35:45/host=hostname-of-processing-node/file=http%3A%2F%2Fyourdomain.com%2Fdoc-path%2Fsome-doc%23line%3D1%2C%3Bmd5%3Dmd5_sum_goes_here%2Cmime-charset_goes_here/$gt; .

@prefix my: <http://yourdomain.com/2010/07/20/conceptmap#> .
@prefix proc: <http://yourdomain.com/2010/07/20/processing#> .
@prefix prg: <http://yourdomain.com/processors#> .

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.example.org/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix doc: <http://yourdomain.com/doc-path/some-doc#> .

:1 rdf:type proc:ProcessRun ; 
    proc:parser prg:tagger ;
    proc:version "1.0.2" ;
    proc:time "2010-07-03 20:35:45"^^<xsd:Timestamp> ;
    proc:host prg:hostname-of-processing-node ;
    proc:file doc:some-doc#line=1,;md5=md5_sum_goes_here,mime-charset_goes_here ;

:2 rdf:type foaf:Person ;
    foaf:name "John Smith"@en ;
    proc:identifiedBy :1 ;
    proc:atLocation doc:some-doc#char=0,9 .


:3 rdf:type owl:Thing ;
    foaf:name "Washington"@en ;
    proc:identifiedBy :1 ;
    proc:atLocation doc:some-doc#char=24,33 .

<http://yourdomain.com/some-doc#this> rdf:type foaf:Document ;
                                      dcterms:references :2, :3 .

You will note that foaf:name has a range of owl:Thing, so it can be applied to anything. An alternative might to use skos:Concept and rdfs:label for the proper nouns.

One final consideration for blank-node vs. URI is that any datastore you use will ultimately have to store any URI you use, and this can have implications regarding performance if you are using very large datasets.

Ultimately if I was going to publish the provenance information in the graph along with the final unified entities, I would be inclined to go with blank-nodes and allocate URI's to the concepts I ultimately unify entities with.

If however I am not going to be tracking the provenance of the inferences, and this is just one pass of many in a pipeline which will ultimately discard the intermediate results, I would just mint URIs using some sort of document hash, timestamp, and id and be done with it.

@prefix : <http://yourdomain.com/entities#> .
@prefix my: <http://yourdomain.com/2010/07/20/conceptmap#> .

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

:filename_timestamp_1 rdf:type foaf:Person ;
                      foaf:name "John Smith"@en .

:filename_timestamp_2 rdf:type owl:Thing ;
    foaf:name "Washington"@en .

<http://yourdomain.com/some-doc#this> rdf:type foaf:Document ;
                                      dcterms:references :2, :3 .

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow