Proper syntax for two /parallel/ left joins with the same name in SPARQL (using OPTIONAL probably)

StackOverflow https://stackoverflow.com/questions/23120912

  •  04-07-2023
  •  | 
  •  

Pergunta

I read about the syntax for OPTIONAL here, and also followed this slideshow on OPTIONAL. I think my problem boils down to not having the right syntax for, given a base set, left joining one OR another field, whichever exists.

It is my understanding that OPTIONAL clauses are executed in order, so I am also trying to take advantage of this to fill in the ?University variable sequentially in order of my trust in the data field.

My sample query is trying to find which educational institutions have the most alumni who have been named Miss America. (I chose this because it's interesting and yet the set is fairly small, enough to debug.)

There are at least two fields that seem appropriate for identifying an educational alumni affiliation, dbpedia-owl:education and dbpedia2:almaMater.

My first query, just pulling dbpedia-owl:education:

SELECT count(distinct(?ma)) as ?people, ?University WHERE {
{
    ?ma dbpedia2:title :Miss_America ;
       rdf:type <http://dbpedia.org/ontology/Person> .
} UNION {
    ?ma <http://dbpedia.org/ontology/title> ?title;
       rdf:type <http://dbpedia.org/ontology/Person> .
    FILTER STRSTARTS(?title, "Miss America") .
}
    OPTIONAL {
             ?ma dbpedia-owl:education ?University 
      }
    OPTIONAL { ?ma dbpedia-owl:birthDate ?bday . }
}
ORDER BY DESC(?people)

SPARQL RESULTS

My second query, just pulling dbpedia2:almaMater:

SELECT count(distinct(?ma)) as ?people, ?University WHERE {
{
    ?ma dbpedia2:title :Miss_America ;
       rdf:type <http://dbpedia.org/ontology/Person> .
} UNION {
    ?ma <http://dbpedia.org/ontology/title> ?title;
       rdf:type <http://dbpedia.org/ontology/Person> .
    FILTER STRSTARTS(?title, "Miss America") .
}
    OPTIONAL { ?ma dbpedia2:almaMater ?University }
    OPTIONAL { ?ma dbpedia-owl:birthDate ?bday . }
}
ORDER BY DESC(?people)

SPARQL RESULTS

As you can see, I need to ask for both ways of phrasing Alma Mater, because they capture different things.

However, both forms of joint optional NESTED (III) and UNION (IV) seem to be leaving items that were in (I) or (II). Neither is truly giving me the OPTIONAL UNION of the above that I am looking for.

Here's the NESTED form:

SELECT count(distinct(?ma)) as ?people, ?University WHERE {
{
    ?ma dbpedia2:title :Miss_America ;
       rdf:type <http://dbpedia.org/ontology/Person> .
} UNION {
    ?ma <http://dbpedia.org/ontology/title> ?title;
       rdf:type <http://dbpedia.org/ontology/Person> .
    FILTER STRSTARTS(?title, "Miss America") .
}
    OPTIONAL {
             ?ma dbpedia-owl:education ?University 
             OPTIONAL { ?ma dbpedia2:almaMater ?University }
      }
    OPTIONAL { ?ma dbpedia-owl:birthDate ?bday . }
}
ORDER BY DESC(?people)

SPARQL RESULTS

Here's the UNION form:

SELECT count(distinct(?ma)) as ?people, ?University WHERE {
{
    ?ma dbpedia2:title :Miss_America ;
       rdf:type <http://dbpedia.org/ontology/Person> .
} UNION {
    ?ma <http://dbpedia.org/ontology/title> ?title;
       rdf:type <http://dbpedia.org/ontology/Person> .
    FILTER STRSTARTS(?title, "Miss America") .
}
    OPTIONAL {{ ?ma dbpedia-owl:education ?University } UNION
             { ?ma dbpedia2:almaMater ?University } .
      }
    OPTIONAL { ?ma dbpedia-owl:birthDate ?bday . }
}
ORDER BY DESC(?people)

SPARQL RESULTS

Reviewing what I get when I just enumerate the names, (I) and (II) without aggregation, it doesn't seem that either of these, (III) or (IV) is getting me the proper return set, incorporating the data from (I) OR (II) where it exists. I understand that I can do the queries individually then merge in a scripting language, or possibly assign both as different optional clause variables, but it seems clumsy. (But please let me know if this is the recommended way.)

So, to be concise about the question:

  • How do I phrase a query that will return all candidates having been named Miss America, joined on EITHER :almaMater or :education, whichever exists?

Additionally, I notice the most recent Miss America, Nina Davuluri, does not appear in the search results on the dbpedia endpoint, though she is in a searchbox at List_of_Miss_America_titleholders. How would I investigate the cause of the discrepancy between the wikidata and dbpedia end-points (and how can I help contribute data back?!)

Foi útil?

Solução

First, it's much easier to help if you provide complete SPARQL queries, including prefixes (especially since you're using some non-standard ones), or if you use the same prefixes that the public endpoint UI does (see http://dbpedia.org/sparql?nsdecl). It's not immediately clear what dbpedia2 is, etc. (although I realize now that dbpedia2 is defined in the SNORQL explorer that you linked to).

Also, note that while Virtuoso may accept your queries, they're not all actually legal SPARQL. E.g., if you take your first query and go to http://sparql.org/validate/query, you'll see that the variable projection syntax isn't legal. It needs to be

select (count(distinct(?ma)) as ?people) ?University where

where the … as ?people is wrapped in parentheses, and there is no comma between the variables. (It's not a problem, but you can also use count(distinct ?ma) and save two parentheses.)

Next, since DBpedia data is based on Wikipedia, and that means that it can be a bit mixed up at times, it's always a good idea to browse the data a bit to find the best way to identify things. In this case, by looking at http://dbpedia.org/page/Angela_Perez_Baraquio, it appears that a good way to identify Miss America winners it to look for persons that have dcterms:subject category:Miss_America_winners. Thus, we have a query like:

select ?person where {
  ?person a dbpedia-owl:Person ;
          dcterms:subject category:Miss_America_winners
}

SPARQL results

Now, not all of these will have clean education/alma mater/etc., information, but you can use an alternation property path with | to use any number of properties. Then you'd end up with a query like this (for three properties):

select ?education (count(distinct ?person) as ?numWinners) where {
  ?person a dbpedia-owl:Person ;
          dcterms:subject category:Miss_America_winners .
  optional { 
    ?person dbpprop:education|dbpprop:almaMater|dbpedia-owl:almaMater ?education 
  }
}
group by ?education

SPARQL results

It's not particularly enlightening; the biggest commonality is of people without values for those properties. For the other values, there's a mix of strings and resources. If nothing else, there are two for the University of Mississippi.

Selecting values of properties where there's a preference among the properties is actually not entirely trivial in SPARQL, and it's been discussed in this answers.semanticweb.com question: Preference patterns for SPARQL (1.1). There are a few ways to do it, but I think the easiest here is to match all the properties in optional blocks, and then coalesce them into one:

select ?person ?education where {
  ?person a dbpedia-owl:Person ;
          dcterms:subject category:Miss_America_winners .
  optional { ?person dbpedia-owl:almaMater ?ed1 }
  optional { ?person dbpprop:almaMater ?ed2 }
  optional { ?person dbpprop:education ?ed3 }
  bind( coalesce(?ed1,?ed2,?ed3) as ?education )
}

SPARQL results

For individuals who have values for more than one of these properties, we get the preferred property. E.g., for http://dbpedia.org/resource/Angela_Perez_Baraquio we get the dbpedia-owl:almaMater, http://dbpedia.org/resource/University_of_Hawaii. For cases there there are multiple values for the best property, we still get all of them. E.g., for http://dbpedia.org/resource/Kylene_Barker, we get both http://dbpedia.org/resource/Virginia_Tech and http://dbpedia.org/resource/Carroll_County_High_School_(Hillsville,_Virginia).

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top