Pregunta

I read some questions related to my question, like Same sparql not returning same results, but I think is a little different.

Consider this query which I submit into http://live.dbpedia.org/sparql (Virtuoso endpoint) and get 34 triples as a result. Result Sparql

SELECT  ?pred ?obj
    WHERE { 
           <http://dbpedia.org/resource/Johann_Sebastian_Bach> ?pred ?obj
        FILTER((langMatches(lang(?obj), "")) ||
                      (langMatches(lang(?obj), "EN"))
          )
    }

Then, I used the same query in a code in python:

import rdflib
import rdfextras
rdfextras.registerplugins()

g=rdflib.Graph()
g.parse("http://dbpedia.org/resource/Johann_Sebastian_Bach")

PREFIX = """
                PREFIX dbp: <http://dbpedia.org/resource/>
"""

query = """
                SELECT ?pred ?obj
                    WHERE {dbp:Johann_Sebastian_Bach ?pred ?obj
                        FILTER( (langMatches(lang(?obj), "")) ||
                                (langMatches(lang(?obj), "EN")))}
"""
query = PREFIX + query
result_set = g.query(query)
print len(result_set)

This time, I get only 27 triples! https://dl.dropboxusercontent.com/u/22943656/result.txt

I thought it could be related to the dbpedia site. I repeated these queries several time and always got the same difference. Therefore, I downloaded the RDF file to test it locally, and used the software Protége to simulate the Virtuoso endpoint. Even though, I still have different results from the sparql submitted into Protége and Python, 31 and 27. Is there any explanation for this difference? And how can I get the same result in both?

¿Fue útil?

Solución

As the question is written, there are a few possible problems. Based on the comments, the first one described here (about lang, langMatches, etc.) seems to be what you're actually running into, but I'll leave the descriptions of the other possible problems, in case someone else finds them useful.

lang, langMatches, and the empty string

lang is defined to return "" for literals with no language tags. According to RFC 4647 §2.1, language tags are defined as follows:

2.1. Basic Language Range

A "basic language range" has the same syntax as an [RFC3066] language tag or is the single character "*". The basic language range was originally described by HTTP/1.1 [RFC2616] and later [RFC3066]. It is defined by the following ABNF [RFC4234]:

language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
alphanum         = ALPHA / DIGIT

This means that "" isn't actually a legal language tag. As Jeen Broekstra pointed out on answers.semanticweb.com, the SPARQL recommendation says:

17.2 Filter Evaluation

SPARQL provides a subset of the functions and operators defined by XQuery Operator Mapping. XQuery 1.0 section 2.2.3 Expression Processing describes the invocation of XPath functions. The following rules accommodate the differences in the data and execution models between XQuery and SPARQL: …

  • Functions invoked with an argument of the wrong type will produce a type error. Effective boolean value arguments (labeled "xsd:boolean (EBV)" in the operator mapping table below), are coerced to xsd:boolean using the EBV rules in section 17.2.2.

Since "" isn't a legal language tag, it might be considered "an argument of the wrong type [that] will produce a type error." In that case, the langMatches invocation would produce an error, and that error will be treated as false in the filter expression. Even if it doesn't return false for this reason, RFC 4647 §3.3.1, which describes how language tags and ranges are compared, doesn't say exactly what should happen in the comparison, since it's assuming legal language tags:

Basic filtering compares basic language ranges to language tags. Each basic language range in the language priority list is considered in turn, according to priority. A language range matches a particular language tag if, in a case-insensitive comparison, it exactly equals the tag, or if it exactly equals a prefix of the tag such that the first character following the prefix is "-". For example, the language-range "de-de" (German as used in Germany) matches the language tag "de-DE-1996" (German as used in Germany, orthography of 1996), but not the language tags "de-Deva" (German as written in the Devanagari script) or "de-Latn-DE" (German, Latin script, as used in Germany).

Based on your comments and my local experiments, it appears that langMatches(lang(?obj),"") for literals without language tags (so really, langMatches("","")) is returning true in Virtuoso (as it's installed on DBpedia), Jena's ARQ (from my experiments), and Proégé (from our experiments), and it's returning false (or an error that's coerced to false) in RDFlib.

In either case, since lang is defined to return "" for the literals without a language tag, , you should be able to reliably include them in your results by changing langMatches(lang(?obj),"") with lang(?obj) = "".

Issues with the data that you're using

You're not querying the same data. The data that you download from

is from DBpedia, but when you run a query against

you're running it against DBpedia Live, which may have different data. If you run this query on the DBpedia Live endpoint and on the DBpedia endpoint, you get a different number of results:

SELECT count(*) WHERE { 
  dbpedia:Johann_Sebastian_Bach ?pred ?obj
  FILTER( langMatches(lang(?obj), "")  || langMatches(lang(?obj), "EN" ) )
}

DBpedia Live results 31
DBpedia results 34

Issues with distinct

Another possible problem, though it doesn't seem to be the one that you're running into, is that your second query has a distinct modifier, but your first one doesn't. That means that your second query could easily have fewer results than the first one.

If you run this query against the DBpedia SPARQL endpoint you should get 34 results, and that's the same whether or not you use the distinct modifiers, and it's the number that you should get if you download the data and run the same query against it.

select ?pred ?obj where { 
  dbpedia:Johann_Sebastian_Bach ?pred ?obj
  filter( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN") )
}

SPARQL results

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top