As the question is written, there are a few possible problems. Based on the comments, the first one described here (about lang
, langMatches
, etc.) seems to be what you're actually running into, but I'll leave the descriptions of the other possible problems, in case someone else finds them useful.
lang
, langMatches
, and the empty string
lang
is defined to return ""
for literals with no language tags. According to RFC 4647 §2.1, language tags are defined as follows:
2.1. Basic Language Range
A "basic language range" has the same syntax as an [RFC3066] language tag or is the single character "*". The basic language range was originally described by HTTP/1.1 [RFC2616] and later [RFC3066]. It is defined by the following ABNF [RFC4234]:
language-range = (1*8ALPHA *("-" 1*8alphanum)) / "*" alphanum = ALPHA / DIGIT
This means that ""
isn't actually a legal language tag. As Jeen Broekstra pointed out on answers.semanticweb.com, the SPARQL recommendation says:
17.2 Filter Evaluation
SPARQL provides a subset of the functions and operators defined by XQuery Operator Mapping. XQuery 1.0 section 2.2.3 Expression Processing describes the invocation of XPath functions. The following rules accommodate the differences in the data and execution models between XQuery and SPARQL: …
- Functions invoked with an argument of the wrong type will produce a type error. Effective boolean value arguments (labeled "xsd:boolean (EBV)" in the operator mapping table below), are coerced to xsd:boolean using the EBV rules in section 17.2.2.
Since ""
isn't a legal language tag, it might be considered "an argument of the wrong type [that] will produce a type error." In that case, the langMatches
invocation would produce an error, and that error will be treated as false in the filter
expression. Even if it doesn't return false for this reason, RFC 4647 §3.3.1, which describes how language tags and ranges are compared, doesn't say exactly what should happen in the comparison, since it's assuming legal language tags:
Basic filtering compares basic language ranges to language tags. Each basic language range in the language priority list is considered in turn, according to priority. A language range matches a particular language tag if, in a case-insensitive comparison, it exactly equals the tag, or if it exactly equals a prefix of the tag such that the first character following the prefix is "-". For example, the language-range "de-de" (German as used in Germany) matches the language tag "de-DE-1996" (German as used in Germany, orthography of 1996), but not the language tags "de-Deva" (German as written in the Devanagari script) or "de-Latn-DE" (German, Latin script, as used in Germany).
Based on your comments and my local experiments, it appears that langMatches(lang(?obj),"")
for literals without language tags (so really, langMatches("","")
) is returning true in Virtuoso (as it's installed on DBpedia), Jena's ARQ (from my experiments), and Proégé (from our experiments), and it's returning false (or an error that's coerced to false) in RDFlib.
In either case, since lang
is defined to return ""
for the literals without a language tag, , you should be able to reliably include them in your results by changing langMatches(lang(?obj),"")
with lang(?obj) = ""
.
Issues with the data that you're using
You're not querying the same data. The data that you download from
is from DBpedia, but when you run a query against
you're running it against DBpedia Live, which may have different data. If you run this query on the DBpedia Live endpoint and on the DBpedia endpoint, you get a different number of results:
SELECT count(*) WHERE {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
FILTER( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN" ) )
}
DBpedia Live results 31
DBpedia results 34
Issues with distinct
Another possible problem, though it doesn't seem to be the one that you're running into, is that your second query has a distinct
modifier, but your first one doesn't. That means that your second query could easily have fewer results than the first one.
If you run this query against the DBpedia SPARQL endpoint you should get 34 results, and that's the same whether or not you use the distinct
modifiers, and it's the number that you should get if you download the data and run the same query against it.
select ?pred ?obj where {
dbpedia:Johann_Sebastian_Bach ?pred ?obj
filter( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN") )
}