Question

I thought it would be interesting to ask DBPedia which of its resources are the most predicate-rich.

I tried running the following query:

SELECT DISTINCT ?s (count(?p) AS ?info)
WHERE {
  ?s ?p ?o .
}
GROUP BY ?s ?p
ORDER BY desc(?info)
LIMIT 50

and it timed out, so I can't verify whether or not it was the right query.

So, I'm left with the following two questions:

  1. is this the correct way to ask this question?
  2. is the query too computationally expensive to run (even on smaller datasets? DBP is 2.46b triples)?
Was it helpful?

Solution

The right way to ask this

Suppose you've got data like this:

@prefix : <http://stackoverflow.com/q/22391927/1281433/> .

:a :p 1, 2, 3 ;
   :q 4, 5 .

:b :p 1, 2 ;
   :q 3, 4 ;
   :r 5, 6 .

:c :p 1 ;
   :q 2 ;
   :r 3 .

Then you can ask how many triples each resource is the subject of with a query like this:

prefix : <http://stackoverflow.com/q/22391927/1281433/>

select ?s (count(*) as ?n) where {
  ?s ?p ?o
}
group by ?s
order by desc(?n)
----------
| s  | n |
==========
| :b | 6 |
| :a | 5 |
| :c | 3 |
----------

Notice that you only want to group by ?s if you're interested in how many triples each resource is the subject of. In you original query, where you group by ?s ?p, you're going to sorting (subject,predicate) pairs by how many values they have. E.g.,

prefix : <http://stackoverflow.com/q/22391927/1281433/>

select ?s ?p (count(*) as ?n) where {
  ?s ?p ?o
}
group by ?s ?p
order by desc(?n)
---------------
| s  | p  | n |
===============
| :a | :p | 3 |
| :b | :p | 2 |
| :a | :q | 2 |
| :b | :q | 2 |
| :b | :r | 2 |
| :c | :p | 1 |
| :c | :q | 1 |
| :c | :r | 1 |
---------------

Doing this for DBpedia

I don't expect that you'll be able to run a query like this on DBpedia. It requires touching every triple in the data, and then ordering the resources by how many triples they're the subject of. That sounds like a lot of work. You might be able to download the data, load it into a local endpoint and run the query, and so avoid the timeout, but I wouldn't be surprised if it still takes a while.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top