
I found a code on Github that uses DBpedia Lookup to send words and get candidate URIs from the DBpedia. The problem is: all the URIs come with the word Category. For example, for the word Berlin it returns:

instead of

If I put the first URI (the one with the "Category") on the browser, it doesn't show me the page corresponding to the subject "History_of_Berlin", it returns me a page that contains a list of links and where I can find the link to "History_of_Berlin". But, if I put the second URI (the one without "Category") it returns me the page corresponding to the subject "History_of_Berlin". How could I avoid having these URIs with "Category" being returned from the lookup?


package com.knowledgebooks.info_spiders;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.methods.GetMethod;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.util.*;

 * Copyright Mark Watson 2008-2010. All Rights Reserved.
 * License: LGPL version 3 (

// Use Georgi Kobilarov's DBpedia lookup web service
//    ref:
//    example:

 * Searches return results that contain any of the search terms. I am going to filter
 * the results to ignore results that do not contain all search terms.

public class DBpediaLookupClient extends DefaultHandler {
  public DBpediaLookupClient(String query) throws Exception {
    this.query = query;
    HttpClient client = new HttpClient();

    String query2 = query.replaceAll(" ", "+"); // URLEncoder.encode(query, "utf-8");
    HttpMethod method =
      new GetMethod("" +
    try {
      InputStream ins = method.getResponseBodyAsStream();
      SAXParserFactory factory = SAXParserFactory.newInstance();
      SAXParser sax = factory.newSAXParser();
      sax.parse(ins, this);
    } catch (HttpException he) {
      System.err.println("Http error connecting to");
    } catch (IOException ioe) {
      System.err.println("Unable to connect to");

  private List<Map<String, String>> variableBindings = new ArrayList<Map<String, String>>();
  private Map<String, String> tempBinding = null;
  private String lastElementName = null;

  public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
    //System.out.println("startElement " + qName);
    if (qName.equalsIgnoreCase("result")) {
      tempBinding = new HashMap<String, String>();
    lastElementName = qName;

  public void endElement(String uri, String localName, String qName) throws SAXException {
    //System.out.println("endElement " + qName);
    if (qName.equalsIgnoreCase("result")) {
      if (!variableBindings.contains(tempBinding) && containsSearchTerms(tempBinding))

  public void characters(char[] ch, int start, int length) throws SAXException {
    String s = new String(ch, start, length).trim();
    //System.out.println("characters (lastElementName='" + lastElementName + "'): " + s);
    if (s.length() > 0) {
      if ("Description".equals(lastElementName)) {
        if (tempBinding.get("Description") == null) {
          tempBinding.put("Description", s);
        tempBinding.put("Description", "" + tempBinding.get("Description") + " " + s);
      if ("URI".equals(lastElementName)) tempBinding.put("URI", s);
      if ("Label".equals(lastElementName)) tempBinding.put("Label", s);

  public List<Map<String, String>> variableBindings() {
    return variableBindings;
  private boolean containsSearchTerms(Map<String, String> bindings) {
    StringBuilder sb = new StringBuilder();
    for (String value : bindings.values()) sb.append(value);  // do not need white space
    String text = sb.toString().toLowerCase();
    StringTokenizer st = new StringTokenizer(this.query);
    while (st.hasMoreTokens()) {
      if (text.indexOf(st.nextToken().toLowerCase()) == -1) {
        return false;
    return true;
  private String query = "";
Was it helpful?

Solution 2

I asked the author of the code, Mark Watson, for some help and he answered me:

You can make this simple code change:

      //if ("URI".equals(lastElementName)) tempBinding.put("URI", s);
      if ("URI".equals(lastElementName) && s.indexOf("Category")==-1
&& tempBinding.get("URI") == null) {
        tempBinding.put("URI", s);

that is comment out 1 line, add the next three.

That's it!


When you do a search for, e.g., "History of Berlin", you're requesting a URL like

and you're getting back an XML result like this:

<?xml version="1.0" encoding="utf-8"?>
xmlns:xsi="" xmlns:xsd="" xmlns="">
        <Label>Museum für Naturkunde</Label>
        <Label>History of Berlin</Label>
            Berlin is the capital city of Germany. Berlin is a young city by European standards, founded in the 12th century.
                <Label>History of Berlin</Label>
                <Label>History of Germany by location</Label>

You're right that there are URI elements with category URIs, e.g.,


but what you should note is that from the root of the document, there are


elements, whereas the elements that you want are


elements. You just need to process your XML a bit differently; don't get all the content from all URI elements, but just from the URI elements that are children of Result elements. I'm not all that familiar with SAX parsing, but I think the important point is that once you've entered a Result, you should only grab the URI if you haven't entered another child element of Result.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top