Question

I found a code on Github that uses DBpedia Lookup to send words and get candidate URIs from the DBpedia. The problem is: all the URIs come with the word Category. For example, for the word Berlin it returns:

instead of

If I put the first URI (the one with the "Category") on the browser, it doesn't show me the page corresponding to the subject "History_of_Berlin", it returns me a page that contains a list of links and where I can find the link to "History_of_Berlin". But, if I put the second URI (the one without "Category") it returns me the page corresponding to the subject "History_of_Berlin". How could I avoid having these URIs with "Category" being returned from the lookup?

Code

package com.knowledgebooks.info_spiders;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.methods.GetMethod;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;
import java.io.InputStream;
import java.net.URLEncoder;
import java.util.*;

/**
 * Copyright Mark Watson 2008-2010. All Rights Reserved.
 * License: LGPL version 3 (http://www.gnu.org/licenses/lgpl-3.0.txt)
 */

// Use Georgi Kobilarov's DBpedia lookup web service
//    ref: http://lookup.dbpedia.org/api/search.asmx?op=KeywordSearch
//    example: http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=Flagstaff&QueryClass=XML&MaxHits=10

/**
 * Searches return results that contain any of the search terms. I am going to filter
 * the results to ignore results that do not contain all search terms.
 */


public class DBpediaLookupClient extends DefaultHandler {
  public DBpediaLookupClient(String query) throws Exception {
    this.query = query;
    HttpClient client = new HttpClient();

    String query2 = query.replaceAll(" ", "+"); // URLEncoder.encode(query, "utf-8");
    HttpMethod method =
      new GetMethod("http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=" +
        query2);
    try {
      client.executeMethod(method);
      System.out.println(method);
      InputStream ins = method.getResponseBodyAsStream();
      SAXParserFactory factory = SAXParserFactory.newInstance();
      SAXParser sax = factory.newSAXParser();
      sax.parse(ins, this);
    } catch (HttpException he) {
      System.err.println("Http error connecting to lookup.dbpedia.org");
    } catch (IOException ioe) {
      System.err.println("Unable to connect to lookup.dbpedia.org");
    }
    method.releaseConnection();
  }

  private List<Map<String, String>> variableBindings = new ArrayList<Map<String, String>>();
  private Map<String, String> tempBinding = null;
  private String lastElementName = null;

  public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
    //System.out.println("startElement " + qName);
    if (qName.equalsIgnoreCase("result")) {
      tempBinding = new HashMap<String, String>();
    }
    lastElementName = qName;
  }

  public void endElement(String uri, String localName, String qName) throws SAXException {
    //System.out.println("endElement " + qName);
    if (qName.equalsIgnoreCase("result")) {
      if (!variableBindings.contains(tempBinding) && containsSearchTerms(tempBinding))
        variableBindings.add(tempBinding);
    }
  }

  public void characters(char[] ch, int start, int length) throws SAXException {
    String s = new String(ch, start, length).trim();
    //System.out.println("characters (lastElementName='" + lastElementName + "'): " + s);
    if (s.length() > 0) {
      if ("Description".equals(lastElementName)) {
        if (tempBinding.get("Description") == null) {
          tempBinding.put("Description", s);
        }
        tempBinding.put("Description", "" + tempBinding.get("Description") + " " + s);
      }
      if ("URI".equals(lastElementName)) tempBinding.put("URI", s);
      if ("Label".equals(lastElementName)) tempBinding.put("Label", s);
    }
  }

  public List<Map<String, String>> variableBindings() {
    return variableBindings;
  }
  private boolean containsSearchTerms(Map<String, String> bindings) {
    StringBuilder sb = new StringBuilder();
    for (String value : bindings.values()) sb.append(value);  // do not need white space
    String text = sb.toString().toLowerCase();
    StringTokenizer st = new StringTokenizer(this.query);
    while (st.hasMoreTokens()) {
      if (text.indexOf(st.nextToken().toLowerCase()) == -1) {
        return false;
      }
    }
    return true;
  }
  private String query = "";
}
Was it helpful?

Solution 2

I asked the author of the code, Mark Watson, for some help and he answered me:

You can make this simple code change:

      //if ("URI".equals(lastElementName)) tempBinding.put("URI", s);
      if ("URI".equals(lastElementName) && s.indexOf("Category")==-1
&& tempBinding.get("URI") == null) {
        tempBinding.put("URI", s);
      }

that is comment out 1 line, add the next three.

That's it!

OTHER TIPS

When you do a search for, e.g., "History of Berlin", you're requesting a URL like

and you're getting back an XML result like this:

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfResult 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://lookup.dbpedia.org/">
    <Result>
        <Label>Museum für Naturkunde</Label>
        <URI>http://dbpedia.org/resource/Museum_für_Naturkunde</URI>
        <Description></Description>
        <Classes></Classes>
        <Categories></Categories>
        <Templates></Templates>
        <Redirects></Redirects>
        <Refcount>155</Refcount>
    </Result>
    <Result>
        <Label>History of Berlin</Label>
        <URI>http://dbpedia.org/resource/History_of_Berlin</URI>
        <Description>
            Berlin is the capital city of Germany. Berlin is a young city by European standards, founded in the 12th century.
        </Description>
        <Classes></Classes>
        <Categories>
            <Category>
                <Label>History of Berlin</Label>
                <URI>http://dbpedia.org/resource/Category:History_of_Berlin</URI>
            </Category>
            <Category>
                <Label>History of Germany by location</Label>
                <URI>http://dbpedia.org/resource/Category:History_of_Germany_by_location</URI>
            </Category>
        </Categories>
        <Templates></Templates>
        <Redirects></Redirects>
        <Refcount>14</Refcount>
    </Result>
</ArrayOfResult>

You're right that there are URI elements with category URIs, e.g.,

<URI>http://dbpedia.org/resource/Category:History_of_Berlin</URI>

but what you should note is that from the root of the document, there are

ArrayOfResult/Result/Categories/Category/URI

elements, whereas the elements that you want are

ArrayOfResult/Result/URI 

elements. You just need to process your XML a bit differently; don't get all the content from all URI elements, but just from the URI elements that are children of Result elements. I'm not all that familiar with SAX parsing, but I think the important point is that once you've entered a Result, you should only grab the URI if you haven't entered another child element of Result.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top