我在GitHub上找到了一个代码,该代码使用DBPEDIA查找发送单词并从DBPEDIA获取候选uris。问题是:所有的uris都带有单词 Category. 。例如,对于单词 Berlin 它返回:

代替

如果我将第一个URI(带有“类别”的URI)放在浏览器上,它不会向我显示与主题“ history_of_berlin”相对应的页面,它将返回我一个包含链接列表和我可以在哪里的页面找到指向“ history_of_berlin”的链接。但是,如果我放置第二个URI(没有“类别”的一个URI),它将返回与主题“ history_of_berlin”相对应的页面。我如何避免将这些带有“类别”的URI从查找中退回?

代码

package com.knowledgebooks.info_spiders;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.methods.GetMethod;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;
import java.io.InputStream;
import java.net.URLEncoder;
import java.util.*;

/**
 * Copyright Mark Watson 2008-2010. All Rights Reserved.
 * License: LGPL version 3 (http://www.gnu.org/licenses/lgpl-3.0.txt)
 */

// Use Georgi Kobilarov's DBpedia lookup web service
//    ref: http://lookup.dbpedia.org/api/search.asmx?op=KeywordSearch
//    example: http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=Flagstaff&QueryClass=XML&MaxHits=10

/**
 * Searches return results that contain any of the search terms. I am going to filter
 * the results to ignore results that do not contain all search terms.
 */


public class DBpediaLookupClient extends DefaultHandler {
  public DBpediaLookupClient(String query) throws Exception {
    this.query = query;
    HttpClient client = new HttpClient();

    String query2 = query.replaceAll(" ", "+"); // URLEncoder.encode(query, "utf-8");
    HttpMethod method =
      new GetMethod("http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=" +
        query2);
    try {
      client.executeMethod(method);
      System.out.println(method);
      InputStream ins = method.getResponseBodyAsStream();
      SAXParserFactory factory = SAXParserFactory.newInstance();
      SAXParser sax = factory.newSAXParser();
      sax.parse(ins, this);
    } catch (HttpException he) {
      System.err.println("Http error connecting to lookup.dbpedia.org");
    } catch (IOException ioe) {
      System.err.println("Unable to connect to lookup.dbpedia.org");
    }
    method.releaseConnection();
  }

  private List<Map<String, String>> variableBindings = new ArrayList<Map<String, String>>();
  private Map<String, String> tempBinding = null;
  private String lastElementName = null;

  public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
    //System.out.println("startElement " + qName);
    if (qName.equalsIgnoreCase("result")) {
      tempBinding = new HashMap<String, String>();
    }
    lastElementName = qName;
  }

  public void endElement(String uri, String localName, String qName) throws SAXException {
    //System.out.println("endElement " + qName);
    if (qName.equalsIgnoreCase("result")) {
      if (!variableBindings.contains(tempBinding) && containsSearchTerms(tempBinding))
        variableBindings.add(tempBinding);
    }
  }

  public void characters(char[] ch, int start, int length) throws SAXException {
    String s = new String(ch, start, length).trim();
    //System.out.println("characters (lastElementName='" + lastElementName + "'): " + s);
    if (s.length() > 0) {
      if ("Description".equals(lastElementName)) {
        if (tempBinding.get("Description") == null) {
          tempBinding.put("Description", s);
        }
        tempBinding.put("Description", "" + tempBinding.get("Description") + " " + s);
      }
      if ("URI".equals(lastElementName)) tempBinding.put("URI", s);
      if ("Label".equals(lastElementName)) tempBinding.put("Label", s);
    }
  }

  public List<Map<String, String>> variableBindings() {
    return variableBindings;
  }
  private boolean containsSearchTerms(Map<String, String> bindings) {
    StringBuilder sb = new StringBuilder();
    for (String value : bindings.values()) sb.append(value);  // do not need white space
    String text = sb.toString().toLowerCase();
    StringTokenizer st = new StringTokenizer(this.query);
    while (st.hasMoreTokens()) {
      if (text.indexOf(st.nextToken().toLowerCase()) == -1) {
        return false;
      }
    }
    return true;
  }
  private String query = "";
}
有帮助吗?

解决方案 2

我向代码的作者马克·沃森(Mark Watson)寻求帮助,他回答了我:

您可以更改此简单的代码:

      //if ("URI".equals(lastElementName)) tempBinding.put("URI", s);
      if ("URI".equals(lastElementName) && s.indexOf("Category")==-1
&& tempBinding.get("URI") == null) {
        tempBinding.put("URI", s);
      }

那是评论1行,添加下三个。

而已!

其他提示

当您进行搜索时,例如“柏林历史”时,您会要求一个URL

而且您会得到这样的XML结果:

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfResult 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://lookup.dbpedia.org/">
    <Result>
        <Label>Museum für Naturkunde</Label>
        <URI>http://dbpedia.org/resource/Museum_für_Naturkunde</URI>
        <Description></Description>
        <Classes></Classes>
        <Categories></Categories>
        <Templates></Templates>
        <Redirects></Redirects>
        <Refcount>155</Refcount>
    </Result>
    <Result>
        <Label>History of Berlin</Label>
        <URI>http://dbpedia.org/resource/History_of_Berlin</URI>
        <Description>
            Berlin is the capital city of Germany. Berlin is a young city by European standards, founded in the 12th century.
        </Description>
        <Classes></Classes>
        <Categories>
            <Category>
                <Label>History of Berlin</Label>
                <URI>http://dbpedia.org/resource/Category:History_of_Berlin</URI>
            </Category>
            <Category>
                <Label>History of Germany by location</Label>
                <URI>http://dbpedia.org/resource/Category:History_of_Germany_by_location</URI>
            </Category>
        </Categories>
        <Templates></Templates>
        <Redirects></Redirects>
        <Refcount>14</Refcount>
    </Result>
</ArrayOfResult>

你是对的 URI 具有URI类别的元素,例如

<URI>http://dbpedia.org/resource/Category:History_of_Berlin</URI>

但是您应该注意的是,从文档的根部中,有

ArrayOfResult/Result/Categories/Category/URI

元素,而您想要的元素是

ArrayOfResult/Result/URI 

元素。您只需要以不同的方式处理XML即可;不明白 全部 来自 全部 URI 元素,但只是从 URI 是子女的元素 Result 元素。我并不那么熟悉萨克斯的解析,但我认为重要的是,一旦您进入了 Result, ,您只应该抓住 URI 如果您还没有输入另一个子元素 Result.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top