Question

I can easily get the TF by counting the number of Term in a document and i want to know how to calculate document frequency, i.e. the number of documents that contain this term

What I've reached so far is querying solr with a large number of rows and counting the results back , but this is very Time and memory expensive . I want to count the terms only

    SolrQuery q = new SolrQuery();
    q.setQuery("tweet_text:"+kw);
    q.addField("tweet_text");
    q.setRows(40000000);        
    SolrDocumentList results = null ;

    try {
        QueryResponse rsp = solrServer.query(q);
        results = rsp.getResults();
    } catch (SolrServerException e) {
        e.printStackTrace();
    }

    ArrayList<String> tweets = new ArrayList<String>();
    for (SolrDocument doc : results)
    {
        tweets.add(doc.getFieldValue("tweet_text").toString());
    }
Was it helpful?

Solution

In SOLR, you can use a function query to query docFreq directly, shown here: http://wiki.apache.org/solr/FunctionQuery#docfreq,

q={!func}docFreq(tweet_text, kw)

Note, also documented on that page are function query methods to get tf, idf and termfreq, which may also be helpful for you.


This is probably less relevant to this question, in retrospect, but I'll leave it for the time being in case it is useful to you.

IndexReader.docFreq(Term) could get you what you're looking for.

such as:

reader.docFreq(new Term("tweet_text", kw));'

IndexSearcher.docFreq(Term) is the same thing, by the way.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top