extracting text using Apache Tika then getting frequently occurring words after removing stopwords

https://stackoverflow.com/questions/17442341

02-06-2022
|

题

i have extracted text for sample.pdf file using Tika and lucene and i tried to remove stopwords then i get the wordcount of remaining words(excluding stopwords) from the text.

my sample.pdf contains

This is java related information it contains java prg.

Below is my code

String[] stopwords ={"a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", 
                        "alone", "along", "already", "also","although","always","am","among", "amongst", "amoungst", "amount",  "an", "and", 
                        "another", "any","anyhow","anyone","anything","anyway", "anywhere", "are", "around", "as",  "at", "back","be","became", 
                        "because","become","becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", 
                        "between", "beyond", "bill", "both", "bottom","but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt",
                        "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven","else",
                        "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", 
                        "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", 
                        "front", "full", "further", "get", "give", "go", "had", "has", "hasnt",
                        "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", 
                        "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", 
                        "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", 
                        "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", 
                        "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", 
                        "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", 
                        "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own","part", "per", "perhaps",
                        "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she",
                        "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", 
                        "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", 
                        "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", 
                        "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", 
                        "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", 
                        "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever",
                        "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", 
                        "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet",
                        "you", "your", "yours", "yourself", "yourselves","1","2","3","4","5","6","7","8","9","10","1.","2.","3.","4.","5.","6.","11",
                        "7.","8.","9.","12","13","14","A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z",
                        "terms","CONDITIONS","conditions","values","interested.","care","sure",".","!","@","#","$","%","^","&","*","(",")","{","}","[","]",":",";",",","<",".",">","/","?","_","-","+","=",
                        "a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",
                        "contact","grounds","buyers","tried","said,","plan","value","principle.","forces","sent:","is,","was","like",
                        "discussion","tmus","diffrent.","layout","area.","thanks","thankyou","hello","bye","rise","fell","fall","psqft.","http://","km","miles"};

                Map map = new TreeMap();
                   File file1 = new File("C://sample.pdf");
                   InputStream input = new FileInputStream(file1);           
                   Metadata metadata = new Metadata();
                  BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
                  AutoDetectParser parser = new AutoDetectParser();       
                  parser.parse(input, handler, metadata);
                  Document doc = new Document();
                doc.add(new Field("contents",handler.toString(),Field.Store.NO,Field.Index.ANALYZED));
                String result = doc.toString();
                String[] res=result.split(" ");
                for (int i=0;i<res.length;i++)
                {
                int flag=1;
                    String s1=res[i].toLowerCase();

                  for(int j=0;j<stopwords.length;j++){
                      if(s1.equals(stopwords[j]))
                          {
                          flag=0;
                          }
                    if(flag!=0)
                  {
                     if (s1.length() > 0) { 

                         Integer frequency = (Integer) map.get(s1);
                              if (frequency == null) {
                                frequency = ONE;
                              } else {

                                int value = frequency.intValue();
                                frequency = new Integer(value + 1);
                            }
                              map.put(s1, frequency);
                             }  
                               }
                }
                }
                input.close();
                System.out.println("Finalresult:"+map);
                 }

i'm getting following output which is not correct

Finalresult:{contains=456, document<indexed,tokenized<contents:this=456, information=456, is=139, it=140, java=912, prg=456, related=456}

i should get the following output

information=1,java=2, prg=1, related=1

can u please suggest me to get the required output. thanks

解决方案

Looks like an example of why consistent code formatting is important. Good indentation would probably make the cause of this issue much more obvious to you.

for (int i=0;i<res.length;i++)
{
    int flag=1;
    String s1=res[i].toLowerCase();

    for(int j=0;j<stopwords.length;j++)
    {
        if(s1.equals(stopwords[j]))
        {
            flag=0;
        }
        // -------- We are still looping through stopwords!  This for loop should be closed here! ---------
        if(flag!=0)
        {
            if (s1.length() > 0) 
            { 
                //Now this is going to add to the list for every entry in stopwords, until we find a match!
                Integer frequency = (Integer) map.get(s1);
                if (frequency == null) 
                {
                    frequency = ONE;
                } else 
                {
                    int value = frequency.intValue();
                    frequency = new Integer(value + 1);
                }
                map.put(s1, frequency);
            }  
        }
    }
}

You have 456 entries in stopwords, as we can see. The behavior you're seeing is all due to the lack of a }

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow