How to remove stop words in java?

https://stackoverflow.com/questions/12469332

02-07-2021
|

Question

I want to remove stop words in java.

So, I read stop words from text file.

and store Set

Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("stopwords.txt"));
        String words = null;
        while( (words = br.readLine()) != null) {
            stopWords.add(words.trim());
            }
        br.close();

And, I read another text file.

So, I wanna remove to duplicate string in text file.

How can I?

Solution

You want to remove duplicate words from file, below is the high level logic for same.

Read File
Loop through file content(i.e one line at a time)
- Have string tokenizer for that line based on space
- Add each each token to your set. This will make sure that you have only one entry per word.
- Close file

Now you have set that contains all the unique word of file.

OTHER TIPS

using set for stopword :

Set<String> stopWords = new LinkedHashSet<String>();
        BufferedReader SW= new BufferedReader(new FileReader("StopWord.txt"));
        for(String line;(line = SW.readLine()) != null;)
           stopWords.add(line.trim());
        SW.close();

and ArrayList for input txt_file

BufferedReader br = new BufferedReader(new FileReader(txt_file.txt));
//make your arraylist here

// function deletStopWord() for remove all stopword in your "stopword.txt"
public ArrayList<String> deletStopWord(Set stopWords,ArrayList arraylist){
        System.out.println(stopWords.contains("?"));
        ArrayList<String> NewList = new ArrayList<String>();
        int i=3;
        while(i < arraylist.size() ){
            if(!stopWords.contains(arraylist.get(i))){
                NewList.add((String) arraylist.get(i));
            }
            i++;        
            }
        System.out.println(NewList);
        return NewList;
    }

  arraylist=deletStopWord(stopWords,arraylist);

Using the ArrayList may be more easier.

public ArrayList removeDuplicates(ArrayList source){
    ArrayList<String> newList = new ArrayList<String>();
    for (int i=0; i<source.size(); i++){
        String s = source.get(i);
        if (!newList.contains(s)){
            newList.add(s);
        }
    }
    return newList;
}

Hope this helps.

If you simply want to remove a certain set of words from the words in a file, you can do it however you want. But if you are dealing with a problem involving natural language processing, you should use a library.

For example, using Lucene for tokenizing will seem more complicated at first, but it will deal with myriad complications that you will overlook, and allow for great flexibility should you change your mind on the specific stopwords, on how you are tokenizing, whether you care about case, etc.

You should try using StringTokenizer.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow