Tokenize Arabic text files java

https://stackoverflow.com/questions/22946736

30-06-2023
|

Question

I am trying to tokenize some text files into words and I write this code, It works perfect in English and when I try it in Arabic it did not work. I added the UTF-8 to read Arabic files. did I miss something

public void parseFiles(String filePath) throws FileNotFoundException, IOException {
    File[] allfiles = new File(filePath).listFiles();
    BufferedReader in = null;
    for (File f : allfiles) {
        if (f.getName().endsWith(".txt")) {
            fileNameList.add(f.getName());
            Reader fstream = new InputStreamReader(new FileInputStream(f),"UTF-8"); 
           // BufferedReader br = new BufferedReader(fstream);
            in = new BufferedReader(fstream);
            StringBuilder sb = new StringBuilder();
            String s=null;
            String word = null;
            while ((s = in.readLine()) != null) {
                Scanner input = new Scanner(s);
                  while(input.hasNext()) {
                       word = input.next();
                if(stopword.isStopword(word)==true)
                {
                    word= word.replace(word, "");
                }

                //String stemmed=stem.stem (word);
                sb.append(word+"\t");
                  }
                   //System.out.print(sb);  ///here the arabic text is outputed without stopwords


            }
            String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+");   //to get individual terms

            for (String term : tokenizedTerms) {
                if (!allTerms.contains(term)) {  //avoid duplicate entry
                    allTerms.add(term);
                    System.out.print(term+"\t");  //here the problem.
                }
            }
            termsDocsArray.add(tokenizedTerms);
        }
    }

}

Please any ideas to help me proceed. Thanks

Solution

The problem lies with your regex which will work well for English but not for Arabic because by definition

[\\W&&[^\\s]

means

// returns true if the string contains a arbitrary number of non-characters except whitespace.
\W  A non-word character other than [a-zA-Z_0-9]. (Arabic chars all satisfy this condition.)
\s  A whitespace character, short for [ \t\n\x0b\r\f]

So, by this logic, all chars of Arabic will be selected by this regex. So, when you give

sb.toString().replaceAll("[\\W&&[^\\s]]", "")

it will mean, replace all non word character which is not a space with "". Which in case of Arabic, is all characters. Thus you will get a problem that all Arabic chars are replaced by "". Hence no output will come. You will have to tweak this regex to work for Arabic text or just split the string with space like

sb.toString().split("\\s+")

which will give you the Arabic words array separated by space.

OTHER TIPS

In addition to worrying about character encoding as in bgth's response, tolkenizing Arabic has an added complication that words are not nessisarily white space separated:

http://www1.cs.columbia.edu/~rambow/papers/habash-rambow-2005a.pdf

If you're not familiar with the Arabic, you'll need to read up on some of the methods regarding tolkenization:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.9748

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow