How to allow more than two words to be checked in this Java Code

https://stackoverflow.com/questions/4644630

09-10-2019
|

Question

I need to modify this script so that more than two words can be checked and my knowledge of Java is too limited to make the changes myself. This script is part of an opensource grammar checker for OpenOffice (LanguageTool), and the scripts purpose is to replace certain words with other words.

The file of words to be checked is called "coherency.txt" and it's format is like this: WrongWord1=CorrectWord1 WrongWord2=CorrectWord2

And when I type: WrongWord1 it is flagged by the script and tells me I should use CorrectWord1 instead.

But I need to be able to have three words or more, like this: WrongWord1=WrongWord2=CorrectWord1 WrongWord3=WrongWord4=WrongWord5=CorrectWord2 WrongWord6=CorrectWord3

So that when I type WrongWord3 it is flagged and the script tells me I should use CorrectWord2 OR when I type WrongWord2 it also is flagged and the script tells me I should use CorrectWord1

If you can help I can put a link to your webpage at http://www.sbbic.org/lang/en-us/volunteer/

Any help you can give on how to modify this code to allow more than two words to be compared a replaced would be greatly appreciated! Thanks, Nathan

    /* LanguageTool, a natural language style checker 
 * Copyright (C) 2005 Daniel Naber (http://www.danielnaber.de)
 * 
 * This library is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Lesser General Public
 * License as published by the Free Software Foundation; either
 * version 2.1 of the License, or (at your option) any later version.
 *
 * This library is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General Public
 * License along with this library; if not, write to the Free Software
 * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301
 * USA
 */
package de.danielnaber.languagetool.rules;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.ResourceBundle;

import de.danielnaber.languagetool.AnalyzedSentence;
import de.danielnaber.languagetool.AnalyzedTokenReadings;
import de.danielnaber.languagetool.JLanguageTool;
import de.danielnaber.languagetool.tools.StringTools;

/**
 * A Khmer rule that matches words or phrases which should not be used and suggests
 * correct ones instead. Loads the relevant words  from 
 * <code>rules/km/coherency.txt</code>, where km is a code of the language.
 * 
 * @author Andriy Rysin
 */
public abstract class KhmerWordCoherencyRule extends KhmerRule {

  private Map<String, String> wrongWords; // e.g. "вреѿті реѿт" -> "зреѿтою"

  private static final String FILE_NAME = "/km/coherency.txt";

  public abstract String getFileName();

  private static final String FILE_ENCODING = "utf-8";

  public String getEncoding() {
    return FILE_ENCODING;
  }

  /**
   * Indicates if the rule is case-sensitive. Default value is <code>true</code>.
   * @return true if the rule is case-sensitive, false otherwise.
   */
  public boolean isCaseSensitive() {
    return false;  
  }

  /**
   * @return the locale used for case conversion when {@link #isCaseSensitive()} is set to <code>false</code>.
   */
  public Locale getLocale() {
    return Locale.getDefault();
  }  

  public KhmerWordCoherencyRule(final ResourceBundle messages) throws IOException {
    if (messages != null) {
      super.setCategory(new Category(messages.getString("category_misc")));
    }
    wrongWords = loadWords(JLanguageTool.getDataBroker().getFromRulesDirAsStream(getFileName()));
  }

  public String getId() {
    return "KM_WORD_COHERENCY";
  }

  public String getDescription() {
    return "Checks for wrong words/phrases";
  }

  public String getSuggestion() {
    return " is not valid, use ";
  }

  public String getShort() {
    return "Wrong word";
  }

  public final RuleMatch[] match(final AnalyzedSentence text) {
    final List<RuleMatch> ruleMatches = new ArrayList<RuleMatch>();
    final AnalyzedTokenReadings[] tokens = text.getTokensWithoutWhitespace();

    for (int i = 1; i < tokens.length; i++) {
      final String token = tokens[i].getToken();

      final String origToken = token;
      final String replacement = isCaseSensitive()?wrongWords.get(token):wrongWords.get(token.toLowerCase(getLocale()));
      if (replacement != null) {
        final String msg = token + getSuggestion() + replacement;
        final int pos = tokens[i].getStartPos();
        final RuleMatch potentialRuleMatch = new RuleMatch(this, pos, pos
            + origToken.length(), msg, getShort());
        if (!isCaseSensitive() && StringTools.startsWithUppercase(token)) {
          potentialRuleMatch.setSuggestedReplacement(StringTools.uppercaseFirstChar(replacement));
        } else {
          potentialRuleMatch.setSuggestedReplacement(replacement);
        }
        ruleMatches.add(potentialRuleMatch);
      }
    }
    return toRuleMatchArray(ruleMatches);
  }


  private Map<String, String> loadWords(final InputStream file) throws IOException {
    final Map<String, String> map = new HashMap<String, String>();
    InputStreamReader isr = null;
    BufferedReader br = null;
    try {
      isr = new InputStreamReader(file, getEncoding());
      br = new BufferedReader(isr);
      String line;

      while ((line = br.readLine()) != null) {
        line = line.trim();
        if (line.length() < 1) {
          continue;
        }
        if (line.charAt(0) == '#') { // ignore comments
          continue;
        }
        final String[] parts = line.split("=");
        if (parts.length != 2) {
          throw new IOException("Format error in file "
              + JLanguageTool.getDataBroker().getFromRulesDirAsUrl(getFileName()) + ", line: " + line);
        }
        map.put(parts[0], parts[1]);
      }

    } finally {
      if (br != null) {
        br.close();
      }
      if (isr != null) {
        isr.close();
      }
    }
    return map;
  }

  public void reset() {
  }  

}

Solution

For small adaptations:

Consider changing the desired input format to

WrongWord = CorrectWord[, CorrectWord]*

The key will be the incorrect word, the value a comma separated list of correct alternatives. So you can keep the file parsing as it is.

The Map should be of type Map<String, Set<String>> - each token maps to a set of alternatives.

Now you can split each line around = to get a key/value pair and each value around , to get an array of suggested tokens to replace the input.

Then you'll need some changes in th match to assemble new message, because you expect more then one suggestions now.

Change the lines after final String origToken = token; to

final String[] replacements = wrongWords.get(token);
  if (replacements != null) {
    final String msg = createMessage(token, replacements);
    final int pos = tokens[i].getStartPos();

and implement the createMessage method to return a human readable message that tells the user the one to many alternatives for the token.

OTHER TIPS

The thing you have to change is this part in loadWords:

final String[] parts = line.split("=");
if (parts.length != 2) {
    throw new IOException("Format error in file " + JLanguageTool.getDataBroker().getFromRulesDirAsUrl(getFileName()) + ", line: " + line);
}
map.put(parts[0], parts[1]);

This one puts the left hand side of the equals as key and the right hand side as value into the map. Therefore I think the left hand side has to be the wrong word. Therefore your input should become wrong1 = wrong2 = ... = correct.

With this setting you could simply change it to the following

final String[] parts = line.split("=");
if (parts.length < 2) {
    throw new IOException("Format error in file " + JLanguageTool.getDataBroker().getFromRulesDirAsUrl(getFileName()) + ", line: " + line);
}
for (int i = 0; i < parts.length - 1; i++) {
    map.put(parts[i], parts[parts.length - 1]);
}

which would produce the folowing entries in the map:

wrong1 = correct
wrong2 = correct
wrong3 = correct
...

Probably it is not the most efficient solution, but it should work somehow like this. With this map wrong words can be searched and the suggestions will be the correct ones.

(P.S.: I could not run the code, so there could be some coding errors in it)

import java.util.regex.Matcher;
import java.util.regex.Pattern;




public class Test {

        public static void main(String[] args) {

            String txtFromFile = "Hipopotamus=hIppoPotamus=hiiippotamus Magazine=Mazagine=Masagine";
            String searchWord = "Masagine";
            Pattern searchPattern= Pattern.compile("\\s*(\\w+=)*?("+searchWord+")");
            Matcher m = searchPattern.matcher(txtFromFile);
            String source = "";
            while(m.find()) {
                source = m.group();
                System.out.println("word pairs:"+source);
            }
            System.out.println("correct word:"+source.split("=")[0]);
        }
    }

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow