Question

I have a collection of text where I would like to extract all instances of all countries listed. So far I was able to populate a Set with all countries based on the following code:

  Set<String> countries = new TreeSet<String>();
  Locale[] locales = Locale.getAvailableLocales();
  for (Locale locale : locales) {
        countries.add(locale.getDisplayCountry());
  }

I can of course make a regular expression for each country to search in each line, but I was wondering if I could do this in one regular expresion, that is which country is mentioned in this given line of text.

Was it helpful?

Solution

You can build a single regular expression by concatenating all names separated by '|', to indicate "any of these is fine". In your case, you can build

StringBuilder exp = new StringBuilder();
for (String s : countries) {
  exp.append(exp.length() == 0 ? "(" : "|");
  exp.append(s);
}
Pattern countryPattern = Pattern.compile(exp.append(")").toString());

And, given the countryPattern, you can now write the following to iterate over all matches

Matcher m = countryPattern.matcher(aStringWithCountries);
while (m.find()) {
   System.err.println("Found country " + m.group(1));
}

OTHER TIPS

Concatenate all country names into one regex:

String regex = "(";
boolean first = true;

for (String name: countries) {
    regex += (first ? "" : "|") + Pattern.quote(name);
    first = false;
}

regex += ")";

(You can write a more efficient code with StringBuffer)

You will get a regex that has the form: (Country1|Country2|Country3), which will match if the text matches any of the country names.

This solution assumes that you want to match the country name exactly (down to the spaces and dot) as the name you get from getDisplayCountry. You can make it match case-insensitive by prepending (?i) to the regex.

If you just simple matching (exact string match) there is actually better way then regular expressions. There is powerful string matching algorithm called Aho-Corasick algorithm . So what you need to do is to create Aho-Corasick tree and populate it with country names. Then you can search for those countries in your text in the best time complexity. Here is python implementation and I hope there is some for java as well.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top