Question

Let's say that I have a paragraph with different languages in it. like:

This is paragraph in English. 这是在英国段。Это пункт на английском языке. این بند در زبان انگلیسی است.

I would like to calculate what percentage (%) of this paragraph includes English words. So would like to ask how to do that in python.

Was it helpful?

Solution

This offline solution uses the pyenchant spellcheck module:

# -*- coding: utf-8 -*
import enchant
dictionary = enchant.Dict("en_US")

paragraph = u"This is paragraph in English. 这是在英国段。Это пункт на английском языке. این بند در زبان انگلیسی است."

words = paragraph.split(" ")
en_count = 0.0
for word in words:
  if dictionary.check(word.strip()):
    en_count += 1

percent = en_count/len(words) if len(words) != 0 else 0
print str(percent) + "% english words"

Output:

31.25% english words

OTHER TIPS

Some posters have found that there are 16 words in the paragraph. But are there? One of the problems is that it's difficult to use only an English language approach if you want to compare the number of English words to the words in the sentence. It's "relatively" easy to find the number of English words, but second part, i.e. to find the number of total words in the sentence, is more difficult because you need resources to disambiguate how many words are contained in 这是在英国段, in order to find the English words as a percentage of words in the paragraph.

Try using the Natural Language Toolkit. The NLTK is a Python library (Python3.0 compatibility in the works) that has built-in-functions for exactly what you're looking for (frequency of occurrence of words, tokenizing strings, etc), plus access to English language corpora that you could use to compare the words to, if you want to find the English words by comparing the words in the sentence against the words contained in the corpora.

The accompanying book Natural Language Processing with Python, 1 edition for Python 2.x is available for free online from the NLTK website. It serves as an introduction to both the NLTK Library and Python programming in general. The Wordlist Corpus or Roget's Thesaurus Corpus might be useful. There is also the that detects language text. For mixed language cases, not sure how that would work.

First, get a list of English words. Then, iterate through the file and count!

import string
import urllib2

punctuation = set(string.punctuation)

eng_words_url = 'https://raw.github.com/eneko/data-repository/master/data/words.txt'
eng_words = urllib2.urlopen(eng_words_url).readlines()
eng_words = [w.strip().lower() for w in eng_words]

def remove_punc(str):
    return ''.join(c for c in str if c not in punctuation)

total_count = 0
eng_count = 0
with open('filename.txt') as f:
    for line in f:
        words = remove_punc(line).lower().split()
        total_count += len(words)
        eng_count += sum(1 for word in words if word.lower() in eng_words)

print '%s English words found' % eng_count
print '%s total words found' % total_count

percentage_eng = 0 if total_count == 0 else (float(eng_count) / total_count * 100)
print '%s%% of words were English' % percentage_eng

For example, this is your sample text:

This is paragraph in English. 这是在英国段。Это пункт на английском языке. این بند در زبان انگلیسی است.

When I ran the above code on that, the output was this:

5 English words found

16 total words found

31.25% of words were English

As pointed out in the comments, the percentage is incorrect due to the Chinese words not having spaces between them. There are 22 words total so the percentage should really be 22.7%.

If all your words written in latin letters are in English you could use regular expressions.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top