Question

I had a similar question asking what language would be best for this task, and Perl was the answer. But I'm still curious how to resolve this with C.

I want to give this program a large text file filled with samples of German text taken from novels, newspapers, webpages. I want a frequency list of all the words in the text file, sorted by most common word. I need a list of the 3000 most common German words.

If this was just an ASCII problem, then this would be child's play for me. After reading about Unicode all morning, I'm really surprised what a minefield it is.

How is this done in C?

I had a friend who put something together in Python, but he's still a beginner and his code took about 30 minutes on a 1.4 MB text file.

Was it helpful?

Solution

It depends on encoding. Simplest one is UTF-8, in which you can simply store strings in char* arrays. Surprisingly, building a frequency list would be done using almost same code like in case of ASCII text. This is kind of UTF-8 magic, but it is why this encoding is so powerful!

There are a few thing that you should remember in this case:

  1. Unicode provides more white characters than ASCII. You'll need a list of them to know where the words are separated. Happily, Wikipedia has one.

  2. Unicode is not always unequivocal. There are cases when different sequences produces the same character. It happens usually with composed characters: e.g. German Ä may be represented as:

    • character U+00C4 - single letter Ä
    • sequence U+0041 U+0308 - Latin letter A and diaeresis (umlaut) over it.

Happily, in German there are only seven non-English characters: ÄäÖöÜüß. You'd need to check how their alternative variants looks (e.g. here on pages 4 and 5 you should find all German characters and their alternative forms).

Of course to solve both problems you will also need to know how all your findings are represented in UTF-8. This is described in RFC 3629, page 3.

In case of other encodings (or other languages), I'd suggest not to deal with it yourself, but use some already existing library. If you are on Linux (or most of other Unices), you can use iconv function (man 3 iconv) to convert your text to UTF-8, and go as I described before.

Other choice is using some library that already deals with various Unicode variants. The most powerful is probably ICU - International Components For Unicode, check their manuals to see how to perform your task using it.

OTHER TIPS

You haven't specified clearly the requirements of your program, but I can only think of two aspects that might need you to care about character identity:

  1. If the input text is mixed case, you may want to map all words to the same case so that differently-cased versions of the same word are counted together.

  2. If the input is in mixed normalization form (some characters precomposed, others decomposed) then you need to perform normalization to ensure that words that differ only in this way get counted together.

If for example your input were all-lowercase NFC, a program written with just ASCII in mind would work perfectly well for your task. Since this probably isn't the case, you need to evaluate your requirements. For just issue 1 (case), you can probably get by with using wide character stdio functions (or byte-oriented stdio and mbsrtowcs) and towlower to do case mapping. For issue 2 (normalization), you would need to either use an existing Unicode library for C or roll your own.

You can use strings of wchar_t and the functions defined in wchar.h header file.

If you could do it without a problem in ASCII, it shouldn't really be much harder in Unicode (in C99, at least).

Pretty much all of the standard library functions that work on strings and characters have wide character equivalents, and when you're working with wide characters, you'll never have to worry about the underlying encoding - one wide character represents one actual character. There's iswupper, towupper, wcslen and so on.

That's assuming you're working in a straightforward environment (e.g. UTF-8 system, UTF-8 text), as the locale will handle everything. If not, there's more work.

You might want to use system tools for this matter, this can be done if your system locale is set correctly. AWK is one you can use quite easily, for example:

BEGIN {
    FS="[^[:alpha:]]"
}
{
    for(i=1; i<=NF; i++) {
        if(array[$i]) {
            array[$i] += 1
        } else {
            array[$i]  = 1
        }
    }
}
END{
    for(i in array) {printf "%s = %d\n", i, array[i] }
}

invoke:

$ awk -f script.awk German.txt | sort

EDIT:

This is very close to what you are looking for.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top