Question

Kindly, I need your help in the following:

How I can detect repeated characters in tokens, for example:

If I have this sentence: كييييف نستطيع التوااصل مع الطلاب؟

I want a java code that detect each word that contains repeated characters then remove them (the repeated characters) and update the word.

So, our sentence should be: كيف نستطيع التواصل مع الطلاب؟

Notice the word "كييييف" as contains repeated character "ي", it should be updated to be only "كيف" and "التوااصل" became "التواصل".

I approciate your help.

Was it helpful?

Solution

Lolina, loops are not of much help. Did you hear about regular expressions. Java uses them as many other languages such as Perl and Python. I am familiar with Python but regex almost have similar functions in all languages.

What you need now is to read about regular expressions in Java and especially read about the metacharacters * and + which match 0 or more and 1 or more characters respectively.

First try to compile simple regular expressions and then add extra stuff to them so that they perform what you actually want to do.

Finally, regular expressions are a bit confusing at the beginning but they worth the trouble. Remeber that Stanford Arabic POS tagger uses regular expressions to perform things similar to what you are trying to do.

I am not familiar at all with Java, but in Python, I would do it as follows:

>>> import re
>>> p = re.compile('ي+') # The + sign means match at least more than one occurrence of ي 
>>> p.sub('ي', 'كييييييييف نتواصل مع الطلاب')
'كيف نتواصل مع الطلاب'

Usually in Arabic we repeat typing the following three letters, ا, ي, and و. These are the vowels of Arabic. You can compile a regex for ي and strip them off. Then compile another one for ا and one more for و.

I hope this will help you!

OTHER TIPS

One option (please consider my knowledge of Arabic is non-existant) is to split the string by space delimiters and then check each of the split strings for the character repeats using the charAt method or using indexOf using the unicode character values of the particular characters you wish to check for.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top