Question

In my case word length is "2" and I am using this regex:

text = text.replace(/\b[a-zA-ZΆ-ώἀ-ῼ]{2}\b/g, '') );

but cannot make it work with greek characters. For your convenience here is a demo:

text = 'English: the on in to of \n Greek: πως θα το πω';
text = text.replace(/\b[0-9a-zA-ZΆ-ώἀ-ῼ]{2}\b/g, '');
console.log(text);

As far as the greek characters are concerned, I try to use a range with 2 sets: "Greek and Coptic" and "Greek Extended" (as seen on unicode-table.com).

Was it helpful?

Solution 4

The problem with greek characters is because of \b. You can take a look here: Javascript - regex - word boundary (\b) issue where @Casimir et Hippolyte proposes the following solution:

Since Javascript doesn't have the lookbehind feature and since word boundaries work only with members of the \w character class, the only way is to use groups (and capturing groups if you want to make a replacement):

//example to remove 2 letter words:
txt = txt.replace(/(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])/gm, '\1');

I also added 0-9 inside the first and the third match because it was removing words like "2TB" or "mp3"

OTHER TIPS

Why using regex, I think you problem can be resolved without using regex

check the example below it should give you a hint on how to start

text = 'English: the on in to of \n Greek: πως θα το πω';
var tokens = text.split(/\s+/);
var text = tokens.filter(function(token){ return token.length > 2}).join(' ');
alert(text);

JavaScript has problems with Unicode support in regular expressions. To make the things working, I'd suggest to use XRegExp library, which has a stable support of Unicode.

MORE: http://xregexp.com/plugins/#unicode

try this

text = 'English: the on in to of \n Greek: πως θα το πω';
text = text.replace(/\b[0-9a-zA-ZΆ-ώἀ-ῼ]{2}\b/g, '');
alert(text);
text2 = text.split(' ');
text = text2.filter(function(text2){ return text2.length != 2}).join(' ');
alert(text);

Edit-------------------

Try this,

text = 'English: the on in to of \n Greek: πως θα το πω';
text.replace(/\b[\n]\b/g, '\n ').replace(/\b[\t]\b/g, '\t ');
text2 = text.split(' ');
text = text2.filter(function(text2){ return text2.length != 2}).join(' ');
alert(text);

You will mantain \t, \n and will remove 2-letter word is between 2 tabs or two line feeds

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top