특정 문자열에 유니 코드 문자가 있는지 여부를 찾는 방법 (Esp. Double Byte 문자)

https://stackoverflow.com/questions/147824

02-07-2019
|

문제

더 정확하기 위해서는 주어진 문자열에 이중 바이트 문자가 있는지 여부를 찾을 수 있는지 여부를 알아야합니다. 기본적으로 중국어 나 일본어와 같은 이중 바이트 문자를 포함 할 수있는 주어진 텍스트를 표시하려면 팝업을 열어야합니다. 이 경우 영어 또는 ASCII보다 창 크기를 조정해야합니다. 누구든지 단서가 있습니까?

해결책

JavaScript는 내부적으로 UCS-2로 텍스트를 보유하고 있으며, 이는 상당히 광범위한 유니 코드를 인코딩 할 수 있습니다.

그러나 그것은 당신의 질문에 대한 독일인이 아닙니다. 한 가지 솔루션은 문자열을 통과하고 각 위치의 문자 코드를 검사하는 것입니다.

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str.charCodeAt( i ) > 255) { return true; }
    }
    return false;
}

이것은 당신이 원하는만큼 빠르지 않을 수 있습니다.

다른 팁

I used mikesamuel answer on this one. However I noticed perhaps because of this form that there should only be one escape slash before the u, e.g. \u and not \\u to make this work correctly.

function containsNonLatinCodepoints(s) {
    return /[^\u0000-\u00ff]/.test(s);
}

Works for me :)

I have benchmarked the two functions in the top answers and thought I would share the results. Here is the test code I used:

const text1 = `The Chinese Wikipedia was established along with 12 other Wikipedias in May 2001. 中文維基百科的副標題是「海納百川，有容乃大」，這是中国的清朝政治家林则徐（1785年－1850年）於1839年為`;

const regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
function containsNonLatinCodepoints(s) {
    return regex.test(s);
}

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str.charCodeAt( i ) > 255) { return true; }
    }
    return false;
}

function benchmark(fn, str) {
    let startTime = new Date();
    for (let i = 0; i < 10000000; i++) {
        fn(str);
    }   
    let endTime = new Date();

    return endTime.getTime() - startTime.getTime();
}

console.info('isDoubleByte => ' + benchmark(isDoubleByte, text1));
console.info('containsNonLatinCodepoints => ' + benchmark(containsNonLatinCodepoints, text1));

When running this I got:

isDoubleByte => 2421
containsNonLatinCodepoints => 868

So for this particular string the regex solution is about 3 times faster.

However note that for a string where the first character is unicode, isDoubleByte() returns right away and so is much faster than the regex (which still has the overhead of the regular expression).

For instance for the string 中国, I got these results:

isDoubleByte => 51
containsNonLatinCodepoints => 288

To get the best of both world, it's probably better to combine both:

var regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
function containsDoubleByte(str) {
    if (!str.length) return false;
    if (str.charCodeAt(0) > 255) return true;
    return regex.test(str);
}

In that case, if the first character is Chinese (which is likely if the whole text is Chinese), the function will be fast and return right away. If not, it will run the regex, which is still faster than checking each character individually.

Actually, all of the characters are Unicode, at least from the Javascript engine's perspective.

Unfortunately, the mere presence of characters in a particular Unicode range won't be enough to determine you need more space. There are a number of characters which take up roughly the same amount of space as other characters which have Unicode codepoints well above the ASCII range. Typographic quotes, characters with diacritics, certain punctuation symbols, and various currency symbols are outside of the low ASCII range and are allocated in quite disparate places on the Unicode basic multilingual plane.

Generally, projects that I've worked on elect to provide extra space for all languages, or sometimes use javascript to determine whether a window with auto-scrollbar css attributes actually has content with a height which would trigger a scrollbar or not.

If detecting the presence of, or count of, CJK characters will be adequate to determine you need a bit of extra space, you could construct a regex using the following ranges: [\u3300-\u9fff\uf900-\ufaff], and use that to extract a count of the number of characters that match. (This is a little excessively coarse, and misses all the non-BMP cases, probably excludes some other relevant ranges, and most likely includes some irrelevant characters, but it's a starting point).

Again, you're only going to be able to manage a rough heuristic without something along the lines of a full text rendering engine, because what you really want is something like GDI's MeasureString (or any other text rendering engine's equivalent). It's been a while since I've done so, but I think the closest HTML/DOM equivalent is setting a width on a div and requesting the height (cut and paste reuse, so apologies if this contains errors):

o = document.getElementById("test");

document.defaultView.getComputedStyle(o,"").getPropertyValue("height"))

Here is benchmark test: http://jsben.ch/NKjKd

This is much faster:

function containsNonLatinCodepoints(s) {
    return /[^\u0000-\u00ff]/.test(s);
}

than this:

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str.charCodeAt( i ) > 255) { return true; }
    }
    return false;
}

Why not let the window resize itself based on the runtime height/width?

Run something like this in your pop-up:

window.resizeTo(document.body.clientWidth, document.body.clientHeight);

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow