어떻게 자르 java 문자열에 맞게 주어진 숫자의 바이트를,한 번에 UTF-8 인코딩된?

https://stackoverflow.com/questions/119328

02-07-2019
|

문제

어떻게 자르 java String 그래서 내가 알고있는 그것을 맞는 것에 주어진 바이트 수를 저장면 그것은 UTF-8 인코딩된?

해결책

UTF-8 표현이 얼마나 큰지 계산하는 간단한 루프는 다음과 같습니다.

public static String truncateWhenUTF8(String s, int maxBytes) {
    int b = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);

        // ranges from http://en.wikipedia.org/wiki/UTF-8
        int skip = 0;
        int more;
        if (c <= 0x007f) {
            more = 1;
        }
        else if (c <= 0x07FF) {
            more = 2;
        } else if (c <= 0xd7ff) {
            more = 3;
        } else if (c <= 0xDFFF) {
            // surrogate area, consume next char as well
            more = 4;
            skip = 1;
        } else {
            more = 3;
        }

        if (b + more > maxBytes) {
            return s.substring(0, i);
        }
        b += more;
        i += skip;
    }
    return s;
}

이것 하다 핸들 대리 쌍 입력 문자열에 나타납니다. Java의 UTF-8 인코더 (올바르게)는 2 개의 3 바이트 시퀀스 대신 단일 4 바이트 시퀀스로 대리 쌍을 출력합니다. truncateWhenUTF8() 할 수있는 가장 긴 잘린 줄을 반환합니다. 구현에서 대리 쌍을 무시하면 잘린 줄이 필요한 것보다 단락 될 수 있습니다.

나는 그 코드에 대해 많은 테스트를 수행하지 않았지만 다음은 몇 가지 예비 테스트입니다.

private static void test(String s, int maxBytes, int expectedBytes) {
    String result = truncateWhenUTF8(s, maxBytes);
    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
    if (utf8.length > maxBytes) {
        System.out.println("BAD: our truncation of " + s + " was too big");
    }
    if (utf8.length != expectedBytes) {
        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
    }
    System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
    test("abcd", 0, 0);
    test("abcd", 1, 1);
    test("abcd", 2, 2);
    test("abcd", 3, 3);
    test("abcd", 4, 4);
    test("abcd", 5, 4);

    test("a\u0080b", 0, 0);
    test("a\u0080b", 1, 1);
    test("a\u0080b", 2, 1);
    test("a\u0080b", 3, 3);
    test("a\u0080b", 4, 4);
    test("a\u0080b", 5, 4);

    test("a\u0800b", 0, 0);
    test("a\u0800b", 1, 1);
    test("a\u0800b", 2, 1);
    test("a\u0800b", 3, 1);
    test("a\u0800b", 4, 4);
    test("a\u0800b", 5, 5);
    test("a\u0800b", 6, 5);

    // surrogate pairs
    test("\uD834\uDD1E", 0, 0);
    test("\uD834\uDD1E", 1, 0);
    test("\uD834\uDD1E", 2, 0);
    test("\uD834\uDD1E", 3, 0);
    test("\uD834\uDD1E", 4, 4);
    test("\uD834\uDD1E", 5, 4);

}

업데이트되었습니다 수정 된 코드 예제, 이제 대리 쌍을 처리합니다.

다른 팁

당신은 사용해야합니다 charsetencoder, 단순 getBytes() + UTF-8 Charcters를 반으로자를 수있는만큼 많은 복사하십시오.

이 같은:

public static int truncateUtf8(String input, byte[] output) {

    ByteBuffer outBuf = ByteBuffer.wrap(output);
    CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());

    Charset utf8 = Charset.forName("UTF-8");
    utf8.newEncoder().encode(inBuf, outBuf, true);
    System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
    return outBuf.position();
}

여기에 제가 생각해 낸 것이 있습니다. 표준 Java API를 사용하므로 모든 유니 코드 이상과 대리 쌍과 안전하고 호환되어야합니다. 솔루션은 가져옵니다. http://www.jroller.com/holy/entry/truncating_utf_string_to_the 널에 대한 점검이 추가되고 문자열이 바이트가 적을 때 디코딩을 피하기 위해 Maxbytes.

/**
 * Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
 * half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
 * character.
 *
 * Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
 */
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
    if (s == null) {
        return null;
    }
    Charset charset = Charset.forName("UTF-8");
    CharsetDecoder decoder = charset.newDecoder();
    byte[] sba = s.getBytes(charset);
    if (sba.length <= maxBytes) {
        return s;
    }
    // Ensure truncation by having byte buffer = maxBytes
    ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
    CharBuffer cb = CharBuffer.allocate(maxBytes);
    // Ignore an incomplete character
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    decoder.decode(bb, cb, true);
    decoder.flush(cb);
    return new String(cb.array(), 0, cb.position());
}

UTF-8 인코딩은 깔끔한 특성을 확인할 수 있습니다 어디에서 바이트-당신을 설정합니다.

체크인 스트림에서 문자로 제한할.

을 경우 높은 비트가 0 하나의 바이트 char,단지 그것을 대체하 0 고 당신은 괜찮습니다.
을 경우 높은 비트는 1 그리고 그 다음에,당신은 시작에서의 멀티바이트 문자,그래서 그냥 설정하는 바이트를 0 으로 당신이 좋습니다.
을 경우 높은 비트는 1 그러나 다음 비트 0,당신은 중앙에서 문자의 여행을 따라 다시 버퍼를 바이트는 2 개 또는 더 1s 에서 높은 비트,그리고 대체하는 바이트와 함께 0.

예제:의 경우 스트림:31 31 33C1A3 32 33 00,당신은 당신의 문자열 1, 2, 3, 5, 6, 7 바이트 길이지만,4,로는 0 후 C1,이는 시작중의 바이트 char.

-New String (data.getBytes ( "UTF-8"), 0, Maxlen, "UTF-8")를 사용할 수 있습니다.

변환없이 바이트 수를 계산할 수 있습니다.

foreach character in the Java string
  if 0 <= character <= 0x7f
     count += 1
  else if 0x80 <= character <= 0x7ff
     count += 2
  else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
     count += 3
  else if 0xdc00 <= character <= 0xffff
     count += 3
  else { // surrogate, a bit more complicated
     count += 4
     skip one extra character in the input stream
  }

대리 쌍 (D800-DBFF 및 U+DC00 – U+DFFF)을 감지하고 각 유효한 대리 쌍에 대해 4 바이트를 계산해야합니다. 첫 번째 범위에서 첫 번째 값을 얻고 두 번째 범위에서 두 번째 값을 얻으면 모두 괜찮습니다. Java가 어떻게 처리하는지 잘 모르겠지만, 알고리즘은 그 (가능성이 낮은) 사례에서 올바른 계산을해야합니다.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow