كيف يمكنني اقتطاع سلسلة جافا لتناسب عددًا معينًا من البايتات، بمجرد ترميز UTF-8؟

https://stackoverflow.com/questions/119328

02-07-2019
|

سؤال

كيف يمكنني اقتطاع Java String حتى أعلم أنه سيتسع لعدد معين من وحدات البايت بمجرد ترميز UTF-8؟

المحلول

فيما يلي حلقة بسيطة تحسب حجم تمثيل UTF-8، ويتم اقتطاعه عند تجاوزه:

public static String truncateWhenUTF8(String s, int maxBytes) {
    int b = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);

        // ranges from http://en.wikipedia.org/wiki/UTF-8
        int skip = 0;
        int more;
        if (c <= 0x007f) {
            more = 1;
        }
        else if (c <= 0x07FF) {
            more = 2;
        } else if (c <= 0xd7ff) {
            more = 3;
        } else if (c <= 0xDFFF) {
            // surrogate area, consume next char as well
            more = 4;
            skip = 1;
        } else {
            more = 3;
        }

        if (b + more > maxBytes) {
            return s.substring(0, i);
        }
        b += more;
        i += skip;
    }
    return s;
}

هذا يفعل مقبض أزواج بديلة التي تظهر في سلسلة الإدخال.يقوم برنامج تشفير UTF-8 الخاص بـ Java (بشكل صحيح) بإخراج أزواج بديلة كتسلسل واحد مكون من 4 بايت بدلاً من تسلسلين مكونين من 3 بايت، لذلك truncateWhenUTF8() سيُرجع أطول سلسلة مقطوعة ممكنة.إذا تجاهلت الأزواج البديلة في التنفيذ، فقد يتم تقصير السلاسل المقتطعة عما يجب أن تكون عليه.

لم أقم بإجراء الكثير من الاختبارات على هذا الرمز، ولكن إليك بعض الاختبارات الأولية:

private static void test(String s, int maxBytes, int expectedBytes) {
    String result = truncateWhenUTF8(s, maxBytes);
    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
    if (utf8.length > maxBytes) {
        System.out.println("BAD: our truncation of " + s + " was too big");
    }
    if (utf8.length != expectedBytes) {
        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
    }
    System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
    test("abcd", 0, 0);
    test("abcd", 1, 1);
    test("abcd", 2, 2);
    test("abcd", 3, 3);
    test("abcd", 4, 4);
    test("abcd", 5, 4);

    test("a\u0080b", 0, 0);
    test("a\u0080b", 1, 1);
    test("a\u0080b", 2, 1);
    test("a\u0080b", 3, 3);
    test("a\u0080b", 4, 4);
    test("a\u0080b", 5, 4);

    test("a\u0800b", 0, 0);
    test("a\u0800b", 1, 1);
    test("a\u0800b", 2, 1);
    test("a\u0800b", 3, 1);
    test("a\u0800b", 4, 4);
    test("a\u0800b", 5, 5);
    test("a\u0800b", 6, 5);

    // surrogate pairs
    test("\uD834\uDD1E", 0, 0);
    test("\uD834\uDD1E", 1, 0);
    test("\uD834\uDD1E", 2, 0);
    test("\uD834\uDD1E", 3, 0);
    test("\uD834\uDD1E", 4, 4);
    test("\uD834\uDD1E", 5, 4);

}

محدث مثال التعليمات البرمجية المعدلة، يتعامل الآن مع الأزواج البديلة.

نصائح أخرى

يجب عليك استخدام CharsetEncoder, ، البسيط getBytes() + انسخ أكبر عدد ممكن من أحرف UTF-8 إلى النصف.

شيء من هذا القبيل:

public static int truncateUtf8(String input, byte[] output) {

    ByteBuffer outBuf = ByteBuffer.wrap(output);
    CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());

    Charset utf8 = Charset.forName("UTF-8");
    utf8.newEncoder().encode(inBuf, outBuf, true);
    System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
    return outBuf.position();
}

إليك ما توصلت إليه، فهو يستخدم واجهات برمجة تطبيقات Java القياسية لذا يجب أن يكون آمنًا ومتوافقًا مع جميع غرابة الكود الموحد والأزواج البديلة وما إلى ذلك.الحل مأخوذ من http://www.jroller.com/holy/entry/truncating_utf_string_to_the مع إضافة عمليات التحقق من القيمة الخالية ولتجنب فك التشفير عندما تكون السلسلة أقل من البايتات maxBytes.

/**
 * Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
 * half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
 * character.
 *
 * Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
 */
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
    if (s == null) {
        return null;
    }
    Charset charset = Charset.forName("UTF-8");
    CharsetDecoder decoder = charset.newDecoder();
    byte[] sba = s.getBytes(charset);
    if (sba.length <= maxBytes) {
        return s;
    }
    // Ensure truncation by having byte buffer = maxBytes
    ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
    CharBuffer cb = CharBuffer.allocate(maxBytes);
    // Ignore an incomplete character
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    decoder.decode(bb, cb, true);
    decoder.flush(cb);
    return new String(cb.array(), 0, cb.position());
}

يتميز ترميز UTF-8 بسمة أنيقة تسمح لك بمعرفة مكان وجودك في مجموعة البايت.

تحقق من الدفق عند الحد الأقصى لعدد الأحرف الذي تريده.

إذا كان البت العالي الخاص به هو 0، فهو عبارة عن حرف أحادي البايت، فقط استبدله بـ 0 وستكون بخير.
إذا كان البت العالي هو 1 وكذلك البت التالي، فأنت في بداية حرف متعدد البايتات، لذا فقط قم بتعيين هذا البايت على 0 وستكون بخير.
إذا كان البت الأعلى هو 1 لكن البت التالي هو 0، فأنت في منتصف الحرف، فارجع عبر المخزن المؤقت حتى تصل إلى بايت يحتوي على 2 أو أكثر من 1 في البتات العالية، واستبدل هذا البايت بـ 0.

مثال:إذا كان التدفق الخاص بك هو:31 33 31 C1 A3 32 33 00، يمكنك جعل السلسلة الخاصة بك بطول 1 أو 2 أو 3 أو 5 أو 6 أو 7 بايت، ولكن ليس 4، لأن ذلك من شأنه أن يضع 0 بعد C1، وهي بداية متعددة بايت شار.

يمكنك استخدام -new String( data.getBytes("UTF-8") , 0, maxLen, "UTF-8");

يمكنك حساب عدد البايتات دون القيام بأي تحويل.

foreach character in the Java string
  if 0 <= character <= 0x7f
     count += 1
  else if 0x80 <= character <= 0x7ff
     count += 2
  else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
     count += 3
  else if 0xdc00 <= character <= 0xffff
     count += 3
  else { // surrogate, a bit more complicated
     count += 4
     skip one extra character in the input stream
  }

سيتعين عليك اكتشاف الأزواج البديلة (D800-DBFF وU+DC00–U+DFFF) وحساب 4 بايت لكل زوج بديل صالح.إذا حصلت على القيمة الأولى في النطاق الأول والثانية في النطاق الثاني، فلا بأس، تخطيهما وأضف 4.ولكن إذا لم يكن الأمر كذلك، فهو زوج بديل غير صالح.لست متأكدًا من كيفية تعامل Java مع ذلك، ولكن سيتعين على الخوارزمية الخاصة بك إجراء العد الصحيح في هذه الحالة (غير المحتملة).

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow