Come faccio a troncare una stringa Java per inserirla in un determinato numero di byte, una volta codificata UTF-8?

https://stackoverflow.com/questions/119328

02-07-2019
|

Domanda

Come posso troncare un String java in modo da sapere che si adatterà in un determinato numero di byte una volta che è codificato UTF-8?

Soluzione

Ecco un semplice ciclo che conta quanto sarà grande la rappresentazione UTF-8 e si tronca quando viene superata:

public static String truncateWhenUTF8(String s, int maxBytes) {
    int b = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);

        // ranges from http://en.wikipedia.org/wiki/UTF-8
        int skip = 0;
        int more;
        if (c <= 0x007f) {
            more = 1;
        }
        else if (c <= 0x07FF) {
            more = 2;
        } else if (c <= 0xd7ff) {
            more = 3;
        } else if (c <= 0xDFFF) {
            // surrogate area, consume next char as well
            more = 4;
            skip = 1;
        } else {
            more = 3;
        }

        if (b + more > maxBytes) {
            return s.substring(0, i);
        }
        b += more;
        i += skip;
    }
    return s;
}

Questo gestisce coppie surrogate che compaiono nella stringa di input. L'encoder UTF-8 di Java (correttamente) genera coppie surrogate come una singola sequenza di 4 byte anziché due sequenze di 3 byte, quindi truncateWhenUTF8 () restituirà la stringa troncata più lunga che può. Se si ignorano le coppie di surrogati nell'implementazione, le stringhe troncate potrebbero essere cortocircuitate rispetto al necessario.

Non ho fatto molti test su quel codice, ma qui ci sono alcuni test preliminari:

private static void test(String s, int maxBytes, int expectedBytes) {
    String result = truncateWhenUTF8(s, maxBytes);
    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
    if (utf8.length > maxBytes) {
        System.out.println("BAD: our truncation of " + s + " was too big");
    }
    if (utf8.length != expectedBytes) {
        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
    }
    System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
    test("abcd", 0, 0);
    test("abcd", 1, 1);
    test("abcd", 2, 2);
    test("abcd", 3, 3);
    test("abcd", 4, 4);
    test("abcd", 5, 4);

    test("a\u0080b", 0, 0);
    test("a\u0080b", 1, 1);
    test("a\u0080b", 2, 1);
    test("a\u0080b", 3, 3);
    test("a\u0080b", 4, 4);
    test("a\u0080b", 5, 4);

    test("a\u0800b", 0, 0);
    test("a\u0800b", 1, 1);
    test("a\u0800b", 2, 1);
    test("a\u0800b", 3, 1);
    test("a\u0800b", 4, 4);
    test("a\u0800b", 5, 5);
    test("a\u0800b", 6, 5);

    // surrogate pairs
    test("\uD834\uDD1E", 0, 0);
    test("\uD834\uDD1E", 1, 0);
    test("\uD834\uDD1E", 2, 0);
    test("\uD834\uDD1E", 3, 0);
    test("\uD834\uDD1E", 4, 4);
    test("\uD834\uDD1E", 5, 4);

}

Aggiornato Esempio di codice modificato, ora gestisce coppie surrogate.

Altri suggerimenti

Dovresti usare CharsetEncoder , il semplice getBytes () + copia il maggior numero possibile di dimezzare i caratteri UTF-8.

Qualcosa del genere:

public static int truncateUtf8(String input, byte[] output) {

    ByteBuffer outBuf = ByteBuffer.wrap(output);
    CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());

    Charset utf8 = Charset.forName("UTF-8");
    utf8.newEncoder().encode(inBuf, outBuf, true);
    System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
    return outBuf.position();
}

Ecco cosa mi è venuto in mente, utilizza API Java standard, quindi dovrebbe essere sicuro e compatibile con tutte le stranezze unicode e coppie surrogate ecc. La soluzione è presa da http://www.jroller.com/holy/entry/truncating_utf_string_to_the con i controlli aggiunti per null e per evitare la decodifica quando la stringa ha meno byte di maxbytes .

/**
 * Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
 * half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
 * character.
 *
 * Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
 */
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
    if (s == null) {
        return null;
    }
    Charset charset = Charset.forName("UTF-8");
    CharsetDecoder decoder = charset.newDecoder();
    byte[] sba = s.getBytes(charset);
    if (sba.length <= maxBytes) {
        return s;
    }
    // Ensure truncation by having byte buffer = maxBytes
    ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
    CharBuffer cb = CharBuffer.allocate(maxBytes);
    // Ignore an incomplete character
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    decoder.decode(bb, cb, true);
    decoder.flush(cb);
    return new String(cb.array(), 0, cb.position());
}

La codifica UTF-8 ha un tratto preciso che ti consente di vedere dove ti trovi in ??un set di byte.

controlla lo stream al limite di caratteri che desideri.

Se il suo bit alto è 0, è un carattere a byte singolo, sostituiscilo con 0 e stai bene.
Se il suo bit alto è 1 e lo è anche il bit successivo, allora sei all'inizio di un carattere multi-byte, quindi imposta quel byte su 0 e sei a posto.
Se il bit più alto è 1 ma il bit successivo è 0, allora sei nel mezzo di un carattere, viaggia indietro lungo il buffer fino a quando non colpisci un byte che ha 2 o più 1s nei bit più alti e sostituisci quel byte con 0.

Esempio: se il tuo stream è: 31 33 31 C1 A3 32 33 00, puoi rendere lunga la stringa 1, 2, 3, 5, 6 o 7 byte, ma non 4, poiché ciò comporterebbe lo 0 dopo C1, che è l'inizio di un carattere multi-byte.

puoi usare -new String (data.getBytes (" UTF-8 "), 0, maxLen, " UTF-8 ");

Puoi calcolare il numero di byte senza fare alcuna conversione.

foreach character in the Java string
  if 0 <= character <= 0x7f
     count += 1
  else if 0x80 <= character <= 0x7ff
     count += 2
  else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
     count += 3
  else if 0xdc00 <= character <= 0xffff
     count += 3
  else { // surrogate, a bit more complicated
     count += 4
     skip one extra character in the input stream
  }

Dovresti rilevare coppie surrogate (D800-DBFF e U + DC00 & # 8211; U + DFFF) e contare 4 byte per ogni coppia surrogata valida. Se ottieni il primo valore nel primo intervallo e il secondo nel secondo intervallo, va tutto bene, saltali e aggiungi 4. In caso contrario, si tratta di una coppia surrogata non valida. Non sono sicuro di come Java gestisca ciò, ma il tuo algoritmo dovrà fare il conteggio corretto in quel caso (improbabile).

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow