字节顺序标记搞砸了 Java 中的文件读取

https://stackoverflow.com/questions/1835430

11-09-2019
|

题

我正在尝试使用 Java 读取 CSV 文件。某些文件可能在开头有字节顺序标记，但不是全部。如果存在，字节顺序将与第一行的其余部分一起读取，从而导致字符串比较出现问题。

当字节顺序标记存在时，是否有一种简单的方法可以跳过它？

谢谢！

解决方案

修改：我已完成GitHub上一个适当的释放：的https：// github上的.com / gpakosz / UnicodeBOMInputStream

下面是I类编码一段时间以前，我刚粘贴前编辑的包名称。没有什么特别的，它非常类似于张贴在SUN的bug数据库解决方案。其纳入你的代码，你的罚款。

/* ____________________________________________________________________________ * * File: UnicodeBOMInputStream.java * Author: Gregory Pakosz. * Date: 02 - November - 2005 * ____________________________________________________________________________ */ package com.stackoverflow.answer; import java.io.IOException; import java.io.InputStream; import java.io.PushbackInputStream; /** * The <code>UnicodeBOMInputStream</code> class wraps any * <code>InputStream</code> and detects the presence of any Unicode BOM * (Byte Order Mark) at its beginning, as defined by * <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a> * * The * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a> * defines 5 types of BOMs:<ul> * <li><pre>00 00 FE FF = UTF-32, big-endian</pre></li> * <li><pre>FF FE 00 00 = UTF-32, little-endian</pre></li> * <li><pre>FE FF = UTF-16, big-endian</pre></li> * <li><pre>FF FE = UTF-16, little-endian</pre></li> * <li><pre>EF BB BF = UTF-8</pre></li> * </ul> * * Use the {@link #getBOM()} method to know whether a BOM has been detected * or not. * * Use the {@link #skipBOM()} method to remove the detected BOM from the * wrapped <code>InputStream</code> object. */ public class UnicodeBOMInputStream extends InputStream { /** * Type safe enumeration class that describes the different types of Unicode * BOMs. */ public static final class BOM { /** * NONE. */ public static final BOM NONE = new BOM(new byte[]{},"NONE"); /** * UTF-8 BOM (EF BB BF). */ public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF, (byte)0xBB, (byte)0xBF}, "UTF-8"); /** * UTF-16, little-endian (FF FE). */ public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF, (byte)0xFE}, "UTF-16 little-endian"); /** * UTF-16, big-endian (FE FF). */ public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE, (byte)0xFF}, "UTF-16 big-endian"); /** * UTF-32, little-endian (FF FE 00 00). */ public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF, (byte)0xFE, (byte)0x00, (byte)0x00}, "UTF-32 little-endian"); /** * UTF-32, big-endian (00 00 FE FF). */ public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00, (byte)0x00, (byte)0xFE, (byte)0xFF}, "UTF-32 big-endian"); /** * Returns a <code>String</code> representation of this <code>BOM</code> * value. */ public final String toString() { return description; } /** * Returns the bytes corresponding to this <code>BOM</code> value. */ public final byte[] getBytes() { final int length = bytes.length; final byte[] result = new byte[length]; // Make a defensive copy System.arraycopy(bytes,0,result,0,length); return result; } private BOM(final byte bom[], final String description) { assert(bom != null) : "invalid BOM: null is not allowed"; assert(description != null) : "invalid description: null is not allowed"; assert(description.length() != 0) : "invalid description: empty string is not allowed"; this.bytes = bom; this.description = description; } final byte bytes[]; private final String description; } // BOM /** * Constructs a new <code>UnicodeBOMInputStream</code> that wraps the * specified <code>InputStream</code>. * * @param inputStream an <code>InputStream</code>. * * @throws NullPointerException when <code>inputStream</code> is * <code>null</code>. * @throws IOException on reading from the specified <code>InputStream</code> * when trying to detect the Unicode BOM. */ public UnicodeBOMInputStream(final InputStream inputStream) throws NullPointerException, IOException { if (inputStream == null) throw new NullPointerException("invalid input stream: null is not allowed"); in = new PushbackInputStream(inputStream,4); final byte bom[] = new byte[4]; final int read = in.read(bom); switch(read) { case 4: if ((bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) && (bom[2] == (byte)0x00) && (bom[3] == (byte)0x00)) { this.bom = BOM.UTF_32_LE; break; } else if ((bom[0] == (byte)0x00) && (bom[1] == (byte)0x00) && (bom[2] == (byte)0xFE) && (bom[3] == (byte)0xFF)) { this.bom = BOM.UTF_32_BE; break; } case 3: if ((bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) && (bom[2] == (byte)0xBF)) { this.bom = BOM.UTF_8; break; } case 2: if ((bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE)) { this.bom = BOM.UTF_16_LE; break; } else if ((bom[0] == (byte)0xFE) && (bom[1] == (byte)0xFF)) { this.bom = BOM.UTF_16_BE; break; } default: this.bom = BOM.NONE; break; } if (read > 0) in.unread(bom,0,read); } /** * Returns the <code>BOM</code> that was detected in the wrapped * <code>InputStream</code> object. * * @return a <code>BOM</code> value. */ public final BOM getBOM() { // BOM type is immutable. return bom; } /** * Skips the <code>BOM</code> that was found in the wrapped * <code>InputStream</code> object. * * @return this <code>UnicodeBOMInputStream</code>. * * @throws IOException when trying to skip the BOM from the wrapped * <code>InputStream</code> object. */ public final synchronized UnicodeBOMInputStream skipBOM() throws IOException { if (!skipped) { in.skip(bom.bytes.length); skipped = true; } return this; } /** * {@inheritDoc} */ public int read() throws IOException { return in.read(); } /** * {@inheritDoc} */ public int read(final byte b[]) throws IOException, NullPointerException { return in.read(b,0,b.length); } /** * {@inheritDoc} */ public int read(final byte b[], final int off, final int len) throws IOException, NullPointerException { return in.read(b,off,len); } /** * {@inheritDoc} */ public long skip(final long n) throws IOException { return in.skip(n); } /** * {@inheritDoc} */ public int available() throws IOException { return in.available(); } /** * {@inheritDoc} */ public void close() throws IOException { in.close(); } /** * {@inheritDoc} */ public synchronized void mark(final int readlimit) { in.mark(readlimit); } /** * {@inheritDoc} */ public synchronized void reset() throws IOException { in.reset(); } /** * {@inheritDoc} */ public boolean markSupported() { return in.markSupported(); } private final PushbackInputStream in; private final BOM bom; private boolean skipped = false; } // UnicodeBOMInputStream

和你用这种方式：

import java.io.BufferedReader; import java.io.FileInputStream; import java.io.InputStreamReader; public final class UnicodeBOMInputStreamUsage { public static void main(final String[] args) throws Exception { FileInputStream fis = new FileInputStream("test/offending_bom.txt"); UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis); System.out.println("detected BOM: " + ubis.getBOM()); System.out.print("Reading the content of the file without skipping the BOM: "); InputStreamReader isr = new InputStreamReader(ubis); BufferedReader br = new BufferedReader(isr); System.out.println(br.readLine()); br.close(); isr.close(); ubis.close(); fis.close(); fis = new FileInputStream("test/offending_bom.txt"); ubis = new UnicodeBOMInputStream(fis); isr = new InputStreamReader(ubis); br = new BufferedReader(isr); ubis.skipBOM(); System.out.print("Reading the content of the file after skipping the BOM: "); System.out.println(br.readLine()); br.close(); isr.close(); ubis.close(); fis.close(); } } // UnicodeBOMInputStreamUsage

其他提示

在的 Apache的百科全书IO 库具有的能够检测InputStream和丢弃的BOM： BOMInputStream（Javadoc中）：

BOMInputStream bomIn = new BOMInputStream(in); int firstNonBOMByte = bomIn.read(); // Skips BOM if (bomIn.hasBOM()) { // has a UTF-8 BOM }

如果还需要检测不同的编码，它也可以各种不同的字节顺序标记，例如区分UTF-8对UTF-16大+小端 - 在上面的文档链接的详细信息。然后，您可以使用检测 ByteOrderMark 选择一个 Charset 解码该流。（有可能做到这一点更精简的方式，如果你需要所有这些功能？ - 也许UnicodeReader在BalusC的答案）。需要注意的是，在一般情况下，有没有检测到某些字节是什么编码的一个很好的办法，但是如果流以BOM开始，显然这会有所帮助。

修改：如果需要，以检测在BOM UTF-16，UTF-32等，则该构造应该是：

new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)

给予好评@马丁-查尔斯沃思的评论：）

更多简单的解决方案：

public class BOMSkipper { public static void skip(Reader reader) throws IOException { reader.mark(1); char[] possibleBOM = new char[1]; reader.read(possibleBOM); if (possibleBOM[0] != '\ufeff') { reader.reset(); } } }

使用样品：

BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(file), fileExpectedCharset)); BOMSkipper.skip(input); //Now UTF prefix not present: input.readLine(); ...

它的工作原理与所有5个UTF编码！

谷歌数据API 有一个 UnicodeReader 它会自动检测编码。

你可以用它代替 InputStreamReader. 。这是其源代码的稍微压缩的摘录，非常简单：

public class UnicodeReader extends Reader { private static final int BOM_SIZE = 4; private final InputStreamReader reader; /** * Construct UnicodeReader * @param in Input stream. * @param defaultEncoding Default encoding to be used if BOM is not found, * or <code>null</code> to use system default encoding. * @throws IOException If an I/O error occurs. */ public UnicodeReader(InputStream in, String defaultEncoding) throws IOException { byte bom[] = new byte[BOM_SIZE]; String encoding; int unread; PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE); int n = pushbackStream.read(bom, 0, bom.length); // Read ahead four bytes and check for BOM marks. if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) { encoding = "UTF-8"; unread = n - 3; } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) { encoding = "UTF-16BE"; unread = n - 2; } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) { encoding = "UTF-16LE"; unread = n - 2; } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) { encoding = "UTF-32BE"; unread = n - 4; } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) { encoding = "UTF-32LE"; unread = n - 4; } else { encoding = defaultEncoding; unread = n; } // Unread bytes if necessary and skip BOM marks. if (unread > 0) { pushbackStream.unread(bom, (n - unread), unread); } else if (unread < -1) { pushbackStream.unread(bom, 0, 0); } // Use given encoding. if (encoding == null) { reader = new InputStreamReader(pushbackStream); } else { reader = new InputStreamReader(pushbackStream, encoding); } } public String getEncoding() { return reader.getEncoding(); } public int read(char[] cbuf, int off, int len) throws IOException { return reader.read(cbuf, off, len); } public void close() throws IOException { reader.close(); } }

这 Apache Commons IO 图书馆的 BOM输入流 @rescdsk 已经提到过，但我没有看到它提到如何获得 InputStream 没有物料清单。

以下是我在 Scala 中的做法。

import java.io._ val file = new File(path_to_xml_file_with_BOM) val fileInpStream = new FileInputStream(file) val bomIn = new BOMInputStream(fileInpStream, false); // false means don't include BOM

要简单地从文件中删除 BOM 字符，我建议使用 Apache 通用 IO

public BOMInputStream(InputStream delegate, boolean include) Constructs a new BOM InputStream that detects a a ByteOrderMark.UTF_8 and optionally includes it. Parameters: delegate - the InputStream to delegate to include - true to include the UTF-8 BOM or false to exclude it

将 include 设置为 false，您的 BOM 字符将被排除。

令人遗憾不。你必须识别并跳过自己。此页面细节，你有什么要留意。另请参见这太问题为更多的细节。

我有同样的问题，因为我不是在一堆文件，我做了一个简单的解决方案阅读。我想我的编码是UTF-8，因为当我打印出来这个页面的帮助下，问题的性质：的获取一个字符的Unicode值我发现它是\ufeff。我使用的代码System.out.println( "\\u" + Integer.toHexString(str.charAt(0) | 0x10000).substring(1) );打印出有问题的Unicode值。

有一次，我有问题的Unicode值，我取代了它在我的文件的第一行之前，我在读书去了。该部分的业务逻辑：

String str = reader.readLine().trim(); str = str.replace("\ufeff", "");

此固定我的问题。然后，我能去处理，没有问题的文件。我只是在开头或结尾的空白的情况下对trim()加入，你可以这样做或没有，根据您的具体需求是什么。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow