Read new line character across different OS

https://stackoverflow.com/questions/23467913

15-07-2023
|

質問

I have come across a situation where I am reading a some log file and then counting the number of lines I encountered via the following code snippet.

byte[] c = new byte[1024];
long count = 0;
int readChars = 0;
while ((readChars = is.read(c)) != -1) {
    for (int i = 0; i < readChars; ++i) {
        if (c[i] == '\n') {
            ++count;
        }
    }
}

My problem is that when I try to read a file (CSV, Syslog, or any other wild format), it runs just fine and gives me the right result. But when I try to run a file that was generated via a mac, it goes hay-wire and simply reports back that a single line was read.

Now my log file is large, I know that it has quite a few thousand lines of logs, but it just read a single line. I opened this file in Sublime and I could see all the separate lines, however when I viewed this file via VIM, It displayed only a single a file with a character '^M' at the end of each line ( My guess it that it is using this as the line terminator).

A sample of two lines is below. You can see that vim is displaying the ^M character where it should have been a new line

15122,25Dec2013,19:42:25,192.168.5.1,log,allow,,eth0,outbound,Application Control,,Network,Bob(+),Bob(+),,,,59857d77,,,,,,,,570033,,,,,,,,,,,,,192.168.5.7,176.32.96.190,tcp,80,56305,15606,554427,60461741,**,,,,,,,1,**,**,**,**,**,**,**,**,**,Other: Wget/1.13.4 (linux-gnu),Other: Server,192.168.5.7,60461741:1,,,,,,**,**,**,,,**,,,,^M359,23Dec2013,18:54:03,192.168.5.1,log,allow,,eth0,outbound,Application Control,,Network,Charlie(+),Charlie(+),,,,c0fa2dac,,,,,,,,1171362,,,,,,,,,,,,,192.168.5.6,205.251.242.54,tcp,80,45483,31395,1139967,60340847,**,,,,,,,2,**,**,**,**,**,**,**,**,**,Other: Wget/1.13.4 (linux-gnu),Other: Server,192.168.5.6,60340847:1,,,,,,,**,**,**,,,**,,,,^M

Any suggestion as to how to tackle this problem ?

解決 2

Both line feed (^J, 0x0a) and carriage return (^M, 0x0d) are used as line separators; Unix uses the first, (old) Mac the latter, Windows both in combination (CR-LF).

If you don't have a file input library that abstracts this (and if you absolutely have to support the old Mac format (as the new MacOS, because the kernel is Unix-based, also uses LF)), treat both LF and CR as a line separator, and don't count the CR-LF used by Windows twice.

Vim

What Vim detects is determined by the 'fileformats' option. You can make it detect Mac as well via

:set fileformats+=mac

他のヒント

The first problem even before you get to line breaks is that you're reading bytes and then treating those as characters. You're effectively assuming an encoding of ISO-8859-1 which may well not be correct. You should be using an InputStreamReader instead.

Then there's the issue of operating systems having different line breaks... use BufferedReader.readLine() to read a line in a way that handles line breaks of \n, \r or \r\n.

So your code would become:

int count = 0;
try (BufferedReader reader = new BufferedReader(
     new InputStreamReader(is, charset))) {
   while (reader.readLine() != null) {
       count++;
   }
}

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow