Read new line character across different OS

https://stackoverflow.com/questions/23467913

15-07-2023
|

Domanda

I have come across a situation where I am reading a some log file and then counting the number of lines I encountered via the following code snippet.

byte[] c = new byte[1024];
long count = 0;
int readChars = 0;
while ((readChars = is.read(c)) != -1) {
    for (int i = 0; i < readChars; ++i) {
        if (c[i] == '\n') {
            ++count;
        }
    }
}

My problem is that when I try to read a file (CSV, Syslog, or any other wild format), it runs just fine and gives me the right result. But when I try to run a file that was generated via a mac, it goes hay-wire and simply reports back that a single line was read.

Now my log file is large, I know that it has quite a few thousand lines of logs, but it just read a single line. I opened this file in Sublime and I could see all the separate lines, however when I viewed this file via VIM, It displayed only a single a file with a character '^M' at the end of each line ( My guess it that it is using this as the line terminator).

A sample of two lines is below. You can see that vim is displaying the ^M character where it should have been a new line

15122,25Dec2013,19:42:25,192.168.5.1,log,allow,,eth0,outbound,Application Control,,Network,Bob(+),Bob(+),,,,59857d77,,,,,,,,570033,,,,,,,,,,,,,192.168.5.7,176.32.96.190,tcp,80,56305,15606,554427,60461741,**,,,,,,,1,**,**,**,**,**,**,**,**,**,Other: Wget/1.13.4 (linux-gnu),Other: Server,192.168.5.7,60461741:1,,,,,,**,**,**,,,**,,,,^M359,23Dec2013,18:54:03,192.168.5.1,log,allow,,eth0,outbound,Application Control,,Network,Charlie(+),Charlie(+),,,,c0fa2dac,,,,,,,,1171362,,,,,,,,,,,,,192.168.5.6,205.251.242.54,tcp,80,45483,31395,1139967,60340847,**,,,,,,,2,**,**,**,**,**,**,**,**,**,Other: Wget/1.13.4 (linux-gnu),Other: Server,192.168.5.6,60340847:1,,,,,,,**,**,**,,,**,,,,^M

Any suggestion as to how to tackle this problem ?

Soluzione 2

Both line feed (^J, 0x0a) and carriage return (^M, 0x0d) are used as line separators; Unix uses the first, (old) Mac the latter, Windows both in combination (CR-LF).

If you don't have a file input library that abstracts this (and if you absolutely have to support the old Mac format (as the new MacOS, because the kernel is Unix-based, also uses LF)), treat both LF and CR as a line separator, and don't count the CR-LF used by Windows twice.

Vim

What Vim detects is determined by the 'fileformats' option. You can make it detect Mac as well via

:set fileformats+=mac

Altri suggerimenti

The first problem even before you get to line breaks is that you're reading bytes and then treating those as characters. You're effectively assuming an encoding of ISO-8859-1 which may well not be correct. You should be using an InputStreamReader instead.

Then there's the issue of operating systems having different line breaks... use BufferedReader.readLine() to read a line in a way that handles line breaks of \n, \r or \r\n.

So your code would become:

int count = 0;
try (BufferedReader reader = new BufferedReader(
     new InputStreamReader(is, charset))) {
   while (reader.readLine() != null) {
       count++;
   }
}

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow