Text Lines are missed when reading a file Line by Line in Perl. <cr> <lf> mismatch

StackOverflow https://stackoverflow.com/questions/15149118

  •  16-03-2022
  •  | 
  •  

Domanda

I want to extract and log various parameters from a 3G modem as there are intermittent dropouts. As such I am using wget to read 3Ginfo.html from a 3G modem and placing the contents into a file contents.txt. Using Notepad++ to open this file shows all of the data.

Due to my reputation, I cannot post pictures, therefore the code below is the best I can do; from Notepad++ (with View All Characters turned on), I get:

<tr>[LF]

<td class='hd'>Signal Strength:</td>[LF]

<td>[LF]

-72[CR]

&nbsp(dBm)&nbsp(High)</td>[LF]

</tr>[LF]

However, when the file is read line by line from Perl, it is clear that there are less lines than those reported by Notepad++ and data is missing. In this case the actual signal strength value is missing.

Here is the Perl code to read the file:

open hLOGFILE, "<output.txt";
while (<hLOGFILE>) 
{ 
    print "Line no $.  Text is $_ ";
}

Here is the output (as text, because I cannot post pictures yet):

Line no 98  Text is <tr>

Line no 99  Text is <td class='hd'>Signal Strength:</td>

Line no 100  Text is <td>

&nbsp(dBm)&nbsp(High)</td>

Line no 102  Text is </tr>

It is clear that there are missing lines and it is related to the <cr> end of line terminator. I have tried slurping the file and the lines are still missing.

Apart from reading byte by byte and then trying to parse the file that way (which is not very appealing) I can not find a solution.

My plan is to simply extract and log the lines of interest every minute or so.

I have tried opening the file specifying various encoding but still no joy. If Notepad++ can read and display all the data, why does it not work in Perl. When using more from the Windows XP command line, it show that the data is also missing.

When I view source from chrome I get,

<tr>
    <td class='hd'>Received Signal Code Power(RSCP):</td>
    <td align='center'> -78 dBm</td>
</tr>
È stato utile?

Soluzione

The -72[CR] line isn't missing. You're just not seeing it.

This is because it's not a line since the Carriage Return character isn't normally recognized as a line break character. What is happening is that you're reading this as one line:

-72[CR]&nbsp(dBm)&nbsp(High)</td>[LF]

And what is happening is that you're printing:

Line No. 101 is -72

Then that carriage return character is being printed which makes the cursor go back to the beginning of the line. Then, the rest of the line is printed. This covers up what you printed out, and thus you see:

&nbsp(High)</td>

because that overwrote the previous text on that line.

I used VI to create three different files with three different file formats ("mac" = "\r", "unix" = "\n", and "dos" = "\r\n"), then I used the Unix cat command to combine them into a single bastardized file.

Here's my program:

use 5.12.0;
use autodie;

open my $test_fh, "<:crlf", "new_test";

local ($/);               #Enable "slurp" mode
my $file = <$test_fh>;    #Whole file is read in.

$file =~ s/[\r\n]+/\n/g;  #Make all line endings just \n

#
# Now "rewrite" the file
#
my @file = split /\n/, $file;
for my $line (@file) {
    say qq(Line: "$line");
}

This prints out:

Line: "MAC FILE"
Line: "this"
Line: "is"
Line: "a"
Line: "test of my"
Line: "program"
Line: "this"
Line: "WINDOWS FILE"
Line: "is"
Line: "a"
Line: "test of my"
Line: "program"
Line: "UNIX FILE"
Line: "this"
Line: "is"
Line: "a"
Line: "test of my"
Line: "program"

As you can see, the MAC FILE did show all the lines, but the word Line: didn't print out with all of them. That's because Perl read it in as one big line. My s/\r+/\n/g converted it to print on multiple lines, but the while loop read it in as a single line.

Take a look at my open statement. I use three parameters which solves some minor issues in Perl. The nice thing is you can attach layers or encodings to the file. For example, the <:crlf automatically converts Windows files from the \r\n ending to just \n, but won't touch Unix files. It's a life saver for those who work in mixed Unix/Windows environments.

I was hoping to find some similar layer for the old Mac style text files (In pre Mac OS X days, Macintosh files ended with just a \r and no \n at all. That would have really solved the issue. Unfortunately, I didn't find any documentation on it. It's been a long time since you had pre-OS X Macintosh text files.

Altri suggerimenti

Carriage return is \r. It is listed in perldoc perlreref. Removing it from your input, for example in that loop of yours, can be done like so:

while (<hLOGFILE>) { 
    s/\r//g;
    print "Line no $.  Text is $_ ";
}

Alternatives

tr/\r//d;        # same thing as above, really
s/[\r\n]+$//;    # remove all line endings

You could chomp() it off...

open hLOGFILE, "<output.txt";
while (<hLOGFILE>)
{
    chomp(); 
    print "Line no $.  Text is $_ \n" if( $_ );
}

On some systems I've seen the need to call chomp() twice, to get rid of multiple end-of-line characters...yes the do exist. You may want to add something to strip out all those HTML tags as well? See: How can I strip HTML in a string using Perl?

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top