Question

I am using a perl script to read in a file, but I'm not sure what encoding the file is in. Basically, my file is a list of book titles, but each book has other info associated with it (author, publication date, etc). So each book title is within a discrete chunk of data for the book. So I iterate through the file line by line until I find the regular expression '/Book Title: (.*)/' and take what's in the paren. Then, I create a separate .txt file with the name of the text file being my book. However, in my unix server, when I look at the name of the file, it's actually not, for example, 'LordOfTheFlies.txt' but rather 'LordOfTheFlies^M.txt'

What is this '^M'? Is that a weird end of line encoding I'm not taking into account? I tried chomp but it doesn't seem to be working. What is the best file encoding for working with perl?

Was it helpful?

Solution

It's the additional carriage return character that Windows systems insert before line feed characters (M == 13th letter, hence ASCII 13 is visualised as ^M).

It has nothing to do with file encoding, it's just the line ending policy biting you. Perl is usually good at handling line ending characters correctly, but if they occur somewhere else than the end of a line you have to do it yourself. You can use s/\r// instead of chomp() to get them out.

OTHER TIPS

Before processing the file, you need to know the encoding of the file, which is determined by the producer of the file.
That "^M" is control-M, which is a carriage return, and is not needed in Unix file systems.
Looks like the file is created in Unix and transferred to Windows. It can also be added with ftp when text file are transfered as binaries.

Try chop, instead of 'chomp'. Chomp removes the 'new line character'. s/\r// is also good. For your general question, you might want to use appropriate module for the file type you have to make your life easier and better with Perl.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top