Question

Am working on Windows Vista with GnuWin32 (sed 4.2.1 and core utilities 5.3.0). Also have ActivePerl 5.14.2 package.

I have a large multi record file. The end of each record in the file is denoted with four dollar signs ($$$$). Within each logical record are many "CRLF."

I would like to replace all instances of CRLF with a symbol such as |+|. Then I will replace $$$$ with CRLF. The result: one record per row for import into Excel for further manipulation.

I've tried several methods for transforming CRLF to |+| but without success.

For example, one method was: sed -e "s/[\r\n]/|+|/g" source_file_in target_file_out

Another method used tr -d to delete \r and then a second statement: sed -e "s/\n/|+|/g" source_file_in target_file_out

The tr statement worked; the sed statement did not.

I've read the following articles but don't see how to adapt them to replace \r\n with a symbol like |+|.

sed: how to replace CR and/or LF with "\r" "\n", so any file will be in one line

Replace string that contains CRLF?

How can I replace a newline (\n) using sed?

If this problem cannot be solved easily using sed (and tr), then I'll use Perl if someone shows me how.


Thank you Ed for your recommendation.

The awk script is not yet working completely, so I'll add some missing detail with the hope that you can fine tune your recommendation.

First, I'm running gawk v3.1.6.2962. I believe there may be differences in awk implementations, so this may be a useful bit of information.

Next, some more information about the type of data and origin of the data.

The data is about chemicals (text data that is input to a stereo-chemical drawing program).

The chemical files are in an .sdf format.

When I open "133711.sdf" in NotePad++ (using View/Show symbol/Show all characters), I see data that is shown in the screen shot: https://dl.dropbox.com/u/3094317/_master_1_screen_shot_.png

As you see, LF only - no CR. I believe this means that the origin of the .sdf files is a UNIX system.

Next, I run the Windows command COPY *.sdf _master_2_.txt. That creates the very large file-of-files that I want to parse into records.

_master_2_.txt has the same structure as 133711.sdf - LF only; no CR.

Then, I run your awk recommendation in a .BAT file. I need to replace your single quotes with double quotes because Microsoft made me.

awk -v FS="\r\n" -v OFS="|+|" -v RS="\$\$\$\$" -v ORS="\r\n" "{$1=$1}1" C:_master_2_.txt >C:\output.txt

I've attached a screen shout of output.txt: https://dl.dropbox.com/u/3094317/output.txt.png

As you can see, the awk command did not successfully replace "\r\n" with "|+|".

Further, Windows created the output.txt with CRLF.

It did successfully replace the four $ with CRLF.

Is this information adequate to update your awk recommendation to handle the Windows-related issues?

Was it helpful?

Solution

Try this with GNU awk:

awk -v FS='\r\n' -v OFS='|+|' -v RS='\\$\\$\\$\\$' -v ORS='\r\n' '{$1=$1}1' file

I see from your updated question that you're on Windows. To avoid ridiculous quoting rules and issues, put this in a file named "whatever.awk":

BEGIN{FS="\r\n"; OFS="|+|"; RS="\\$\\$\\$\\$"; ORS="\r\n"} {$1=$1}1

and run it as

awk -f whatever.awk file

and see if that does what you want.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top