Removing new line character from pipes delimited file for lines not starting with timestamp

StackOverflow https://stackoverflow.com/questions/23659648

Pergunta

Here is an example of the data:

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013
NUM: 90834098
data: 0394884
cX: 90h010f03040f
mR: 034050t0ds0
cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210

I am in need of a script to remove the new line character from lines that do not begin with a timestamp. In the example above, lines 2-5 would be appended to the last field in the first line in a sort of text blob. I know how to detect the good lines,

grep '^[0-9][0-9][0-9][0-9].*' testfile

and also the bad lines,

grep '^[^0-9][^0-9][^0-9][^0-9].*' testfile

The question now is, how do I apply this (using sed?) in order to put the lines following a 'good' line back into the last field of this line. Any help here would be much appreciated.

Here is an example of the desired output:

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406 |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603 |PHONE HOME|SDRKRKS|REAS|something|TN 90210

Edit:

There is some disagreement as to which is the most appropriate tool. At the moment I am leaning towards notepad++. This is close to the kind of thing I want to do but it is not quite working, maybe someone out there can help me tune it to my use case:

(?! [0-9]{4}\-[0-9]{2}-[0-9]{2}).*

(?! [0-9]{4}\-[0-9]{2}-[0-9]{2})  - searches for a line not like a timestamp
.*                                  - followed by anything else

The problem is that the .* catches the timestamp that I am attempting to negate. Any thoughts?

Edit 2: Thanks everyone for the helpful advice, it's definitely moving me in the right direction! The following regex finds the problematic \n char in notepad++, but when I try to perform the substitution nothing happens:

Find: (.*)(\n)(?![0-9]{4}\-[0-9]{2}\-[0-9]{2})
Replace: \1

Does anyone have any ideas here as to how to force notepad++ to remove the problematic \n?

Edit 3: Here is additional sample data that does not seem to work with the proposed solutions:

2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR
6:13 AM 6/22/2013
VERIFIED CURLING
TN :- 834974978398
XX and YY updated
THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr
tn 4887839847
Foi útil?

Solução

Simplest solution:

echo $(cat file) | sed -re 's/(2013-06)/@@@\1/g' | sed -re 's/@@@/\n/g'

This works because echo without quotes put everything in the same line, then we insert @@@ before the timestamp and the replace @@@ with new line character.

tiago@dell:~$ echo $(cat file) | sed -re 's/(2013-06)/@@@\1/g' | sed -re 's/@@@/\n/g'

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0 
2013-06-22 00:00:49.307121|0950704421406 |PHONE HOME|SDRKRKS|REAS|something|MRS 
2013-06-22 00:00:50.379487|0441813679603 |PHONE HOME|SDRKRKS|REAS|something|TN 90210 
2013-06-22 00:00:02.540298|0238704723874 |SMELL TEST|HAKEKJ |REAS|No cooking|tcna / ncc 
2013-06-22 00:00:04.302887|3289749873342 |SMELL TEST|ICNIDF |REAS|No cooking|JINUJ/CVGIND/NASR 6:13 AM 6/22/2013 VERIFIED CURLING TN :- 834974978398 XX and YY updated THIS IS A SENTENCE 
2013-06-22 00:00:06.937545|30874987392838 |SMELL TEST|KCIDKD |REAS|No cooking|SrutiD/cvgind/nasr tn 4887839847
tiago@dell:~$ cat file
2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013
NUM: 90834098
data: 0394884
cX: 90h010f03040f
mR: 034050t0ds0
cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210
2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR
6:13 AM 6/22/2013
VERIFIED CURLING
TN :- 834974978398
XX and YY updated
THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr
tn 4887839847

Outras dicas

Using all of your posted sample input concatenated in one file:

$ cat file
2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013
NUM: 90834098
data: 0394884
cX: 90h010f03040f
mR: 034050t0ds0
cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210
2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR
6:13 AM 6/22/2013
VERIFIED CURLING
TN :- 834974978398
XX and YY updated
THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr
tn 4887839847

.

$ awk 'NR>1{pre = (/^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}/ ? ORS : OFS)} {printf "%s%s",pre,$0} END{print ""}' file
2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210
2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR 6:13 AM 6/22/2013 VERIFIED CURLING TN :- 834974978398 XX and YY updated THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr tn 4887839847

If that's not your expected output, please update your question to show what it is.

I am not sure what you like to do, since you have not provided with output example.
But if you like to connect lines, you can try this awk

awk '{printf (!/2013/?" ":RS)"%s",$0} END {print ""}'

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210

This might work for you (GNU sed):

sed ':a;$!N;/^[^|]*$/Ms/\n/ /;ta' file

If the last line appended does not contain a | replace the newline with a space and repeat.

Here is one way using GNU sed:

sed -nr ':a;N;/\n[0-9]{4}-[0-9]{2}-[0-9]{2}/{P;$!D;s/.*\n//p};s/\n/ /g;$!ba;p' file

Explanation:

  • Create a label :a
  • Append next line to current line on pattern space using N
  • /\n[0-9]{4}-[0-9]{2}-[0-9]{2}/{P;$!D;s/.*\n//p} Test if the line that is appended starts with date if so print up to the first newline and if it is not the last line, delete up to first new line. If it is the last line delete up to the newline and print it.
  • s/\n/ /g; for all other lines keep removing new lines.
  • ba branch back to our label and repeat
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top