Reverse newline tokenization in one-token per line files? - Unix

https://stackoverflow.com/questions/21779272

11-10-2022
|

Question

How to separate tokens in line using Unix? showed that a file is tokenizable using sed or xargs.

Is there a way to do the reverse?

[in:]

some
sentences
are
like
this.

some
sentences
foo
bar
that

[out]:

some sentences are like this.
some sentences foo bar that

The only delimiter per sentence is the \n\n. I could have done the following in python, but is there a unix way?

def per_section(it):
  """ Read a file and yield sections using empty line as delimiter """
  section = []
  for line in it:
    if line.strip('\n'):
      section.append(line)
    else:
      yield ''.join(section)
      section = []
  # yield any remaining lines as a section too
  if section:
    yield ''.join(section)

print ["".join(i).replace("\n"," ") for i in per_section(codecs.open('outfile.txt','r','utf8'))]

[out:]

[u'some sentences are like this. ', u'some sentences foo bar that ']

Solution

using awk is eaiser to handle this kind of task:

awk -v RS="" '{$1=$1}7' file

if you want to keep multiple spaces in your each line, you could

awk -v RS="" -F'\n' '{$1=$1}7' file

with your example:

kent$  cat f
some
sentences
are
like
this.

some
sentences
foo
bar
that

kent$  awk -v RS=""  '{$1=$1}7' f   
some sentences are like this.
some sentences foo bar that

OTHER TIPS

You can do with awk command as follows:

awk -v RS="\n\n" '{gsub("\n"," ",$0);print $0}' file.txt

Set the record separator as \n\n which means the strings are tokenized in a group of lines separated by a blank line. Now, print that token after replacing all the \n by a space character.

sed -n --posix 'H;$ {x;s/\n\([^[:cntrl:]]\{1,\}\)/\1 /gp;}' YourFile

Based on blank line separation so, each string could differ in length also

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow