sed: replace spaces within quotes with underscores

Question 1

For a sed-only solution (which I don't necessarily advocate), try:

echo 'a b "c d e" f g "h i"' |\
sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta'
a b "c_d_e" f g "h_i"

Translation:

Start at the beginning of the line.
Look for the pattern junk"junk", repeated zero or more times, where junk doesn't have a quote, followed by junk"junk space.
Replace the final space with _.
If successful, jump back to the beginning.

Question 2

try this:

awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" file

it works for multi quotation parts in a line:

echo '"first part" foo "2nd part" bar "the 3rd part comes" baz'| awk -F'"' '{for(i=2;i<=NF;i++)if(i%2==0)gsub(" ","_",$i);}1' OFS="\"" 
"first_part" foo "2nd_part" bar "the_3rd_part_comes" baz

EDIT alternative form:

awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' file

Question 3

Another awk to try:

awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\"

Removing the quotes:

awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=

Some additional testing with a triple size test file further to the earlier tests done by @steve. I had to transform the sed statement a little bit so that non-GNU seds could process it as well. I included awk (bwk) gawk3, gawk4 and mawk:

$ for i in {1..1500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' ; done > test
$ time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null

real    0m27.802s
user    0m27.588s
sys 0m0.177s
$ time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m6.565s
user    0m6.500s
sys 0m0.059s
$ time gawk3 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m21.486s
user    0m18.326s
sys 0m2.658s
$ time gawk4 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m14.270s
user    0m14.173s
sys 0m0.083s
$ time mawk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m4.251s
user    0m4.193s
sys 0m0.053s
$ time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m13.229s
user    0m13.141s
sys 0m0.075s
$ time gawk3 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m33.965s
user    0m26.822s
sys 0m7.108s
$ time gawk4 '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m15.437s
user    0m15.328s
sys 0m0.087s
$ time mawk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m4.002s
user    0m3.948s
sys 0m0.051s
$ time sed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null

real    5m14.008s
user    5m13.082s
sys 0m0.580s
$ time gsed -e :a -e 's/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test > /dev/null

real    4m11.026s
user    4m10.318s
sys 0m0.463s

mawk rendered the fastest results...

Question 4

You'd be better off with perl. The code is much more readable and maintainable:

perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge'

With your input, the results are:

a b "c_d_e" f g "h_i"

Explanation:

-p            # enable printing
-e            # the following expression...

s             # begin a substitution

:             # the first substitution delimiter

"[^"]*"      # match a double quote followed by anything not a double quote any
              # number of times followed by a double quote

:             # the second substitution delimiter

($x=$&)=~s/ /_/g;      # copy the pattern match ($&) into a variable ($x), then 
                       # substitute a space for an underscore globally on $x. The
                       # variable $x is needed because capture groups and
                       # patterns are read only variables.

$x            # return $x as the replacement.

:             # the last delimiter

g             # perform the nested substitution globally
e             # make sure that the replacement is handled as an expression

Some testing:

for i in {1..500000}; do echo 'a b "c d e" f g "h i" j k l "m n o "p q r" s t" u v "w x" y z' >> test; done

time perl -pe 's:"[^"]*":($x=$&)=~s/ /_/g;$x:ge' test >/dev/null

real    0m8.301s
user    0m8.273s
sys     0m0.020s

time awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(" ","_",$i)} 1' test >/dev/null

real    0m4.967s
user    0m4.924s
sys     0m0.036s

time awk '!(NR%2){gsub(FS,"_")}1' RS=\" ORS=\" test >/dev/null

real    0m4.336s
user    0m4.244s
sys     0m0.056s

time sed ':a;s/^\(\([^"]*"[^"]*"[^"]*\)*[^"]*"[^"]*\) /\1_/;ta' test >/dev/null

real    2m26.101s
user    2m25.925s
sys     0m0.100s

Question 5

NOT AN ANSWER, just posting awk equivalent code for @steve's perl code in case anyone's interested (and to help me remember this in future):

@steve posted:

perl -pe 's:"[^\"]*":($x=$&)=~s/ /_/g;$x:ge'

and from reading @steve's explanation the briefest awk equivalent to that perl code (NOT the preferred awk solution - see @Kent's answer for that) would be the GNU awk:

gawk '{
   head = ""
   while ( match($0,"\"[^\"]*\"") ) {
      head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
      $0 = substr($0,RSTART+RLENGTH)
   }
   print head $0
}'

which we get to by starting from a POSIX awk solution with more variables:

awk '{
   head = ""
   tail = $0
   while ( match(tail,"\"[^\"]*\"") ) {
      x = substr(tail,RSTART,RLENGTH)
      gsub(/ /,"_",x)
      head = head substr(tail,1,RSTART-1) x
      tail = substr(tail,RSTART+RLENGTH)
   }
   print head tail
}'

and saving a line with GNU awk's gensub():

gawk '{
   head = ""
   tail = $0
   while ( match(tail,"\"[^\"]*\"") ) {
      x = gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
      head = head substr(tail,1,RSTART-1) x
      tail = substr(tail,RSTART+RLENGTH)
   }
   print head tail
}'

and then getting rid of the variable x:

gawk '{
   head = ""
   tail = $0
   while ( match(tail,"\"[^\"]*\"") ) {
      head = head substr(tail,1,RSTART-1) gensub(/ /,"_","g",substr(tail,RSTART,RLENGTH))
      tail = substr(tail,RSTART+RLENGTH)
   }
   print head tail
}'

and then getting rid of the variable "tail" if you don't need $0, NF, etc, left hanging around after the loop:

gawk '{
   head = ""
   while ( match($0,"\"[^\"]*\"") ) {
      head = head substr($0,1,RSTART-1) gensub(/ /,"_","g",substr($0,RSTART,RLENGTH))
      $0 = substr($0,RSTART+RLENGTH)
   }
   print head $0
}'