Parse Sitemap Quickly

Question 1

I definitely wouldn't suggest regular expressions as a general way of parsing arbitrary XML or HTML, but since you said this is so well-formed the usual warning can probably be ignored in this case:

sed -n '/^<url>$/{n;N;N;N;s/\n/ /g;s/ *<[a-z]*>//g;s/<\/[a-z]*>/ /g;p}'

Here is a commented version that explains what is going on:

sed -n '/^<url>$/ {  # if this line contains only <url>
  n;N;N;N              # read the next 4 lines into the pattern space
  s/\n//g              # remove newlines
  s/ *<[a-z]*>//g      # remove opening tags and the spaces before them
  s/<\/[a-z]*>/ /g     # replace closing tags with a space
  p                    # print the pattern space
}' test.txt

The -n option suppresses the automatic printing of the pattern space.

Question 2

This might work for you (GNU sed):

sed '/^<url>/!d;:a;N;/<\/url>/!ba;s/<[^>]*>\s*<[^>]*>/ /g;s/^ \| $//g' file

Gathers up url lines in the pattern space, replaces tags by spaces and removes leading and trailing spaces. All other lines are deleted.

If you know there will only be 4 lines between the url tags:

sed '/^<url>/!d;N;N;N;N;s/<[^>]*>\s*<[^>]*>/ /g;s/^ \| $//g' file

Question 3

sed is an excellent tool for simple substitutions on a single line, for anything else just use awk:

$ awk -F'[<>]' '
    /^<\/url>/ { inUrl=0; print line }
    inUrl      { line = line (line?" ":"") $3 }
    /^<url>/   { inUrl=1; line="" }
' file
http://www.A.com/a 2013-08-01 weekly 0.6
http://www.A.com/b 2013-08-01 weekly 0.6