Question

I have a 30 sitemap files look like below:

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
    <loc>http://www.A.com/a</loc>
    <lastmod>2013-08-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.6</priority>
</url>
<url>
    <loc>http://www.A.com/b</loc>
    <lastmod>2013-08-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.6</priority>
</url>
...
</urlset>

The output I want four columns each row for each url tag, print out to screen

http://www.A.com/a 2013-08-01 weekly 0.6
http://www.A.com/b 2013-08-01 weekly 0.6 

The way that I am using is Python BeautifulSoup to parse the tag out, however, the performance is horribly slow since there are 30+ files there and 300,000 lines per file. I am wondering would it be possible that use some shell AWK or SED to do that or.. I am just using the wrong tools to do that.

Since the sitemap is so well formatted, there might be some regular expression tricks to get around it.

Any one have experience dividing records/rows in AWK or SED by multiple lines instead of new line character?

Thanks a lot!

Was it helpful?

Solution

I definitely wouldn't suggest regular expressions as a general way of parsing arbitrary XML or HTML, but since you said this is so well-formed the usual warning can probably be ignored in this case:

sed -n '/^<url>$/{n;N;N;N;s/\n/ /g;s/ *<[a-z]*>//g;s/<\/[a-z]*>/ /g;p}'

Here is a commented version that explains what is going on:

sed -n '/^<url>$/ {  # if this line contains only <url>
  n;N;N;N              # read the next 4 lines into the pattern space
  s/\n//g              # remove newlines
  s/ *<[a-z]*>//g      # remove opening tags and the spaces before them
  s/<\/[a-z]*>/ /g     # replace closing tags with a space
  p                    # print the pattern space
}' test.txt

The -n option suppresses the automatic printing of the pattern space.

OTHER TIPS

This might work for you (GNU sed):

sed '/^<url>/!d;:a;N;/<\/url>/!ba;s/<[^>]*>\s*<[^>]*>/ /g;s/^ \| $//g' file

Gathers up url lines in the pattern space, replaces tags by spaces and removes leading and trailing spaces. All other lines are deleted.

If you know there will only be 4 lines between the url tags:

sed '/^<url>/!d;N;N;N;N;s/<[^>]*>\s*<[^>]*>/ /g;s/^ \| $//g' file

sed is an excellent tool for simple substitutions on a single line, for anything else just use awk:

$ awk -F'[<>]' '
    /^<\/url>/ { inUrl=0; print line }
    inUrl      { line = line (line?" ":"") $3 }
    /^<url>/   { inUrl=1; line="" }
' file
http://www.A.com/a 2013-08-01 weekly 0.6
http://www.A.com/b 2013-08-01 weekly 0.6
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top