strip comments from xml file and pretty-print it

https://stackoverflow.com/questions/1464697

13-09-2019
|

Question

I have this huge xml file which contains a lot of comments.

Whats the "best way" to strip out all the comments and nicely format the xml from the linux command line?

Solution

you can use tidy

$ tidy -quiet -asxml -xml -indent -wrap 1024 --hide-comments 1 tomcat-users.xml
<?xml version='1.0' encoding='utf-8'?>
<tomcat-users>
  <user username="qwerty" password="ytrewq" roles="manager-gui" />
</tomcat-users>

OTHER TIPS

Run your XML through an identity transform XSLT, with an empty template for comments.

All of the XML content, except for the comments, will be passed through to the output.

In order to niecely format the output, set the output @indent="yes":

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<!--Match on Attributes, Elements, text nodes, and Processing Instructions-->
<xsl:template match="@*| * | text() | processing-instruction()">
   <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
</xsl:template>

<!--Empty template prevents comments from being copied into the output -->
<xsl:template match="comment()"/>

</xsl:stylesheet>

You might want to look at the xmllint tool. It has several options (one of which --format will do a pretty print), but I can't figure out how to remove the comments using this tool.

Also, check out XMLStarlet, a bunch of command line tools to do anything you would want to with xml. Then do:

xml c14n --without-comments # XML file canonicalization w/o comments

EDIT: OP eventually used this line:

xmlstarlet c14n --without-comments old.xml > new.xml

To tidy up something simple like Tomcat's server.xml, I use

sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' | grep -zv '^<!--' | tr -d '\0' | grep -v "^\s*$"

I.e.

function tidy() {
 echo "$( cat $1 | sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' | grep -zv '^<!--' | tr -d '\0' | grep -v "^\s*$")"
}

tidy server.xml

... will print the xml without comments.

NOTE: while it works reasonably well for simple things, it will fail with certain CDATA blocks and some other situations. Only use it for controlled xml scripts that have no need and will never need to escape a single <-- or --> anywhere!

First sed marks comment's start and stop with 0x0 characters, then grep with -z treats 0x0 as the only line delimiter, searches for lines starting with comment, it's -v inverts the filter, leaving only meaningful lines. Finally, tr -d\0` deletes all these 0x0, and to polish it up, another grep removes empty lines: voila.

The best way would be to use an XML parser to handle all the obscure corner cases correctly. But if you need something quick and dirty, there are a variety of short solutions using Perl regexes which may be sufficient.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow