Parsing XML using unix terminal

https://stackoverflow.com/questions/29004

09-06-2019
|

Question

Sometimes I need to quickly extract some arbitrary data from XML files to put into a CSV format. What's your best practices for doing this in the Unix terminal? I would love some code examples, so for instance how can I get the following problem solved?

Example XML input:

<root>
<myel name="Foo" />
<myel name="Bar" />
</root>

My desired CSV output:

Foo,
Bar,

Solution

If you just want the name attributes of any element, here is a quick but incomplete solution.

(Your example text is in the file example)

grep "name" example | cut -d"\"" -f2,2 | xargs -I{} echo "{},"

OTHER TIPS

Peter's answer is correct, but it outputs a trailing line feed.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text"/>
  <xsl:template match="root">
    <xsl:for-each select="myel">
      <xsl:value-of select="@name"/>
      <xsl:text>,</xsl:text>
      <xsl:if test="not(position() = last())">
        <xsl:text>&#xA;</xsl:text>
      </xsl:if>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

Just run e.g.

xsltproc stylesheet.xsl source.xml

to generate the CSV results into standard output.

Use a command-line XSLT processor such as xsltproc, saxon or xalan to parse the XML and generate CSV. Here's an example, which for your case is the stylesheet:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>

    <xsl:template match="root">
        <xsl:apply-templates select="myel"/>
    </xsl:template>

    <xsl:template match="myel">
        <xsl:for-each select="@*">
            <xsl:value-of select="."/>
            <xsl:value-of select="','"/>
        </xsl:for-each>
        <xsl:text>&#10;</xsl:text>
    </xsl:template> 
</xsl:stylesheet>

XMLStarlet is a command line toolkit to query/edit/check/transform XML documents (for more information see http://xmlstar.sourceforge.net/)

No files to write, just pipe your file to xmlstarlet and apply an xpath filter.

cat file.xml | xml sel -t -m 'xpathExpression' -v 'elemName' 'literal' -v 'elname' -n

-m expression -v value '' included literal -n newline

So for your xpath the xpath expression would be //myel/@name which would provide the two attribute values.

Very handy tool.

Here's a little ruby script that does exactly what your question asks (pull an attribute called 'name' out of elements called 'myel'). Should be easy to generalize

#!/usr/bin/ruby -w

require 'rexml/document'

xml = REXML::Document.new(File.open(ARGV[0].to_s))
xml.elements.each("//myel") { |el| puts "#{el.attributes['name']}," if el.attributes['name'] }

Answering the original question, assuming xml file is "test.xml" that contains:

<root> <myel name="Foo" /> <myel name="Bar" /> </root>

cat text.xml | tr -s "\"" " " | awk '{printf "%s,\n", $3}'

your test file is in test.xml.

sed -n 's/^\s`*`&lt;myel\s`*`name="\([^"]`*`\)".`*`$/\1,/p' test.xml

It has it's pitfalls, for example if it is not strictly given that each myel is on one line you have to "normalize" the xml file first (so each myel is on one separate line)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow