Question

I have a file named test.txt with the following content

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<test time="60" id="01">
<java.lang.String value="cat"/><java.lang.String value="dog"/>
<java.lang.String value="mouse"/>
<java.lang.String value="cow"/>
</test>

What I would like to do is that , i want to edit the file so that when i get something like , <java.lang.String value="something"/> i will change that part to <animal>something</animal>

So for previous example , after applying a script with sed/awk/grep command the file content will be changed to or a new file will be created like following:

   <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <test time="60" id="01">
    <animal>cat</animal><animal>dog</animal>
    <animal>mouse</animal>
    <animal>cow</animal>
    </test>

I tried to extract that particular part using following command :

$less test.txt | grep -Po 'java.lang.String value="\K[^"]*' | awk -F: '{print "<animal>" $1 "</animal>"}'

The output gives me the changed part, but I want this changed part along with the rest of the file unchanged :

<animal>cat</animal>
<animal>dog</animal>
<animal>mouse</animal>
<animal>cow</animal>

I am new to scripting , I don't know how to write the complete output in a file .

Was it helpful?

Solution

sed -r 's#<java.lang.String value="([^"]*)"/>#<animal>\1</animal>#g' test.txt

And you should not do XML transformations with regular expressions...

EDIT about how it works

By default sed uses "basic regular expressions", where many special characters have to be prefixed with \. -r flag switches to "extended regular expressions" where the syntax is less cumbersome. See OpenGroup for details.

By default sed prints output as-is unless commands modify it. The replacement command is like s#search_regexp#replacement#flags. The delimiter can be anything like /, #, or ,. I choose # so it doesn't clash with the \ character in XML.

Then we match things like <java.lang.String value="anything_except_quotes"/>. The part that we want to reuse has parenthesis, it's called a matching group. In the replacement we refer to the thing we captured inside the matching group by \1.

g flag makes sed replace all occurences of the search pattern, not only the first one.

OTHER TIPS

ok some problems with your command:

less test.txt | grep -Po 'java.lang.String value="\K[^"]*' | awk -F: '{print "<animal>" $1 "</animal>"}'

to begin with, there's a useless use of less, grep can take a file as a parameter:

grep -Po 'java.lang.String value="\K[^"]*' test.txt | awk -F: '{print "<animal>" $1 "</animal>"}'

then you're using grep to select lines that matches a string, so basically, your sequence of commands is explicitely keeping only the lines that have the java.lang... string, taking everything else out... A simpler solution would be to use sed:

sed -r 's,<java.lang.String value="([^"]*)"\s*/>,<animal>\1</animal>,g' test.txt

which uses the substitution syntax of sed to replace the match, while extracting what's in the parenthesis ( and ) as \1 in the right part. The [^"] part is for matching everything that is not a " character, and the * operator is to apply the match 0 or more times. The \s is to match a space, *, 0 or more times.

A regex is an automaton that uses states and transitions to match a given string. Here's a visual of how the regex works:

Regular expression visualization

demo of the regex on an example

Though in your particular case that simple regex works out, keep in mind that this is only a hack. You should instead use an XML parser and replace the nodes to match your needs, using XSLT/XSLFO that are tools designed to transform an XML into another one (or something else).

To do that, you could use a tool such as xsltproc and look at this Q for an example that transforms all foo nodes into bar nodes in an XML tree, here's how to do it:

test.xsl:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <!--Identity Template. This will copy everything as-is.-->
  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>

  <!--Change "java.lang.String" element to "animal" element.-->
  <xsl:template match="java.lang.String">
    <animal>
      <!-- get the attribute 'value' of java.lang.String -->
      <xsl:copy-of select="@*"/>
      <xsl:apply-templates/>
    </animal>
  </xsl:template>

</xsl:stylesheet>

run:

xsltproc test.xsl test.xml

result:

<?xml version="1.0"?>
<test time="60" id="01">
  <animal value="cat"/>
  <animal value="dog"/>
  <animal value="mouse"/>
  <animal value="cow"/>
</test>

and by the way, given your XML, it looks like it has been generated by Java, and there's multiple ways to apply that XSL from within your code, even before you need to handle it using command line tools.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top