Replacing unique identifiers in a file

https://stackoverflow.com/questions/19846909

29-07-2022
|

Question

I have an xml file that looks like this:

<species compartment="compartment" id="alpha_dash_D_dash_glucose_dash_6P" initialAmount="0" hasOnlySubstanceUnits="true" constant="false" boundaryCondition="false">
     </species>
     <species compartment="compartment" id="six_dash_Phospho_dash_D_dash_gluconate" initialAmount="0" hasOnlySubstanceUnits="true" constant="false" boundaryCondition="false">
     </species>
     <species compartment="compartment" id="beta_dash_D_dash_Fructose_dash_6P2" initialAmount="0" hasOnlySubstanceUnits="true" constant="false" boundaryCondition="false">
     </species>
     <species compartment="compartment" id="beta_dash_D_dash_Glucose" initialAmount="0" hasOnlySubstanceUnits="true" constant="false" boundaryCondition="false">
     </species>

each id attribute I want to replace with my own attribute. I want my end file to look like this:

<species compartment="compartment" id="id1" initialAmount="0" hasOnlySubstanceUnits="true" constant="false" boundaryCondition="false">
     </species>
     <species compartment="compartment" id="id2" initialAmount="0" hasOnlySubstanceUnits="true" constant="false" boundaryCondition="false">
     </species>
     <species compartment="compartment" id="id3" initialAmount="0" hasOnlySubstanceUnits="true" constant="false" boundaryCondition="false">
     </species>
     <species compartment="compartment" id="id4" initialAmount="0" hasOnlySubstanceUnits="true" constant="false" boundaryCondition="false">

However the id attribute is referenced in other places in the file:

 <speciesReference constant="true" stoichiometry="1" species="alpha_dash_D_dash_glucose_dash_6P">

this line should be updated to:

 <speciesReference constant="true" stoichiometry="1" species="id1">

I tried using sed with 's/id="(*)"/id="$IdCOUNTER"/g' but this makees all id attributes the same. How can I solve this? Any help is appreciated, thank you.

Solution

sed -n 's/\s*<species [^>]* id="\([^"]*\).*/\1/p' species.xml |\
  cat -n |\
  sed 's/\s*\([0-9]\+\)\s*/id\1 /' > ids.txt

cp species.xml my_species.xml

while read a b
do
  sed -i 's/"'"$b"'"/"'$a'"/g' my_species.xml
done < ids.txt

Assuming your XML file is nicely formatted (i.e., each tag is all on one line), you can get away with sed and bash. Otherwise, you'll need a language with an XML parser. The same approach will work, but the details will vary.

Make a map of ids to replacements. Then, each time you encounter an id you've seen before, you look it up and replace it.

The sed line above maps each id from a <species> tag to a numbered id (the backslashes allow the line to be split over several lines for readability).

The file is copied to prevent modifying the original.

As each line is read from the id map file, all occurrences of the original id are replaced with the new, numbered id.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow