سؤال

I have few XML files, and some users have added extra spaces in middle (like in element tag or text tag), and it is getting really hard to compare multiple versions of files.

Example (xml file)

    <?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author   >
      <title>XML Developer's Guide      </title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102"     >
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
</catalog>

As you can see in above example code, element tag of author, and text node of title in first book element has extra spaces. Similarly element tag of second book element has extra spaces.

I want a regular expression to search for these types of white spaces (more than 1 adjacent whitespace), but I don't want the leading white spaces. If I don't leave leading whitespaces (starting of the lines), and replace these with single space, indentation will be lost.

There are some ways I can handle this (like first removing all double+ spaces and the doing a xmllint --format on the file), but it would be helpful if someone can give me a reg exp for spaces in middle of lines.

i tried combinations of ^, \s and ^\s, but I cannot seem to get the solution. So if someone can suggest something, it would be really helpful. (The multiple spaces in text nodes are incorrect values as per our project's design. So removing them will not cause any adverse affect)

هل كانت مفيدة؟

المحلول

This might work for you (GNU sed):

sed -r 's/(\S)\s+([<>])/\1\2/g' file

This looks for a non-space, followed by one or more spaces, followed by a < or a > and removes the spaces one or more times.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top