Domanda

I am trying to change some XML tags in a file to make them easier to read into R, but some of the tags are the same, which xmlToDataFrame does not seem to like. See below:

<DATE calender="Western">1996-06-22</DATE>
<DATE calender="Persian">1375/04/02</DATE>
<CAT xml:lang="fa">ادب و هنر</CAT>
<CAT xml:lang="en">Literature and Art</CAT>

I'm trying to rename some of the tags using regular expressions so that it looks more like this:

<DATE_Western>1996-06-22</DATE_Western>
<DATE_Persian>1375/04/02</DATE_Persian>
<CAT_Persian>ادب و هنر</CAT_Persian>
<CAT_English>Literature and Art</CAT_English>

I tried using a positive lookbehind, but I would need some kind of quantifiers to capture after the inside of the tags and that doesn't seem to be supported by many regex implementations.

Any suggestions?

Also, what is the best command line tool for doing search and replace on a large number of files (sed, awk?)

Thanks!

È stato utile?

Soluzione

Using GNU awk for gensub():

$ gawk '
BEGIN {
    map["fa"]="Persian"
    map["en"]="English"
}
{
    for (abbr in map)
        $0 = gensub("(xml:lang=\")" abbr "(\")","\\1" map[abbr] "\\2","")
    $0 = gensub(/(<[^[:space:]]+)[^"]+"([^"]+)">(.*)>$/,"\\1_\\2>\\3_\\2>","")
}
1' file
<DATE_Western>1996-06-22</DATE_Western>
<DATE_Persian>1375/04/02</DATE_Persian>
<CAT_Persian>ادب و هنر</CAT_Persian>
<CAT_English>Literature and Art</CAT_English>

Altri suggerimenti

You can do this without lookbehinds, simply perform a substitution.

Regex

<(\w+)[^"]*"(.*?)">(.*?)<\/\1>

Replacement

<\1_\2>\3</\1_\2>

Example

http://regex101.com/r/kM0rA4

Try this sed command also,

sed '/ xml\:lang\=\"fa\"/ s/fa/Persian/g; / xml\:lang\=\"en\"/ s/en/English/g; s|^<\(.*\) .*="\(.*\)">\(.*\)<\(.*\)>|<\1_\2>\3<\4_\2>|g' file

Output:

<DATE_Western>1996-06-22</DATE_Western>
<DATE_Persian>1375/04/02</DATE_Persian>
<CAT_Persian>ادب و هنر</CAT_Persian>
<CAT_English>Literature and Art</CAT_English>
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top