Question

I am trying to change some XML tags in a file to make them easier to read into R, but some of the tags are the same, which xmlToDataFrame does not seem to like. See below:

<DATE calender="Western">1996-06-22</DATE>
<DATE calender="Persian">1375/04/02</DATE>
<CAT xml:lang="fa">ادب و هنر</CAT>
<CAT xml:lang="en">Literature and Art</CAT>

I'm trying to rename some of the tags using regular expressions so that it looks more like this:

<DATE_Western>1996-06-22</DATE_Western>
<DATE_Persian>1375/04/02</DATE_Persian>
<CAT_Persian>ادب و هنر</CAT_Persian>
<CAT_English>Literature and Art</CAT_English>

I tried using a positive lookbehind, but I would need some kind of quantifiers to capture after the inside of the tags and that doesn't seem to be supported by many regex implementations.

Any suggestions?

Also, what is the best command line tool for doing search and replace on a large number of files (sed, awk?)

Thanks!

Était-ce utile?

La solution

Using GNU awk for gensub():

$ gawk '
BEGIN {
    map["fa"]="Persian"
    map["en"]="English"
}
{
    for (abbr in map)
        $0 = gensub("(xml:lang=\")" abbr "(\")","\\1" map[abbr] "\\2","")
    $0 = gensub(/(<[^[:space:]]+)[^"]+"([^"]+)">(.*)>$/,"\\1_\\2>\\3_\\2>","")
}
1' file
<DATE_Western>1996-06-22</DATE_Western>
<DATE_Persian>1375/04/02</DATE_Persian>
<CAT_Persian>ادب و هنر</CAT_Persian>
<CAT_English>Literature and Art</CAT_English>

Autres conseils

You can do this without lookbehinds, simply perform a substitution.

Regex

<(\w+)[^"]*"(.*?)">(.*?)<\/\1>

Replacement

<\1_\2>\3</\1_\2>

Example

http://regex101.com/r/kM0rA4

Try this sed command also,

sed '/ xml\:lang\=\"fa\"/ s/fa/Persian/g; / xml\:lang\=\"en\"/ s/en/English/g; s|^<\(.*\) .*="\(.*\)">\(.*\)<\(.*\)>|<\1_\2>\3<\4_\2>|g' file

Output:

<DATE_Western>1996-06-22</DATE_Western>
<DATE_Persian>1375/04/02</DATE_Persian>
<CAT_Persian>ادب و هنر</CAT_Persian>
<CAT_English>Literature and Art</CAT_English>
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top