質問

I have a question regarding xml parsing. I have tags with spaces in e.g.

<item1 id=rt name ="th">
<point1>1254</point1>
<point2>1254</point2>
</item>

How do I extract the id and name out of this tags?

I'm now using R as I need for the rest of my analysis, but I can also do file parsing in perl and python. What is the best solution?

役に立ちましたか?

解決

You can do this for example, using XML package:

tt <- '<?xml version="1.0" encoding="utf-8"?>
<item id="rt" name ="th">
  <point1>1254</point1>
  <point2>1254</point2>
</item>
'

library(XML)
xpathSApply(doc,'//item',xmlGetAttr,'id')
[1] "rt"

EDIT

In case your data is not well formatted, you should reformat your data as I did above or read your data line by line , and extract the information using some regular expression ( not recommended with XML tags to use regex)

    tt <- '<item1 id=rt name ="th">
<point1>1254</point1>
<point2>1254</point2>
</item>
    '

    ll <- readLines(textConnection(tt))
    gsub('.*id=(.*)[ ]name.*','\\1',ll[1])
 [1] "rt"

他のヒント

How about a regex?

/=\K\W?\K\w+/g

=\K finds but does not save the =

\W?\K finds but does not save the potential quotation mark before your tag.

\w+ is your tag.

You can read the file line by line and save your matches into an array, something like:

my @matches = $line =~ /=\K\W?\K\w+/g;

And then use $matches[] to access the individual elements.

Here it the regex in action if you want to play with it further: http://regexr.com?37im8

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top