Вопрос

I have an XML file of a book. The main tree has Body/Pagecolumn/Region/Paragraph/Line/Word levels. However, I am not interested in the Line level. Is there any way to fuse the Line level without destroying the Word level in R using XML package or any other package? After the conversion, the main tree would be Body/Pagecolumn/Region/Paragraph/Word

A sample of the XML data is provided below:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DjVuXML>
<DjVuXML>
<BODY>
<OBJECT data="file://localhost//book1.djvu" height="1650" type="image/x.djvu" usemap="book1.djvu" width="1275">
<PARAM name="PAGE" value="book1_001.djvu"/>
<PARAM name="DPI" value="300"/>
<HIDDENTEXT>
<PAGECOLUMN>
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="1,2,3,4,5">Title</WORD>
</LINE>
</PARAGRAPH>
</REGION>
</PAGECOLUMN>
<PAGECOLUMN>
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="30,564,90,545,559">This</WORD>
<WORD coords="97,559,109,545,559">is</WORD>
<WORD coords="115,564,162,545,559">a</WORD>
</LINE>
</PARAGRAPH>
<PARAGRAPH>
<LINE>
<WORD coords="30,589,80,570,584">First</WORD>
<WORD coords="88,584,115,570,584">line</WORD>
<WORD coords="123,584,146,574,584">is</WORD>
</LINE>
<LINE>
<WORD coords="30,614,90,598,609">Second</WORD>
<WORD coords="97,609,143,595,609">line</WORD>
<WORD coords="148,614,168,595,609">is</WORD>
</LINE>
<LINE>
<WORD coords="30,640,56,626,640">Third</WORD>
<WORD coords="63,640,95,626,640">line</WORD>
<WORD coords="101,640,128,626,640">is</WORD>
</LINE>
</PARAGRAPH>
</REGION>
</PAGECOLUMN>
</HIDDENTEXT>
</OBJECT>
<MAP name="book1.djvu"/>
</BODY>
</DjVuXML>

Thanks.

Это было полезно?

Решение

I like the easy regexp solutions and in this case they are probably the way to go. In general with XML we would look to use XSLT. This is a language for transforming XML. There is an R package Sxslt which can be used to transform XML. The idea is to define 2 templates:

  1. The first template is whats known as the identity transform . This copies all attributes and nodes. If there is a more relevant template for a particular element xslt will use that instead.
  2. Then We declared a template more relevant to LINE. This does nothing. So for all nodes and attributes except LINE the transform performs a copy.

Here mye code:

# install package if needed
# install.packages('Sxslt', repos = "http://www.omegahat.org/R")
require(Sxslt)
# define a transformation
sltTemp <- '<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
 <xsl:template match="@* | node()">
   <xsl:copy>
     <xsl:apply-templates select="@* | node()"/>
   </xsl:copy>
</xsl:template>

<xsl:template match="LINE">
   <xsl:apply-templates/>
</xsl:template>

</xsl:stylesheet>'

# assume your XML is text variable named xdata
# you can also work on a parsed file if you like
# xD <- xmlParse(xdata)
# xsltApplyStyleSheet(xD, sltTemp)
# gives same result

require(XML)
newxdata <- saveXML(xsltApplyStyleSheet(xdata, sltTemp))
xmlParse(newxdata)
<?xml version="1.0"?>
<DjVuXML>
  <BODY>
    <OBJECT data="file://localhost//book1.djvu" height="1650" type="image/x.djvu" usemap="book1.djvu" width="1275">
      <PARAM name="PAGE" value="book1_001.djvu"/>
      <PARAM name="DPI" value="300"/>
      <HIDDENTEXT>
        <PAGECOLUMN>
          <REGION>
            <PARAGRAPH>
              <WORD coords="1,2,3,4,5">Title</WORD>
            </PARAGRAPH>
          </REGION>
        </PAGECOLUMN>
        <PAGECOLUMN>
          <REGION>
            <PARAGRAPH>
              <WORD coords="30,290,65,276,290">This</WORD>
              <WORD coords="73,290,84,276,290">is</WORD>
              <WORD coords="92,290,100,280,290">a</WORD>
            </PARAGRAPH>
            <PARAGRAPH>
              <WORD coords="30,290,65,276,290">First</WORD>
              <WORD coords="73,290,84,276,290">line</WORD>
              <WORD coords="92,290,100,280,290">is</WORD>
              <WORD coords="30,290,65,276,290">Second</WORD>
              <WORD coords="73,290,84,276,290">line</WORD>
              <WORD coords="92,290,100,280,290">is</WORD>
              <WORD coords="30,290,65,276,290">Third</WORD>
              <WORD coords="73,290,84,276,290">line</WORD>
              <WORD coords="92,290,100,280,290">is</WORD>
            </PARAGRAPH>
          </REGION>
        </PAGECOLUMN>
      </HIDDENTEXT>
    </OBJECT>
    <MAP name="book1.djvu"/>
  </BODY>
</DjVuXML>

Другие советы

The proper way would be to use the xml package and do the nodes fusing.

However, for the sample xml that you have provided, you might be able to get away with a simple gsub (find and replace).

Something along the lines of:

xmlfile <- readLines("test.xml")
newfile <- gsub("<LINE>|</LINE>", "", xmlfile)

And go from there.

A similar solution using grepl to remove all lines:

ll <- readLines(textConnection(txt))
ll <- ll[!grepl("<LINE>|</LINE>" ,ll)]
txt <- paste(ll, "\n", collapse="")
xmlParse(txt,asText=TRUE)
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top