I like the easy regexp solutions and in this case they are probably the way to go. In general with XML
we would look to use XSLT
. This is a language for transforming XML
. There is an R package Sxslt
which can be used to transform XML
. The idea is to define 2 templates:
- The first template is whats known as the identity transform . This copies all attributes and nodes. If there is a more relevant template for a particular element xslt will use that instead.
- Then We declared a template more relevant to LINE. This does nothing. So for all nodes and attributes except LINE the transform performs a copy.
Here mye code:
# install package if needed
# install.packages('Sxslt', repos = "http://www.omegahat.org/R")
require(Sxslt)
# define a transformation
sltTemp <- '<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="LINE">
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>'
# assume your XML is text variable named xdata
# you can also work on a parsed file if you like
# xD <- xmlParse(xdata)
# xsltApplyStyleSheet(xD, sltTemp)
# gives same result
require(XML)
newxdata <- saveXML(xsltApplyStyleSheet(xdata, sltTemp))
xmlParse(newxdata)
<?xml version="1.0"?>
<DjVuXML>
<BODY>
<OBJECT data="file://localhost//book1.djvu" height="1650" type="image/x.djvu" usemap="book1.djvu" width="1275">
<PARAM name="PAGE" value="book1_001.djvu"/>
<PARAM name="DPI" value="300"/>
<HIDDENTEXT>
<PAGECOLUMN>
<REGION>
<PARAGRAPH>
<WORD coords="1,2,3,4,5">Title</WORD>
</PARAGRAPH>
</REGION>
</PAGECOLUMN>
<PAGECOLUMN>
<REGION>
<PARAGRAPH>
<WORD coords="30,290,65,276,290">This</WORD>
<WORD coords="73,290,84,276,290">is</WORD>
<WORD coords="92,290,100,280,290">a</WORD>
</PARAGRAPH>
<PARAGRAPH>
<WORD coords="30,290,65,276,290">First</WORD>
<WORD coords="73,290,84,276,290">line</WORD>
<WORD coords="92,290,100,280,290">is</WORD>
<WORD coords="30,290,65,276,290">Second</WORD>
<WORD coords="73,290,84,276,290">line</WORD>
<WORD coords="92,290,100,280,290">is</WORD>
<WORD coords="30,290,65,276,290">Third</WORD>
<WORD coords="73,290,84,276,290">line</WORD>
<WORD coords="92,290,100,280,290">is</WORD>
</PARAGRAPH>
</REGION>
</PAGECOLUMN>
</HIDDENTEXT>
</OBJECT>
<MAP name="book1.djvu"/>
</BODY>
</DjVuXML>