How to fuse a specific type of nodes in XML data in R?

Question 1

I like the easy regexp solutions and in this case they are probably the way to go. In general with XML we would look to use XSLT. This is a language for transforming XML. There is an R package Sxslt which can be used to transform XML. The idea is to define 2 templates:

The first template is whats known as the identity transform . This copies all attributes and nodes. If there is a more relevant template for a particular element xslt will use that instead.
Then We declared a template more relevant to LINE. This does nothing. So for all nodes and attributes except LINE the transform performs a copy.

Here mye code:

# install package if needed
# install.packages('Sxslt', repos = "http://www.omegahat.org/R")
require(Sxslt)
# define a transformation
sltTemp <- '<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
 <xsl:template match="@* | node()">
   <xsl:copy>
     <xsl:apply-templates select="@* | node()"/>
   </xsl:copy>
</xsl:template>

<xsl:template match="LINE">
   <xsl:apply-templates/>
</xsl:template>

</xsl:stylesheet>'

# assume your XML is text variable named xdata
# you can also work on a parsed file if you like
# xD <- xmlParse(xdata)
# xsltApplyStyleSheet(xD, sltTemp)
# gives same result

require(XML)
newxdata <- saveXML(xsltApplyStyleSheet(xdata, sltTemp))
xmlParse(newxdata)
<?xml version="1.0"?>
<DjVuXML>
  <BODY>
    <OBJECT data="file://localhost//book1.djvu" height="1650" type="image/x.djvu" usemap="book1.djvu" width="1275">
      <PARAM name="PAGE" value="book1_001.djvu"/>
      <PARAM name="DPI" value="300"/>
      <HIDDENTEXT>
        <PAGECOLUMN>
          <REGION>
            <PARAGRAPH>
              <WORD coords="1,2,3,4,5">Title</WORD>
            </PARAGRAPH>
          </REGION>
        </PAGECOLUMN>
        <PAGECOLUMN>
          <REGION>
            <PARAGRAPH>
              <WORD coords="30,290,65,276,290">This</WORD>
              <WORD coords="73,290,84,276,290">is</WORD>
              <WORD coords="92,290,100,280,290">a</WORD>
            </PARAGRAPH>
            <PARAGRAPH>
              <WORD coords="30,290,65,276,290">First</WORD>
              <WORD coords="73,290,84,276,290">line</WORD>
              <WORD coords="92,290,100,280,290">is</WORD>
              <WORD coords="30,290,65,276,290">Second</WORD>
              <WORD coords="73,290,84,276,290">line</WORD>
              <WORD coords="92,290,100,280,290">is</WORD>
              <WORD coords="30,290,65,276,290">Third</WORD>
              <WORD coords="73,290,84,276,290">line</WORD>
              <WORD coords="92,290,100,280,290">is</WORD>
            </PARAGRAPH>
          </REGION>
        </PAGECOLUMN>
      </HIDDENTEXT>
    </OBJECT>
    <MAP name="book1.djvu"/>
  </BODY>
</DjVuXML>

Question 2

The proper way would be to use the xml package and do the nodes fusing.

However, for the sample xml that you have provided, you might be able to get away with a simple gsub (find and replace).

Something along the lines of:

xmlfile <- readLines("test.xml")
newfile <- gsub("<LINE>|</LINE>", "", xmlfile)

And go from there.

Question 3

A similar solution using grepl to remove all lines:

ll <- readLines(textConnection(txt))
ll <- ll[!grepl("<LINE>|</LINE>" ,ll)]
txt <- paste(ll, "\n", collapse="")
xmlParse(txt,asText=TRUE)