Extracting text from Lotus Notes XML rich text element

Question 1

I have (for now) just stripped the richtext xml element of all XML tags and unwanted embedded elements using Regex with the following expressions:

        //Removes all attachmentref elements
        newString = new Regex(@"(<attachmentref(.|\n)*</attachmentref>)").Replace(newString, "");
        //Removes all formula elements
        newString = new Regex(@"(<formula(.|\n)*</formula>)").Replace(newString, "");
        //Removes all xml tags (<par>, <pardef>, <table> etc). Be aware that this also removes any content in the table
        newString = new Regex("<(.)*/>").Replace(newString, "");
        newString = new Regex("<(.)*>").Replace(newString, "");
        newString = new Regex("</(.)*>").Replace(newString, ""); 

        //Trims the text to tidy up the many \n, \r and white-spaces introduced by removing the xml tags. 
        newString = new Regex(@"\r").Replace(newString, "\n");
        newString = new Regex(@"[ \f\r\t\v]+\n").Replace(newString, "\n");
        newString = new Regex(@"\n{2,}").Replace(newString, "\n");

        //makes < and > appear correctly in the text.
        newString = newString.Replace("&lt;", "<").Replace("&gt;", ">");

Its not pretty, but at least the text is readable and some sense of linebreaks are preserved.

Question 2

Obviously the XML you deal with is DXL. A more elegant method would be to convert it to HTML with an XSL transformation. A required XSLT stylesheet you may find supplied with PD4ML tool. From HTML format a document can be converted to PDF, RTF or an image with PD4ML (or probably to another format using another tool)

Question 3

You could convert the rich text item contents to HTML/MIME which is the other supported format for rich text items.

Or you could create an XPage or form that shows the rich text content in HTTP URL and refer to that in the export XML.

Panu