I am to migrate the contents of a Lotus Notes database to SharePoint. The entire database is exported to XML files (this requirement cannot be changed) and I have to parse these XML files and insert the data into SharePoint.

Whats tripping me up is the elements that contain rich text. The XML elements contain an XML representation of the exact rich text format used in the field in Lotus Notes using DXL as described in http://publib.boulder.ibm.com/infocenter/domhelp/v8r0/index.jsp?topic=%2Fcom.ibm.designer.domino.main.doc%2FH_PARAGRAPH_DEFINITIONS_ELEMENT_XML.html

I don't need to keep the actual formatting of the text (unless this is equally easy as retrieving the plain text), but if I simply extract the value of the XML element containing the rich text (using LinqToXML) I get the plain text without linebreaks which is not acceptable. Additionally, embedded images are displayed in the retrieved text as base64 encoded strings (they are embedded in the XML as such).

Can anyone provide me with guidance to how to extract the text from the XML element either as proper RTF format that can be inserted into an RTF file or as a plain text that includes the correct line breaks and don't contain the embedded images?

有帮助吗?

解决方案 3

I have (for now) just stripped the richtext xml element of all XML tags and unwanted embedded elements using Regex with the following expressions:

        //Removes all attachmentref elements
        newString = new Regex(@"(<attachmentref(.|\n)*</attachmentref>)").Replace(newString, "");
        //Removes all formula elements
        newString = new Regex(@"(<formula(.|\n)*</formula>)").Replace(newString, "");
        //Removes all xml tags (<par>, <pardef>, <table> etc). Be aware that this also removes any content in the table
        newString = new Regex("<(.)*/>").Replace(newString, "");
        newString = new Regex("<(.)*>").Replace(newString, "");
        newString = new Regex("</(.)*>").Replace(newString, ""); 

        //Trims the text to tidy up the many \n, \r and white-spaces introduced by removing the xml tags. 
        newString = new Regex(@"\r").Replace(newString, "\n");
        newString = new Regex(@"[ \f\r\t\v]+\n").Replace(newString, "\n");
        newString = new Regex(@"\n{2,}").Replace(newString, "\n");

        //makes < and > appear correctly in the text.
        newString = newString.Replace("&lt;", "<").Replace("&gt;", ">");

Its not pretty, but at least the text is readable and some sense of linebreaks are preserved.

其他提示

Obviously the XML you deal with is DXL. A more elegant method would be to convert it to HTML with an XSL transformation. A required XSLT stylesheet you may find supplied with PD4ML tool. From HTML format a document can be converted to PDF, RTF or an image with PD4ML (or probably to another format using another tool)

You could convert the rich text item contents to HTML/MIME which is the other supported format for rich text items.

Or you could create an XPage or form that shows the rich text content in HTTP URL and refer to that in the export XML.

  • Panu
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top