XML转换为纯文本

https://stackoverflow.com/questions/1050644

20-08-2019
|

题

我的目标是建立一个引擎，采用最新的HL7 CDA 3.0文件，使之与HL7 2.5这是一个完全不同的野兽向后兼容。

在CDA文件是当与其匹配的XSL文件配对呈现一个HTML文件适合显示给最终用户的XML文件。

在HL7 2.5我需要得到呈现的文本，没有任何标记，并将其折叠成一个文本流（或类似的），我可以在80个字符行写出到填充HL7 2.5消息。

到目前为止，我正在使用XslCompiledTransform使用XSLT和产品所得的HTML文档转换我的XML文档的方法。

我的下一步骤是取该文件（或者在此之前一个步骤），并呈现HTML为文本。我已经寻找了一段时间，但无法弄清楚如何做到这一点。我希望它的东西很容易说我只是远眺，或只是找不到神奇的搜索词。任何人都可以提供一些帮助？

FWIW，我读过的SO 5个或10等问题，这拥抱或使用正则表达式这个告诫，不要以为我想这条路走。我需要呈现的文本。

using System;
using System.IO;
using System.Xml;
using System.Xml.Xsl;
using System.Xml.XPath;

public class TransformXML
{

    public static void Main(string[] args)
    {
        try
        {

            string sourceDoc = "C:\\CDA_Doc.xml";
            string resultDoc = "C:\\Result.html";
            string xsltDoc = "C:\\CDA.xsl";

            XPathDocument myXPathDocument = new XPathDocument(sourceDoc);
            XslCompiledTransform myXslTransform = new XslCompiledTransform();

            XmlTextWriter writer = new XmlTextWriter(resultDoc, null);
            myXslTransform.Load(xsltDoc);

            myXslTransform.Transform(myXPathDocument, null, writer);

            writer.Close();

            StreamReader stream = new StreamReader (resultDoc);

        }

        catch (Exception e)
        {
            Console.WriteLine ("Exception: {0}", e.ToString());
        }
    }
}

解决方案

既然你有XML源，考虑写一个XSL会给你想要的输出，无需中间HTML一步。这将是远比试图改造HTML更可靠。

其他提示

这会使你的只是文本：

class Program
{
    static void Main(string[] args)
    {
        var blah =  new System.IO.StringReader(sourceDoc);
        var reader = System.Xml.XmlReader.Create(blah);
        StringBuilder result = new StringBuilder();

        while (reader.Read())
        {
            result.Append( reader.Value);
        }
        Console.WriteLine(result);
    }

    static string sourceDoc = "<html><body><p>this is a paragraph</p><p>another paragraph</p></body></html>";
}

或者，可以使用正则表达式：

public static string StripHtml(String htmlText)
{
    // replace all tags with spaces...
   htmlText = Regex.Replace(htmlText, @"<(.|\n)*?>", " ");

   // .. then eliminate all double spaces
   while (htmlText.Contains("  "))
   {
       htmlText = htmlText.Replace("  ", " ");
    }

   // clear out non-breaking spaces and & character code
   htmlText = htmlText.Replace("&nbsp;", " ");
   htmlText = htmlText.Replace("&amp;", "&");

   return htmlText;
}

您可以使用类似这其使用猁和perl呈现HTML，然后将其转换成纯文本？

这是一个很大的用例为XSL：FO和FOP。 FOP 不仅仅是为PDF输出，所支持的其他主要成果之一是文本。你应该能够以构造简单的XSLT + FO具有的规格要（即线宽度）。

的样式表

此解决方案将是多一点，仅仅使用XML-> xslt->文本ScottSEA建议重量级，但如果你有任何更复杂的格式要求（如缩进），它会变得更加容易表达FO ，比XSLT嘲笑起来。

我将避免regexs用于提取的文本。这是太低级，并保证是脆。如果你只是想文本和80个字符的线条，默认的XSLT模板将只打印元素的文本。一旦你有只有文字，你可以申请任何文字处理是必要的。

顺便说一句，我对谁生产的CDA为我们的产品（语音识别双阳离子）的一部分的公司工作。我会考虑直接变换成3.0 2.5 XSLT。根据您想要两个版本之间保持保真度，全XSLT路径可能是您的最简单的选择，如果你真的想达到什么格式之间的转换。这就是XSLT建做。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow