Best way to process non-standard quotes in XML parser [duplicate]

https://stackoverflow.com/questions/17010150

31-05-2022
|

Вопрос

I am creating a program that process text with XML formatting. I found that when the tag values are non-ASCII quotes (double quotes / ASCII 34, single quote / ASCII 39) the parsing throws exception. Such quotes may come from editing software such as Ms Word (automatic formatting).

Currently I parses each line of my text box and replace the quotes before processing the XML. Here is the code (in C#)

int nLines = textBox1.Lines.Length;

for (int i = 0; i < nLines; i++)
{
    // get the current line and replace quotes with standard ones
    line = Regex.Replace(textBox1.Lines[i], "[\u2018|\u2019|\u201A]", "'");
    line = Regex.Replace(line, "[\u201C|\u201D|\u201E]", "\"");

I wonder if there is a better / more correct / faster way to achieve this? What I mean by a more correct way is such the method shall covers almost all possibilities of quotes (I heard that \d can be used for 0-9 as well as unicode). Thanks in advance!

Решение

\p{Pi} and \p{Pf} classes can be useful to match this kind of quotes. However they don't make the difference between single and double quotes.

\p{Pi} -> opening quotes

\p{Pf} -> closing quotes

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow