最好的方式来分析空间分文

https://stackoverflow.com/questions/54866

09-06-2019
|

题

我喜欢这个字符串

 /c SomeText\MoreText "Some Text\More Text\Lol" SomeText

我想要标记，但是我只是不能拆分的空间。我已经有点丑陋的分析器工作，但我想知道，如果任何人都更加优雅的设计。

这是在C#顺便说一句.

编辑： 我的丑陋的版本，而丑陋的，是O(N)和实际上可能以更快的速度比使用RegEx.

private string[] tokenize(string input)
{
    string[] tokens = input.Split(' ');
    List<String> output = new List<String>();

    for (int i = 0; i < tokens.Length; i++)
    {
        if (tokens[i].StartsWith("\""))
        {
            string temp = tokens[i];
            int k = 0;
            for (k = i + 1; k < tokens.Length; k++)
            {
                if (tokens[k].EndsWith("\""))
                {
                    temp += " " + tokens[k];
                    break;
                }
                else
                {
                    temp += " " + tokens[k];
                }
            }
            output.Add(temp);
            i = k + 1;
        }
        else
        {
            output.Add(tokens[i]);
        }
    }

    return output.ToArray();            
}

解决方案

计算机期为你在做什么词汇分析;读一个很好的总结这一共同任务。

根据你的榜样，我猜你想要的空白独立的您的词，但东西放在引号中应被视为一个"单词"，没有引号。

最简单的方法来做到这一点是定义一词作为一个经常的表达：

([^"^\s]+)\s*|"([^"]+)"\s*

这种表达国一个"单词"是(1)非报价，非空白文字包围的空白，或(2)无价的文本所包围的报价(随后通过一些空白).注意使用的捕获括号中，以突出所需的文本。

武装，regex，你的算法是简单的：搜索你的文本为下一次"一词的"定义为捕获括号，并返回。重复，直到你跑出来的"单词"。

这是最简单位的工作码我可以进来了，VB.NET.注意，我们必须检查既群体的数据，因为有两个组捕获括号中。

Dim token As String
Dim r As Regex = New Regex("([^""^\s]+)\s*|""([^""]+)""\s*")
Dim m As Match = r.Match("this is a ""test string""")

While m.Success
    token = m.Groups(1).ToString
    If token.length = 0 And m.Groups.Count > 1 Then
        token = m.Groups(2).ToString
    End If
    m = m.NextMatch
End While

注1：会的答案，同上，是同样的想法，因为这一个。我们希望这一答复解释的详细背景信息的场景更好一点:)

其他提示

Microsoft.Basic.声明上的名字空间(在Microsoft.VisualBasic.dll)具有TextFieldParser你可以用来分割空间delimeted文本。它处理串在引号内(即，"这是一个令牌"thisistokentwo)。

注意，只是因为DLL说Basic并不意味着你只能利用它在VB项目。其一部分的整个框架。

有的国家机的方法。

    private enum State
    {
        None = 0,
        InTokin,
        InQuote
    }

    private static IEnumerable<string> Tokinize(string input)
    {
        input += ' '; // ensure we end on whitespace
        State state = State.None;
        State? next = null; // setting the next state implies that we have found a tokin
        StringBuilder sb = new StringBuilder();
        foreach (char c in input)
        {
            switch (state)
            {
                default:
                case State.None:
                    if (char.IsWhiteSpace(c))
                        continue;
                    else if (c == '"')
                    {
                        state = State.InQuote;
                        continue;
                    }
                    else
                        state = State.InTokin;
                    break;
                case State.InTokin:
                    if (char.IsWhiteSpace(c))
                        next = State.None;
                    else if (c == '"')
                        next = State.InQuote;
                    break;
                case State.InQuote:
                    if (c == '"')
                        next = State.None;
                    break;
            }
            if (next.HasValue)
            {
                yield return sb.ToString();
                sb = new StringBuilder();
                state = next.Value;
                next = null;
            }
            else
                sb.Append(c);
        }
    }

它可以很容易地延长的东西喜欢套报价和逃脱。作为回 IEnumerable<string> 允许你的代码中仅分析尽你需要的。没有任何真正缺点，这样的懒惰的方法，因为串不可改变的，所以你知道， input 是不会改变之前，你有分析整个事情。

参见： http://en.wikipedia.org/wiki/Automata-Based_Programming

你也可能想要看到经常表达方式。这可能会帮助你。这里是一样撕下从MSDN...

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main ()
    {

        // Define a regular expression for repeated words.
        Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
          RegexOptions.Compiled | RegexOptions.IgnoreCase);

        // Define a test string.        
        string text = "The the quick brown fox  fox jumped over the lazy dog dog.";

        // Find matches.
        MatchCollection matches = rx.Matches(text);

        // Report the number of matches found.
        Console.WriteLine("{0} matches found in:\n   {1}", 
                          matches.Count, 
                          text);

        // Report on each match.
        foreach (Match match in matches)
        {
            GroupCollection groups = match.Groups;
            Console.WriteLine("'{0}' repeated at positions {1} and {2}",  
                              groups["word"].Value, 
                              groups[0].Index, 
                              groups[1].Index);
        }

    }

}
// The example produces the following output to the console:
//       3 matches found in:
//          The the quick brown fox  fox jumped over the lazy dog dog.
//       'The' repeated at positions 0 and 4
//       'fox' repeated at positions 20 and 25
//       'dog' repeated at positions 50 and 54

克雷格适用的规则的表达。 Regex.分裂可以更加简明扼要为你的需要。

[^ ]+ |"[^"]+"

使用Regex肯定看起来就像是最好的赌注，但是这个人刚刚返回的整个弦。我试图调整，但没有多少幸运为止。

string[] tokens = System.Text.RegularExpressions.Regex.Split(this.BuildArgs, @"[^\t]+\t|""[^""]+""\t");

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow