공백으로 구분된 텍스트를 구문 분석하는 가장 좋은 방법

https://stackoverflow.com/questions/54866

09-06-2019
|

문제

나는 이런 문자열을 가지고있다

 /c SomeText\MoreText "Some Text\More Text\Lol" SomeText

토큰화하고 싶지만 공간을 분할할 수는 없습니다.나는 작동하는 다소 추악한 파서를 생각해 냈지만 더 우아한 디자인을 가진 사람이 있는지 궁금합니다.

이것은 C# btw에 있습니다.

편집하다: 내 추악한 버전은 추악하지만 O(N)이며 실제로 RegEx를 사용하는 것보다 더 빠를 수 있습니다.

private string[] tokenize(string input)
{
    string[] tokens = input.Split(' ');
    List<String> output = new List<String>();

    for (int i = 0; i < tokens.Length; i++)
    {
        if (tokens[i].StartsWith("\""))
        {
            string temp = tokens[i];
            int k = 0;
            for (k = i + 1; k < tokens.Length; k++)
            {
                if (tokens[k].EndsWith("\""))
                {
                    temp += " " + tokens[k];
                    break;
                }
                else
                {
                    temp += " " + tokens[k];
                }
            }
            output.Add(temp);
            i = k + 1;
        }
        else
        {
            output.Add(tokens[i]);
        }
    }

    return output.ToArray();            
}

해결책

당신이 하고 있는 일을 가리키는 컴퓨터 용어는 다음과 같습니다. 어휘 분석;이 일반적인 작업에 대한 좋은 요약을 보려면 읽어보세요.

귀하의 예를 바탕으로 단어를 구분하기 위해 공백을 원한다고 추측하지만 따옴표 안의 항목은 따옴표 없이 "단어"로 처리되어야 합니다.

이를 수행하는 가장 간단한 방법은 단어를 정규식으로 정의하는 것입니다.

([^"^\s]+)\s*|"([^"]+)"\s*

이 표현은 "단어"가 (1) 공백으로 둘러싸인 따옴표가 아닌 공백이 아닌 텍스트이거나 (2) 따옴표로 묶인 따옴표가 아닌 텍스트(뒤에 공백이 있음)임을 나타냅니다.원하는 텍스트를 강조 표시하려면 캡처 괄호를 사용하십시오.

해당 정규식으로 무장하면 알고리즘은 간단합니다.캡처 괄호로 정의된 다음 "단어"를 텍스트에서 검색하여 반환합니다."단어"가 다 떨어질 때까지 이를 반복하세요.

VB.NET에서 제가 생각해낼 수 있는 가장 간단한 작업 코드는 다음과 같습니다.확인해야 한다는 점 참고하세요 둘 다 두 개의 캡처 괄호 세트가 있으므로 데이터 그룹입니다.

Dim token As String
Dim r As Regex = New Regex("([^""^\s]+)\s*|""([^""]+)""\s*")
Dim m As Match = r.Match("this is a ""test string""")

While m.Success
    token = m.Groups(1).ToString
    If token.length = 0 And m.Groups.Count > 1 Then
        token = m.Groups(2).ToString
    End If
    m = m.NextMatch
End While

참고 1: 윌의 위의 대답은 이것과 같은 생각입니다.이 답변이 장면 뒤의 세부 사항을 좀 더 잘 설명해주기를 바랍니다 :)

다른 팁

Microsoft.VisualBasic.FileIO 네임스페이스(Microsoft.VisualBasic.dll)에는 공백으로 구분된 텍스트를 분할하는 데 사용할 수 있는 TextFieldParser가 있습니다.따옴표 안의 문자열(예: "이것은 하나의 토큰입니다" thisistokentwo)을 잘 처리합니다.

DLL에 VisualBasic이 있다고 해서 VB 프로젝트에서만 사용할 수 있다는 의미는 아닙니다.전체 프레임워크의 일부입니다.

상태 머신 접근 방식이 있습니다.

    private enum State
    {
        None = 0,
        InTokin,
        InQuote
    }

    private static IEnumerable<string> Tokinize(string input)
    {
        input += ' '; // ensure we end on whitespace
        State state = State.None;
        State? next = null; // setting the next state implies that we have found a tokin
        StringBuilder sb = new StringBuilder();
        foreach (char c in input)
        {
            switch (state)
            {
                default:
                case State.None:
                    if (char.IsWhiteSpace(c))
                        continue;
                    else if (c == '"')
                    {
                        state = State.InQuote;
                        continue;
                    }
                    else
                        state = State.InTokin;
                    break;
                case State.InTokin:
                    if (char.IsWhiteSpace(c))
                        next = State.None;
                    else if (c == '"')
                        next = State.InQuote;
                    break;
                case State.InQuote:
                    if (c == '"')
                        next = State.None;
                    break;
            }
            if (next.HasValue)
            {
                yield return sb.ToString();
                sb = new StringBuilder();
                state = next.Value;
                next = null;
            }
            else
                sb.Append(c);
        }
    }

중첩된 따옴표 및 이스케이프와 같은 작업을 위해 쉽게 확장할 수 있습니다.다음으로 돌아옴 IEnumerable<string> 코드에서 필요한 만큼만 구문 분석할 수 있습니다.문자열은 불변이므로 이러한 종류의 게으른 접근 방식에는 실제 단점이 없습니다. input 모든 것을 구문 분석하기 전에는 변경되지 않습니다.

보다: http://en.wikipedia.org/wiki/Automata-Based_Programming

정규식을 살펴보고 싶을 수도 있습니다.그게 당신에게 도움이 될 수도 있어요.다음은 MSDN에서 가져온 샘플입니다.

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main ()
    {

        // Define a regular expression for repeated words.
        Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
          RegexOptions.Compiled | RegexOptions.IgnoreCase);

        // Define a test string.        
        string text = "The the quick brown fox  fox jumped over the lazy dog dog.";

        // Find matches.
        MatchCollection matches = rx.Matches(text);

        // Report the number of matches found.
        Console.WriteLine("{0} matches found in:\n   {1}", 
                          matches.Count, 
                          text);

        // Report on each match.
        foreach (Match match in matches)
        {
            GroupCollection groups = match.Groups;
            Console.WriteLine("'{0}' repeated at positions {1} and {2}",  
                              groups["word"].Value, 
                              groups[0].Index, 
                              groups[1].Index);
        }

    }

}
// The example produces the following output to the console:
//       3 matches found in:
//          The the quick brown fox  fox jumped over the lazy dog dog.
//       'The' repeated at positions 0 and 4
//       'fox' repeated at positions 20 and 25
//       'dog' repeated at positions 50 and 54

크레이그 맞습니다. 정규식을 사용하세요. 정규식.분할 귀하의 요구에 더 간결할 수 있습니다.

[^ ]+ |"[^"]+"

Regex를 사용하는 것이 가장 좋은 방법인 것 같지만 이는 전체 문자열을 반환할 뿐입니다.나는 그것을 조정하려고 노력하고 있지만 지금까지는 운이 별로 좋지 않습니다.

string[] tokens = System.Text.RegularExpressions.Regex.Split(this.BuildArgs, @"[^\t]+\t|""[^""]+""\t");

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow