인용된 섹션을 무시하고 문자열 분할

https://stackoverflow.com/questions/6209

08-06-2019
|

문제

다음과 같은 문자열이 주어지면:

a,"문자열,",다양한,"값 및 일부",인용됨

인용된 섹션 내의 쉼표를 무시하고 쉼표를 기준으로 이를 분할하는 좋은 알고리즘은 무엇입니까?

출력은 배열이어야 합니다.

[ "a", "string, with", "various", "values, and some", "quoted" ]

해결책

내가 선택한 언어가 생각 없이 이 작업을 수행하는 방법을 제공하지 않는다면 처음에는 쉬운 방법으로 두 가지 옵션을 고려할 것입니다.

문자열 내의 쉼표를 사전 구문 분석하고 다른 제어 문자로 바꾼 다음 분할한 다음 배열에 대한 사후 구문 분석을 수행하여 이전에 사용된 제어 문자를 쉼표로 바꿉니다.
또는 쉼표로 분할한 다음 결과 배열을 다른 배열로 사후 구문 분석하여 각 배열 항목의 선행 따옴표를 확인하고 종료 따옴표에 도달할 때까지 항목을 연결합니다.

그러나 이것은 해킹이며 이것이 순수한 '정신적' 훈련이라면 도움이 되지 않을 것이라고 생각합니다.이것이 실제 문제라면 구체적인 조언을 제공할 수 있도록 언어를 아는 것이 도움이 될 것입니다.

다른 팁

여기에 좋은 답변이 있는 것 같습니다.

자신만의 CSV 파일 구문 분석을 처리하려는 경우 전문가의 조언에 귀를 기울이고 자신만의 CSV 파서를 굴리지 마세요..

당신의 첫 번째 생각은, "따옴표 안의 쉼표를 처리해야 합니다."

당신의 다음 생각은, "아, 젠장, 따옴표 안의 따옴표를 처리해야 해요.이스케이프된 따옴표.큰따옴표.작은따옴표..."

그것은 광기로 가는 길이다.직접 작성하지 마십시오.모든 어려운 부분을 해결하고 지옥을 겪은 광범위한 단위 테스트 범위를 갖춘 라이브러리를 찾으십시오..NET의 경우 무료를 사용하세요. FileHelpers 도서관.

파이썬:

import csv
reader = csv.reader(open("some.csv"))
for row in reader:
    print row

물론 CSV 파서를 사용하는 것이 더 좋지만 재미를 위해 다음을 수행할 수 있습니다.

Loop on the string letter by letter.
    If current_letter == quote : 
        toggle inside_quote variable.
    Else if (current_letter ==comma and not inside_quote) : 
        push current_word into array and clear current_word.
    Else 
        append the current_letter to current_word
When the loop is done push the current_word into array

여기서 작성자는 문제가 있는 시나리오를 처리하는 C# 코드 덩어리를 추가했습니다.

.Net에서 CSV 파일 가져오기

번역이 너무 어렵지는 않을 것입니다.

홀수의 인용문이 원래 문자열에 나타나면 어떻게됩니까?

이는 인용된 필드를 처리하는 데 몇 가지 특징이 있는 CSV 구문 분석과 놀랍도록 유사해 보입니다.필드가 큰따옴표로 구분된 경우에만 필드가 이스케이프됩니다.

필드1, "필드2, 필드3", 필드4, "필드5, 필드6" 필드7

된다

필드1

필드2, 필드3

필드4

"필드5

필드6" 필드7

따옴표로 시작하고 끝나지 않으면 따옴표로 묶인 필드가 아니며 큰따옴표는 단순히 큰따옴표로 처리됩니다.

내가 올바르게 기억한다면 실제로 누군가가 링크한 내 코드는 이것을 올바르게 처리하지 못합니다.

다음은 Pat의 의사 코드를 기반으로 한 간단한 Python 구현입니다.

def splitIgnoringSingleQuote(string, split_char, remove_quotes=False):
    string_split = []
    current_word = ""
    inside_quote = False
    for letter in string:
      if letter == "'":
        if not remove_quotes:
           current_word += letter
        if inside_quote:
          inside_quote = False
        else:
          inside_quote = True
      elif letter == split_char and not inside_quote:
        string_split.append(current_word)
        current_word = ""
      else:
        current_word += letter
    string_split.append(current_word)
    return string_split

나는 이것을 문자열을 구문 분석하는 데 사용하지만, 이것이 여기서 도움이 되는지 확실하지 않습니다.하지만 약간의 수정이 필요할 수도 있습니다.

function getstringbetween($string, $start, $end){
    $string = " ".$string;
    $ini = strpos($string,$start);
    if ($ini == 0) return "";
    $ini += strlen($start);   
    $len = strpos($string,$end,$ini) - $ini;
    return substr($string,$ini,$len);
}

$fullstring = "this is my [tag]dog[/tag]";
$parsed = getstringbetween($fullstring, "[tag]", "[/tag]");

echo $parsed; // (result = dog)

/mp

이는 표준 CSV 스타일 구문 분석입니다.많은 사람들이 정규식을 사용하여 이를 수행하려고 합니다.정규식을 사용하면 약 90%에 도달할 수 있지만 제대로 수행하려면 실제 CSV 파서가 필요합니다.나는 찾았다 CodeProject의 빠르고 뛰어난 C# CSV 파서 몇 달 전 제가 강력 추천하는 곳이에요!

여기 의사코드(일명)로 된 것이 있습니다.Python)을 한 번에 :-P

def parsecsv(instr):
    i = 0
    j = 0

    outstrs = []

    # i is fixed until a match occurs, then it advances
    # up to j. j inches forward each time through:

    while i < len(instr):

        if j < len(instr) and instr[j] == '"':
            # skip the opening quote...
            j += 1
            # then iterate until we find a closing quote.
            while instr[j] != '"':
                j += 1
                if j == len(instr):
                    raise Exception("Unmatched double quote at end of input.")

        if j == len(instr) or instr[j] == ',':
            s = instr[i:j]  # get the substring we've found
            s = s.strip()    # remove extra whitespace

            # remove surrounding quotes if they're there
            if len(s) > 2 and s[0] == '"' and s[-1] == '"':
                s = s[1:-1]

            # add it to the result
            outstrs.append(s)

            # skip over the comma, move i up (to where
            # j will be at the end of the iteration)
            i = j+1

        j = j+1

    return outstrs

def testcase(instr, expected):
    outstr = parsecsv(instr)
    print outstr
    assert expected == outstr

# Doesn't handle things like '1, 2, "a, b, c" d, 2' or
# escaped quotes, but those can be added pretty easily.

testcase('a, b, "1, 2, 3", c', ['a', 'b', '1, 2, 3', 'c'])
testcase('a,b,"1, 2, 3" , c', ['a', 'b', '1, 2, 3', 'c'])

# odd number of quotes gives a "unmatched quote" exception
#testcase('a,b,"1, 2, 3" , "c', ['a', 'b', '1, 2, 3', 'c'])

간단한 알고리즘은 다음과 같습니다.

문자열이 다음으로 시작하는지 확인 '"' 성격
문자열을 다음으로 구분된 배열로 분할합니다. '"' 성격.
자리 표시자로 인용된 쉼표를 표시하세요. #COMMA#
- 입력이 다음으로 시작하는 경우 '"', 인덱스가 % 2 == 0인 배열의 항목을 표시합니다.
- 그렇지 않으면 인덱스가 % 2 == 1인 배열의 항목을 표시합니다.
배열의 항목을 연결하여 수정된 입력 문자열을 형성합니다.
문자열을 다음으로 구분된 배열로 분할합니다. ',' 성격.
배열의 모든 인스턴스를 교체합니다. #COMMA# 자리 표시자는 ',' 성격.
배열이 출력입니다.

Python 구현은 다음과 같습니다.
('"a,b",c,"d,e,f,h","i,j,k"'를 처리하도록 수정됨)

def parse_input(input):

    quote_mod = int(not input.startswith('"'))

    input = input.split('"')
    for item in input:
        if item == '':
            input.remove(item)
    for i in range(len(input)):
        if i % 2 == quoted_mod:
            input[i] = input[i].replace(",", "#COMMA#")

    input = "".join(input).split(",")
    for item in input:
        if item == '':
            input.remove(item)
    for i in range(len(input)):
        input[i] = input[i].replace("#COMMA#", ",")
    return input

# parse_input('a,"string, with",various,"values, and some",quoted')
#  -> ['a,string', ' with,various,values', ' and some,quoted']
# parse_input('"a,b",c,"d,e,f,h","i,j,k"')
#  -> ['a,b', 'c', 'd,e,f,h', 'i,j,k']

나는 Python 한 줄로 작동하게 만들 수 있는지 확인하고 싶지 않았습니다.

arr = [i.replace("|", ",") for i in re.sub('"([^"]*)\,([^"]*)"',"\g<1>|\g<2>", str_to_test).split(",")]

['a', 'string, with', 'various', 'values, and some', 'quoted']를 반환합니다.

먼저 ','내부 따옴표를 다른 분리기 (|)로 대체하고 문자열을 켜고 ','| 다시 분리기.

언어에 구애받지 않는다고 말씀하셨기 때문에 저는 가능한 한 의사 코드에 가장 가까운 언어로 알고리즘을 작성했습니다.

def find_character_indices(s, ch):
    return [i for i, ltr in enumerate(s) if ltr == ch]


def split_text_preserving_quotes(content, include_quotes=False):
    quote_indices = find_character_indices(content, '"')

    output = content[:quote_indices[0]].split()

    for i in range(1, len(quote_indices)):
        if i % 2 == 1: # end of quoted sequence
            start = quote_indices[i - 1]
            end = quote_indices[i] + 1
            output.extend([content[start:end]])

        else:
            start = quote_indices[i - 1] + 1
            end = quote_indices[i]
            split_section = content[start:end].split()
            output.extend(split_section)

        output += content[quote_indices[-1] + 1:].split()                                                                 

    return output

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow