분석 CSV 파일을 사용하여 둔한

https://stackoverflow.com/questions/314384

10-07-2019
|

문제

당신은 어떻게 분석 CSV 파일을 사용하여 감?단순히 설정 FS="," 이 충분하지 않으로 제시된 필드를 쉼표로 구분 안에 처리됩니다으로 여러 개의 필드가 있습니다.

예를 사용하여 FS="," 는 작동하지 않습니다.

파일내용:

one,two,"three, four",five
"six, seven",eight,"nine"

감 스크립트:

BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
  printf "---------------------------\n"
}

나쁜 출력:

field #1: one
field #2: two
field #3: "three
field #4:  four"
field #5: five
---------------------------
field #1: "six
field #2:  seven"
field #3: eight
field #4: "nine"
---------------------------

원하는 출력:

field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------

해결책

짧은 대답은"나를 사용하지 않을 둔한 구문 분석하 CSV 경우 CSV 포함되어 어색한 데이터는"여기서'어색하다'의 의미 쉼표 같은 것들에 CSV 분야이다.

다음 질문은"무엇이 다른 처리를 해야하는 일"이후에 영향을 미칠 것이 무엇인 대안입니다.

나는 아마 사용하여 Perl 및 텍스트::CSV 또는 텍스트::CSV_XS 모듈을 읽고 프로세스는 데이터입니다.기억,Perl 원래 기록에 부로 awk 고 sed 킬러-따라서 a2p 고 s2p 프로그램은 여전히 함께 배포 Perl 변환 awk 고 sed 스크립트를(각각)로 Perl.

다른 팁

Gawk 버전 4 매뉴얼 사용하겠다고 말합니다 FPAT = "([^,]*)|(\"[^\"]+\")"

언제 FPAT 정의되어 있으면 비활성화됩니다 FS 분리기 대신 컨텐츠별로 필드를 지정합니다.

당신이 사용할 수 있는 간단한 래퍼라는 기능 csvquote 소독하기 위해 입력 및 복원 후에 그것을 awk 이 처리한다.관 데이터를 통해 그것은 시작과 끝에서,그리고 모든 것을 밖으로 작동 합니다 확인:

기:

gawk -f mypgoram.awk input.csv

후:

csvquote input.csv | gawk -f mypgoram.awk | csvquote -u

보 https://github.com/dbro/csvquote 코드 및 설명서가 포함되어 있습니다.

허용되면 파이썬을 사용합니다 CSV 모듈, 특별한주의를 기울입니다 사용 된 방언 및 서식 매개 변수가 필요합니다, 당신이 가진 CSV 파일을 구문 분석하려면.

csv2delim.awk

# csv2delim.awk converts comma delimited files with optional quotes to delim separated file
#     delim can be any character, defaults to tab
# assumes no repl characters in text, any delim in line converts to repl
#     repl can be any character, defaults to ~
# changes two consecutive quotes within quotes to '

# usage: gawk -f csv2delim.awk [-v delim=d] [-v repl=`"] input-file > output-file
#       -v delim    delimiter, defaults to tab
#       -v repl     replacement char, defaults to ~

# e.g. gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > test.txt

# abe 2-28-7
# abe 8-8-8 1.0 fixed empty fields, added replacement option
# abe 8-27-8 1.1 used split
# abe 8-27-8 1.2 inline rpl and "" = '
# abe 8-27-8 1.3 revert to 1.0 as it is much faster, split most of the time
# abe 8-29-8 1.4 better message if delim present

BEGIN {
    if (delim == "") delim = "\t"
    if (repl == "") repl = "~"
    print "csv2delim.awk v.m 1.4 run at " strftime() > "/dev/stderr" ###########################################
}

{
    #if ($0 ~ repl) {
    #   print "Replacement character " repl " is on line " FNR ":" lineIn ";" > "/dev/stderr"
    #}
    if ($0 ~ delim) {
        print "Temp delimiter character " delim " is on line " FNR ":" lineIn ";" > "/dev/stderr"
        print "    replaced by " repl > "/dev/stderr"
    }
    gsub(delim, repl)

    $0 = gensub(/([^,])\"\"/, "\\1'", "g")
#   $0 = gensub(/\"\"([^,])/, "'\\1", "g")  # not needed above covers all cases

    out = ""
    #for (i = 1;  i <= length($0);  i++)
    n = length($0)
    for (i = 1;  i <= n;  i++)
        if ((ch = substr($0, i, 1)) == "\"")
            inString = (inString) ? 0 : 1 # toggle inString
        else
            out = out ((ch == "," && ! inString) ? delim : ch)
    print out
}

END {
    print NR " records processed from " FILENAME " at " strftime() > "/dev/stderr"
}

test.csv

"first","second","third"
"fir,st","second","third"
"first","sec""ond","third"
" first ",sec   ond,"third"
"first" , "second","th  ird"
"first","sec;ond","third"
"first","second","th;ird"
1,2,3
,2,3
1,2,
,2,
1,,2
1,"2",3
"1",2,"3"
"1",,"3"
1,"",3
"","",""
"","""aiyn","oh"""
"""","""",""""
11,2~2,3

test.bat

rem test csv2delim
rem default is: -v delim={tab} -v repl=~
gawk                      -f csv2delim.awk test.csv > test.txt
gawk -v delim=;           -f csv2delim.awk test.csv > testd.txt
gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > testdr.txt
gawk            -v repl=` -f csv2delim.awk test.csv > testr.txt

이것이 올바른 방법인지 확실하지 않습니다. 차라리 모든 값이 인용되거나 없어진 CSV 파일에서 작업하고 싶습니다. BTW, AWK를 사용하면 regexes가 필드 분리기가 될 수 있습니다. 그것이 유용한 지 확인하십시오.

{
  ColumnCount = 0
  $0 = $0 ","                           # Assures all fields end with comma
  while($0)                             # Get fields by pattern, not by delimiter
  {
    match($0, / *"[^"]*" *,|[^,]*,/)    # Find a field with its delimiter suffix
    Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter
    gsub(/^ *"?|"? *,$/, "", Field)     # Strip delimiter text: comma/space/quote
    Column[++ColumnCount] = Field       # Save field without delimiter in an array
    $0 = substr($0, RLENGTH + 1)        # Remove processed text from the raw data
  }
}

이 패턴을 따르는 패턴은 [] 열의 필드에 액세스 할 수 있습니다. ColumnCount는 발견 된 열의 요소 수를 나타냅니다. 모든 행에 동일한 수의 열이 포함되어 있지 않은 경우, 열 []은 짧은 행을 처리 할 때 열 [ColumnCount] 후 추가 데이터를 포함합니다.

이 구현은 느리지 만 FPAT/patsplit() GAWK에서 찾은 기능> = 4.0.0 이전 답변에 언급되었습니다.

참조

여기 내가 생각해 낸 것입니다. 모든 의견 및/또는 더 나은 솔루션에 감사드립니다.

BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) {
    f[++n] = $i
    if (substr(f[n],1,1)=="\"") {
      while (substr(f[n], length(f[n]))!="\"" || substr(f[n], length(f[n])-1, 1)=="\\") {
        f[n] = sprintf("%s,%s", f[n], $(++i))
      }
    }
  }
  for (i=1; i<=n; i++) printf "field #%d: %s\n", i, f[i]
  print "----------------------------------\n"
}

기본 아이디어는 필드를 통과하고 견적으로 시작하지만 인용문으로 끝나지 않는 필드는 다음 필드가 추가된다는 것입니다.

Perl에는 Text :: CSV_XS 모듈이 있으며, 이는 인용 된 공동체 기묘함을 처리하기 위해 목적으로 제작되었습니다.
텍스트 :: CSV 모듈을 번갈아 사용해보십시오.

perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv

이 출력을 생성합니다.

field #0: one
field #1: two
field #2: three, four
field #3: five
---
field #0: six, seven
field #1: eight
field #2: nine
---

여기에 사람이 읽을 수있는 버전이 있습니다.
parsecsv, chmod +x로 저장하고 "parsecsv file.csv"로 실행하십시오.

#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
open(my $data, '<', $ARGV[0]) or die "Could not open '$ARGV[0]' $!\n";
while (my $line = <$data>) {
    if ($csv->parse($line)) {
        my @f = $csv->fields();
        for my $n (0..$#f) {
            print "field #$n: $f[$n]\n";
        }
        print "---\n";
    }
}

텍스트 :: csv_xs 모듈이 기본 버전의 Perl에 설치되지 않을 수 있으므로 컴퓨터의 다른 버전의 Perl을 가리켜 야 할 수도 있습니다.

Can't locate Text/CSV_XS.pm in @INC (@INC contains: /home/gnu/lib/perl5/5.6.1/i686-linux /home/gnu/lib/perl5/5.6.1 /home/gnu/lib/perl5/site_perl/5.6.1/i686-linux /home/gnu/lib/perl5/site_perl/5.6.1 /home/gnu/lib/perl5/site_perl .).
BEGIN failed--compilation aborted.

Perl 버전 중 어느 것도 Text :: CSV_XS가 설치되어 있지 않으면 다음을 수행해야합니다.
sudo apt-get install cpanminus
sudo cpanm Text::CSV_XS

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow