题
你如何使用GAWK解析一个CSV文件?简单地设置FS=","
是不够的,如与内部的逗号一个引用字段将作为多个字段进行处理。
使用<=>不工作示例:
文件内容:
one,two,"three, four",five
"six, seven",eight,"nine"
GAWK脚本:
BEGIN { FS="," }
{
for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
printf "---------------------------\n"
}
坏输出:
field #1: one
field #2: two
field #3: "three
field #4: four"
field #5: five
---------------------------
field #1: "six
field #2: seven"
field #3: eight
field #4: "nine"
---------------------------
期望的输出:
field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------
解决方案
简单的答案是“我不会用GAWK解析CSV如果CSV包含尴尬数据”,其中“笨拙”是指之类的东西在CSV场数据的逗号。
接下来的问题是:“你打算什么其他处理,以做”,因为这会影响你用什么替代品。
我可能会使用Perl和文本:: CSV或文本:: CSV_XS模块读取和处理的数据。请记住,Perl的原文为部分作为awk
和sed
杀手 - 因此a2p
和s2p
用Perl还是分布式程序,其转换<=>和<=>脚本(分别)转换为Perl
其他提示
的GAWK版本4手册说,使用FPAT = "([^,]*)|(\"[^\"]+\")"
当FPAT
被定义,它禁用FS
和通过内容,而不是通过分离器指定字段。
您可以使用名为csvquote一个简单的包装函数来净化输入和awk处理完毕之后恢复。在开始和结束,一切都管你的数据通过它应该执行的确定:
之前:
gawk -f mypgoram.awk input.csv
后:
csvquote input.csv | gawk -f mypgoram.awk | csvquote -u
请参阅 https://github.com/dbro/csvquote 获得代码和文档
如果可允许的,我会使用Python CSV 模块,并特别注意rel="nofollow使用href="https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters" noreferrer">方言,解析你有CSV文件。
csv2delim.awk
# csv2delim.awk converts comma delimited files with optional quotes to delim separated file
# delim can be any character, defaults to tab
# assumes no repl characters in text, any delim in line converts to repl
# repl can be any character, defaults to ~
# changes two consecutive quotes within quotes to '
# usage: gawk -f csv2delim.awk [-v delim=d] [-v repl=`"] input-file > output-file
# -v delim delimiter, defaults to tab
# -v repl replacement char, defaults to ~
# e.g. gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > test.txt
# abe 2-28-7
# abe 8-8-8 1.0 fixed empty fields, added replacement option
# abe 8-27-8 1.1 used split
# abe 8-27-8 1.2 inline rpl and "" = '
# abe 8-27-8 1.3 revert to 1.0 as it is much faster, split most of the time
# abe 8-29-8 1.4 better message if delim present
BEGIN {
if (delim == "") delim = "\t"
if (repl == "") repl = "~"
print "csv2delim.awk v.m 1.4 run at " strftime() > "/dev/stderr" ###########################################
}
{
#if ($0 ~ repl) {
# print "Replacement character " repl " is on line " FNR ":" lineIn ";" > "/dev/stderr"
#}
if ($0 ~ delim) {
print "Temp delimiter character " delim " is on line " FNR ":" lineIn ";" > "/dev/stderr"
print " replaced by " repl > "/dev/stderr"
}
gsub(delim, repl)
$0 = gensub(/([^,])\"\"/, "\\1'", "g")
# $0 = gensub(/\"\"([^,])/, "'\\1", "g") # not needed above covers all cases
out = ""
#for (i = 1; i <= length($0); i++)
n = length($0)
for (i = 1; i <= n; i++)
if ((ch = substr($0, i, 1)) == "\"")
inString = (inString) ? 0 : 1 # toggle inString
else
out = out ((ch == "," && ! inString) ? delim : ch)
print out
}
END {
print NR " records processed from " FILENAME " at " strftime() > "/dev/stderr"
}
test.csv
"first","second","third"
"fir,st","second","third"
"first","sec""ond","third"
" first ",sec ond,"third"
"first" , "second","th ird"
"first","sec;ond","third"
"first","second","th;ird"
1,2,3
,2,3
1,2,
,2,
1,,2
1,"2",3
"1",2,"3"
"1",,"3"
1,"",3
"","",""
"","""aiyn","oh"""
"""","""",""""
11,2~2,3
test.bat的
rem test csv2delim
rem default is: -v delim={tab} -v repl=~
gawk -f csv2delim.awk test.csv > test.txt
gawk -v delim=; -f csv2delim.awk test.csv > testd.txt
gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > testdr.txt
gawk -v repl=` -f csv2delim.awk test.csv > testr.txt
我不是很确定这是否是做事的正确方法。我将在一个CSV文件中,或者所有值都引用或没有,而工作。顺便说一句,AWK允许正则表达式是字段分隔符。检查,如果这是有用的。
{
ColumnCount = 0
$0 = $0 "," # Assures all fields end with comma
while($0) # Get fields by pattern, not by delimiter
{
match($0, / *"[^"]*" *,|[^,]*,/) # Find a field with its delimiter suffix
Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter
gsub(/^ *"?|"? *,$/, "", Field) # Strip delimiter text: comma/space/quote
Column[++ColumnCount] = Field # Save field without delimiter in an array
$0 = substr($0, RLENGTH + 1) # Remove processed text from the raw data
}
}
这遵循这一项可以访问在柱[]中的字段模式。列数指示被发现在柱[]的元素数。如果不是所有的行包含相同数量的列,列[]包含列后额外的数据[列数]处理所述短行的时候。
此实施是缓慢的,但它似乎仿效GAWK> =在前面的回答中提到4.0.0发现FPAT
/ patsplit()
特征。
这就是我想出了。任何意见和/或更好的解决方案,将不胜感激。
BEGIN { FS="," }
{
for (i=1; i<=NF; i++) {
f[++n] = $i
if (substr(f[n],1,1)=="\"") {
while (substr(f[n], length(f[n]))!="\"" || substr(f[n], length(f[n])-1, 1)=="\\") {
f[n] = sprintf("%s,%s", f[n], $(++i))
}
}
}
for (i=1; i<=n; i++) printf "field #%d: %s\n", i, f[i]
print "----------------------------------\n"
}
的基本思想是通过字段我环,并与报价开始,但不与报价结束任何字段获得附加到它的下一个字段。
Perl有文字:: CSV_XS模块,它是专门用于处理引号的逗号怪事。结果 交替尝试文本:: CSV模块。
perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv
产生这样的输出:
field #0: one
field #1: two
field #2: three, four
field #3: five
---
field #0: six, seven
field #1: eight
field #2: nine
---
这里的一个人类可读的版本。结果 将其保存为parsecsv,使用chmod + x和运行它为 “parsecsv FILE.CSV”
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
open(my $data, '<', $ARGV[0]) or die "Could not open '$ARGV[0]' $!\n";
while (my $line = <$data>) {
if ($csv->parse($line)) {
my @f = $csv->fields();
for my $n (0..$#f) {
print "field #$n: $f[$n]\n";
}
print "---\n";
}
}
您可能需要指向不同的版本,你的机器上的perl,因为文本:: CSV_XS模块可能无法在Perl语言的默认安装的版本。
Can't locate Text/CSV_XS.pm in @INC (@INC contains: /home/gnu/lib/perl5/5.6.1/i686-linux /home/gnu/lib/perl5/5.6.1 /home/gnu/lib/perl5/site_perl/5.6.1/i686-linux /home/gnu/lib/perl5/site_perl/5.6.1 /home/gnu/lib/perl5/site_perl .).
BEGIN failed--compilation aborted.
如果没有你的Perl版本的安装文:: CSV_XS,你将需要:点击
sudo apt-get install cpanminus
结果
sudo cpanm Text::CSV_XS