I am trying to read in a huge csv file from R, but I am having troubles since the elements of the columns that is suppose to be in the string format is not separated by quotes and is creating a new row each time there is a new line. My data is delimited by ~.

For example, my data looks something similar to this:

a ~ b ~ c ~ d ~ e
1 ~ name1 ~ This is a paragraph. 

This is a second paragraph.

~ num1 ~ num2 ~

2 ~ name2 ~ This is an new set of paragraph.

~ num1 ~ num2 ~

I hope to get something like this:

a |      b     |         c                                        |  d     |   e   |
____________________________________________________________________________________
1 |    name1  | This is a paragraph. This is a second paragraph.  |  num1  | num2  |

2 |    name2  | This is a new set of paragraph.                   |  num1  | num2  |

But I end up with something ugly like this:

a                          |    b    |         c               |  d     |   e   |
__________________________________________________________________________________
1                          |  name1  |   This is a paragraph.  |        |       |

This is a second paragraph |         |                         |        |       |
                           |  num1   |        num2
2                          |  name2  | This is a new set of paragraph. | num1 | num2  |

I tried to set allowEscapes = TRUE in read.csv but that didn't do the trick. My input currently looks like this:

read.csv(filename, header = T, sep = '~', stringAsFactors = F, fileEncoding = "latin1", quote = "", strip.white = TRUE)

My next idea is to insert a quotation after each ~, but I am hoping to see if there are better methods.

Any help would be appreciated.

有帮助吗?

解决方案

Something like this for example :

ll = readLines(textConnection('a ~ b ~ c ~ d ~ e
1 ~ name1 ~ This is a paragraph. 
This is a second paragraph.
~ num1 ~ num2 ~
2 ~ name2 ~ This is an new set of paragraph.
~ num1 ~ num2 ~'))
## each line begin with a numeric followed by a space
## I use this pattern to sperate lines
llines <- split(ll[-1],cumsum(grepl('^[0-9] ',ll[-1])))
## add the header to the splitted and concatenated lines 
read.table(text=unlist(c(ll[1],lapply(llines,paste,collapse=''))),
           sep='~',header=TRUE)


         a                                                 b      c      d  e
1   name1   This is a paragraph. This is a second paragraph.  num1   num2  NA
2   name2                   This is an new set of paragraph.  num1   num2  NA

其他提示

Here is an approach in R that depends on (1) ~ being a true delimiter that doesn't appear in any of your paragraphs and (2) ~ appearing at the end of each record.

But first, some sample data (in a way that others can also reproduce your problem).

cat("a ~ b ~ c ~ d ~ e",
    "1 ~ name1 ~ This is a paragraph.",
    "",
    "This is a second paragraph.",
    "",
    "~ num1 ~ num2 ~",
    "",
    "2 ~ name2 ~ This is an new set of paragraph.",
    "",
    "~ num1 ~ num2 ~", sep = "\n", file = "test.txt")

We'll start with readLines to get the data in. We'll also add a ~ at the end of the header row.

x <- readLines("test.txt")
x[1] <- paste(x[1], "~") ## Add a ~ at the end of the first line

Now, we'll paste everything into a nice long string.

y <- paste(x, collapse = " ")

Use scan to quickly "read" the data again, but instead of using the file argument, we'll use the text argument and refer to the "y" object we just created. Since the last line ends with a ~ there will be an extra "" at the end, which we will remove before proceeding.

z <- scan(text = y, what = character(), sep = "~", strip.white = TRUE)
# Read 16 items
z <- z[-length(z)]

Since we now have a character vector, we can easily convert this to a matrix, and then to a data.frame. We know the colnames are the first 5 values, so we'll drop those when creating the matrix, and reinsert them as the names of the data.frame.

df <- setNames(data.frame(
  matrix(z[6:length(z)], ncol = 5, byrow = TRUE)), z[1:5])
df
#   a     b                                                 c    d    e
# 1 1 name1 This is a paragraph.  This is a second paragraph. num1 num2
# 2 2 name2                  This is an new set of paragraph. num1 num2

When I saw this was a text-processing problem, I decided Python would be much easier. Apologies if you aren't familiar with it or don't have access to it:

import csv

all_rows = []
with open('tilded_csv.txt') as in_file:
    header_line = next(in_file)
    header = header_line.strip().split('~')
    current_record = []
    for line in in_file:
        # Assume that a number at the start of a line
        # signals a new record
        if line[0].isdigit():
            new_record = line.strip()
            if current_record:
                all_rows.append(current_record.split('~'))
            current_record = line.strip()
        else:
            current_record += line.strip()
# Add the last record
all_rows.append(current_record.split('~'))

with open('standard_csv.csv', 'w') as out_file:
    out_csv = csv.writer(out_file, dialect='excel')
    out_csv.writerow(header)
    for row in all_rows:
        out_csv.writerow(row)
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top