read.csv lines with no quotes in R

Question 1

Something like this for example :

ll = readLines(textConnection('a ~ b ~ c ~ d ~ e
1 ~ name1 ~ This is a paragraph. 
This is a second paragraph.
~ num1 ~ num2 ~
2 ~ name2 ~ This is an new set of paragraph.
~ num1 ~ num2 ~'))
## each line begin with a numeric followed by a space
## I use this pattern to sperate lines
llines <- split(ll[-1],cumsum(grepl('^[0-9] ',ll[-1])))
## add the header to the splitted and concatenated lines 
read.table(text=unlist(c(ll[1],lapply(llines,paste,collapse=''))),
           sep='~',header=TRUE)


         a                                                 b      c      d  e
1   name1   This is a paragraph. This is a second paragraph.  num1   num2  NA
2   name2                   This is an new set of paragraph.  num1   num2  NA

Question 2

Here is an approach in R that depends on (1) ~ being a true delimiter that doesn't appear in any of your paragraphs and (2) ~ appearing at the end of each record.

But first, some sample data (in a way that others can also reproduce your problem).

cat("a ~ b ~ c ~ d ~ e",
    "1 ~ name1 ~ This is a paragraph.",
    "",
    "This is a second paragraph.",
    "",
    "~ num1 ~ num2 ~",
    "",
    "2 ~ name2 ~ This is an new set of paragraph.",
    "",
    "~ num1 ~ num2 ~", sep = "\n", file = "test.txt")

We'll start with readLines to get the data in. We'll also add a ~ at the end of the header row.

x <- readLines("test.txt")
x[1] <- paste(x[1], "~") ## Add a ~ at the end of the first line

Now, we'll paste everything into a nice long string.

y <- paste(x, collapse = " ")

Use scan to quickly "read" the data again, but instead of using the file argument, we'll use the text argument and refer to the "y" object we just created. Since the last line ends with a ~ there will be an extra "" at the end, which we will remove before proceeding.

z <- scan(text = y, what = character(), sep = "~", strip.white = TRUE)
# Read 16 items
z <- z[-length(z)]

Since we now have a character vector, we can easily convert this to a matrix, and then to a data.frame. We know the colnames are the first 5 values, so we'll drop those when creating the matrix, and reinsert them as the names of the data.frame.

df <- setNames(data.frame(
  matrix(z[6:length(z)], ncol = 5, byrow = TRUE)), z[1:5])
df
#   a     b                                                 c    d    e
# 1 1 name1 This is a paragraph.  This is a second paragraph. num1 num2
# 2 2 name2                  This is an new set of paragraph. num1 num2

Question 3

When I saw this was a text-processing problem, I decided Python would be much easier. Apologies if you aren't familiar with it or don't have access to it:

import csv

all_rows = []
with open('tilded_csv.txt') as in_file:
    header_line = next(in_file)
    header = header_line.strip().split('~')
    current_record = []
    for line in in_file:
        # Assume that a number at the start of a line
        # signals a new record
        if line[0].isdigit():
            new_record = line.strip()
            if current_record:
                all_rows.append(current_record.split('~'))
            current_record = line.strip()
        else:
            current_record += line.strip()
# Add the last record
all_rows.append(current_record.split('~'))

with open('standard_csv.csv', 'w') as out_file:
    out_csv = csv.writer(out_file, dialect='excel')
    out_csv.writerow(header)
    for row in all_rows:
        out_csv.writerow(row)