Here is an approach in R that depends on (1) ~
being a true delimiter that doesn't appear in any of your paragraphs and (2) ~
appearing at the end of each record.
But first, some sample data (in a way that others can also reproduce your problem).
cat("a ~ b ~ c ~ d ~ e",
"1 ~ name1 ~ This is a paragraph.",
"",
"This is a second paragraph.",
"",
"~ num1 ~ num2 ~",
"",
"2 ~ name2 ~ This is an new set of paragraph.",
"",
"~ num1 ~ num2 ~", sep = "\n", file = "test.txt")
We'll start with readLines
to get the data in. We'll also add a ~
at the end of the header row.
x <- readLines("test.txt")
x[1] <- paste(x[1], "~") ## Add a ~ at the end of the first line
Now, we'll paste
everything into a nice long string.
y <- paste(x, collapse = " ")
Use scan
to quickly "read" the data again, but instead of using the file
argument, we'll use the text
argument and refer to the "y" object we just created. Since the last line ends with a ~
there will be an extra ""
at the end, which we will remove before proceeding.
z <- scan(text = y, what = character(), sep = "~", strip.white = TRUE)
# Read 16 items
z <- z[-length(z)]
Since we now have a character vector, we can easily convert this to a matrix
, and then to a data.frame
. We know the colnames
are the first 5 values, so we'll drop those when creating the matrix
, and reinsert them as the names of the data.frame
.
df <- setNames(data.frame(
matrix(z[6:length(z)], ncol = 5, byrow = TRUE)), z[1:5])
df
# a b c d e
# 1 1 name1 This is a paragraph. This is a second paragraph. num1 num2
# 2 2 name2 This is an new set of paragraph. num1 num2