Question

I have a combined table made up of hundreds of sub-tables which are separated by *. Those sub-tables have the same structure, says, col1 is name, col2 is weight, col3 is eye-color etc. I want to removed the * but add new column to the combined table to tell where the sub-tables are originally from. the new column looks like

subtable1
subtable1
subtable1
subtable2
subtable2
subtable3
subtable3
subtable3
subtable3

How can I do it in R?

Was it helpful?

Solution

Here How I would do this. First I generate some data to simulate the problem.

text='Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
*******
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
*******
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1'

## read all lines, in your case you give the fileName here
ll <- readLines(textConnection(text))
## detect the sub table delimiter lines
id <- grepl('\\*+',ll)
## removes them from lines and read them using read.table
dat <- read.table(text=ll[!id])
## create the group delimiter using cumsum
dat$table <- paste0('subtable',cumsum(id)[!id])


     V1  V2   V3 V4  V5  V6   V7   V8    V9 V10 V11 V12 V13     table
1  Mazda RX4 21.0  6 160 110 3.90 2.62 16.46   0   1   4   4 subtable0
2 Datsun 710 22.8  4 108  93 3.85 2.32 18.61   1   1   4   1 subtable0
3  Mazda RX4 21.0  6 160 110 3.90 2.62 16.46   0   1   4   4 subtable1
4 Datsun 710 22.8  4 108  93 3.85 2.32 18.61   1   1   4   1 subtable1
5  Mazda RX4 21.0  6 160 110 3.90 2.62 16.46   0   1   4   4 subtable2
6 Datsun 710 22.8  4 108  93 3.85 2.32 18.61   1   1   4   1 subtable2

OTHER TIPS

Assuming you're reading from a file:

f <- read.table("filename", fill=TRUE, ....) # insert the required arguments here

# identify separator lines: assume 1st column is '*' and others are all blank
# tweak specifics to fit
sep <- f[,1] == "*" & rowSums(!is.na(f[,-1])) == 0

f$subtable <- cumsum(sep) + 1
f <- f[!sep, ]

The idea is to read in the entire file, then identify the separator lines as those containing nothing but a *. Since you haven't said what the actual contents of your file are, it's hard to provide anything more specific. You'll need to tweak this to handle whatever your file contains.

Based on what I understand, I will illustrate using mtcars data from R:

library(plyr) # for rbind.fill 

# divide the data frames into 2 which is equivalent to 2 sub-tables
data1<-subset(mtcars,am==0)
data2<-subset(mtcars,am==1)

# let s be your special sign which is * seperating dataframe 1 and dataframe2 (horizontally)
data1$s<-rep("*",(dim(data1)[1]))


data3<-rbind.fill(data1,data2) # append data1 and data2 
tablename<-rep(paste0("subtable",1:2),c(dim(data1)[1],dim(data2)[1])) 
tablename<-as.data.frame(tablename) # generate filename as data frame

mydata<-cbind(data3,tablename) # merge data3 and tablename
finaldata<-mydata[,-(dim(mydata)[2]-1)] # remove column with seperator which is s

> head(finaldata,n=20)
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb tablename
1  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 subtable1
2  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2 subtable1
3  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1 subtable1
4  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 subtable1
5  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 subtable1
6  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2 subtable1
7  19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4 subtable1
8  17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4 subtable1
9  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3 subtable1
10 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3 subtable1
11 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3 subtable1
12 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 subtable1
13 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 subtable1
14 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4 subtable1
15 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1 subtable1
16 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2 subtable1
17 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2 subtable1
18 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4 subtable1
19 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2 subtable1
20 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 subtable2
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top