Question

I've got data with the following form:

    brt_id          ADDRESS             OWNERNAME year PRINCIPAL INTEREST PENALTY OTHER        TOTAL LIEN STATUS
1 11000600 00108 WHARTON ST PRUSINOWSKI JOSEPHINE 2001         0        0       0     0     0            
2 11000600 00108 WHARTON ST PRUSINOWSKI JOSEPHINE 2002         0        0       0     0     0            
3 11000600 00108 WHARTON ST PRUSINOWSKI JOSEPHINE 2003         0        0       0     0     0            
4 11000600 00108 WHARTON ST PRUSINOWSKI JOSEPHINE 2004         0        0       0     0     0            
5 11000600 00108 WHARTON ST PRUSINOWSKI JOSEPHINE 2005         0        0       0     0     0            
6 11000600 00108 WHARTON ST PRUSINOWSKI JOSEPHINE 2006         0        0       0     0     0            

I want to reshape it "wide by year" (as is my instinct from similar exercises in STATA), so that I get variables like PRINCIPAL_2001, PRINCIPAL_2002, etc.

However, when I run:

data2m<-melt(data2, id=c("brt_id", "year"))
data2c<-dcast(data2m, brt_id+year~...)

The resulting data (which should be identical to the original data) looks like:

    brt_id year ADDRESS OWNERNAME PRINCIPAL INTEREST PENALTY OTHER TOTAL LIEN STATUS
1 11000600 2001       1         1         1        1       1     1     1    1      1
2 11000600 2002       1         1         1        1       1     1     1    1      1
3 11000600 2003       1         1         1        1       1     1     1    1      1
4 11000600 2004       1         1         1        1       1     1     1    1      1
5 11000600 2005       1         1         1        1       1     1     1    1      1
6 11000600 2006       1         1         1        1       1     1     1    1      1

I get a warning message when I melt the data:

Warning message:
attributes are not identical across measure variables; they will be dropped 

And another when I cast the data:

Aggregation function missing: defaulting to length

It looks like the problem is happening with casting, as a peek at the melted data seems fine:

            brt_id year variable      value
70000000 621506800 2005     LIEN           
70000001 621506800 2006     LIEN           
70000002 621506800 2007     LIEN           
70000003 621506800 2008     LIEN           
70000004 621506800 2009     LIEN           

The result is similar (though worse) if I use acast:

              ADDRESS OWNERNAME PRINCIPAL INTEREST PENALTY OTHER TOTAL LIEN STATUS
11000600_2001       1         1         1        1       1     1     1    1      1
11000600_2002       1         1         1        1       1     1     1    1      1
11000600_2003       1         1         1        1       1     1     1    1      1
11000600_2004       1         1         1        1       1     1     1    1      1
11000600_2005       1         1         1        1       1     1     1    1      1
11000600_2006       1         1         1        1       1     1     1    1      1

Any idea what might be going wrong here? I also lose one observation when trying to bring it back to normal, for some reason...

Was it helpful?

Solution

Here's a solution using base R's reshape function and applied to @MrFlick's sample data. This avoids having to first melt your data and dcast it to get it into a "wide" format.

reshape(data2, direction = "wide", 
        idvar = c("brt_id", "ADDRESS", "OWNERNAME"), 
        timevar = "year")

Now, regarding your warnings: @MrFlick showed you the way to do this with the "reshape" package (why not "reshape2"? Better to keep updated!) But, he didn't really explain the warnings in his answer.

The first warning is basically telling you that variables that you are trying to put in the "value" column (the measure variables) are different types (some may be character, others may be factors, others may be numeric). In this particular case, "ADDRESS" and "OWNERNAME" (factors) are getting put into the same column with numeric values from teh remaining columns, hence the error. @MrFlick's suggestion to treat those columns as keys even if they might not be would solve that problem.

The second warning is a warning that you usually get when the combination of IDs isn't unique. If your data is like the sample data here, and you follow @MrFlick's advice, then you should be OK. Otherwise, you would need to add another column to make the ID variables unique to avoid having dcast automatically use length as its fun.aggregate function.

OTHER TIPS

Well, using this sample data.frame

data2<-data.frame(brt_id=11000600, 
    ADDRESS = "00108-WHARTON-ST",
    OWNERNAME = "PRUSINOWSKI-JOSEPHINE",
    year=2001:2006,
    PRINCIPAL =0,
    INTEREST =0,
    PENALTY =0,
    OTHER =0,
    TOTAL.LIEN.STATUS=0
)

Then I think you will find that

library(reshape2)
data2m <- melt(data2, id=c("brt_id","ADDRESS","OWNERNAME","year"))
data2c <- dcast(data2m, brt_id+ADDRESS+OWNERNAME+year~...)

will produce the original data.frame. The idea is that even though address and owner name are not necessarily part of the key, you want to treat them as such so they don't get melted as well.

And finally, to get it wide by year as you desire, use

dcast(data2m, brt_id+ADDRESS+OWNERNAME~...)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top