Question

I have a data frame that contains a large number of symbols, dates, and values

date         symbol value
2014-01-03     A      2.5
2014-01-04     A      3.1
2014-01-06     A      4.5
2014-01-03     B      2.6
2014-01-05     B      3.2
2014-01-06     B      4.3

I want to split the data by symbol, compute a percentage change for the 2 most recent dates, and bin the data by some variable number of groups where the 1st group has the largest set of pct. change, the next has the 2nd largest and so on. Each group needs to have approximately the same number of symbols.

Ideally, I would like my new data frame to look something like this

date         symbol value       pctchg     bin
2014-01-03     A      2.5       .45161      1
2014-01-04     A      3.1       .45161      1
2014-01-06     A      4.5       .45161      1
2014-01-03     B      2.6       .34375      2
2014-01-05     B      3.2       .34375      2
2014-01-06     B      4.3       .34375      2

This seems like a perfect task for ddply, but I'm struggling to get something to work. Any suggestions would be very much appreciated. Thank you for your time and help.

Was it helpful?

Solution

I'm not an experienced coder, but I'll field this candidate:

df <- read.table(sep=" ", header=T, text="
date symbol value
2014-01-03 A 2.5
2014-01-04 A 3.1
2014-01-06 A 4.5
2014-01-03 B 2.6
2014-01-05 B 3.2
2014-01-06 B 4.3")

library(plyr)
df <- df[order(df$symbol, df$date),]
df <- ddply(df, "symbol", transform, pctchg=value[length(value)]/value[length(value)-1]-1)
df <- df[order(-df$pctchg),]

bins <- 2

library(ggplot2)
groups <- cut_number(1:length(unique(df$pctchg)), n=bins)
levels(groups) <- 1:length(levels(groups))
df <- merge(x=df, y=cbind.data.frame(symbol=unique(df$symbol), bin=groups))
df[order(-df$pctchg),]
#   symbol       date value    pctchg bin
# 1      A 2014-01-03   2.5 0.4516129   1
# 2      A 2014-01-04   3.1 0.4516129   1
# 3      A 2014-01-06   4.5 0.4516129   1
# 4      B 2014-01-03   2.6 0.3437500   2
# 5      B 2014-01-05   3.2 0.3437500   2
# 6      B 2014-01-06   4.3 0.3437500   2

OTHER TIPS

Adapted from LukeA's answer as a more canonical plyr solution.

If you are going to use plyr, call arrange rather than order.

df <- arrange(df, symbol, date)

The code for pctchg isn't quite right. To match the output in the question, use

df2 <- ddply(
  df,
  .(symbol), 
  mutate, 
  pctchg = (value[length(value)] - value[length(value)-1]) / value[length(value)-1]
)

(Also note the use of mutate rather than transform. Bins can be generated using a hack based on the factor function.

mutate(df2, bin = as.integer(factor(-pctchg)))

Assuming the data.frame, DF, is already sorted by symbol and date as it is in the question, f computes the change of the last two values of its vector argument and ave applies that to each group. Finally we use order to re-sort and we append bin. We used the heading relchg rather than pctchg since the values shown in the question are not percents.

f <- function(x) { n <- length(x); if (n > 1) x[n] / x[n-1] - 1 else NA }
DF2 <- transform(DF, relchg = ave(value, symbol, FUN = f))
o <- with(DF2, order(-relchg, symbol, date))
transform(DF2[o, ], bin = as.numeric(factor(symbol, levels = unique(symbol))))

The result is:

        date symbol value    relchg bin
1 2014-01-03      A   2.5 0.4516129   1
2 2014-01-04      A   3.1 0.4516129   1
3 2014-01-06      A   4.5 0.4516129   1
4 2014-01-03      B   2.6 0.3437500   2
5 2014-01-05      B   3.2 0.3437500   2
6 2014-01-06      B   4.3 0.3437500   2
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top