
df <- data.frame(var1=c('a', 'b', 'c'), var2=c('d', 'e', 'f'), freq=1:3)

What is the simplest way to expand the first two columns of the data.frame above, so that each row appears the number of times specified in the column 'freq'?

In other words, go from this:

  var1 var2 freq
1    a    d    1
2    b    e    2
3    c    f    3

To this:

  var1 var2
1    a    d
2    b    e
3    b    e
4    c    f
5    c    f
6    c    f
Was it helpful?


Here's one solution:

df.expanded <- df[rep(row.names(df), df$freq), 1:2]


    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f


Use expandRows() from the splitstackshape package:

expandRows(df, "freq")

Simple syntax, very fast, works on data.frame or data.table.


    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f

old question, new verb in tidyverse:

library(tidyr) # version >= 0.8.0
df <- data.frame(var1=c('a', 'b', 'c'), var2=c('d', 'e', 'f'), freq=1:3)
df %>% 

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f

@neilfws's solution works great for data.frames, but not for data.tables since they lack the row.names property. This approach works for both:

df.expanded <- df[rep(seq(nrow(df)), df$freq), 1:2]

The code for data.table is a tad cleaner:

# convert to data.table by reference
df.expanded <- df[rep(seq(.N), freq), !"freq"]

In case you have to do this operation on very large data.frames I would recommend converting it into a data.table and use the following, which should run much faster:

dt <- data.table(df)
dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")]
dt.expanded[ ,freq := NULL]

See how much faster this solution is:

df <- data.frame(var1=1:2e3, var2=1:2e3, freq=1:2e3)
system.time(df.exp <- df[rep(row.names(df), df$freq), 1:2])
##    user  system elapsed 
##    4.57    0.00    4.56
dt <- data.table(df)
system.time(dt.expanded <- dt[ ,list(freq=rep(1,freq)),by=c("var1","var2")])
##    user  system elapsed 
##    0.05    0.01    0.06

Another dplyr alternative with slice where we repeat each row number freq times


df %>%  
  slice(rep(seq_len(n()), freq)) %>% 

#  var1 var2
#1    a    d
#2    b    e
#3    b    e
#4    c    f
#5    c    f
#6    c    f

seq_len(n()) part can be replaced with any of the following.

df %>% slice(rep(1:nrow(df), freq)) %>% select(-freq)
df %>% slice(rep(row_number(), freq)) %>% select(-freq)
df %>% slice(rep(seq_len(nrow(.)), freq)) %>% select(-freq)

Another possibility is using tidyr::expand:


df %>% group_by_at(vars(-freq)) %>% expand(temp = 1:freq) %>% select(-temp)
#> # A tibble: 6 x 2
#> # Groups:   var1, var2 [3]
#>   var1  var2 
#>   <fct> <fct>
#> 1 a     d    
#> 2 b     e    
#> 3 b     e    
#> 4 c     f    
#> 5 c     f    
#> 6 c     f

One-liner version of vonjd's answer:


setDT(df)[ ,list(freq=rep(1,freq)),by=c("var1","var2")][ ,freq := NULL][]
#>    var1 var2
#> 1:    a    d
#> 2:    b    e
#> 3:    b    e
#> 4:    c    f
#> 5:    c    f
#> 6:    c    f

Created on 2019-05-21 by the reprex package (v0.2.1)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top