Function to define yearly intervals on big data frame in R

https://stackoverflow.com/questions/22159554

19-10-2022
|

Question

I'm working on a big data frame in order to prepare this data frame for a Boosted Regression Tree Model project. Since I'm new to R and programming in general I'm stuck at a point of data preparation. I've done hours of thinking on this problem and know how I would like to do it. I just can't do it in R. My data frame basically looks like this:

start.date and end.date indicate a time interval that a company (e.g. C1) has been a customer of my potential company. Company 1 was a customer from 01/01/2009 to 31/12/2009, as well as in the next two years. The variable amount.x is the amount that was paid to be customer of my company.

> df <- data.frame(company,start.date,end.date,amount.x)
> df

      company start.date   end.date amount.x
    1      C1 01/01/2009 31/12/2009       10
    2      C1 01/01/2010 31/12/2010       20
    3      C1 01/01/2011 31/12/2011        5
    4      C2 01/01/2009 31/12/2009        7
    5      C2 01/01/2010 31/12/2010       12
    6      C2 01/01/2011 31/12/2011       11

What I'm trying to do is to add a new column showing how many years different companies have been customers of my company. The problem is that the time interval between start.date and end.date is not always exactly one year. Sometimes companies have been a customer for a month, but this should still be displayed as 1 year as a customer. It should look like this:

> df <- data.frame(company,start.date,end.date,amount.x,Years.as.customer)
> df
      company start.date   end.date amount.x   Years.as.customer
    1      C1 01/01/2009 31/12/2009       10   1
    2      C1 01/01/2010 31/12/2010       20   2 
    3      C1 01/01/2011 31/12/2011        5   3
    4      C2 01/01/2009 31/12/2009        7   1
    5      C2 01/01/2010 31/12/2010       12   2
    6      C2 01/01/2011 31/12/2011       11   3

I thought this could be achieved by defining a starting date for each company. So if a new name in df$company occurs, take the date from start.date in the same row and keep it for all rows for the same company in df$company. The next step should be to calculate the time difference between the end.dates and the starting date. If the difference is <= 1 year, write 1 in df$years. if: 2=> time diff >1 , write year 2 etc.

This should be done for a huge date frame with a different dates (not always exactly 1 year between the two dates and different starting and end dates) and around 3000 companies.

I'm struggling with defining a working function and applying it on the whole data frame.

I hope I could briefly explain the problem and what I want to do about it. Feel free to ask questions if there's anything unclear. I will try to answer them clearly.

Thanks for your help, guys.

Edit: Problems with overlapping years. (@Hugh)

To completely solve my described problems I'm dealing with one last issue: I used Hugh's solution (see comments) using a combination of the dplyr and lubridate package. See the results written in code below

company   start.date  end.date   Years.as.customer
    C20   2010-07-10  2010-09-30  1
    C20   2010-07-10  2011-06-30  2
    C20   2010-07-10  2011-06-30  2
    C20   2010-07-10  2011-06-30  2
    C20   2010-07-10  2011-06-30  2 
    C20   2010-07-10  2011-06-30  2
    C20   2010-10-01  2010-12-31  1
    C20   2011-01-01  2011-03-31  2
    C20   2011-04-01  2011-06-30  2

The problem is that company C20 has been a customer for only one year. All dates (from the first date in column start.date to the last date in column end.date) are withing one year if one takes the first row as start. I guess when the year in column end.date from 2010 to 2011 changes, the value in column Years.as.customer changes from 1 to 2 as well. It should stay at 1 for all given rows since the time interval is still <= 1 year. Any ideas how this can be done?

Thanks in advance.

Solution

I think this gives what you're after:

library(dplyr)
library(lubridate)

df$start.date <- as.Date(df$start.date, format="%d/%m/%Y")
df$end.date <- as.Date(df$end.date, format="%d/%m/%Y")

  df %.% 
  group_by(company) %.% 
  # mutate(Years.as.customer = year(end.date) - min(year(start.date)) + 1)
  mutate(Years.as.customer = 
           ceiling((end.date - min(start.date))/365.25))


# months
library(zoo)
df %.%
  group_by(company) %.%
  mutate(Months.as.customer = as.yearmon(end.date) - min(as.yearmon(start.date)) + 1)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow