Pergunta

I have data about thousands of customers who visited stores in the 3 past years. For each customer, I have :

  • ID
  • Combination of a year and the first store visited in this year.
Customer_Id | Year_*_Store 
1            2010_A
1            2011_B
1            2012_C
2            2010_A
2            2011_B
2            2012_D

What I’d like to have is the following structure of data in order to visualize the evolution of the customers’behaviour with a riverplot( aka Sankey plot)

For instance the 2 customers, who firstly visited the store A in 2010, firstly visited the store B in 2011:

SOURCE |     TARGET |   NB_CUSTOMERS
2010_A      2011_B      2
2011_B      2012_C      1
2011_B      2012_D      1

I don't want links between two years which are not consecutive like 2010_A and 2012_D

How can I do that in R ?

Foi útil?

Solução 2

Note that you can't have a * in column names (see ?make.names). Here is a basic approach:

  1. Split Year_store into two separate columns Year and Store in your data frame; at the moment it contains two completely different kinds of data and you actually need to process them separately.

  2. Make a NextYear column, defined as Year + 1

  3. Make a NextStore column, in which you assign the store code matching Customer_Id and for which Year is the same as this row's NextYear, assigning NA if there is no record of the customer visiting a store the next year, and throwing an error if the data do not meet the required specification (are ambiguous about which store was visited first the next year).

  4. Strip out any of the rows in which NextStore is NA, and combine the NextYear and NextStore columns into a NextYear_NextStore column.

  5. Summarize your data frame by the Year_store and NextYear_NextStore columns e.g. using ddply in the plyr package.

Some sample data:

# same example data as question
customer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
    Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2012_D"),
    stringsAsFactors = FALSE)

# alternative data should throw error, customer 2 is inconsistent in 2011
badCustomer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
    Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2011_D"),
    stringsAsFactors = FALSE)

And an implementation:

require(plyr)

splitYearStore <-  function(df) {
    df$Year <- as.numeric(substring(df$Year_Store, 1, 4))
    df$Store <- as.character(substring(df$Year_Store, 6))
    return(df) 
}

findNextStore <- function(df, matchCust, matchYear) {
    matchingStore <- with(df,
        df[Customer_Id == matchCust & Year == matchYear, "Store"])
    if (length(matchingStore) == 0) {
        return(NA)
    } else if (length(matchingStore) > 1) {
        errorString <- paste("Inconsistent store results for customer",
            matchCust, "in year", matchYear)
        stop(errorString)
    } else {
        return(matchingStore)
    }
}

tabulateTransitions <-  function(df) {
    df <- splitYearStore(df)
    df$NextYear <- df$Year + 1
    df$NextStore <- mapply(findNextStore, matchCust = df$Customer_Id,
        matchYear = df$NextYear, MoreArgs = list(df = df)) 
    df$NextYear_NextStore <- with(df, paste(NextYear, NextStore, sep = "_"))
    df <- df[!is.na(df$NextStore),]
    df <- ddply(df, .(Source = Year_Store, Target = NextYear_NextStore),
        summarise, No_Customers = length(Customer_Id))
    return(df) 
}

Results:

> tabulateTransitions(customer.df)
  Source Target No_Customers
1 2010_A 2011_B            2
2 2011_B 2012_C            1
3 2011_B 2012_D            1
> tabulateTransitions(badCustomer.df)
Error in function (df, matchCust, matchYear)  : 
  Inconsistent store results for customer 2 in year 2011

No attempt has been made to optimise; if your data set is massive then perhaps you should investigate a data.table solution.

Outras dicas

I would do this with dplyr (faster)

df<-read.table(header=T,text="Customer_Id  Year_Store 
1            2010_A
1            2011_B
1            2012_C
2            2010_A
2            2011_B
2            2012_D")

require(dplyr)             # for aggregation
require(riverplot)         # for Sankey

targets<-
group_by(df,Customer_Id) %.%           # group by Customer
mutate(source=Year_Store,target=c(as.character(Year_Store)[-1],NA)) %.%   # add a lag to show the shift
filter(!is.na(target)) %.%                                                # filter out empty edges
regroup(list("source","target")) %.%                                      # regroup by source & target
summarise(len=length(Customer_Id)) %.%                                    # count customers for relationship
mutate(step=as.integer(substr(target,1,4))-as.integer(substr(source,1,4))) %.%   # add a step to show how many years
filter(step==1)                                                            # filter out relationships for non consec years

topnodes <- c(as.character(unique(df$Year_Store)))                         # unique nodes

nodes <- data.frame( ID=topnodes,                                          # IDs
                   x=as.numeric(substr(topnodes,1,4)),                     # x value for plot
                   col= rainbow(length(topnodes)),                         # color each different
                   labels= topnodes,                                       # labels
                   stringsAsFactors= FALSE )

edges<-                                                                    # create list of list 
  lapply(unique(targets$source),function(x){
      l<-as.list(filter(targets,source==x)$len)                            # targets per source
      names(l)<-filter(targets,source==x)$target                           # name of target
      l
  })

names(edges)<-unique(targets$source)                                       # name top level nodes

r <- makeRiver( nodes, edges)                                              # make the River 
plot( r )                                                                  # plot it!

enter image description here

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top