Question

I have a list of individuals, charities, and years. I am trying to find out how many times individual i overlaps with individual j in a given charity and year. I would like to make a square matrix for every year and have any given cell tell me the number of overlaps.

Example of Data:

Individual    Year    Charity
    1         2003       A
    2         2003       A
    2         2003       B
    2         2005       A
   ...        ...       ...
   17         2003       A
   17         2003       B

Wanted Result 2003 (for every year):

    Individual       Individual_1    Individual_2    ...       Individual_17
        1                 .               1                      1
        2                 1               .                      2
       ...               ...             ...                    ...
        17                1               2                      .

I have heard that R is best for network data, but right now using Stata, I created a variable for each individual and then I am running an if statement that looks in the [_n+x] cell for the individual in the given column and places a one. I was then going to aggregate these data. This seems to be working but is very time intensive and I am sure there could be an error.

qui forval j = 1/1750 { 
gen individual_`j'= 0
}

qui forval j = 1/1750 {
replace individual_`j' = 1 if individual[_n+`j'] == 1 & year == 2002 & charity == "A"
}

qui forval j = 1/1750 {
replace individual_`j' = 1 if individual[_n+`j'] == 1 & year == 2003 & charity == "A"
}

qui forval j = 1/1750 {
replace individual_`j' = 1 if individual[_n+`j'] == 1 & year == 2004 & charity == "A"
}

qui forval j = 1/1750 {
replace individual_`j' = 1 if individual[_n+`j'] == 1 & year == 2005 & charity == "A"
}

I would then sum over each charity. The data are too numerous for this brute force to work, hopefully there is an easier way.

I am open to doing this outside of Stata.

Was it helpful?

Solution

I recently did something kind of similar. First add a column combining year and charity. Then convert the data frame into a list of charities per individual. I called your example of the data x

x$info <- paste(x$Year,x$Charity,sep="_")
All_Groups.list <- vector(length(unique(x$Individual)),mode="list")
names(All_Groups.list) <- as.character(unique(x$Individual))
for (i in 1:length(All_Groups.list)) {
  All_Groups.list[i] <- list(c(as.character(x[x$Individual == names(All_Groups.list)[i],4])))
}
Self.Cor.table <- sapply(All_Groups.list, function(x) {
  sapply(All_Groups.list,function(y){
length(x[x %in% y])
  })
})

The output is a correlation table where the numbers count the overlap in attended events

> Self.Cor.table
   1 2 17
1  1 1  1
2  1 3  2
17 1 2  2

This differs from your desired output by giving the number of events attended by each individual instead of a . which I think is important because each individual attends a different number of events.

If you want it per year subset the data frame by year and repeat for each subset.

OTHER TIPS

As an alternative, you might want to consider benchmarking the following. First, tabulate all triplets (entries will be 1 or 0 depending on whether an individual contributed to the charity in the year):

tbl <- table(dat$Individual, dat$Charity, dat$Year)

Now we want to loop through each Year (which is the third dimension of tbl) and for each pair of rows (individuals), calculate the number of shared 1's. Achieved as follows:

res <- apply(tbl, 3, function(x) x %*% t(x))
dim(res) <- c(dim(tbl)[1], dim(tbl)[1], dim(tbl)[3])
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top