Domanda

I'm a social science researcher, working to visualize how people move through various roles over time in a community.

I have clustered people's monthly behavior into role categories, and now I want to visualize the number and ratio of people that are in each role at each (relative) time period.

Right now, the data is in a CSV that looks something like this:

ID  T1  T2  T3 ...
1   2   2   3
2   1   0   2
3   1   2   1
...

Where X(ij) is the cluster ID i was in during their jth month.

What I would like is something like this (which I created in LibreOffice). enter image description here

I believe I will need to use ggplot2, but I have really been struggling to figure out how to get the data in a format that ggplot likes.

I guess my first task would be to summarize each cluster at each time period? Is there an easy way to do that?

I can do this with the following code, but it's terrible and messy, and there must be a better way to do it?

clus1 <- apply(clusters, 2, function(x) {sum(x=='1', na.rm=TRUE)})
clus2 <- apply(clusters, 2, function(x) {sum(x=='2', na.rm=TRUE)})
clus3 <- apply(clusters, 2, function(x) {sum(x=='3', na.rm=TRUE)})
clus0 <- apply(clusters, 2, function(x) {sum(x=='0', na.rm=TRUE)})
clusters2 <- data.frame(clus0, clus1, clus2, clus3)
c2 <- t(clusters2)
c3 <- as.data.frame(c2)
c3$id = c('Low Activity Cluster', 'Cluster 1', 'Cluster 2', 'Cluster 3')
c3 <- c3[order(c3$'id'),]
print(ggplot(melt(c3, id.vars="id")) +
  geom_area(aes(x=variable, y=value, fill=id, group=id), position="fill"))

This results in something like this for the sample data:

id                      T1  T2  T3
Low Activity Cluster     0   1   0
Cluster 1                2   0   1
Cluster 2                1   2   1
Cluster 3                0   0   1

Is that the right strategy?

È stato utile?

Soluzione

EDITs, trying to address comments:

`rownames<-`(
  as.data.frame(lapply(df[-1], function(x) as.numeric(table(x)))), 
  paste("Clust ", 0:3)
)

Produces:

         T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
Clust  0  4  3  5  8 11  6  2  4  5   7
Clust  1  5  9  8  6  3  7  7  8  7   4
Clust  2  5  6  2  3  3  3  2  3  4   4
Clust  3  6  2  5  3  3  4  9  5  4   5

This counts the # of occurrences of each cluster type (0:3) at each time period using table. The key piece of code is the lapply(...). The stuff around it is just so it displays pretty.

With data:

set.seed(1)
labels <- paste("Clust ", 0:3)
df <- as.data.frame(c(list(ID=1:20), setNames(replicate(10, factor(sample(0:3, 20, rep=T)), simplify=F), paste0("T", 1:10))))

Here is a ggplot solution. First you need to get the data into long format with melt from the reshape2 package, you can then aggregate it (optionally re-cast it), and then plot it:

library(reshape2)
library(ggplot2)
df.mlt <- melt(df, id.vars="ID")
df.agg <- aggregate(. ~ ID + variable, df.mlt, sum)
dcast(df.agg, ID ~ variable)  # just for show, we don't use the result anyplace

#   ID T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
# 1  0 25 18 29 23 16 15 14 22 29  19
# 2  1  7  7 14 18 19 11 21 17 15  22
# 3  2 16 15 16 20 23 20 16 13 15  12
# 4  3 14 13 20 17 25 14 13  7 21  24

ggplot(df.agg) +
  geom_area(aes(x=variable, y=value, fill=ID, group=ID), position="fill") 

enter image description here

It takes a little getting used to ggplot, but once you do get used to it is mostly intuitive. You should look at the result of melt(df, id.vars="ID") to see what I mean by "long format" first. Then, in this case, we use geom_area, and specify as "aesthetics" (values that change with the data) in aes the x value (variable is a name produced by melt, in this case it contains the time values), the y value (value is also created by melt), and also specify that the color of the fill of our areas should be derived from the ID. Note that because the time we're using here is categorical (T1, T2, etc., instead of actual dates), we must use group in addition to fill so that ggplot knows that you want points in different times to be connected.

Note you do not need to do the aggregation step ahead of plotting. ggplot can handle it internally. The following command is equivalent (note how we're using df.mlt):

ggplot(df.mlt) +
  stat_summary(aes(x=variable, y=value, fill=ID, group=ID), fun.y=sum, position="fill", geom="area") 

This is the data I used:

df <- as.data.frame(c(list(ID=rep(factor(0:3), 3)), setNames(replicate(10, sample(1:10, 12, rep=T), simplify=F), paste0("T", 1:10))))
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top