Question

As I have a big dataset and only limited computational ressources, I want to make use of aggregated sequence objects for a discrepancy analysis using the R packages TraMineR and WeightedCluster. But I struggle to find the right syntax for doing so.

In the example code below you find two discrepancy analyses, the first tree diagramm of the discrepancy analysis uses the original dataset, the second uses aggregated data (that is only unique sequences weighted by their frequencies).
Unfortunately, the results do not match. Do you have any idea why?

Example code

library(TraMineR) 
library(WeightedCluster) 

## Load example data and assign labels
data(mvad)
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.labels <- c("Employment", "Further Education", "Higher Education", 
                 "Joblessness", "School", "Training")
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")

## Aggregate example data
mvad.agg <- wcAggregateCases(mvad[, 17:86], weights=mvad$weight)
mvad.agg

## Define sequence object 
mvad.seq <- seqdef(mvad[, 17:86], alphabet=mvad.alphabet, states=mvad.scodes,
                   labels=mvad.labels, weights=mvad$weight, xtstep=6)
mvad.agg.seq <- seqdef(mvad[mvad.agg$aggIndex, 17:86], alphabet=mvad.alphabet,
                       states=mvad.scodes, labels=mvad.labels,
                       weights=mvad.agg$aggWeights, xtstep=6)

## Computing OM dissimilarities
mvad.dist <- seqdist(mvad.seq, method="OM", indel=1.5, sm="CONSTANT")
mvad.agg.dist <- seqdist(mvad.agg.seq, method="OM", indel=1.5, sm="CONSTANT")

## Discrepancy analysis
tree <- seqtree(mvad.seq ~ gcse5eq + Grammar + funemp, 
                data=mvad, diss=mvad.dist, weight.permutation="diss")
seqtreedisplay(tree, type="d", border=NA)
tree.agg <- seqtree(mvad.agg.seq ~ gcse5eq + Grammar + funemp, 
                    data=mvad[mvad.agg$aggIndex, ], diss=mvad.agg.dist, 
                    weight.permutation="diss")
seqtreedisplay(tree.agg, type="d", border=NA)

This question is related to big data and the computation of sequence distances.

Was it helpful?

Solution

The procedure you are using for aggregated data is wrong, because you do not consider explanatory covariates when aggregating the data. Because of that each unique sequence is attributed to an almost random covariate profile, giving wrong results.

What you need to do is aggregating sequence and covariates. Here covariates "Grammar" "funemp" "gcse5eq" are located in columns 10 to 12. So

## Aggregate example data
mvad.agg <- wcAggregateCases(mvad[, c(10:12, 17:86)], weights=mvad$weight)
mvad.agg

We then come to the next problem: permutation test. If you do nothing, you will permute only aggregates (and omit permutations inside aggregates) giving you wrong p-values. Two solutions can be used:

  • If you do not have sampling weights use weight.permutation="replicate" telling the procedure to permute inside aggregates using a case unit of one.
  • If you have sampling weights, there are no perfect procedure. You can use weight.permutation="random-sampling" (random assignment of covariate profiles to the objects using distributions defined by the weights.)

In all the cases, you may observe small differences of p-values (because you have a different procedure), and also because p-values are estimated using permutation tests. To get more precise p-value try to use an higher R value (number of permutations). In the tree procedure, the minimum p-value to make a split can be changed using the pval argument. You can try to set it just a little higher to see if the differences come from here.

I hope it helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top