Strange number of subsequences?

https://stackoverflow.com/questions/20718879

traminer

20-09-2022
|

Question

I have a sequence object created like this:

subsequences <- function(data){
  slmax <- max(data$time)
  sequences.seqe <- seqecreate(data)
  sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time", end="end", id="id", status="event", limit=slmax)
  sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL", gaps = "DEL")
  (sequences.sts)
}

data <- subsequences(data)

head(data)

Which gives the output:

    Sequence                                                                     
[1] discussed-subscribed-*-discussed-*-discussed-*-discussed-*-discussed-*-closed
[2] *-opened-*-reviewed-*-discussed-*-discussed-*-discussed-*-merged             
[3] *-discussed-*-discussed-*-discussed-*-discussed                              
[4] *-opened-*-discussed-merged-discussed                                        
[5] *-discussed-*-referenced-discussed-closed-discussed-referenced-discussed     
[6] *-referenced-*-referenced-*-referenced-assigned-*-closed

But when I calculate the subsequences, I get seemingly ridiculous answers:

seqsubsn(head(data))
 [!] found missing state in the sequence(s), adding missing state to the alphabet
    Subseq.
[1]    1036
[2]    1248
[3]      88
[4]      49
[5]     294
[6]     240

How could the number of subsequences be far longer than the number of events in each sequence?

A 'dput()' of the dataset can be found here. The issue seems to be that the original data has time stamps in seconds. However, I've used the function below in order to change the timestamps to simply be sequential:

read_seqdata <- function(data, startdate, stopdate){
  data <- read.table(data, sep = ",", header = TRUE)
  data <- subset(data, select = c("pull_req_id", "action", "created_at"))
  colnames(data) <- c("id", "event", "time")
  data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') <= '",stopdate,"'"))
  data$end <- data$time
  data <- data[with(data, order(time)), ]
  data$time <- match( data$time , unique( data$time ) )
      data$end <- match( data$end , unique( data$end ) )
  slmax <- max(data$time)
  (data)
}

This makes it possible to create appropriate measures for entropy, sequence length etc., but the number of subsequences is still problematic.

Solution

The number of subsequences returned are not surprising at all. It is a matter of definition of 'subsequence', which should not be confused with 'substring'.

A sequence $x = (x_1, x_2, ... , x_3)$ is a subsequence of $y$ if its elements $x_i$ are all in $y$ and occur in the same order as in $y$. For instance, A-B-A is a subsequence of C-A-D-B-C-D-A-D.

To illustrate, consider the `mvad' example from the TraMineR package.

library(TraMineR)
data(mvad)
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
mvad.seq <- seqdef(mvad, 17:86, states = mvad.scodes)
print(mvad.seq[1:3,], format="SPS")

##    Sequence                      
##[1] (EM,4)-(TR,2)-(EM,64)         
##[2] (FE,36)-(HE,34)               
##[3] (TR,24)-(FE,34)-(EM,10)-(JL,2)

seqsubsn(mvad.seq)[1:3]

##[1]  7  4 16

By default, seqsubsn computes the number of subsequences of the distinct successive states (DSS). The DSS of the first sequence, for example, is EM-TR-EM. The seven subsequences of EM-TR-EM are:

the empty sequence
the two sequences made of a single element: EM and TR
the two-length subsequences: EM-TR, EM-EM, TR-EM
the three-length sequence: EM-TR-EM

Proceeding the same way you can verify that your fourth sequence (that is equal to its DSS)

*-opened-*-discussed-merged-discussed

has 49 subsequences, of which the nine two-length subsequences:

*-open, *-discussed, *-merged, opened-*, opened-discussed, opened-merged, discussed-merged, discussed-discussed, merged-discussed

Hope this helps

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow