The internal seqdist
function is written in C++ and has numerous optimizations. For this reason, if you want to parallelize seqdist, you need to do it in C++. The loop is located in the source file "distancefunctions.cpp" and you need to look at the two loops located around line 300 in function "cstringdistance" (Sorry but all comments are in French). Unfortunately, the second important optimization is that the memory is shared between all computations. For this reason, I think that parallelization would be very complicated.
Apart from selecting a sample, you should consider the following optimizations:
- aggregation of identical sequences (see here: Problem with big data (?) during computation of sequence distances using TraMineR )
- If relevant, you can try to reduce the time granularity. Distance computation time is highly dependent on sequence length (O^2). See https://stats.stackexchange.com/questions/43601/modifying-the-time-granularity-of-a-state-sequence
- Reducing time granularity may also increase the number of identical sequences, and hence, the impact of optimization one.
- There is a hidden option in
seqdist
to use an optimized version of the optimal matching algorithm. It is still in testing phase (that's why it is hidden), but it should replace the actual algorithm in a future version. To use it, setmethod="OMopt"
, instead ofmethod="OM"
. Depending on your sequences, it may reduce computation time.