Question

I'm using ELKI to cluster, in a hierarchical way, a dataset of geolocations using OPTICSXi. The result of the execution of the algorithm is a set of files.

The content of a file could be:

# Cluster: nameOfCluster
# OPTICSModel
# Parents: nameOfParents (this element doesn't exist for the root cluster)
# Children: nameOfChild_0, nameOfChild_1 ... nameOfChild_n, (optional) 
ID=1 lat0 lon0 reachability=?
ID=3062 lat1 lon1 reachability=1.30972586 predecessor=1
ID=7383 lat2 lon2 reachability=2.56784445 predecessor=3062
ID=42839 lat3 lon3 reachability=4.05510623 predecessor=1

I don't understand if the elements that are in each file (in the example there are four elements) belong to the same cluster or could belong to different clusters. In the latter case, I need to write some code that builds the clusters ( for example looking at the predecessor of each node), or there are some parameters that could I specify in Elki to obtain each single cluster?

Was it helpful?

Solution

By default, ELKI will produce a directory with one file per cluster. Unless the output file already exists, in which case you will get all the clusters written into the same file, separated with comments as seen above.

With a hierarchical result, such as OPTICSXi, your should however also treat all members of the child clusters to be also part of the parent. These are clusters nested into the parent. They are not repeated in the parent, to reduce redundancy in the output.

Compare the output of OPTICSXi to OPTICS output. What the Xi approach does, is split the data for you, based on sudden drops in reachability-distance. All clusters of Xi should be subsequences of the original OPTICS cluster order.

In your case, you may have chosen minPts too small, if your cluster has just 4 elements. (Although, you may have truncated the file, or you may have a lot of elements in child clusters; so the output may be fine).

Also note that you will usually want to validate whether you want the first element(s) of your cluster to belong to the cluster or not; similarly the last elements. OPTICSXi tends to err on the first elements, but not in a systematic way that would be trivial to fix. The first and last elements are those that bridge the gap from one cluster to another. You really should verify these manually (which is a good reason to not choose minPts too small).

I strongly recommend to build/use a visualization for your specific use case. Then you could just load such a cluster into your visualization and visually inspect if the result makes sense to you. I have used OPTICSXi on geographic data, and that worked very well for me.

OTHER TIPS

So, if I've understood well, in the example above, the cluster is composed of the elements ID=1, ID=3062, ID=7383, ID=42839, and all the elements in nameOfChild_0, nameOfChild_1 ... nameOfChild_n. Maybe, I don't have to join the children in the root element, because I guess I'll obtain a unique big cluster contained all my geo-locations, in fact I have 903 child elements and 18795 node (ID).

I've done a lot of tests, choosing minPoint = {2,5,10} and xi = {0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001}. I use a visualization of my clusters, but I can't find a good result. I'm having a lot of trouble.

Thanks to your reply I've understood that I split my elements too much, in the sense that for me each file is a cluster, and for this reason I don't consider the child elements in the parent, but I consider them as separated clusters.

Moreover, I noticed that the first and the last element sometimes are wrong, I've thought to verify if this elements are predecessor of at least one element in the cluster, or at least one element in the cluster is a predecessor of those. Does this make sense?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top