Pregunta

I have a large dataset I am trying to do cluster analysis on using SOM. The dataset is HUGE (~ billions of records) and I am not sure what should be the number of neurons and the SOM grid size to start with. Any pointers to some material that talks about estimating the number of neurons and grid size would be greatly appreciated.

Thanks!

¿Fue útil?

Solución 3

I don't have a reference for it, but I would suggest starting off by using approximately 10 SOM neurons per expected class in your dataset. For example, if you think your dataset consists of 8 separate components, go for a map with 9x9 neurons. This is completely just a ballpark heuristic though.

If you'd like the data to drive the topology of your SOM a bit more directly, try one of the SOM variants that change topology during training:

Unfortunately these algorithms involve even more parameter tuning than plain SOM, but they might work for your application.

Otros consejos

Quoting from the som_make function documentation of the som toolbox

It uses a heuristic formula of 'munits = 5*dlen^0.54321'. The 'mapsize' argument influences the final number of map units: a 'big' map has x4 the default number of map units and a 'small' map has x0.25 the default number of map units.

dlen is the number of records in your dataset

You can also read about the classic WEBSOM which addresses the issue of large datasets http://www.cs.indiana.edu/~bmarkine/oral/self-organization-of-a.pdf http://websom.hut.fi/websom/doc/ps/Lagus04Infosci.pdf

Keep in mind that the map size is also a parameter which is also application specific. Namely it depends on what you want to do with the generated clusters. Large maps produce a large number of small but "compact" clusters (records assigned to each cluster are quite similar). Small maps produce less but more generilized clusters. A "right number of clusters" doesn't exists, especially in real world datasets. It all depends on the detail which you want to examine your dataset.

I have written a function that, with the data set as input, returns the grid size. I rewrote it from the som_topol_struct() function of Matlab's Self Organizing Maps Toolbox into a R function.

topology=function(data)
{
  #Determina, para lattice hexagonal, el número de neuronas (munits) y su disposición (msize)
  D=data
  # munits: número de hexágonos
  # dlen: número de sujetos
  dlen=dim(data)[1]
  dim=dim(data)[2]
  munits=ceiling(5*dlen^0.5) # Formula Heurística matlab
  #munits=100
  #size=c(round(sqrt(munits)),round(munits/(round(sqrt(munits)))))
  A=matrix(Inf,nrow=dim,ncol=dim)
  for (i in 1:dim)
  {
    D[,i]=D[,i]-mean(D[is.finite(D[,i]),i])
  }

  for (i in 1:dim){
    for (j in i:dim){
      c=D[,i]*D[,j]
      c=c[is.finite(c)];
      A[i,j]=sum(c)/length(c)
      A[j,i]=A[i,j]
    }
  }

  VS=eigen(A)
  eigval=sort(VS$values)

  if (eigval[length(eigval)]==0 | eigval[length(eigval)-1]*munits<eigval[length(eigval)]){
    ratio=1
  }else{
    ratio=sqrt(eigval[length(eigval)]/eigval[length(eigval)-1])}

  size1=min(munits,round(sqrt(munits/ratio*sqrt(0.75))))
  size2=round(munits/size1)

  return(list(munits=munits,msize=sort(c(size1,size2),decreasing=TRUE))) 
}

hope it helps...

Iván Vallés-Pérez

Kohenon has written on the issue of selecting parameters and map size for SOM in his book "MATLAB Implementations and Applications of the Self-Organizing Map". In some cases, he suggest the initial values can be arrived at after testing several sizes of the SOM to check that the cluster structures were shown with sufficient resolution and statistical accuracy.

my suggestion would be the following

  1. SOM is distantly related to correspondence analysis. In statistics, they use 5*r^2 as a rule of thumb, where r is the number of rows/columns in a square setup
  2. usually, one should use some criterion that is based on the data itself, meaning that you need some criterion for estimating the homogeneity. If a certain threshold would be violated, you would need more nodes. For checking the homogeneity you would need some records per node. Agai, from statistics you could learn that for simple tests (small number of variables) you would need around 20 records, for more advanced tests on some variables at least 8 records.
  3. remember that the SOM represents a predictive model. So validation is the key, absolutely mandatory. Yet, validation of predictive models (see typeI / II error entry in Wiki) is a subject on its own. And the acceptable risk as well as the risk structure also depend fully on your purpose.
  4. You may test the dynamics of the error rate of the model by reducing its size more and more. Then take the smallest one with acceptable error.
  5. It is a strength of the SOM to allow for empty nodes. Yet, there should not be too much of them. Let me say, less than 5%.

Taken all together, from experience, I would recommend the following criterion a minimum of the absolute number of 8..10 records, but those should not be more than 5% of all clusters. Those 5% rule is of of course a heuristics, which however can be justified by the general usage of the confidence level in statistical tests. You may choose any percentage from 1% to 5%.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top