Вопрос

I have a list of dates I'd like to cluster into 3 clusters. Now, I can see hints that I should be looking at k-means, but all the examples I've found so far are related to coordinates, in other words, pairs of list items.

I want to take this list of dates and append them to three separate lists indicating whether they were before, during or after a certain event. I don't have the time for this event, but that's why I'm guessing it by breaking the date/times into three groups.

Can anyone please help with a simple example on how to use something like numpy or scipy to do this?

Это было полезно?

Решение 2

Here are some workaround methods that may not be the best answer but should help.

You can plot the dates as converted durations from a starting date (such as one week) and convert the dates to number representations for time in minutes or hours from the starting point.

These would all graph along an x-axis but Kmeans should still be possible and clustering still visible on a graph.

Here are more examples of numpy:Python k-means algorithm

Другие советы

k-means is exclusively for coordinates. And more precisely: for continuous and linear values.

The reason is the mean functions. Many people overlook the role of the mean for k-means (despite it being in the name...)

On non-numerical data, how do you compute the mean?

There exist some variants for binary or categorial data. IIRC there is k-modes, for example, and there is k-medoids (PAM, partitioning around medoids).

It's unclear to me what you want to achieve overall... your data seems to be 1-dimensional, so you may want to look at the many questions here about 1-dimensional data (as the data can be sorted, it can be processed much more efficiently than multidimensional data).

In general, even if you projected your data into unix time (seconds since 1.1.1970), k-means will likely only return mediocre results for you. The reason is that it will try to make the three intervals have the same length.

Do you have any reason to suspect that "before", "during" and "after" have the same duration? If not, don't use k-means.

You may however want to have a look at KDE; and plot the estimated density. Once you have understood the role of density for your task, you can start looking at appropriate algorithms (e.g. take the derivative of your density estimation, and look for the largest increase / decrease, or estimate an "average" level, and look for the longest above-average interval).

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top