Best way to generalize a distribution of data into smaller subsets?

https://stackoverflow.com/questions/20631743

18-09-2022
|

Question

Suppose I had a data set as follows:

168.95, 176.83, 178.13, 179.61, 179.44, 172.83, 173.37, 174.06, 174.94, 175.43, 175.73, 175.5, 173.78, 174.06, 172.71, 174.3, 178.38, 178.43, 177.18, 175.34, 176.16, 177.03, 178.2, 178.19, 176.42, 177.4, 178.15, 173.1, 173.4, 174.42, 173.87, 171.85, 172.37, 172.73, 172.28, 173.65, 172.35, 171.66, 172.8, 171.97, 168.56, 168.74, 168.38, 169.78, 169.79, 170.07, 171.57, 169.46, 169.32, 170.3, 154.83, 171.33, 169.61, 170.08, 170.11, 170.22, 172.16, 169.64, 169.86, 170.56, 171.64, 166.91, 167.93, 170.35, 170.56, 168.16, 168.41, 170.54, 168.89, 168.03, 169.28, 169.47, 171.24, 169.15, 170.43, 168.36, 169.19, 168.58, 168.67, 169.19, 167.69, 168.4, 168.27, 167.89, 168.18, 168.91, 168.16, 168.98, 168.79, 166.87, 168.36, 170.22, 168.36, 169.01, 168.45, 168.73, 168.24, 170.12, 169.13, 168.95, 169.06, 168.47, 168.68, 170.22, 170.04, 167.92, 168.38, 168.14, 169.27, 170.31, 168.68, 170.59, 168.95, 170.35, 152.99, 153.69, 156.8, 156.45, 156.03, 156.01, 156.63, 156.68, 155.33, 156.31, 155.65, 154.64, 156.02, 155.94, 154.5, 154.53, 154.81, 156.15, 155.58, 155.55, 154.54, 154.66, 155.09, 155.99, 155.27, 155.11, 155.22, 155.98, 156.04, 153.86, 158.48, 158.34, 158.3, 159.29, 158.4, 158.69, 159.17, 159.13, 158.02, 158.7, 157.94, 158.81, 159.14, 159.1

And another data set as follows:

125.14, 130.07, 130.45, 127.8, 128.12, 128.3, 129.18, 128, 127.84, 128.4, 128.86, 128.95, 128.29, 129.24, 128.33, 127.9, 128.96, 127.96, 128.08, 128.96, 128.6, 129.27, 128.74, 129.82, 129.72, 129.65, 130.12, 129.02, 129.8, 129.5, 129.51, 129.97, 129.95, 130.45, 130.51, 130.51, 129.44, 128.42, 129.33, 128.65, 129.15, 129.71, 128.63, 130.17, 128.96, 127.8, 128.12, 128.3, 129.18, 128, 127.84, 128.4, 128.86, 128.95, 128.29, 129.24, 128.33, 127.9, 128.96, 127.96, 128.08, 128.96, 128.6, 129.27, 128.74, 129.82, 129.72, 129.65, 130.12, 129.02, 129.8, 129.5

If you are not allowed to filter or sort the data, as they correspond to real-world, real-time values, what algorithmic method would you suggest to group the data into the least number of statistically significant groups, that can be used for both data sets (and possibly more)?

My initial idea was to take a global mean, standard deviation and variance, then split the groups into two and take local values for the aforementioned. I would then compare the two values and if they fell below par, the group(s) that failed the comparison would be further split into two. This would keep on going until the standard deviation and variance within the groups was small enough so as to not pose a statistical bias.

If anyone can think of a method that is more computationally efficient and/or mathematically rigorous, I would appreciate it greatly.

Also, I am new to numerical methods in general and would welcome any advice you had in this endeavor.

Solution

The example you give may be misleading because by eyeball it looks likes the two datasets are already separated into two significantly different groups.

If your question is, what if the two datasets were one big dataset and you needed an algorithm to discover these two groups, then the answer is you want to look up clustering methods. Wikipedia actually has a very nice write up on k-means clustering, which is a good place to start. http://en.wikipedia.org/wiki/K-means_clustering

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow