Categorizing points using known distributions

https://stackoverflow.com/questions/23393456

12-07-2023
|

Question

My problem is as follows:

I am given a number of chi-squared values for the same collection of data sets, fitted with different models. (so, for example, for 5 collections of points, fitted with either a single binomial distribution, or both binomial and normal distributions, I would have 10 chi-squared values).

I would like to use machine learning categorization to categorize the data sets into "models":

e.g. data sets (1,2,5 and 7) are best fitted using only binomial distributions, whereas sets (3,4,6,8,9,10) - using normal distribution as well.

Notably, the number of degrees of freedom is likely to be different for both chi-squared distributions and is always known, as is the number of models.

My (probably) naive guess for a solution would be as follows:

Randomly distribute the points (10 chi-squared values in this case) into the number of categories (2).
Fit each of the categories using the particular chi-squared distributions (in this case with different numbers of degrees of freedom)
Move outlying points from one distribution to the next.
Repeat steps 2 and 3 until happy with result.

However I don't know how I would select the outlying points, or, for that matter, if there already is an algorithm that does it.

I am extremely new to machine learning and fairly new to statistics, so any relevant keywords would be appreciated too.

Solution

The principled way to do this is to assign probabilities to different model types and to different parameters within a model type. Look for "Bayesian model estimation".

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow