質問

Suppose that your supervised learning training set is made out of 3 different datasets, merged into a big one. Because of the way each of those was labeled before merging, you might be suspicious that one of them (maybe the smallest one) is more "important" than the other ones, meaning that their labels are more reliable. The others might contain more errors.

How could you weight the most reliable data points for the ML model to pay more attention to them and increase the loss when it makes a mistake on those samples? And is there a simple way to implement this using scikit-learn?

役に立ちましたか?

解決

In scikit-learn, most algorithms (SVM, Decision Trees, SGD, etc.) have a sample_weight argument that you can pass when fitting. In your case, you could provide a different weight based on which of the 3 datasets the data point comes from.

If the algorithm you want to use doesn't provide the sample_weight argument, you can always sample with replacement. Simply put, you give each sample weight, and then you create your dataset by sampling them with replacement. This means that instances with higher weights may appear multiple times in your dataset.

ライセンス: CC-BY-SA帰属
所属していません datascience.stackexchange
scroll top