Laplacian smoothing to Biopython
-
25-09-2019 - |
문제
I am trying to add Laplacian smoothing support to Biopython's Naive Bayes code 1 for my Bioinformatics project.
I have read many documents about Naive Bayes algorithm and Laplacian smoothing and I think I got the basic idea but I just can't integrate this with that code (actually I cannot see which part I will add 1 -laplacian number).
I am not familiar with Python and I am a newbie coder. I appreciate if anyone familiar with Biopython can give me some suggestions.
해결책
Try using this definition of the _contents()
method instead:
def _contents(items, laplace=False):
# count occurrences of values
counts = {}
for item in items:
counts[item] = counts.get(item,0) + 1.0
# normalize
for k in counts:
if laplace:
counts[k] += 1.0
counts[k] /= (len(items)+len(counts))
else:
counts[k] /= len(items)
return counts
Then change the call on Line 194
into:
# Estimate P(value|class,dim)
nb.p_conditional[i][j] = _contents(values, True)
use True
to enable the smoothing, and False
to disable it.
Here's a comparison of the output with/without the smoothing:
# without
>>> carmodel.p_conditional
[[{'Red': 0.40000000000000002, 'Yellow': 0.59999999999999998},
{'SUV': 0.59999999999999998, 'Sports': 0.40000000000000002},
{'Domestic': 0.59999999999999998, 'Imported': 0.40000000000000002}],
[{'Red': 0.59999999999999998, 'Yellow': 0.40000000000000002},
{'SUV': 0.20000000000000001, 'Sports': 0.80000000000000004},
{'Domestic': 0.40000000000000002, 'Imported': 0.59999999999999998}]]
# with
>>> carmodel.p_conditional
[[{'Red': 0.42857142857142855, 'Yellow': 0.5714285714285714},
{'SUV': 0.5714285714285714, 'Sports': 0.42857142857142855},
{'Domestic': 0.5714285714285714, 'Imported': 0.42857142857142855}],
[{'Red': 0.5714285714285714, 'Yellow': 0.42857142857142855},
{'SUV': 0.2857142857142857, 'Sports': 0.7142857142857143},
{'Domestic': 0.42857142857142855, 'Imported': 0.5714285714285714}]]
Aside from the above, I think there might be a bug with the code:
The code splits the instances according to their class, and then for each class, and giving each dimension, it counts how many times each of this dimension values appear.
The problem is if for a subset of the instances belonging to one class, it happens that not all values of a dimension appear in that subset, then when the _contents()
function is called, it will not see all possible values, and thus will return the wrong probabilities...
I think you need to keep track of the all unique values for each dimension (from the entire dataset), and take that into consideration during the counting process.