拉普拉斯平滑到Biopython

https://stackoverflow.com/questions/4011115

25-09-2019
|

题

我想拉普拉斯平滑支持Biopython的朴素贝叶斯代码添加 1 我的生物信息学项目。

我看了一下朴素贝叶斯算法和拉普拉斯平滑许多文件，我想我得到了基本的想法，但我不能与代码（其实我看不到这部分，我将新增1个-laplacian号码）整合这一点。

我不熟悉使用Python和我是新手编码器。我很感激，如果任何人都熟悉Biopython可以给我一些建议。

解决方案

尝试使用_contents()方法的这个定义来代替：

def _contents(items, laplace=False):
    # count occurrences of values
    counts = {}
    for item in items:
        counts[item] = counts.get(item,0) + 1.0
    # normalize
    for k in counts:
        if laplace:
            counts[k] += 1.0
            counts[k] /= (len(items)+len(counts))
        else:
            counts[k] /= len(items)
    return counts

然后改变呼叫上Line 194成：

# Estimate P(value|class,dim)
nb.p_conditional[i][j] = _contents(values, True)

使用True以使平滑化，并False禁用它。

下面是用/输出的未进行平滑化的比较：

# without
>>> carmodel.p_conditional
[[{'Red': 0.40000000000000002, 'Yellow': 0.59999999999999998},
  {'SUV': 0.59999999999999998, 'Sports': 0.40000000000000002},
  {'Domestic': 0.59999999999999998, 'Imported': 0.40000000000000002}],
 [{'Red': 0.59999999999999998, 'Yellow': 0.40000000000000002},
  {'SUV': 0.20000000000000001, 'Sports': 0.80000000000000004},
  {'Domestic': 0.40000000000000002, 'Imported': 0.59999999999999998}]]

# with
>>> carmodel.p_conditional
[[{'Red': 0.42857142857142855, 'Yellow': 0.5714285714285714},
  {'SUV': 0.5714285714285714, 'Sports': 0.42857142857142855},
  {'Domestic': 0.5714285714285714, 'Imported': 0.42857142857142855}],
 [{'Red': 0.5714285714285714, 'Yellow': 0.42857142857142855},
  {'SUV': 0.2857142857142857, 'Sports': 0.7142857142857143},
  {'Domestic': 0.42857142857142855, 'Imported': 0.5714285714285714}]]

除了上述，我认为有可能与所述代码中的错误：

代码根据其类拆分实例，然后为每个类，并给予每个维度，它计数多少次，每次该维度的值出现。

问题是如果针对属于一个类的实例的一个子集，它发生，并非所有的尺寸的值出现在该子集中，那么当_contents()函数被调用，它不会看到所有可能的值，并且因此将返回错误的概率...

我认为你需要在计票过程来跟踪所有唯一值的每个维度（从整个数据集），并考虑到这一点。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow