蟒蛇、字典、卡方应急表

https://stackoverflow.com/questions/3029600

26-09-2019
|

题

这是一个问题，我已经伤透我的脑袋上长时间，因此任何帮助将是巨大的。我有一个文件，其中包含几条线以下列格式(字，时间，这个词是发生在和频率的文件包含的文字内给出的实例在时间)。下面是一个例子是该输入文件的样子。

#inputfile
<word, time, frequency>
apple, 1, 3
banana, 1, 2
apple, 2, 1
banana, 2, 4
orange, 3, 1

我蟒蛇类之下，我的用于创建2-D的字典储存上述文件中使用作为关键，并且频率如价值：

class Ddict(dict):
    '''
    2D dictionary class
    '''
    def __init__(self, default=None):
            self.default = default

    def __getitem__(self, key):
            if not self.has_key(key):
                self[key] = self.default()
            return dict.__getitem__(self, key)


wordtime=Ddict(dict) # Store each inputfile entry with a <word,time> key
timeword=Ddict(dict) # Store each inputfile entry with a <time,word> key

# Loop over every line of the inputfile
for line in open('inputfile'):
    word,time,count=line.split(',')

    # If <word,time> already a key, increment count
    try:
        wordtime[word][time]+=count
    # Otherwise, create the key
    except KeyError:
        wordtime[word][time]=count

    # If <time,word> already a key, increment count     
    try:
        timeword[time][word]+=count
    # Otherwise, create the key
    except KeyError:
        timeword[time][word]=count

这个问题，我已经属于计算某些事情，同时循环的项目，在这个2D词典。每个单词'w'在每一个时间"t"，计算:

文件的数量与单词'w' 内时间"t".(a)
文件的数量没有单词'w' 内时间"t".(b)
文件的数量与单词'w' 外面时间"t".(c)
文件的数量没有单词'w' 外面时间"t".(d)

每个项目上表示其中一个细胞的卡方应急表的每个字和时间。可所有这些被计算在一个单一的循环，或者他们需要做的一个时间？

理想情况下，我想输出是什么下，a、b、c、d的所有项目，上述计算:

print "%s, %s, %s, %s" %(a,b,c,d)

在输入的情况下文件所述，结果试图找到应急表的单词'apple'在时间'1'会是 (3,2,1,6).我会解释如何每个单元计算：

'3'的文件，包含'apple'内时间'1'.
有2文件内的时间 '1'不包含'apple'.
有'1'文件，其中载有 'apple'以外的时间'1'.
有6个文件之外的时间 '1'不包含这个词 "苹果"(1+4+1).

解决方案

你4号的苹果/1添加至12个，超过总数的意见(11)!只有5的文件以外的时间'1'不包含这个词'apple'.

你需要分区观察到4分离的子集：
a:苹果和1=>3
b:不-苹果和1=>2
c:苹果和不-1=>1
d:不-苹果和不-1=>5

这里是一些代码表示的一种方式这样做:

from collections import defaultdict

class Crosstab(object):

    def __init__(self):
        self.count = defaultdict(lambda: defaultdict(int))
        self.row_tot = defaultdict(int)
        self.col_tot = defaultdict(int)
        self.grand_tot = 0

    def add(self, r, c, n):
        self.count[r][c] += n
        self.row_tot[r] += n
        self.col_tot[c] += n
        self.grand_tot += n

def load_data(line_iterator, conv_funcs):
    ct = Crosstab()
    for line in line_iterator:
        r, c, n = [func(s) for func, s in zip(conv_funcs, line.split(','))]
        ct.add(r, c, n)
    return ct

def display_all_2x2_tables(crosstab):
    for rx in crosstab.row_tot:
        for cx in crosstab.col_tot:
            a = crosstab.count[rx][cx]
            b = crosstab.col_tot[cx] - a
            c = crosstab.row_tot[rx] - a
            d = crosstab.grand_tot - a - b - c
            assert all(x >= 0 for x in (a, b, c, d))
            print ",".join(str(x) for x in (rx, cx, a, b, c, d))

if __name__ == "__main__":

    # inputfile
    # <word, time, frequency>
    lines = """\
    apple, 1, 3
    banana, 1, 2
    apple, 2, 1
    banana, 2, 4
    orange, 3, 1""".splitlines()

    ct = load_data(lines, (str.strip, int, int))
    display_all_2x2_tables(ct)

这里是输出：

orange,1,0,5,1,5
orange,2,0,5,1,5
orange,3,1,0,0,10
apple,1,3,2,1,5
apple,2,1,4,3,3
apple,3,0,1,4,6
banana,1,2,3,4,2
banana,2,4,1,2,4
banana,3,0,1,6,4

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow