Calculating the upper 5% of my discrete distribution

https://stackoverflow.com/questions/19162083

30-06-2022
|

Question

So, I'm trying to do a rather simple statistical significance calculation.

My program creates datasets as lists of tuples:

example_dataset = [(0, 629), (1, 546), (2, 255), (3, 72), (4, 27), (5, 2), (6, 4), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0)]

Each data set is the same, meaning it's a list of 16 tuples in which the first item is 0 appearances and the last item is 15 appearances.

For example, in the above example data set, the first tuple means 629 of my DNA sequences appeared 0 times, 546 of my DNA sequences appeared 1 time etc.

Each data set is also the same in the manner that the total of sequences is always 1535.

5% of the sequences is 76.75. I want to know where are the upper 5% (appearances-wise) situated for every dataset. In the data set above, between 15 appearances to 4 appearances I have 33 sequences (4 + 2 + 27), and between 15 appearances to 3 appearances I have 105 sequences.

That means 76.75 sequences is somewhere between 3 to 4 appearances.

How do I discover this information for each data set and not by manual calculation?

I somehow need to create a function that would get a list of tuples as the example above as input and output 4 (because 3 is already over 76.75 sequences).

another_example_dataset = [(0, 331), (1, 532), (2, 398), (3, 180), (4, 74), (5, 17), (6, 3), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0)]

Another example, for the above dataset the output should be 5 (because at 4 we already cross 76.75, 76.75 is somewhere between 4 to 5).

Not asking anybody to code this for me, but a helpful command or hint would be appreciated. :)

Thanks,

Eyal

Solution

U have to do some manual calculation, and here is simple example:

example_dataset = [(0, 629), (1, 546), (2, 255), (3, 72), (4, 27), (5, 2), (6, 4), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0)]
another_example_dataset = [(0, 331), (1, 532), (2, 398), (3, 180), (4, 74), (5, 17), (6, 3), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0)]

def CalculateIndex(dataset):
    sum5 = 0
    for i in range(15,-1,-1):
        sum5 += dataset[i][1]
        if sum5 > 76.75:
            return i+1

print "index for example_dataset is: ", CalculateIndex(example_dataset)
print "index for another_example_dataset is: ", CalculateIndex(another_example_dataset)

OTHER TIPS

One possible way to do it would be to iterate over the from highest frequency to lowest and then when you get to 77 occurrences stop and use that as your 5% point. Save that number of occurrences and move on to the next set of tuples. If the tuples are stored in a dictionary or 2d array, list etc. just iterate over with a foreach and save the point where the 77 occurs to a list and print the list. Kind of a naive way of doing it but could solve your problem.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow