So, I'm trying to do a rather simple statistical significance calculation.
My program creates datasets as lists of tuples:
example_dataset = [(0, 629), (1, 546), (2, 255), (3, 72), (4, 27), (5, 2), (6, 4), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0)]
Each data set is the same, meaning it's a list of 16 tuples in which the first item is 0 appearances and the last item is 15 appearances.
For example, in the above example data set, the first tuple means 629 of my DNA sequences appeared 0 times, 546 of my DNA sequences appeared 1 time etc.
Each data set is also the same in the manner that the total of sequences is always 1535.
5% of the sequences is 76.75. I want to know where are the upper 5% (appearances-wise) situated for every dataset. In the data set above, between 15 appearances to 4 appearances I have 33 sequences (4 + 2 + 27), and between 15 appearances to 3 appearances I have 105 sequences.
That means 76.75 sequences is somewhere between 3 to 4 appearances.
How do I discover this information for each data set and not by manual calculation?
I somehow need to create a function that would get a list of tuples as the example above as input and output 4 (because 3 is already over 76.75 sequences).
another_example_dataset = [(0, 331), (1, 532), (2, 398), (3, 180), (4, 74), (5, 17), (6, 3), (7, 0), (8, 0), (9, 0), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0)]
Another example, for the above dataset the output should be 5 (because at 4 we already cross 76.75, 76.75 is somewhere between 4 to 5).
Not asking anybody to code this for me, but a helpful command or hint would be appreciated. :)
Thanks,
Eyal