data processing, correlation calculation

https://datascience.stackexchange.com/questions/1225

16-10-2019
|

Question

I have product purchase count data which looks likes this:

user item1 item2
   a     2     4
   b     1     3
   c     5     6
   ...   ...   ...

These data are imported into python using numpy.genfromtxt. Now I want to process it to get the correlation between item1 purchase amount and item2 purchase amount -- basically for each value x of item1 I want to find all the users who bought item1 in x quantity then average the item2 over the same users. What is the best way to do this? I can do this by using for loops but I thought there might be something more efficient than that. Thanks!

Solution

Pandas is the best thing since sliced bread (for data science, at least).

an example:

import pd
In [22]: df = pd.read_csv('yourexample.csv')

In [23]: df
Out[23]:
   user   item1   item2
0     a        2      4
1     b        1      3
2     c        5      6

In [24]: df.columns
Out[24]: Index([u'user ', u'item1 ', u'item2'], dtype='object')

In [25]: df.corr()
Out[25]:
          item1      item2
item1   1.000000  0.995871
item2   0.995871  1.000000

In [26]: df.cov()
Out[26]:
          item1      item2
item1   4.333333  3.166667
item2   3.166667  2.333333

Bingo!

OTHER TIPS

Use one of Pandas' built in functions: http://pandas.pydata.org/pandas-docs/stable/computation.html#correlation

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange