data processing, correlation calculation
-
16-10-2019 - |
Question
I have product purchase count data which looks likes this:
user item1 item2
a 2 4
b 1 3
c 5 6
... ... ...
These data are imported into python
using numpy.genfromtxt
. Now I want to process it to get the correlation between item1
purchase amount and item2
purchase amount -- basically for each value x
of item1
I want to find all the users who bought item1
in x
quantity then average the item2
over the same users. What is the best way to do this? I can do this by using for
loops but I thought there might be something more efficient than that. Thanks!
Solution
Pandas is the best thing since sliced bread (for data science, at least).
an example:
import pd
In [22]: df = pd.read_csv('yourexample.csv')
In [23]: df
Out[23]:
user item1 item2
0 a 2 4
1 b 1 3
2 c 5 6
In [24]: df.columns
Out[24]: Index([u'user ', u'item1 ', u'item2'], dtype='object')
In [25]: df.corr()
Out[25]:
item1 item2
item1 1.000000 0.995871
item2 0.995871 1.000000
In [26]: df.cov()
Out[26]:
item1 item2
item1 4.333333 3.166667
item2 3.166667 2.333333
Bingo!
OTHER TIPS
Use one of Pandas' built in functions: http://pandas.pydata.org/pandas-docs/stable/computation.html#correlation