This might work (using numpy.histogram2d
):
import numpy as np
col1 = np.random.random_integers(1, 5, 100)
col2 = np.random.random_integers(1, 10, 100)
bin1 = np.arange(0.5, 6.5, 1)
bin2 = np.arange(0.5, 11.5, 1)
mat = np.histogram2d(col1, col2, bins=[bin1, bin2])[0]
print mat.shape
print mat
which yields
(5, 10)
array([[ 4., 2., 0., 6., 2., 2., 1., 2., 1., 2.],
[ 3., 3., 3., 0., 3., 0., 1., 4., 1., 0.],
[ 4., 2., 1., 1., 3., 2., 5., 2., 2., 2.],
[ 1., 1., 3., 2., 3., 1., 4., 4., 0., 0.],
[ 0., 2., 1., 4., 3., 1., 3., 2., 0., 1.]])
I haven't tested it using lists with lengths over a couple thousand, but I think it should scale OK for a few million values.
Edit:
Hi @nahsivar. I'm not too familiar with pandas (but I should probably be), so I spent a few minutes playing around. Here are several ways to get what you want (I think):
#instantiate dataframe
import pandas as pd
from random import choice
x_choice = 'ABCDE'
y_choice = 'ABCDEFGHIJ'
x, y = [], []
for i in range(100):
x[i] = choice(x_choice)
y[i] = choice(y_choice)
df = pd.DataFrame(data={'col1': x, 'col2': y})
# 1
df.pivot_table(rows='col1', cols='col2', aggfunc=len)
# Use fill_value=0 to replace the NaNs with 0
# Output:
col2 A B C D E F G H I J
col1
A NaN 1 3 1 2 2 2 1 4 2
B 1 1 NaN 3 5 1 2 3 1 NaN
C 4 1 2 NaN NaN 4 3 2 1 2
D 2 2 2 1 1 3 3 4 4 2
E 1 1 1 NaN 4 2 6 3 2 2
# 2
df.groupby('col2')['col1'].value_counts().unstack(level=0)
# Output:
col2 A B C D E F G H I J
A NaN 1 3 1 2 2 2 1 4 2
B 1 1 NaN 3 5 1 2 3 1 NaN
C 4 1 2 NaN NaN 4 3 2 1 2
D 2 2 2 1 1 3 3 4 4 2
E 1 1 1 NaN 4 2 6 3 2 2
# 3
pd.crosstab(df.col1, df.col2)
# Output:
col2 A B C D E F G H I J
col1
A 0 1 3 1 2 2 2 1 4 2
B 1 1 0 3 5 1 2 3 1 0
C 4 1 2 0 0 4 3 2 1 2
D 2 2 2 1 1 3 3 4 4 2
E 1 1 1 0 4 2 6 3 2 2