Flatten a Series in pandas, i.e. a series whose elements are lists

https://stackoverflow.com//questions/24027723

21-12-2019
|

Question

I have a series of the form:

s = Series([['a','a','b'],['b','b','c','d'],[],['a','b','e']])

which looks like

0       [a, a, b]
1    [b, b, c, d]
2              []
3       [a, b, e]
dtype: object

I would like to count how many elements I have in total. My naive tentatives like

s.values.hist()

s.values.flatten()

didn't work. What am I doing wrong?

Solution

s.map(len).sum()

does the trick. s.map(len) applies len() to each element and returns a series of all the lengths, then you can just use sum on that series.

OTHER TIPS

Personally, I love having arrays in dataframes, for every single item a single column. It will give you much more functionality. So, here's my alternative approach

>>> raw = [['a', 'a', 'b'], ['b', 'b', 'c', 'd'], [], ['a', 'b', 'e']]
>>> df = pd.DataFrame(raw)
>>> df
Out[217]: 
      0     1     2     3
0     a     a     b  None
1     b     b     c     d
2  None  None  None  None
3     a     b     e  None

Now, see how many values we have in each row

>>> df.count(axis=1)
Out[226]: 
0    3
1    4
2    0
3    3

Applying sum() here would give you what you wanted.

Second, what you mentioned in a comment: get the distribution. There may be a cleaner approach here, but I still prefer the following over the hint that was given you in the comment

>>> foo = [col.value_counts() for x, col in df.iteritems()]
>>> foo
Out[246]: 
[a    2
 b    1
 dtype: int64, b    2
 a    1
 dtype: int64, b    1
 c    1
 e    1
 dtype: int64, d    1
 dtype: int64]

foo contains distribution for every column now. The interpretation of columns is still "xth value", such that column 0 contains the distribution of all the "first values" in your arrays.

Next step, "sum them up".

>>> df2 = pd.DataFrame(foo)
>>> df2
Out[266]: 
    a   b   c   d   e
0   2   1 NaN NaN NaN
1   1   2 NaN NaN NaN
2 NaN   1   1 NaN   1
3 NaN NaN NaN   1 NaN
>>> test.sum(axis=0)
Out[264]: 
a    3
b    4
c    1
d    1
e    1
dtype: float64

Note that for these very simple problems the difference between a series of lists and a dataframe with columns per item is not big, but once you want to do real data work, the latter gives you way more functionality. Moreover, it can potentially be more efficient, since you can use pandas internal methods.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow