Pergunta

I'm getting started with pandas, and have one column of data in a larger DataFrame such as

0                  one two
1            two seven six
2           three one five
3    seven five five eight
4                 six four
5                    three
dtype: object

and what I'd like to do is split the sequences of words into their component parts, then get a unique set or counts for the words. I can do the split just fine

numbers.str.split(' ')

0                    [one, two]
1             [two, seven, six]
2            [three, one, five]
3    [seven, five, five, eight]
4                   [six, four]
5                       [three]
dtype: object

However, I'm not sure where to go from here. Again, I'd like to have output such as

['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight']

or the same in a dictionary with counts, or in a Series/DataFrame equivalent of one of these two.

The best I've been able to do so far is to use apply() in combination with a Set to get the unique words. pandas is a very elegant package from what I've seen so far, and it seems like this is probably within easy reach for someone who knows it better than I do.

Thanks in advance!

Foi útil?

Solução

If I understand you correctly, I think you could do it as follows using pandas. I'll start with the series before you split the strings:

print s

0                  one two
1            two seven six
2           three one five
3    seven five five eight
4                 six four
5                    three

stacked = pd.DataFrame(s.str.split().tolist()).stack()
print stacked

0  0      one
   1      two
1  0      two
   1    seven
   2      six
2  0    three
   1      one
   2     five
3  0    seven
   1     five
   2     five
   3    eight
4  0      six
   1     four
5  0    three

Now just compute the value counts of the Series:

print stacked.value_counts()

five     3
one      2
three    2
six      2
two      2
seven    2
eight    1
four     1
dtype: int64

Outras dicas

This code makes a dictionary of all of your words and their counts.

x = ['one two', 'two seven six', 'three one five', 'seven five five eight', 'six four', 'three']

#create list comprehension of all elements
x_list = [j for i in x for j in i.split()]
print x_list

# ['one', 'two', 'two', 'seven', 'six', 'three', 'one', 'five', 'seven', 'five', 'five', 'eight', 'six', 'four', 'three']

d = {}

#initialize keys
for e in set(x_list):
    d[e] = 0

#store counts in dict
for e in x_list:
        d[e] += 1

print d

The result is a dictionary with counts:

{'seven': 2, 'six': 2, 'three': 2, 'two': 2, 'four': 1, 'five': 3, 'eight': 1, 'one': 2}

I was recently working on the similar task where I wanted to count space separated strings. Using it for your data would be like this:

import pandas as pd
data = [['one two'],['two seven six'],['three one five'],['seven five five eight'],['six four'],['three']]
numbers = pd.DataFrame(data)

uniq_groups = set(x for l in numbers[0].str.split(' ') for x in l)
#{'eight', 'five', 'four', 'one', 'seven', 'six', 'three', 'two'}

#add a dataframe column for count of each value
for gr in uniq_groups:
   numbers[gr] = numbers[0].map(lambda x: len([i for i in x.split(' ') if i == gr]))

#sum all columns
numbers.loc['Total'] = numbers.sum(axis=0,numeric_only=True)
#pandas display format without decimals
pd.options.display.float_format = '{:,.0f}'.format

resulting into:

enter image description here

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top