Question

I have a Pandas dataframe, named "impression_data," which includes a column called "site.id," like this:

   >>> impression_data['site.id']

0      62
1     189
2     191
3      62
...

Each item in this column has the datatype numpy.int64, like this:

>>> for i in impression_data['site.id']:
    print type(i)

<type 'numpy.int64'>
<type 'numpy.int64'>
<type 'numpy.int64'>
...

And as expected, membership testing works well so long as I test integers:

>>> 62 in impression_data['site.id']
True

But here's the unexpected result: I was under the impression that a column of np.int64's ought not to include any decimal values whatsoever. Apparently I'm wrong. What's going on here?

>>> 62.5 in impression_data['site.id']
True

Edit 1: All values in the column ought to be integers by construction. For completeness, I have also performed the following casting operation and encountered no errors:

impression_data['site.id'] = impression_data['site.id'].astype('int')

As per @BremBam's suggestions in the comments, I tried

impression_data['site.id'].map(type).unique()

which produces

[<type 'numpy.int64'>]

A minimal example and the real datafile I'm working with are here https://dl.dropboxusercontent.com/u/28347262/SE%20Pandas%20Int64%20Membership%20Testing/cm_impression.csv

and here

https://dl.dropboxusercontent.com/u/28347262/SE%20Pandas%20Int64%20Membership%20Testing/ExampleCode.py

Was it helpful?

Solution

This is a bug in pandas. The value is cast to the type of the index before the containment test is done, so 62.5 is converted to 62. (Note that in for a Series checks whether the value is in the index, not the values.)

I believe you can get what you want by doing 62.5 in impression_data.values.

OTHER TIPS

First, membership tests in Series are of the index, not the values:

>>> s = pd.Series([10,20,30])
>>> s
0    10
1    20
2    30
dtype: int64
>>> 0 in s
True
>>> 10 in s
False

But you're right:

>>> 1.5 in s
True

After some work, this seems to be because of __contains__ in Int64HashTable:

cdef class Int64HashTable: #(HashTable):
    [...]
    def __contains__(self, object key):
        cdef khiter_t k
        k = kh_get_int64(self.table, key)
        return k != self.table.n_buckets

key comes in as a float, but we have

inline khint_t kh_get_int64(kh_int64_t*, int64_t)

and so I think it's coerced to an integer before the comparison is made.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top