Python Pandas优化：对文本索引值的过滤

https://datascience.stackexchange.com/questions/11935

16-10-2019
|

题

我必须通过在文本索引上匹配复杂的正则表达式来过滤熊猫的数据框架。

数据框是多级索引的，并且包含超过200万个记录。

我的方式是：

identifiers = self._data.index.get_level_values('identifier')
filt = ... # an_array_of_np.bool_with_the_same_length_as_my_data

pattern = ... # a complex regular expression, as a string
filt = filt & np.array(identifiers.str.contains(pat=pattern, case=False, regex=True), dtype=np.bool)

... # other filterings

不幸的是，这条线以 filt = filt & 非常慢。

我想知道您是否有一些想法可以更快。我想那是因为 identifiers.str.contains

非常感谢！

编辑：

谢谢@emre

我不允许共享这些数据，但是下面的代码证明了问题：

步骤0：0：00：00.013527
步骤1：0：00：00.010127
步骤2：0：00：04.468114
步骤3：0：00：02.109594
步骤4：0：00：00.027437

实际上，我的感觉是我们将正则表达式应用于标识符的所有值，而我希望该过滤器适用于索引的可能值（丢失的值丢失是重复使用的多次）。

import pandas as pd
import numpy as np
import datetime

N = 2000000
N_DVC = 10000

def getData():
    identifiers = np.random.choice(np.array([
        "need", "need: foo", "need: bar", "need: foo: bar", "foo: need", "bar: need",
        "not: need", "not: need: foo", "not: need: bar", "not: need: foo: bar", "foo: need: not", "bar: need: not",
        "need ign", "need: foo ign", "need: bar ign", "need: foo: bar ign", "foo: need ign", "bar: need ign",
        "ign need", "need: ign foo", "need: ign bar", "need: foo: ign bar",
    ]), N)
    devices = np.random.choice(np.arange(N_DVC))
    timestamps = np.random.choice(pd.date_range('1/1/2016 00:00:00', periods=60*60*24, freq='s'), N)
    x = np.random.rand(N)
    y = np.random.rand(N)
    data = pd.DataFrame({'identifier': identifiers, 'device': devices, 'timestamp': timestamps, 'x': x, 'y': y})
    data.set_index(['device', 'identifier', 'timestamp'], drop=True, inplace=True)
    return data

def filterData(data):
    # I know those regular expressions are not perfect for the example,
    # but it mimics the real expressions I have
    rexpPlus = '^(?:[^\s]+:\s)*need(?:(?::\s[^\s]+)*:\s[^\s]+)?$'
    rexpMinus = '(?::\s)(?:(?:not)|(?:ign))(?::\s)'

    tic = datetime.datetime.now()
    identifiers = data.index.get_level_values('identifier')
    print("- Step 0: %s" % str(datetime.datetime.now() - tic))

    tic = datetime.datetime.now()
    filt = np.repeat(np.False_, data.shape[0])
    print("- Step 1: %s" % str(datetime.datetime.now() - tic))

    tic = datetime.datetime.now()
    filt = filt | np.array(identifiers.str.contains(pat=rexpPlus, case=False, regex=True), dtype=np.bool)
    print("- Step 2: %s" % str(datetime.datetime.now() - tic))

    tic = datetime.datetime.now()
    filt = filt & (~np.array(identifiers.str.contains(pat=rexpMinus, case=False, regex=True), dtype=np.bool))
    print("- Step 3: %s" % str(datetime.datetime.now() - tic))

    tic = datetime.datetime.now()
    data = data.loc[filt, :]
    print("- Step 4: %s" % str(datetime.datetime.now() - tic))

    return data

if __name__ == "__main__":
    filterData(getData())

解决方案

我找到了一个解决方案：我没有过滤所有值，而是计算一个可能值的哈希表（实际上是一个数据框），并在这些值上过滤：

步骤0：0：00：00.014251
步骤1：0：00：00.010097
初始快速过滤器：0：00：00.353645
步骤2：0：00：00.027462
步骤3：0：00：00.002550
步骤4：0：00：00.027452
全球：0：00：00.435839

它的运行速度更快，因为有很多重复的值。

这是一个正在做这些事情的课程。它适用于索引和普通列：

import pandas as pd
import numpy as np


class DataFrameFastFilterText(object):
    """This class allows fast filtering of text column or index of a data frame
    Instead of filtering directy the values, it computes a hash of the values
    and applies the filtering to it.
    Then, it filters the values based on the hash mapping.
    This is faster than direct filtering if there are lots of duplicates values.
    Please note that missing values will be replaced by "" and every value will
    be converted to string.
    Input data won't be changed (working on a copy).
    """

    def __init__(self, data, column, isIndex):
        """Constructor
        - data: pandas Data Frame
        - column: name of the column - the column can be an index
        - isIndex: True if the column is an index
        """
        self._data = data.copy()  # working with a copy because we will make some changes
        self._column = column
        self._isIndex = isIndex
        self._hashedValues = None
        self._initHashAndData()

    def _initHashAndData(self):
        """Initialize the hash and transforms the data
        1. data: a. adds an order column
                 b. creates the column from the index if needed,
                 c. rename the column as 'value' and
                 d. drops all the other columns
                 e. fill na with ""
                 f. convert to strings
        2. hash: data frame of 'hashId', 'hashValue'
        3. data: a. sort by initial order
                 b. delete order
        """
        self._data['order'] = np.arange(self._data.shape[0])
        if self._isIndex:
            self._data[self._column] = self._data.index.get_level_values(self._column)
        self._data = self._data.loc[:, [self._column, 'order']]
        self._data.rename(columns={self._column: 'value'}, inplace=True)
        self._data['value'] = self._data['value'].fillna("", inplace=False)
        self._data['value'] = self._data['value'].astype(str)
        self._hash = pd.DataFrame({'value': pd.Series(self._data['value'].unique())})
        self._hash['hashId'] = self._hash.index
        self._data = pd.merge(
            left=self._data,
            right=self._hash,
            on='value',
            how='inner',
        )
        self._data.sort_values(by='order', inplace=True)
        del self._data['order']

    def getFilter(self, *args, **kwargs):
        """Returns the filter as a boolean array. It doesn't applies it.
        - args: positional arguments to pass to pd.Series.str.contains
        - kwargs: named arguments to pass to pd.Series.str.contains
        """
        assert "na" not kwargs.keys()  # useless in this context because all the NaN values are already replaced by empty strings
        h = self._hash.copy()
        h = h.loc[h['value'].str.contains(*args, **kwargs), :].copy()
        d = self._data.copy()
        d['keep'] = d['hashId'].isin(h['hashId'])
        return np.array(
            d['keep'],
            dtype=np.bool
        ).copy()

并且示例代码适用于如下：

...
tic = datetime.datetime.now()
fastFilter = DataFrameFastFilterText(data, 'identifier')
print("- Init fast filter: %s" % str(datetime.datetime.now() - tic))

tic = datetime.datetime.now()
filt = filt | fastFilter.getFilter(rexpPlus)
print("- Step 2: %s" % str(datetime.datetime.now() - tic))

tic = datetime.datetime.now()
filt = filt & (~fastFilter.getFilter(rexpMinus))
print("- Step 3: %s" % str(datetime.datetime.now() - tic))
...

许可以下： CC-BY-SA 和归因

不隶属于 datascience.stackexchange