Ignoring case, punctuation, and whitespace in Strings

https://stackoverflow.com/questions/2169170

24-09-2019
|

Question

What is the most efficient way of ignoring case, punctuation, and whitespace in strings? These strings should be divided into words instead of characters should ignore the aforementioned details on comparisons, and slices of these word-strings should be as efficient as possible with speed in mind.

I was going to use case and punctuation insensitive strings for the following code, but after seeing how long it would take to evaluate class Slice: def __eq__(self, other): return self.root == other.root, I have decided to work with data = tuple(string.split()) instead. Having strings that are insensitive to case, punctuation, and spacing and that work over words instead of characters was too expensive into the computationally expensive algorithms already expressed in the code below.

class Slice:

    def __init__(self, data, offset, length):
        self.prefix = data[:offset]
        self.root = data[offset:offset+length]
        self.suffix = data[offset+length:]

    def __eq__(self, other):
        return self.root == other.root

    def __len__(self):
        return len(self.root)

################################################################################

class Match:

    def __init__(self, data, key, prefix_tree, suffix_tree):
        self.data = data
        self.key = key
        self.prefix_tree = prefix_tree
        self.suffix_tree = suffix_tree
        self.__value = len(key) + prefix_tree.value() + suffix_tree.value()

    def value(self):
        return self.__value

################################################################################

class Tree(tuple):

    def __new__(cls, nodes):
        tree = super().__new__(cls, nodes)
        tree.__value = max(map(Match.value, tree)) if tree else 0
        return tree

    def value(self):
        return self.__value

    def find(self, value):
        for index, match in enumerate(self):
            if match.value() == value:
                return index
        raise ValueError()

################################################################################

def search(data, key):
    length = 0
    nodes = []
    for d_block in shrink(data, len(key)):
        block_len = len(d_block)
        if length > block_len:
            return Tree(nodes)
        for k_block in slide(key, block_len):
            if d_block == k_block:
                length = block_len
                prefix_tree = search(d_block.prefix, k_block.prefix)
                suffix_tree = search(d_block.suffix, k_block.suffix)
                match = Match(d_block, k_block, prefix_tree, suffix_tree)
                nodes.append(match)
    return Tree(nodes)

def shrink(data, max_len):
    for length in range(min(len(data), max_len), 0, -1):
        for block in slide(data, length):
            yield block

def slide(data, length):
    for offset in range(len(data) - length + 1):
        yield Slice(data, offset, length)

################################################################################

def build_tree(nodes):
    match = nodes[nodes.find(nodes.value())]
    node = match.key
    if match.prefix_tree:
        node.prefix = build_tree(match.prefix_tree)
    if match.suffix_tree:
        node.suffix = build_tree(match.suffix_tree)
    return node

def flatten_tree(node):
    array = [0]
    _flatten(node, array)
    return tuple(array)

def _flatten(node, array):
    if isinstance(node.prefix, Slice):
        _flatten(node.prefix, array)
    else:
        array.append(node.prefix)
    array[0] += 1
    array.append((array[0], node.root))
    if isinstance(node.suffix, Slice):
        _flatten(node.suffix, array)
    else:
        array.append(node.suffix)

Solution

"What is the best way to go about fixing this problem?"

The best -- and only -- way is to define what this object "means" and what the length of this object "means".

The object appears to be a list of words. Nothing more. That seems to be the value in _string.

It's not clear what _simple is, other than an inaccessible filtered subset of the words in _string.

So what's the length? The length of the words or the length of the words in the filtered subset?

Only you can define what this class means. The meaning will then determine how to implement __len__. Until you define the meaning, it's impossible to determine how anything should be implemented.

OTHER TIPS

If you want iteration on a String instance to iterate on its self.__string, as your __iter__ method indicates, the only sensible choice for length is also to return the length of __string -- it would be truly peculiar if len(x) and sum(1 for _ in x) resulted in different values.

I have to admit I don't understand the purpose of this class (and in particular why you made the terrible choice of having it old-style, and why you use such a contorted way to build __simple), but internal consistency is important anyway. So, either change __iter__, or make __len__ logically compatible with it.

Your slicing logic also totally escapes me -- why are you building the slice's __simple in a way that's likely to be different from what you'd get by rebuilding it from the slice's __string? E.g., if self.__string is '?Boh!' and therefore self.__simple is 'boh', why would you want self[1:-1] to have a __string of 'Boh' but with a __simple of 'o', so incompatible, different, and inconsistent from the __simple you'd get by recomputing it from the slice...?

I guess that's not germane to this Q about length, but I'm just curious about these many, extremely peculiar design choices that you're making...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow