Python: preguiçoso Cordas Decoding

https://stackoverflow.com/questions/1656048

11-09-2019
|

Pergunta

Eu estou escrevendo um analisador, e há muito texto para decodificar, mas a maioria dos meus usuários só se preocupam com alguns campos de todos os dados. Então eu só quero fazer a decodificação quando um usuário realmente usa alguns dos dados. Será esta uma boa maneira de fazê-lo?

class LazyString(str):
    def __init__(self, v) :
        self.value = v
    def __str__(self) :
        r = ""
        s = self.value
        for i in xrange(0, len(s), 2) :
            r += chr(int(s[i:i+2], 16))
        return r

def p_buffer(p):
    """buffer : HASH chars"""
    p[0] = LazyString(p[2])

É esse o único método que eu preciso substituir?

Solução

I'm not sure how implementing a string subclass is of much benefit here. It seems to me that if you're processing a stream containing petabytes of data, whenever you've created an object that you don't need to you've already lost the game. Your first priority should be to ignore as much input as you possibly can.

You could certainly build a string-like class that did this:

class mystr(str):
    def __init__(self, value):
        self.value = value
        self._decoded = None
    @property
    def decoded(self):
        if self._decoded == None:
            self._decoded = self.value.decode("hex")
            return self._decoded
    def __repr__(self):
        return self.decoded
    def __len__(self):
        return len(self.decoded)
    def __getitem__(self, i):
        return self.decoded.__getitem__(i)
    def __getslice__(self, i, j):
        return self.decoded.__getslice__(i, j)

and so on. A weird thing about doing this is that if you subclass str, every method that you don't explicitly implement will be called on the value that's passed to the constructor:

>>> s = mystr('a0a1a2')
>>> s
 ¡¢
>>> len(s)
3
>>> s.capitalize()
'A0a1a2'

Outras dicas

I don't see any kind on lazy evaluation in your code. The fact that you use xrange only means that the list of integers from 0 to len(s) will be generated on demand. The whole string r will be decoded during string conversion anyway.

The best way to implement lazy sequence in Python is using generators. You could try something like this:

def lazy(v):
    for i in xrange(0, len(v), 2):
        yield int(v[i:i+2], 16)

list(lazy("0a0a0f"))
Out: [10, 10, 15]

What you're doing is built in already:

s =  "i am a string!".encode('hex')
# what you do
r = ""
for i in xrange(0, len(s), 2) :
    r += chr(int(s[i:i+2], 16))
# but decoding is builtin
print r==s.decode('hex') # => True

As you can see your whole decoding is s.decode('hex').

But "lazy" decoding sounds like premature optimization to me. You'd need gigabytes of data to even notice it. Try profiling, the .decode is 50 times faster that your old code already.

Maybe you want somthing like this:

class DB(object): # dunno what data it is ;)
    def __init__(self, data):
        self.data = data
        self.decoded = {} # maybe cache if the field data is long
    def __getitem__(self, name):
        try:
            return self.decoded[name]
        except KeyError:
            # this copies the fields data
            self.decoded[name] = ret = self.data[ self._get_field_slice( name ) ].decode('hex')
            return ret
    def _get_field_slice(self, name):
        # find out what part to decode, return the index in the data
        return slice( ... )

db = DB(encoded_data)    
print db["some_field"] # find out where the field is, get its data and decode it

The methods you need to override really depend on how are planning to use you new string type.

However you str based type looks a little suspicious to me, have you looked into the implementation of str to check that it has the value attribute that you are setting in your __init__()? Performing a dir(str) does not indicate that there is any such attribute on str. This being the case the normal str methods will not be operating on your data at all, I doubt that is the effect you want otherwise what would be the advantage of sub-classing.

Sub-classing base data types is a little strange anyway unless you have very specific requirements. For the lazy evaluation you want you are probably better of creating your class that contains a string rather than sub-classing str and write your client code to work with that class. You will then be free to add the just in time evaluation you want in a number of ways an example using the descriptor protocol can be found in this presentation: Python's Object Model (search for "class Jit(object)" to get to the relevant section)

The question is incomplete, in that the answer will depend on details of the encoding you use.

Say, if you encode a list of strings as pascal strings (i.e. prefixed with string length encoded as a fixed-size integer), and say you want to read the 100th string from the list, you may seek() forward for each of the first 99 strings and not read their contents at all. This will give some performance gain if the strings are large.

If, OTOH, you encode a list of strings as concatenated 0-terminated stirngs, you would have to read all bytes until the 100th 0.

Also, you're speaking about some "fields" but your example looks completely different.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow