Pergunta

>>> import sys
>>> sys.getsizeof("")
40

Why does the empty string use so many bytes? Does anybody know what is stored in those 40 bytes?

Foi útil?

Solução

In Python strings are objects, so that values is the size of the object itself. So this size will always be bigger than the string size itself.

From stringobject.h:

typedef struct {
    PyObject_VAR_HEAD
    long ob_shash;
    int ob_sstate;
    char ob_sval[1];

    /* Invariants:
     *     ob_sval contains space for 'ob_size+1' elements.
     *     ob_sval[ob_size] == 0.
     *     ob_shash is the hash of the string or -1 if not computed yet.
     *     ob_sstate != 0 iff the string object is in stringobject.c's
     *       'interned' dictionary; in this case the two references
     *       from 'interned' to this object are *not counted* in ob_refcnt.
     */
} PyStringObject;

From here you can get some clues on how those bytes are used:

  • len(str)+1 bytes to store the string itself;
  • 8 bytes for the hash;
  • (...)

Outras dicas

You can find some information about the implementation if python strings in a weblog article by Laurent Luce. Additionally you can browse the source.

The size of string objects depends on the OS and type of machine and some choices. On 64-bit FreeBSD, using unicode for string literals (from __future__ import unicode_literals):

In [1]: dir(str)
Out[1]: ['__add__', '__class__', '__contains__', '__delattr__', '__doc__',
 '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', 
'__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', 
'__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', 
'__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', 
'__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', 
'_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode', 
'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 
'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 
'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 
'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 
'swapcase', 'title', 'translate', 'upper', 'zfill']

In [2]: import sys

In [3]: sys.getsizeof("")
Out[3]: 52

In [4]: sys.getsizeof("test")
Out[4]: 68

In [7]: sys.getsizeof("t")
Out[7]: 56

In [8]: sys.getsizeof("te")
Out[8]: 60

In [9]: sys.getsizeof("tes")
Out[9]: 64

Every character uses 4 bytes extra in this case.

It gives the object size of str class with empty value, when doing such things sys.getsizeof("") it actually creates a string class object which have many attributes, and then calculate the size of that object. It is equal to,

x = str()
sys.getsizeof(x)  #in my environment it prints 37

Then for each char it takes 1 byte

x = "r"
sys.getsizeof(x)  #prints 38
x = "ros"
sys.getsizeof(x)  #prints 40
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top