Is there a better way to do csv/namedtuple with urlopen?

https://stackoverflow.com/questions/16374913

14-04-2022
|

Pergunta

Using the namedtuple documentation example as my template in Python 3.3, I have the following code to download a csv and turn it into a series of namedtuple subclass instances:

from collections import namedtuple
from csv import reader
from urllib.request import urlopen    

SecurityType = namedtuple('SecurityType', 'sector, name')

url = 'http://bsym.bloomberg.com/sym/pages/security_type.csv'
for sec in map(SecurityType._make, reader(urlopen(url))):
    print(sec)

This raises the following exception:

Traceback (most recent call last):
  File "scrap.py", line 9, in <module>
    for sec in map(SecurityType._make, reader(urlopen(url))):
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

I know that the issue is that urlopen is returning bytes and not strings and that I need to decode the output at some point. Here's how I'm doing it now, using StringIO:

from collections import namedtuple
from csv import reader
from urllib.request import urlopen
import io

SecurityType = namedtuple('SecurityType', 'sector, name')

url = 'http://bsym.bloomberg.com/sym/pages/security_type.csv'
reader_input = io.StringIO(urlopen(url).read().decode('utf-8'))

for sec in map(SecurityType._make, reader(reader_input)):
    print(sec)

This smells funny because I'm basically iterating over the bytes buffer, decoding, rebuffering, then iterating over the new string buffer. Is there a more Pythonic way to do this without two iterations?

Solução

Use io.TextIOWrapper() to decode the urllib response:

reader_input = io.TextIOWrapper(urlopen(url), encoding='utf8', newline='')

Now csv.reader is passed the exact same interface that it would get when opening a regular file on the filesystem in text mode.

With this change your example URL works for me on Python 3.3.1:

>>> for sec in map(SecurityType._make, reader(reader_input)):
...     print(sec)
... 
SecurityType(sector='Market Sector', name='Security Type')
SecurityType(sector='Comdty', name='Calendar Spread Option')
SecurityType(sector='Comdty', name='Financial commodity future.')
SecurityType(sector='Comdty', name='Financial commodity generic.')
SecurityType(sector='Comdty', name='Financial commodity option.')
...
SecurityType(sector='Muni', name='ZERO COUPON, OID')
SecurityType(sector='Pfd', name='PRIVATE')
SecurityType(sector='Pfd', name='PUBLIC')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')
SecurityType(sector='', name='')

The last lines appear to yield empty tuples; the original indeed has lines with nothing more than a comma on them.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow