Question

I am reading some data files stored as excel from online. My current process involves downloading the file to disk using the retrieve function defined below which uses the urllib2 library and then parses the excel document using the traverseWorkbook function. The traverse function uses the xlrd library for parsing the excel.

I would like to perform the same operation without requiring downloading the file to disk but will prefer to keep the file in memory and parse it memory.

Not sure how to even proceed, but I'm sure its possible.

def retrieveFile(url, filename):
    try:
        req = urllib2.urlopen(url)
        CHUNK = 16 * 1024
        with open(filename, 'wb') as fp:
            while True:
                chunk = req.read(CHUNK)
                if not chunk: break
                    fp.write(chunk)
        return True
    except Exception, e:
        return None


def traverseWorkbook(filename):
    values = []

    wb = open_workbook(filename)
    for s in wb.sheets():
        for row in range(s.nrows):
           if row > 10:
               rowData = processRow(s, row, type)
               if rowData:
                   values.append(rowData)
Was it helpful?

Solution

You can read the entire file into memory using:

data = urllib2.urlopen(url).read()

Once the file is in memory, you can load it into xlrd using the file_contents argument of open_workbook:

wb = xlrd.open_workbook(url, file_contents=data)

Pass the url in as the filename as the documentation states it might be used in messages; otherwise, it will be ignored.

Thus, your traverseWorbook method can be rewritten as:

def traverseWorkbook(url):
    values = []
    data = urllib2.urlopen(url).read()
    wb = xlrd.open_workbook(filename, file_contents=data)
    for s in wb.sheets():
        for row in range(s.nrows):
        if row > 10:
            rowData = processRow(s, row, type)
            if rowData:
                values.append(rowData)
    return values

OTHER TIPS

You could use the StringIO library and write the downloaded data to a file-like StringIO object, rather than a normal file.

import cStringIO as cs
from contextlib import closing

def retrieveFile(url, filename):
    try:
        req = urllib2.urlopen(url)
        CHUNK = 16 * 1024
        full_str = None
        with closing(cs.StringIO()) as fp:
            while True:
                chunk = req.read(CHUNK)
                if not chunk: break
                    fp.write(chunk)
            full_str = fp.getvalue()  # This contains the full contents of the downloaded file.
        return True
    except Exception, e:
        return None

You can use pandas for this. The benefits are that it's optimized to handle working with data in memory since the computation is done in C and not actually Python. It also abstracts away a lot of the messy details that come with downloading the data.

import pandas as pd

xl = pd.ExcelFile(url, engine='xlrd')
sheets = xl.sheet_names

# work with the first sheet, or iterate through sheets if there are more than one.
df = xl.parse(sheets[0])

# The file is now a dataframe.
# You can manipulate the data in memory using the Pandas API
# ...
# ...

# after massaging the data, write to to an xls file:
out_file = '~/Documents/out_file.xls'
data.to_excel(out_file, encoding='utf-8', index=False)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top