Question

I stumbled upon a little coding problem. I have to basically read data from a .csv file which looks a lot like this:

2011-06-19 17:29:00.000,72,44,56,0.4772,0.3286,0.8497,31.3587,0.3235,0.9147,28.5751,0.3872,0.2803,0,0.2601,0.2073,0.1172,0,0.0,0,5.8922,1,0,0,0,1.2759

Now, I need to basically an entire file consisting of rows like this and parse them into numpy arrays. Till now, I have been able to get them into a big string type object using code similar to this:

order_hist = np.loadtxt(filename_input,delimiter=',',dtype={'names': ('Year', 'Mon', 'Day', 'Stock', 'Action', 'Amount'), 'formats': ('i4', 'i4', 'i4', 'S10', 'S10', 'i4')})

The format for this file consists of a set of S20 data types as of now. I need to basically extract all of the data in the big ORDER_HIST data type into a set of arrays for each column. I do not know how to save the date time column (I've kept it as String for now). I need to convert the rest to float, but the below code is giving me an error:

    temparr=float[:len(order_hist)]
    for x in range(len(order_hist['Stock'])): 
        temparr[x]=float(order_hist['Stock'][x]);

Can someone show me just how I can convert all the columns to the arrays that I need??? Or possibly direct me to some link to do so?

Was it helpful?

Solution

Boy, have I got a treat for you. numpy.genfromtxt has a converters parameter, which allows you to specify a function for each column as the file is parsed. The function is fed the CSV string value. Its return value becomes the corresponding value in the numpy array.

Morever, the dtype = None parameter tells genfromtxt to make an intelligent guess as to the type of each column. In particular, numeric columns are automatically cast to an appropriate dtype.

For example, suppose your data file contains

2011-06-19 17:29:00.000,72,44,56

Then

import numpy as np
import datetime as DT

def make_date(datestr):
    return DT.datetime.strptime(datestr, '%Y-%m-%d %H:%M:%S.%f')

arr = np.genfromtxt(filename, delimiter = ',',
                    converters = {'Date':make_date},
                    names =  ('Date', 'Stock', 'Action', 'Amount'),
                    dtype = None)
print(arr)
print(arr.dtype)

yields

(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56)
[('Date', '|O4'), ('Stock', '<i4'), ('Action', '<i4'), ('Amount', '<i4')]

Your real csv file has more columns, so you'd want to add more items to names, but otherwise, the example should still stand.

If you don't really care about the extra columns, you can assign a fluff-name like this:

arr = np.genfromtxt(filename, delimiter=',',
                    converters={'Date': make_date},
                    names=('Date', 'Stock', 'Action', 'Amount') +
                    tuple('col{i}'.format(i=i) for i in range(22)),
                    dtype = None)

yields

(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56, 0.4772, 0.3286, 0.8497, 31.3587, 0.3235, 0.9147, 28.5751, 0.3872, 0.2803, 0, 0.2601, 0.2073, 0.1172, 0, 0.0, 0, 5.8922, 1, 0, 0, 0, 1.2759)

You might also be interested in checking out the pandas module which is built on top of numpy, and which takes parsing CSV to an even higher level of luxury: It has a pandas.read_csv function whose parse_dates = True parameter will automatically parse date strings (using dateutil).

Using pandas, your csv could be parsed with

df = pd.read_csv(filename, parse_dates = [0,1], header = None,
                    names=('Date', 'Stock', 'Action', 'Amount') +
                    tuple('col{i}'.format(i=i) for i in range(22)))

Note there is no need to specify the make_date function. Just to be clear --pands.read_csvreturns aDataFrame, not a numpy array. The DataFrame may actually be more useful for your purpose, but you should be aware it is a different object with a whole new world of methods to exploit and explore.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top