Boy, have I got a treat for you. numpy.genfromtxt has a converters
parameter, which allows you to specify a function for each column as the file is parsed. The function is fed the CSV string value. Its return value becomes the corresponding value in the numpy array.
Morever, the dtype = None
parameter tells genfromtxt
to make an intelligent guess as to the type of each column. In particular, numeric columns are automatically cast to an appropriate dtype.
For example, suppose your data file contains
2011-06-19 17:29:00.000,72,44,56
Then
import numpy as np
import datetime as DT
def make_date(datestr):
return DT.datetime.strptime(datestr, '%Y-%m-%d %H:%M:%S.%f')
arr = np.genfromtxt(filename, delimiter = ',',
converters = {'Date':make_date},
names = ('Date', 'Stock', 'Action', 'Amount'),
dtype = None)
print(arr)
print(arr.dtype)
yields
(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56)
[('Date', '|O4'), ('Stock', '<i4'), ('Action', '<i4'), ('Amount', '<i4')]
Your real csv file has more columns, so you'd want to add more items to names
, but otherwise, the example should still stand.
If you don't really care about the extra columns, you can assign a fluff-name like this:
arr = np.genfromtxt(filename, delimiter=',',
converters={'Date': make_date},
names=('Date', 'Stock', 'Action', 'Amount') +
tuple('col{i}'.format(i=i) for i in range(22)),
dtype = None)
yields
(datetime.datetime(2011, 6, 19, 17, 29), 72, 44, 56, 0.4772, 0.3286, 0.8497, 31.3587, 0.3235, 0.9147, 28.5751, 0.3872, 0.2803, 0, 0.2601, 0.2073, 0.1172, 0, 0.0, 0, 5.8922, 1, 0, 0, 0, 1.2759)
You might also be interested in checking out the pandas module which is built on top of numpy
, and which takes parsing CSV to an even higher level of luxury: It has a pandas.read_csv function whose parse_dates = True
parameter will automatically parse date strings (using dateutil).
Using pandas, your csv could be parsed with
df = pd.read_csv(filename, parse_dates = [0,1], header = None,
names=('Date', 'Stock', 'Action', 'Amount') +
tuple('col{i}'.format(i=i) for i in range(22)))
Note there is no need to specify the make_date
function. Just to be clear --
pands.read_csvreturns a
DataFrame, not a numpy array. The DataFrame
may actually be more useful for your purpose, but you should be aware it is a different object with a whole new world of methods to exploit and explore.