Pregunta

I have a large csv file ~90k rows and 355 columns. The first 354 columns correspond to the presence of different words, showing a 1 or 0 and the last column to a numerical value.

Eg:

table, box, cups, glasses, total
1,0,0,1,30
0,1,1,1,28
1,1,0,1,55

When I use:

d = np.recfromcsv('clean.csv', dtype=None, delimiter=',', names=True)
d.shape
# I get: (89460,)

So my question is:

  1. How do I get a 2d array/matrix? Does it matter?
  2. How can I separate the 'total' column so I can create train, cross_validation and test sets and train a model?
¿Fue útil?

Solución 2

Ok after much googling and time wasting this is what anyone trying to get numpy out of the way so they can read a CSV and get on with Scikit Learn needs to do:

# Say your csv file has 10 columns, 1-9 are features and 10 
# is the Y you're trying to predict.
cols = range(0,10)
X = np.loadtxt('data.csv', delimiter=',', dtype=float, usecols=cols, ndmin=2, skiprows=1)
Y = np.loadtxt('data.csv', delimiter=',', dtype=float, usecols=(9,), ndmin=2, skiprows=1)
# note how for Y the usecols argument only takes a sequence, 
# even though I only want 1 column I have to give it a sequence.

Otros consejos

np.recfromcsv returns a 1-dimensional record array.

When you have a structured array, you can access the columns by field title:

d['total']

returns the totals column.

You can access rows using integer indexing:

d[0]

returns the first row, for example.


If you wish to select all the columns except the last row, then you'd be better off using a 2D plain NumPy array. With a plain NumPy array (as opposed to a structured array) you can select all the rows except the last on using integer indexing:

You could use np.genfromtxt to load the data into a 2D array:

import numpy as np

d = np.genfromtxt('data', dtype=None, delimiter=',', skiprows=1)
print(d.shape)
# (3, 5)
print(d)
# [[ 1  0  0  1 30]
#  [ 0  1  1  1 28]
#  [ 1  1  0  1 55]]

This select the last column:

print(d[:,-1])
# [30 28 55]

This select everything but the last column:

print(d[:,:-1])
# [[1 0 0 1]
#  [0 1 1 1]
#  [1 1 0 1]]
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top