Normalize/Standardize a numpy recarray

https://stackoverflow.com/questions/9775765

25-05-2021
|

Question

I wonder what the best way of normalizing/standardizing a numpy recarray is. To make it clear, I'm not talking about a mathematical matrix, but a record array that also has e.g. textual columns (such as labels).

a = np.genfromtxt("iris.csv", delimiter=",", dtype=None)
print a.shape
> (150,)

As you can see, I cannot e.g. process a[:,:-1] as the shape is one-dimensional.

The best I found is to iterate over all columns:

for nam in a.dtype.names[:-1]:
    col = a[nam]
    a[nam] = (col - col.min()) / (col.max() - col.min())

Any more elegant way of doing this? Is there some method such as "normalize" or "standardize" somewhere?

Solution

There are a number of ways to do it, but some are cleaner than others.

Usually, in numpy, you keep the string data in a separate array.

(Things are a bit more low-level than, say, R's data frame. You typically just wrap things up in a class for the association, but keep different data types separate.)

Honestly, numpy isn't optimized for handling "flexible" datatypes such as this (though it can certainly do it). Things like pandas provide a better interface for "spreadsheet-like" data (and pandas is just a layer on top of numpy).

However, structured arrays (which is what you have here) will allow you to slice them column-wise when you pass in a list of field names. (e.g. data[['col1', 'col2', 'col3']])

At any rate, one way is to do something like this:

import numpy as np

data = np.recfromcsv('iris.csv')

# In this case, it's just all but the last, but we could be more general
# This must be a list and not a tuple, though.
float_fields = list(data.dtype.names[:-1])

float_dat = data[float_fields]

# Now we just need to view it as a "regular" 2D array...
float_dat = float_dat.view(np.float).reshape((data.size, -1))

# And we can normalize columns as usual.
normalized = (float_dat - float_dat.min(axis=0)) / float_dat.ptp(axis=0)

However, this is far from ideal. If you want to do the operation in-place (as you currently are) the easiest solution is what you already have: Just iterate over the field names.

Incidentally, using pandas, you'd do something like this:

import pandas
data = pandas.read_csv('iris.csv', header=None)

float_dat = data[data.columns[:-1]]
dmin, dmax = float_dat.min(axis=0), float_dat.max(axis=0)

data[data.columns[:-1]] = (float_dat - dmin) / (dmax - dmin)

OTHER TIPS

What version of NumPy are you using? With version 1.5.1, I don't get this behavior. I made a short text file as an example, saved as test.txt:

last,first,country,state,zip
tyson,mike,USA,Nevada,89146
brady,tom,USA,Massachusetts,02035

When I then execute the following code, this is what I get:

>>> import numpy as np
>>> a = np.genfromtxt("/home/ely/Desktop/Python/test.txt",delimiter=',',dtype=None)
>>> print a.shape
(3,5)
>>> print a
[['last' 'first' 'country' 'state' 'zip']
 ['tyson' 'mike' 'USA' 'Nevada' '89146']
 ['brady' 'tom' 'USA' 'Massachusetts' '02035']]
>>> print a[0,:-1]
['last' 'first' 'country' 'state']
>>> print a.dtype.names
None

I'm just wondering what's different about your data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow