How to hstack arrays of numpy records?

https://stackoverflow.com/questions/14944275

10-03-2022
|

Question

_{[An earlier version of this post had the inaccurate title "How to add one column to an array of numpy records?" The question asked in that earlier title has already been partially answered, but this answer is not quite what the body of that earlier version of this post was asking for. I've reworded the title, and edited the post substantially, to make the distinction clearer. I also explain why I the answer mentioned earlier falls short of what I'm looking for.]}

Suppose I have two numpy arrays x and y, each consisting of r "record" (aka "structured") arrays. Let the shape of x be (r, c_x) and the shape of y be (r, c_y). Let's also assume that there's no overlap between x.dtype.names and y.dtype.names.

For example, for r = 2, c_x = 2, and c_y = 1:

import numpy as np
x = np.array(zip((1, 2), (3., 4.)), dtype=[('i', 'i4'), ('f', 'f4')])
y = np.array(zip(('a', 'b')), dtype=[('s', 'a10')])

I would like to "horizontally" concatenate x and y to produce a new array of records z, having shape (r, c_x + c_y). This operation should not modify x or y at all.

In general, z = np.hstack((x, y)) won't do, because the dtype's in x and y won't necessarily match. E.g., continuing the example above:

z = np.hstack((x, y))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-def477e6c8bf> in <module>()
----> 1 z = np.hstack((x, y))
TypeError: invalid type promotion

Now, there is a function, numpy.lib.recfunctions.append_fields, that looks like it may do something close to what I'm looking for, but I have not been able to get anything out of it: everything I have tried with it either fails with an error, or produces something other than what I'm trying to get.

Can someone please show me explicitly the code (using n.l.r.append_fields or otherwise¹) that would generate, from the x and y defined in the example above, a new array of records, z, equivalent to the horizontal concatenation of x and y, and do so without modifying either x or y?

I assume that this will require only one or two lines of code. Of course, I am looking for code that does not require building z, record by record, by iterating over x and y. Also, the code may assume that x and y have the same number of records, and that there is no overlap between x.dtype.names and y.dtype.names. Other than this, the code I'm looking for should know nothing about x and y. Ideally, it should be agnostic also about the number of arrays to join. IOW, leaving out error checking, the code I'm looking for could be the body of a function hstack_rec so that the new array z would be the result hstack_rec((x, y)).

¹_{...although I have to admit that, after my so-far perfect record of failure with numpy.lib.recfunctions.append_fields, I've become a bit curious about how this function could be used at all, irrespective of its relevance to this post's question.}

Solution

I never use recarrays, and so someone else is going to come up with something slicker, but maybe merge_arrays would work?

>>> import numpy.lib.recfunctions as nlr
>>> x = np.array(zip((1, 2), (3., 4.)), dtype=[('i', 'i4'), ('f', 'f4')])
>>> y = np.array(zip(('a', 'b')), dtype=[('s', 'a10')])
>>> x
array([(1, 3.0), (2, 4.0)], 
      dtype=[('i', '<i4'), ('f', '<f4')])
>>> y
array([('a',), ('b',)], 
      dtype=[('s', '|S10')])
>>> z = nlr.merge_arrays([x, y], flatten=True)
>>> z
array([(1, 3.0, 'a'), (2, 4.0, 'b')], 
      dtype=[('i', '<i4'), ('f', '<f4'), ('s', '|S10')])

OTHER TIPS

This is a very late answer, but maybe it will be helpful to someone else. I used this solution after asking the same question with most of the same criteria.

It doesn't generate a new numpy array, but by using zip and itertools.chain it is much faster. In my case, I needed to access every value of every row in sequential order. Here is benchmark which simulates this use-case:

import numpy
from numpy.lib.recfunctions import merge_arrays
from itertools import chain

a = numpy.empty(3, [("col1", int), ("col2", float)])
b = numpy.empty(3, [("col3", int), ("col4", "U1")])

Results:

%timeit [i for i in (row for row in merge_arrays([a,b], flatten=True))]
52.9 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit [i for i in (row for row in (chain(i,k) for i,k in zip(a,b)))]
3.47 µs ± 52 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow