Merging numpy ndarray from CSVs

https://stackoverflow.com/questions/13105156

14-07-2021
|

Question

I have the following code:

from numpy import genfromtxt
nysedatafile = os.getcwd() + '/nyse.txt';
nysedata = genfromtxt(nysedatafile, delimiter='\t', names=True, dtype=None);
nasdaqdatafile = os.getcwd() + '/nasdaq.txt';
nasdaqdata = genfromtxt(nasdaqdatafile, delimiter='\t', names=True, dtype=None);

Now I would like to merge the data from the 2 CSVs and I tried various functions:

For example:

import numpy as np;
alldata = np.array(np.concatenate((nysedata, nasdaqdata)));
print('NYSE stocks:' + str(nysedata.shape[0]));
print('NASDAQ stocks:' + str(nasdaqdata.shape[0]));
print('ALL stocks:' + str(alldata.shape[0]));

returns:

TypeError: invalid type promotion

I tried as well numpy.vstack and to try to call an array on it. I expect the last print to give the sum of the rows of the two previous csv files.

EDIT: This command:

print('NYSE shape:' + str(nysedata.shape));
print('NASDAQ shape:' + str(nasdaqdata.shape));
print('NYSE dtype:' + str(nysedata.dtype));
print('NASDAQ dtype:' + str(nasdaqdata.dtype));

returns:

NYSE shape:(3257,)
NASDAQ shape:(2719,)
NYSE dtype:[('Symbol', 'S14'), ('Name', 'S62'), ('LastSale', 'S9'), ('MarketCap', '<f8'), ('ADR_TSO', 'S3'), ('IPOyear', 'S4'), ('Sector', 'S21'), ('industry', 'S62'), ('Summary_Quote', 'S38')]
NASDAQ dtype:[('Symbol', 'S14'), ('Name', 'S62'), ('LastSale', 'S7'), ('MarketCap', '<f8'), ('ADR_TSO', 'S3'), ('IPOyear', 'S4'), ('Sector', 'S21'), ('industry', 'S62'), ('Summary_Quote', 'S34')]

La solution

The reason why np.vstack (or np.concatenate) is raising an error is because the dtypes of the two arrays do not match.

Notice the very last field: ('Summary_Quote', 'S38') versus ('Summary_Quote', 'S34'). nysedata's Summary_Quote column is 38 bytes long, while nasdaqdata's column is only 34 bytes long. (Edit: The LastSale column suffers a similar problem.)

This happened because genfromtxt guesses the dtype of the columns when the dtype = None parameter is set. For string columns, genfromtxt determines the minimum number of bytes needed to contain all the strings in that column.

So to stack the two arrays, the smaller one has to be promoted to the larger one's dtype:

import numpy.lib.recfunctions as recfunctions
recfunctions.stack_arrays([nysedata,nasdaqdata.astype(nysedata.dtype)], usemask = False)

(My previous answer used np.vstack. This results in a 2-dimensional array of shape (N,1). recfunctions.stack_arrays returns a 1-dimensional array of shape (N,). Since nysedata and nasdaqdata are 1-dimensional, I think it is better to return a 1-dimensional array too.)

Possibly an easier solution would be to concatenate the two csv files first and then call genfromtxt:

import numpy as np
import os

cwd = os.getcwd()    
nysedatafile = os.path.join(cwd, 'nyse.txt')
nasdaqdatafile = os.path.join(cwd, 'nasdaq.txt')
alldatafile = os.path.join(cwd, 'all.txt')
with open(nysedatafile) as f1, open(nasdaqdatafile) as f2, open(alldatafile, 'w') as g:
    for line in f1:
        g.write(line)
    next(f2)
    for line in f2:
        g.write(line)

alldata = np.genfromtxt(alldatafile, delimiter='\t', names=True, dtype=None)

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow