Question

I am learning "large" data set calculations using Matlab. I have a txt file consisting of every trade made for a stock called MTB. My goal is to turn this tick data into daily data. For example, on the first day, over 15,000 transactions took place, my prgm turn that data into the open, high, low, close, total volume, and net transaction for each day.

My questions: Can you help me make the code faster? Do you have any practical "techniques" to verify the calculations since they are made on such large data set?

It took my pgm: 20.7757 seconds and I go the following warning. I don't really know what it means Warning: 'rows' flag is ignored for cell arrays.

In cell.unique at 32 In ex5 at 16 Warning: 'rows' flag is ignored for cell arrays. In cell.unique at 32 In ex5 at 17

%DESCRIPTION: Turn tick data into daily data
%INPUT: stock tick data(tradeDate,tradeTime,open,high,low,
%close,upVol,downVol)
%OUTPUT: openDay,highDay,lowDay,closeDay,volumeDay,netTransaction
%net transaction taded = sum (price*upVol -price*downVol)

clear;
startTime=tic;
%load data from MTB_db2
[tradeDate, tradeTime,open,high,low,close,upVol,downVol]=textread('MTB_db2.txt','%s   %u    %f %f %f %f %f %f','delimiter',',');


%begIdx:Index the first trade for the day from tick database and
%endIdx:index for the last trade for that day
[dailyDate begIdx]=unique(tradeDate,'rows','first');
[dailyDate2 endIdx]=unique(tradeDate,'rows','last');

%the number of daily elements, useful for the loop.
n=numel(dailyDate);

%initilize arrays
highDay=[];
lowDay=[];openDay=[];closeDay=[];
volumeDay=[];netTransaction=[];
priceChange(1)=NaN; mfChange(1)=NaN;

%loop: bottleneck is here!!
for j=1:n
    openDay(j)=open(begIdx(j));
    closeDay(j)=close(endIdx(j));
    highDay(j)=max(high(begIdx(j):endIdx(j)));
    lowDay(j)=min(low(begIdx(j):endIdx(j)));
    volumeDay(j)=sum(upVol(begIdx(j):endIdx(j)))+sum(downVol(begIdx(j):endIdx(j)));
  cumSum=0;
  for i=begIdx(j):endIdx(j)
  cumSum=cumSum+close(i)*(upVol(i)-downVol(i));
  end
  netTransaction(j)=cumSum;
end

elapsedTimeNonVectorized=toc(startTime)
Was it helpful?

Solution 2

For the code, I would replace the internal for loop

cumSum=0;
for i=begIdx(j):endIdx(j)
    cumSum=cumSum+close(i)*(upVol(i)-downVol(i));
end
netTransaction(j)=cumSum;

by

cs = cumsum( close(begIdx(j):endIdx(j)) .* ...
    (upVol(begIdx(j):endIdx(j)) - downVol(begIdx(j):endIdx(j))) );
netTransaction(j) = cs(end);

that is slightly faster.

You can also pre-allocate your variables with highDay=zeros(1,n); or highDay(1,n)=0;.

Not sure that you can vectorize it more since the algorithm has built-in functions and the number of transactions is not constant. One option left in parallelization since your daily data are independent of each other.

For the test part, you can

  • work on fake data, the desired output being known in advance (data being only ones or 1:n).
  • program a second algorithm yourself with a different technique and compare the results,
  • ask a colleague of yours to kindly program the algorithm with the language of his choice (best option if reliability is fundamental)

OTHER TIPS

Simply read the documentation for unique.

The 'rows' option does not support cell arrays.

Your input is a cell, thus you can not use the 'rows' flag. If the output matches your expectation, remove 'rows' and everything is fine.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top