Question

Would be grateful for some pointers. I am reading about 1M rows of data and it is taking almost 24 hours with the following code. How can I improve the execution time?

The array Day contains the value of the nth day from the start and there are more then one record for a particular day. The program checks if a particular id (stored in unique_id) is repeated within 180 days.

%// calculating the number of repeats within 180 days
fid2 = 'data_050913/Unique_id_repeat_count1.xlsx';
fid1 = 'data_050913/data_050913_2000.csv';

fid_data = fopen(fid1);
data     = fgetl(fid_data); %// the first line, title line
ep       = 0; %// position point number

while 1
    data = fgetl(fid_data);
    if(length(data)<10)
      break;
    end
    ep = ep+1;
    id = find(data == ',');
    unique_id(ep) = str2num(data(1:id(1)-1)); 
    day(ep) = str2num(data(id(8)+1:id(9)-1)); 
end

repeat = zeros(ep,1);

tic
i = 1; 
count = 0;
while i <= ep
    j = i+1;
    while ( (j<=ep) && (day(j)<= day(i)+179) )
        if unique_id(i) == unique_id(j)
           count = 1;
           break;
        end
        j = j+1;
    end

    repeat(i,1) = count;

    count = 0;
    i = i+1;
end
toc

i = 1;
k = 1;
while i<=ep
    count = repeat(i,1);
    j=i;
    while (day(j) == day(i))
        count = repeat(j,1)+count;
        j = j+1;
        if j > ep
            break;
        end
    end

    day_final(k,1)= day(i);
    repeat_final(k,1) = count;

    k = k+1;
    i = j;
end

xlswrite(fid2,day_final,'Repeat_Count','A2');
xlswrite(fid2,repeat_final,'Repeat_Count','B2');

Thanks

Was it helpful?

Solution 2

The code below runs about 200x faster than your original code and gives the same result.

Of course, the speed-up is dependant on the distribution of the input data and my assumptions may be incorrect (I have 1000 unique IDs and on average 19 records per day).

I've also written some code to generate data similar to what I believe your input data is.

% Generate Input data 
ep = 100000;

isRepeatedDay = rand(1,ep) < 0.95;
day = cumsum(~isRepeatedDay);
unique_ids = 1:1000;
unique_id_indices = round(rand(ep,1)*length(unique_ids));
unique_id_indices(unique_id_indices < 1) = 1;
unique_id_indices(unique_id_indices > length(unique_id_indices) ) = length(unique_id_indices);

unique_id = unique_ids(unique_id_indices);

%Process the input data to find repeats
tic
repeat = zeros(ep,1);
[unique_values,~,indices] = unique(unique_id);
for uv_index = 1:length(unique_values)
    uv = unique_values(uv_index);
    uv_indices = find(indices == uv_index);
    for i=1:length(uv_indices)-1
        daysDifference = day(uv_indices(i+1)) - day(uv_indices(i));
        if daysDifference <= 179
            repeat(uv_indices(i),1) = 1;
        end
    end
end
toc

OTHER TIPS

if not already doing this, ensure you allocate all memory up-front where possible. I've seen Matlab scripts go from 24-hours to 8 minutes by doing this.

Use the zeros function to preallocate memory for all growing arrays (day, unique_id, repeat, day_final and repeat_final).

x = zeros(1000); %// Creates a 1000 element array of all zeros

Here is the way I would do it if unique_id can have many different values (and maybe even if it doesn't).

The operation takes under 5 seconds on my system:

x = round(rand(1000000,1)*10);
result = zeros(size(x));
windowsize = 180;
for t = 1:(numel(x)-windowsize)
    result(t) = sum(x(t+1:t+windowsize)==x(t));
end

I think this is what you need, make sure to check whether you want to look 'forward' or 'backward'.

  • You can check which line needs most time with MATLAB's profiler.
  • One thing you should know: MATLAB ist fast with vectors and matrices but it's usually very slow in loops. You should always try to work with the whole vector (or matrice) if possible. There are various ways to do this.
  • One way is logical indexing. I think this should work for a part of your Problem. At

first I will show you a little example how logical indexing works:

vector=[0 4 5 2 4]

logicalIndex=(vector==4) %the type of logicalIndex is bool!

excerpOfVector=vector(logicalIndex) %some other ways to use logial Indexing 
excerpOfVectorSecondVariation=zeros(1,length(vector)) 
excerpOfVectorSecondVariation(logicalIndex)=vector(logicalIndex)
vector(vector < 5) = 11;            %implicit use of logical indexing
  • You should change your messy while loops into for loops! ;-)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top