Word search algorithm using an m.file

Question

Suppose your data file is called data.txt and its content is:

string1 string2 string3 string4
string2 string3 
string4 string5 string6

A very easy way to retain only the first unique occurrence is:

% Parse everything in one go
fid = fopen('C:\Users\ok1011\Desktop\data.txt');
out = textscan(fid,'%s');
fclose(fid);

unique(out{1})
ans = 
    'string1'
    'string2'
    'string3'
    'string4'
    'string5'
    'string6'

As already mentioned, this approach might not work if:

your data file has irregularities
you actually need the comparison indices

EDIT: solution for performance

% Parse in bulk and split (assuming you don't know maximum 
%number of strings in a line, otherwise you can use textscan alone)

fid = fopen('C:\Users\ok1011\Desktop\data.txt');
out = textscan(fid,'%s','Delimiter','\n');
out = regexp(out{1},' ','split');
fclose(fid);

% Preallocate unique comb
comb = unique([out{:}]); % you might need to remove empty strings from here

% preallocate idx
m   = size(out,1);
idx = false(m,size(comb,2));

% Loop for number of lines (rows)
for ii = 1:m
    idx(ii,:) = ismember(comb,out{ii});
end

Note that the resulting idx is:

idx =
     1     1     1     1     0     0
     0     1     1     0     0     0
     0     0     0     1     1     1

The advantage of keeping it in this form is that you save on space with respect to a cell array (which imposes 112 bytes of overhead per cell). You can also store it as a sparse array to potentially improve on storage costs.

Another thing to note, is that even if the logical array is longer than the e.g. double array which is indexing, as long as the exceeding elements are false you can still use it (and by construction of the above problem, idx satisfies this requirement). An example to clarify:

A = 1:3;
A([true false true false false])