Question

My purpose for this feature is to check whether there is 2 or 3...URLs hide within 1 URL, if yes then return 1, else return 0. e.g. www.applee.com/www.samsunge.com, http://www.samsungds.http://comwww.samsung.com

I have settled the importdata prob, but now I faced difficulty for checking the data below:(I have modify the 'is_double_url.m' file, but it return me the error)

http://encuestanavemotors.com.ar/doc/newgoogledoc2013/2013gdocs/ http://totalwhiteboard.com.au/.pp/0053d4ae3e2c78154d29d413c1236341/192.186.237.145/H/ http://www.wwwwwwwwwws2.com/ http://www.paypal.com.cy.cgi.bin.webscr.cmd.login.submit.dispatch.5885d80a1faee8d48a116ba977951b3435308b8c4.turningpoint.in/f044c94b4394939f4a1a75798875f78c/ http://www.celebramania.cl/web/cc/personal/cards/5d0d5c5af4f12c319d47872fabe11262/Pool=0/?cmd=_home&dispatch=5885d80a13c0db1f8e&ee=5cd428ee24c5037dda298a4762735a94 http://joannalindsay.com/wp-content/uploads/aloo/aaleor.php?bidderblocklogin&hc=1&hm=uk%601d72f%2Bj2b2vi%3C265bidderblocklogin&hc=1&hm=uk%601d72f%2Bj2b2vi%3C265bidderblocklogin&hc=1&hm=uk%601d72f%2Bj2b2vi%3C265 http://bluedominoes.com/~kosalbco/paypal.de/

is_double_url.m file

function out = is_double_url(url_path1)

f1 = strfind(url_path1,'www.');
if isempty(f1)
out = 0;
return;
end
f2 = strfind(url_path1,'/');
f3 = bsxfun(@minus,f2,f1');

count_dots = zeros(size(f3,1),1);
for k = 1:size(f3,1)
[x,y] = find(f3(k,:)>0,1);
str2 = url_path1(f1(k):f2(y));
if ~isempty(strfind(str2,'..'))
    continue
end
count_dots(k) = nnz(strfind(str2,'.'));
end
out = ~any(count_dots(2:end)<2);

if any(strfind(url_path1,'://')>f2(1))
out = true;
end

return;

f10.m file data = importdata('url'); [sizeData b] = size(data);

for i = 1:sizeData
feature10(i) = is_double_url(data{i});

end
Was it helpful?

Solution

Code

function out = is_double_url(url_path1)

if url_path1(end)~='/'
    url_path1(end+1)='/';
end

url_path1 = regexprep(url_path1,'//','//www.');
url_path1 = regexprep(url_path1,'//www.www.','//www.');

f1 = strfind(url_path1,'www.');
if numel(f1)<2
    out = false;
else
    f2 = strfind(url_path1,'/');
    f3 = bsxfun(@minus,f2,f1');

    count_dots = zeros(size(f3,1),1);
    for k = 1:size(f3,1)
        [~,y] = find(f3(k,:)>0,1);
        str2 = url_path1(f1(k):f2(y));
        if ~isempty(strfind(str2,'..'))
            continue
        end
        count_dots(k) = nnz(strfind(str2,'.'));
    end
    out = ~any(count_dots(2:end)<2);

    if any(strfind(url_path1,'://')>f2(1))
        out = true;
    end
end

return;

Runs

is_double_url('http://www.farthingalescorsetmakingsupplies.com/files/files/www.apple.com/')

is_double_url('http://www.farthingalescorsetmakingsupplies.com/files/files/www.com/')

is_double_url('http://www.farthingalescorsetmakingsupplies.com/files/files/https://www.com/')

is_double_url('http://www.farthingalescorsetmakingsupplies.com/files/files/https://www.dfdsf.my/')


Returns - 1 0 1 1 respectively.

If you have a list of URLs in a text file , use this to do checking for each of them -

fid = fopen('text2.txt'); %% 'text2.txt' has the urls on line by line basis
C = textscan(fid, '%s\n');
fclose(fid);

for k = 1:numel(C{1})
    out(k) = is_double_url(C{1}{k}); %%// out stores the condition checked statuses
end
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top