Question

Can Matlab eliminate the path in URL and leave only the domain part? Does Matlab have any function to eliminate the path behind?

Let's say, example 1:

 input  :http://www.mathworks.com/help/images/removing-noise-from-images.html
 output :http://www.mathworks.com
Was it helpful?

Solution

This regexp pattern should do the trick:

>> str = 'http://www.mathworks.com/help/images/removing-noise-from-images.html';
>> out = regexp(str,'\w*://[^/]*','match','once')
out = 
    'http://www.mathworks.com'

The search pattern '\w*://[^/]*' says look for a string that starts with some "word" characters ('\w*) corresponding to the protocol (e.g. http, https, rtsp), followed by the ubiquitous ://, and then any number of characters that are not a forward slash ([^/]*).

Edit: The 'once' option should eliminate a nested cell.


UPDATE: just the hostname, allowing inputs with no protocol.

>> str = {'http://www.mathworks.com/help/images/removing-noise-from-images.html';
          'https://www.mathworks.com/help/matlab/ref/strcmpi@dfvfv.html';
          'google.com/voice'}
>> out = regexp(str,'([^/]*)(?=/[^/])','match','once')
out = 
    'www.mathworks.com'
    'www.mathworks.com'
    'google.com'

UPDATE 2: regexp madness!

>> str = {'http://www.mathworks.com/help/images/removing-noise-from-images.html';
          'https://www.mathworks.com/help/matlab/ref/strcmpi@dfvfv.html';
          'google.com/voice';
          'http://monkey.org/';
          'stackoverflow.com/';
          'meta.stackoverflow.com'};
>> out = regexp(str,'.*?[^/](?=(/([^/]|$)|$))','match','once')
out = 
    'http://www.mathworks.com'
    'https://www.mathworks.com'
    'google.com'
    'http://monkey.org'
    'stackoverflow.com'
    'meta.stackoverflow.com'

% hostname.m
function hostnames = hostname(str)

hostnames = regexp(str,'.*?[^/](?=(/([^/]|$)|$))','match','once');

OTHER TIPS

Code:

function output_url = domain_name(input_url)

c1 = strfind(input_url,'//');
ind1 = strfind(input_url,'/');

if isempty(c1) && isempty(ind1) 
    output_url = input_url; % For case like - www.mathworks.com
    return;
end

if ~isempty(c1)
    if numel(ind1)>2
        output_url = input_url(1:ind1(3)-1); % For cases like - http://www.mathworks.com/ or http://www.mathworks.com/something/
    else
        output_url = input_url; % For case like - http://www.mathworks.com
    end
else
    output_url = input_url(1:ind1(1)-1); % For cases like - www.mathworks.com/ or www.mathworks.com/something/
end

return;

Example runs:

%% Long URLs with extensions
disp(domain_name('www.mathworks.com/help/images/removing-noise-from-images.html'))
disp(domain_name('http://www.mathworks.com/help/images/removing-noise-from-images.html'))

%% Short URLs without HTTP://
disp(domain_name('www.mathworks.com'))
disp(domain_name('www.mathworks.com/'))

%% Short URLs with HTTP://
disp(domain_name('http://www.mathworks.com'))
disp(domain_name('http://www.mathworks.com/'))

Return:

www.mathworks.com
http://www.mathworks.com
www.mathworks.com
www.mathworks.com
http://www.mathworks.com
http://www.mathworks.com

An alternative method and probably efficient one would be to use REGEXP, but apparently I prefer numbers.

Edit 1: If you prefer to use bunch of URLs at the sametime, you may use a cell array. Obviously, the output would be a cell array too. Look at the following MATLAB script to get a feel of it -

% Input
in_urls_cell = [{'http://mathworks.com/'},{'mathworks.com/help/matlab/ref/strcmpi.html'},{'mathworks.com/help/matlab/ref/strcmpi@dfvfv.html'}];

% Get domain name
out_urls_cell = cell(size(in_urls_cell));
for count = 1:numel(in_urls_cell)
    out_urls_cell(count)={domain_name(cell2mat(in_urls_cell(count)))};
end

% Display only domain name
for count = 1:numel(out_urls_cell)
    disp(cell2mat(out_urls_cell(count)));
end  

The above script returns -

http://mathworks.com
mathworks.com
mathworks.com
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top