Question

I need to extract only parts of a URL with PHP but I am struggling to the set point where the extraction should stop. I used a regex to extract the entire URL from a longer string like this:

$regex = '/\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|$!:,.;]*[A-Z0-9+&@#\/%=~_|$]/i';
preg_match_all($regex, $href, $matches);

The result is the following string:

http://www.cambridgeenglish.org/test-your-english/&sa=U&ei=a4rbU8agB-zY0QWS_IGYDw&ved=0CFEQFjAL&usg=AFQjCNGU4FMUPB2ZuVM45OoqQ39rJbfveg

Now I want to extract only this bit http://www.cambridgeenglish.org/test-your-english/. I basically need to get rid off everything starting at &amp onwards.

Anyone an idea how to achieve this? Do I need to run another regex or can I add it to the initial one?

Était-ce utile?

La solution

The below regex would get ridoff everything after the string &amp. Your php code would be,

<?php
echo preg_replace('~&amp.*$~', '', 'http://www.cambridgeenglish.org/test-your-english/&amp;sa=U&amp;ei=a4rbU8agB-zY0QWS_IGYDw&amp;ved=0CFEQFjAL&amp;usg=AFQjCNGU4FMUPB2ZuVM45OoqQ39rJbfveg');
?> //=> http://www.cambridgeenglish.org/test-your-english/

Explanation:

  • &amp Matches the string &amp.
  • .* Matches any character zero or more times.
  • $ End of the line.

Autres conseils

I would suggest you abandon regex and let PHP's own parse_url function do this for you:

http://php.net/manual/en/function.parse-url.php

$parsed = parse_url($url);
$my_url = $parsed['scheme'] . '://' . $parsed['hostname'] . $parsed['path'];

to get the substring of the path up to the &amp, try:

$parsed = parse_url($url);
$my_url = $parsed['scheme'] . '://' . $parsed['hostname'] . substr($parsed['path'], 0, strpos($parsed['path'],'&amp'));
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top