Question

I'm receiving string from the Wikipedia APi which look like this:

{{Wikibooks|Wikijunior:Countries A-Z|France}} {{Sister project links|France}} * [http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]] * [http://ucblibraries.colorado.edu/govpubs/for/france.htm France] at ''UCB Libraries GovPubs'' *{{dmoz|Regional/Europe/France}} * [http://www.britannica.com/EBchecked/topic/215768/France France] ''Encyclopædia Britannica'' entry * [http://europa.eu/about-eu/countries/member-countries/france/index_en.htm France] at the [[European Union|EU]] *{{Wikiatlas|France}} *{{osmrelation-inline|1403916}} * [http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR Key Development Forecasts for France] from [[International Futures]] ;Economy *{{INSEE|National Institute of Statistics and Economic Studies}} * [http://stats.oecd.org/Index.aspx?QueryId=14594 OECD France statistics] 

I have to use both the actual url's, and the description of the url. So for example, for [http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]] I need to have "http://www.bbc.co.uk/news/world-europe-17298730" and also "France] from the [[BBC News]] " but without the brackets, like so "France from the BBC News".

I managed to get the first parts, by doing the following:

if(preg_match_all('/\[http(.*?)\s/',$result,$extmatch)) {           
   $mt= str_replace("[[","",$extmatch[1]);

But I don't know how to go around getting the second part (I'm quite weak at regex unfortunately :-( ).

Any ideas?

Was it helpful?

Solution

A solution not using regex:

  1. Explode the string at '*'
  2. Ditch the parts starting with '{';
  3. Remove all the brackets
  4. Explode the String at 'space'
  5. The first part is the link
  6. Glue back together the rest for the description

The code:

$parts=explode('*',$str);
$links=array();
foreach($parts as $k=>$v){
    $parts[$k]=ltrim($v);
    if(substr($parts[$k],0,1)!=='['){
        unset($parts[$k]);
        continue;
        }
    $parts[$k]=preg_replace('/\[|\]/','',$parts[$k]);
    $subparts=explode(' ',$parts[$k]);
    $links[$k][0]=$subparts[0];
        unset($subparts[0]);
    $links[$k][1]=implode(' ',$subparts);
    }

echo '<pre>'.print_r($links,true).'</pre>';

The result:

Array
(
    [1] => Array
        (
            [0] => http://www.bbc.co.uk/news/world-europe-17298730
            [1] => France from the BBC News 
        )

    [2] => Array
        (
            [0] => http://ucblibraries.colorado.edu/govpubs/for/france.htm
            [1] => France at ''UCB Libraries GovPubs'' 
        )

    [4] => Array
        (
            [0] => http://www.britannica.com/EBchecked/topic/215768/France
            [1] => France ''Encyclopædia Britannica'' entry 
        )

    [5] => Array
        (
            [0] => http://europa.eu/about-eu/countries/member-countries/france/index_en.htm
            [1] => France at the European Union|EU 
        )

    [8] => Array
        (
            [0] => http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR
            [1] => Key Development Forecasts for France from International Futures ;Economy 
        )

    [10] => Array
        (
            [0] => http://stats.oecd.org/Index.aspx?QueryId=14594
            [1] => OECD France statistics 
        )

)

OTHER TIPS

PHP:

$input = "{{Wikibooks|Wikijunior:Countries A-Z|France}} {{Sister project links|France}} * [http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]] * [http://ucblibraries.colorado.edu/govpubs/for/france.htm France] at ''UCB Libraries GovPubs'' *{{dmoz|Regional/Europe/France}} * [http://www.britannica.com/EBchecked/topic/215768/France France] ''Encyclopædia Britannica'' entry * [http://europa.eu/about-eu/countries/member-countries/france/index_en.htm France] at the [[European Union|EU]] *{{Wikiatlas|France}} *{{osmrelation-inline|1403916}} * [http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR Key Development Forecasts for France] from [[International Futures]] ;Economy *{{INSEE|National Institute of Statistics and Economic Studies}} * [http://stats.oecd.org/Index.aspx?QueryId=14594 OECD France statistics]";
$regex = '/\[(http\S+)\s+([^\]]+)\](?:\s+from(?:\s+the)?\s+\[\[(.*?)\]\])?/';

preg_match_all($regex, $input, $matches, PREG_SET_ORDER);
var_dump($matches);

Output:

array(6) {
  [0]=>
  array(4) {
    [0]=>
    string(78) "[http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]]"
    [1]=>
    string(47) "http://www.bbc.co.uk/news/world-europe-17298730"
    [2]=>
    string(6) "France"
    [3]=>
    string(8) "BBC News"
  }
  ...
  ...
  ...
  ...
  ...
}

Explanation:

\[       (?# match [ literally)
(        (?# start capture group)
  http   (?# match http literally)
  \S+    (?# match 1+ non-whitespace characters)
)        (?# end capture group)
\s+      (?# match 1+ whitespace characters)
(        (?# start capture group)
  [^\]]+ (?# match 1+ non-] characters)
)        (?# end capture group)
\]       (?# match ] literally)
(?:      (?# start non-capturing group)
  \s+    (?# match 1+ whitespace characters)
  from   (?# match from literally)
  (?:    (?# start non-capturing group)
    \s+  (?# match 1+ whitespace characters)
    the  (?# match the literally)
  )?     (?# end optional non-capturing group)
  \s+    (?# match 1+ whitespace characters)
  \[\[   (?# match [[ literally)
  (      (?# start capturing group)
    .*?  (?# lazily match 0+ characters)
  )      (?# end capturing group)
  \]\]   (?# match ]] literally)
)?       (?# end optional non-caputring group)

Let me know if you need a more thorough explanation, but my comments above should help. If you have any specific questions I'd be more than happy to help. Link below will help you visualize what the expression is doing.

Regex101

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top