Question

I am trying to match a series of text strings with PCRE on PHP, and am having trouble getting all the matches in between the first and second.

If anyone wonders why on Earth I would want to do this, it's because of Doc Comments. Oh, how I wish Zend would make native/plugin functions to read Doc Comments from a PHP file...

The following example (plain) text will be used for the problem. It will always be pure PHP code, with only one opening tag at the beginning of the file, no closing. You can assume that the syntax will always be correct.

<?php
  class someClass extends someExample
  {
    function doSomething($someArg = 'someValue')
    {
      // Nested code blocks...
      if($boolTest){}
    }
    private function killFurbies(){}
    protected function runSomething(){}
  }

  abstract
  class anotherClass
  {
    public function __construct(){}
    abstract function saveTheWhales();
  }

  function globalFunc(){}

Problem

Trying to match all methods in a class; my RegEx does not find the method killFurbies() at all. Letting it be greedy means it only matches the last method in a class, and letting it be lazy means it only matches the first method.

$part = '.*';  // Greedy
$part = '.*?'; // Lazy

$regex = '%class(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff]*)'
       . '.*?\{' . $part .'(?:(public|protected|private)(?:\\n|\\r|\\s)+)?'
       . 'function(?:\\n|\\r|\\s)+([a-zA-Z_\\x7f-\\xff][a-zA-Z0-9_\\x7f-\\xff'
       . ']*)(?:\\n|\\r|\\s)*\\(%ms';

preg_match_all($regex, file_get_contents(__EXAMPLE__), $matches, PREG_SET_ORDER);
var_dump($matches);

Results in:

// Lazy:
array(2) {
  [0]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(9) "someClass"
    [2]=>
    string(0) ""
    [3]=>
    string(11) "doSomething"
  }
  [1]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(12) "anotherClass"
    [2]=>
    string(6) "public"
    [3]=>
    string(11) "__construct"
  }
}

// Greedy:
array(2) {
  [0]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(9) "someClass"
    [2]=>
    string(0) ""
    [3]=>
    string(13) "saveTheWhales"
  }
  [1]=>
  array(4) {
    [0]=>
    // Omitted.
    [1]=>
    string(12) "anotherClass"
    [2]=>
    string(0) ""
    [3]=>
    string(13) "saveTheWhales"
  }
}

How do I match all? :S

Any help would be gratefully appreciated, as I already feel this question is ridiculous as I'm typing it out. Anyone attempting to answer a question like this is braver than me!

Was it helpful?

Solution

Better use token_get_all to get the tokens of a PHP code and iterate them. PHPDoc style comments tokens can be identified with T_DOC_COMMENT.

OTHER TIPS

Err, can't you just parse the source using token_get_all and look for the tokens of type T_DOC_COMMENT (changed from T_COMMENT to T_DOC_COMMENT, see Gumnbo's post)?

An example of how to use this token_get_all function can be found here.

Solution

I've come up with a class to extract Doc Comments for classes and methods in a file. Thanks to all the people who answered this question, and the other on matching code blocks.

The average benchmarks for the following example is between 0.00495 and 0.00505 seconds.

<?php

$file = 'path/to/libraries/tokenizer.php';
include $file;
$tokenizer = new Tokenizer;
// Start Benchmarking here.
$tokenizer->load($file);
// End Benchmarking here.
// The following will output 'bool(false)'.
var_dump($tokenizer->get_doc('Tokenizer', 'get_tokens'));
// The following will output 'string(18) "/** load method */"'.

Tokenizer (yes, I still haven't thought of a better name for it...) Class:

<?php

class Tokenizer
{

  private $compiled = false, $path = false, $tokens = false, $classes = array();

  /** load method */
  public function load($path)
  {
    $path = realpath($path);
    if(!file_exists($path) || !function_exists('token_get_all'))
    {
      return false;
    }
    $this->compiled = false;
    $this->classes = array();
    $this->path = $path;
    $this->tokens = false;

    $this->get_tokens();
    $this->get_classes();
    $this->class_blocks();
    $this->class_functions();
    return true;
  }

  protected function get_tokens()
  {
    $tokens = token_get_all(file_get_contents($this->path));
    $compiled = '';
    foreach($tokens as $k => $t)
    {
      if(is_array($t) && $t[0] != T_WHITESPACE)
      {
        $compiled .= $k . ':' . $t[0] . ',';
      }
      else
      {
        if($t == '{' || $t == '}')
        {
          $compiled .= $t . ',';
        }
      }
    }
    $this->tokens = $tokens;
    $this->compiled = trim($compiled, ',');
  }

  protected function get_classes()
  {
    if(!$this->compiled)
    {
      return false;
    }
    $regex = '%(?:(\\d+)\\:366,)?(?:\\d+\\:(?:345|344|353),)?\\d+\\:352,(\\d+)\\:307,(?:\\d+\\:(?:354|355),\\d+\\:307,)*{%';
    preg_match_all($regex, $this->compiled, $classes, PREG_SET_ORDER);
    if(is_array($classes))
    {
      foreach($classes as $class)
      {
        $this->classes[$this->tokens[$class[2]][1]] = array('token' => $class[2]);
        $this->classes[$this->tokens[$class[2]][1]]['doc'] = isset($this->tokens[$class[1]][1]) ? $this->tokens[$class[1]][1] : false;
      }
    }
  }

  private function class_blocks()
  {
    if(!$this->compiled)
    {
      return false;
    }
    foreach($this->classes as $class_name => $class)
    {
      $this->classes[$class_name]['block'] = $this->get_block($class['token']);
    }
  }

  protected function get_block($name_token)
  {
    if(!$this->compiled || ($pos = strpos($this->compiled, $name_token . ':')) === false)
    {
      return false;
    }
    $section= substr($this->compiled, $pos);
    $len = strlen($section);
    $block = '';
    $opening = 1;
    $closing = 0;
    for($i = 0; $i < $len; $i++)
    {
      if($section[$i] == '{')
      {
        $opening++;
      }
      elseif($section[$i] == '}')
      {
        $closing++;
        if($closing == $opening)
        {
          break;
        }
      }
      if($opening > 0)
      {
        $block .= $section[$i];
      }
    }
    return trim($block, ',');
  }

  protected function class_functions()
  {
    if(!$this->compiled)
    {
      return false;
    }
    foreach($this->classes as $class_name => $class)
    {
      $regex = '%(?:(\d+)\:366,)?(?:\d+\:(?:344|345),)?(?:\d+\:(?:341|342|343),)?\d+\:333,(\d+)\:307,\{%';
      preg_match_all($regex, $class['block'], $functions, PREG_SET_ORDER);
      foreach($functions as $function)
      {
        $function_name = $this->tokens[$function[2]][1];
        $this->classes[$class_name]['functions'][$function_name] = array('token' => $function[2]);
        $this->classes[$class_name]['functions'][$function_name]['doc'] = isset($this->tokens[$function[1]][1]) ? $this->tokens[$function[1]][1] : false;
        $this->classes[$class_name]['functions'][$function_name]['block'] = $this->get_block($function[2]);
      }
    }
  }

  public function get_doc($class, $function = false)
  {
    if(!is_string($class) || !isset($this->classes[$class]))
    {
      return false;
    }
    if(!is_string($function))
    {
      return $this->classes[$class]['doc'];
    }
    else
    {
      if(!isset($this->classes[$class]['functions'][$function]))
      {
        return false;
      }
      return $this->classes[$class]['functions'][$function]['doc'];
    }
  }

}

Any thoughts or comments on this? All criticism welcome!

Thanks, mniz.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top