Regex to strip comments and multi-line comments and empty lines

https://stackoverflow.com/questions/643113

22-07-2019
|

Question

I want to parse a file and I want to use php and regex to strip:

blank or empty lines
single line comments
multi line comments

basically I want to remove any line containing

/* text */

or multi line comments

/***
some
text
*****/

If possible, another regex to check if the line is empty (Remove blank lines)

Is that possible? can somebody post to me a regex that does just that?

Thanks a lot.

Solution

$text = preg_replace('!/\*.*?\*/!s', '', $text);
$text = preg_replace('/\n\s*\n/', "\n", $text);

OTHER TIPS

Keep in mind that any regex you use will fail if the file you're parsing has a string containing something that matches these conditions. For example, it would turn this:

print "/* a comment */";

Into this:

print "";

Which is probably not what you want. But maybe it is, I don't know. Anyway, regexes technically can't parse data in a manner to avoid that problem. I say technically because modern PCRE regexes have tacked on a number of hacks to make them both capable of doing this and, more importantly, no longer regular expressions, but whatever. If you want to avoid stripping these things inside quotes or in other situations, there is no substitute for a full-blown parser (albeit it can still be pretty simple).

//  Removes multi-line comments and does not create
//  a blank line, also treats white spaces/tabs 
$text = preg_replace('!^[ \t]*/\*.*?\*/[ \t]*[\r\n]!s', '', $text);

//  Removes single line '//' comments, treats blank characters
$text = preg_replace('![ \t]*//.*[ \t]*[\r\n]!', '', $text);

//  Strip blank lines
$text = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $text);

It is possible, but I wouldn't do it. You need to parse the whole php file to make sure that you're not removing any necessary whitespace (strings, whitespace beween keywords/identifiers (publicfuntiondoStuff()), etc). Better use the tokenizer extension of PHP.

This should work in replacing all /* to */.

$string = preg_replace('/(\s+)\/\*([^\/]*)\*\/(\s+)/s', "\n", $string);

$string = preg_replace('#/\*[^*]*\*+([^/][^*]*\*+)*/#', '', $string);

This is my solution , if one is not used to regexp. The following code remove all comment delimited by # and retrieves the values of variable in this style NAME=VALUE

  $reg = array();
  $handle = @fopen("/etc/chilli/config", "r");
  if ($handle) {
   while (($buffer = fgets($handle, 4096)) !== false) {
    $start = strpos($buffer,"#") ;
    $end   = strpos($buffer,"\n");
     // echo $start.",".$end;
       // echo $buffer ."<br>";



     if ($start !== false)

        $res = substr($buffer,0,$start);
    else
        $res = $buffer; 
        $a = explode("=",$res);

        if (count($a)>0)
        {
            if (count($a) == 1 && !empty($a[0]) && trim($a[0])!="")
                $reg[ $a[0] ] = "";
            else
            {
                if (!empty($a[0]) && trim($a[0])!="")
                    $reg[ $a[0] ] = $a[1];
            }
        }




    }

    if (!feof($handle)) {
        echo "Error: unexpected fgets() fail\n";
    }
    fclose($handle);
}

This is a good function, and WORKS!

<?
if (!defined('T_ML_COMMENT')) {
   define('T_ML_COMMENT', T_COMMENT);
} else {
   define('T_DOC_COMMENT', T_ML_COMMENT);
}
function strip_comments($source) {
    $tokens = token_get_all($source);
    $ret = "";
    foreach ($tokens as $token) {
       if (is_string($token)) {
          $ret.= $token;
       } else {
          list($id, $text) = $token;

          switch ($id) { 
             case T_COMMENT: 
             case T_ML_COMMENT: // we've defined this
             case T_DOC_COMMENT: // and this
                break;

             default:
                $ret.= $text;
                break;
          }
       }
    }    
    return trim(str_replace(array('<?','?>'),array('',''),$ret));
}
?>

Now using this function 'strip_comments' for passing code contained in some variable:

<?
$code = "
<?php 
    /* this is comment */
   // this is also a comment
   # me too, am also comment
   echo "And I am some code...";
?>";

$code = strip_comments($code);

echo htmlspecialchars($code);
?>

Will result output as

<?
echo "And I am some code...";
?>

Loading from a php file:

<?
$code = file_get_contents("some_code_file.php");
$code = strip_comments($code);

echo htmlspecialchars($code);
?>

Loading a php file, stripping comments and saving it back

<?
$file = "some_code_file.php"
$code = file_get_contents($file);
$code = strip_comments($code);

$f = fopen($file,"w");
fwrite($f,$code);
fclose($f);
?>

Source: http://www.php.net/manual/en/tokenizer.examples.php

I found this one to suit me better, (\s+)\/\*([^\/]*)\*/\n* it removes multi-line, tabbed or not comments and the spaced behind it. I'll leave a comment example which this regex would match.

/**
 * The AdditionalCategory
 * Meta informations extracted from the WSDL
 * - minOccurs : 0
 * - nillable : true
 * @var TestStructAdditionalCategorizationExternalIntegrationCUDListDataContract
 */

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow