Parsing CSS by regex

https://stackoverflow.com/questions/236979

04-07-2019
|

Question

I'm creating a CSS editor and am trying to create a regular expression that can get data from a CSS document. This regex works if I have one property but I can't get it to work for all properties. I'm using preg/perl syntax in PHP.

Regex

(?<selector>[A-Za-z]+[\s]*)[\s]*{[\s]*((?<properties>[A-Za-z0-9-_]+)[\s]*:[\s]*(?<values>[A-Za-z0-9#, ]+);[\s]*)*[\s]*}

Test case

body { background: #f00; font: 12px Arial; }

Expected Outcome

Array(
    [0] => Array(
            [0] => body { background: #f00; font: 12px Arial; }
            [selector] => Array(
                [0] => body
            )
            [1] => Array(
                [0] => body
            )
            [2] => font: 12px Arial; 
            [properties] => Array(
                [0] => font
            )
            [3] => Array(
                [0] => font
            )
            [values] => Array(
                [0] => 12px Arial
                [1] => background: #f00
            )
            [4] => Array(
                [0] => 12px Arial
                [1] => background: #f00
            )
        )
)

Real Outcome

Array(
    [0] => Array
        (
            [0] => body { background: #f00; font: 12px Arial; }
            [selector] => body 
            [1] => body 
            [2] => font: 12px Arial; 
            [properties] => font
            [3] => font
            [values] => 12px Arial
            [4] => 12px Arial
        )
    )

Thanks in advance for any help - this has been confusing me all afternoon!

Solution

That just seems too convoluted for a single regular expression. Well, I'm sure that with the right extentions, an advanced user could create the right regex. But then you'd need an even more advanced user to debug it.

Instead, I'd suggest using a regex to pull out the pieces, and then tokenising each piece separately. e.g.,

/([^{])\s*\{\s*([^}]*?)\s*}/

Then you end up with the selector and the attributes in separate fields, and then split those up. (Even the selector will be fun to parse.) Note that even this will have pains if }'s can appear inside quotes or something. You could, again, convolute the heck out of it to avoid that, but it's probably even better to avoid regex's altogether here, and handle it by parsing one field at a time, perhaps by using a recursive-descent parser or yacc/bison or whatever.

OTHER TIPS

You are trying to pull structure out of the data, and not just individual values. Regular expressions might could be painfully stretched to do the job, but you are really entering parser territory, and should be pulling out the big guns, namely parsers.

I have never used the PHP parser generating tools, but they look okay after a light scan of the docs. Check out LexerGenerator and ParserGenerator. LexerGenerator will take a bunch of regular expressions describing the different types of tokens in a language (in this case, CSS) and spit out some code that recognizes the individual tokens. ParserGenerator will take a grammar, a description of what things in a language are made up of what other things, and spit out a parser, code that takes a bunch of tokens and returns a syntax tree (the data structure that you are after.

Do not use your own regex for parsing CSS. Why reinvent the wheel while there is code waiting for you, ready to use and (hopefully) bug-free?

There are two generally available classes that can parse CSS for you:

HTML_CSS PEAR package at pear.php.net

and

CSS Parser class at PHPCLasses:

http://www.phpclasses.org/browse/package/1289.html

I would recommend against using regex's to parse CSS - especially in single regex!

If you insist on doing the parsing in regex's, split it up into sensible sections - use one regex to split all the body{..} blocks, then another to parse the color:rgb(1,2,3); attributes.

If you are actually trying to write something "useful" (not trying to learn regular expressions), look for a prewritten CSS parser.

I found this cssparser.php which seems to work very well:

$cssp = new cssparser;
$cssp -> ParseStr("body { background: #f00;font: 12px Arial; }");
print_r($cssp->css);

..which outputs the following:

Array
(
    [body] => Array
        (
            [background] => #f00
            [font] => 12px arial
        )
)

The parser is pretty simple, so should be easy to work out what it's doing. Oh, I had to remove the lines that read if($this->html) {$this->Add("VAR", "");} (it seems to be a debugging thing that was left in)

I've mirrored the script here, with the above changes in

I am using the regex below and it pretty much works... of course this question is old now and I see that you've abandoned your efforts... but in case someone else runs across it:

(?<selector>(?:(?:[^,{]+),?)*?)\{(?:(?<name>[^}:]+):?(?<value>[^};]+);?)*?\}

(hafta remove all of the /* comments */ from your CSS first to be safe)

I wrote a piece of code that easily parses CSS. All you have to do is do a couple of explodes really... The $css variable is a string of the CSS. All you have to do is do a print_r($css) to get a nice array of CSS, fully parsed.

$css_array = array(); // master array to hold all values
$element = explode('}', $css);
foreach ($element as $element) {
    // get the name of the CSS element
    $a_name = explode('{', $element);
    $name = $a_name[0];
    // get all the key:value pair styles
    $a_styles = explode(';', $element);
    // remove element name from first property element
    $a_styles[0] = str_replace($name . '{', '', $a_styles[0]);
    // loop through each style and split apart the key from the value
    $count = count($a_styles);
    for ($a=0;$a<$count;$a++) {
        if ($a_styles[$a] != '') {
            $a_key_value = explode(':', $a_styles[$a]);
            // build the master css array
            $css_array[$name][$a_key_value[0]] = $a_key_value[1];
        }
    }               
}

Gives you this:

Array
(
    [body] => Array
        (
            [background] => #f00
            [font] => 12px arial
        )
)

Building off of the current answer by Tanktalus there's a couple of improvements and edge cases to note.

CSS Parsing Regex

\s*([^{]+)\s*\{\s*([^}]*?)\s*}

This Regex will do some space trimming and hits on some additional edge cases as listed in this example: https://regex101.com/r/qQRIHx/5

key:value pairs; Pitfalls of Further Complexicated Regex

I too started to try work on delimiting the key:value pairs but quickly saw in the case where there were multiple styles per selector that things started to get trickier than I wanted. You can view version 1 of the regex where I tried to delimit the key:values and how it failed with multiple declarations here: https://regex101.com/r/qQRIHx/1

Implementation

As others mentioned, you should break this up into multiple steps to parse and tokenize your css. This regex will help you obtain the declarations, but you will need to then parse those out.

Declaration Parser

You could use something like this to parse the declarations after you get your first set of matches.

([^:\s]+)*\s*:\s*([^;]+);

Example: https://regex101.com/r/py9OKO/1/

Edge Case

The above example works great with multiple declarations, but it's possible that it's just 1 declaration with no semi-colon to end which will render in [most] browsers but break this regex.

Noted Cases

You may also need to account for nested rules in the case that there's a media query. In this case I would try to run the css matching regex against the declarations that are extracted. If you get matches you could run recursion on it (although I'm not sure there's cases where you would have more than 1 level nested for vanilla CSS).

Edge Cases

This doesn't handle a right curly bracket in a string

Tomorrow's Research

I've decided to instead use an npm package like css or cssom. I know this is in PHP but it's going to do a lot of heavy lifting for me and handle edge cases I keep running into.

Edit:

I ended up using Jotform's public css.js library. It has a really small footprint which was one of the main requirements I had when choosing libraries to parse CSS.

https://github.com/jotform/css.js/tree/master
They also published this article explaining their process:
- https://stories.jotform.com/writing-a-css-parser-in-javascript-3ecaa1719a43

Try this

function trimStringArray($stringArray){
    $result = array();
    for($i=0; $i < count($stringArray); $i++){
        $trimmed = trim($stringArray[$i]);
        if($trimmed != '') $result[] = $trimmed;
    }
    return $result;
}
$regExp = '/\{|\}/';
$rawCssData = preg_split($regExp, $style);

$cssArray = array();
for($i=0; $i < count($rawCssData); $i++){
    if($i % 2 == 0){
        $cssStyle['selectors'] = array();
        $selectors = split(',', $rawCssData[$i]);
        $cssStyle['selectors'] = trimStringArray($selectors);
    }
    if($i % 2 == 1){
        $attributes = split(';', $rawCssData[$i]);
        $cssStyle['attributes'] = trimStringArray($attributes);
        $cssArray[] = $cssStyle;
    }

}
//return false;
echo '<pre>'."\n";
print_r($cssArray);
echo '</pre>'."\n";

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow