Question

I am analysing informal chat style message for sentiment and other information. I need all of the emoticons to be replaced with their actual meaning, to make it easier for the system to parse the message.

At the moment I have the following code:

$str = "Am I :) or :( today?";

$emoticons = array(
    ':)'    =>  'happy',
    ':]'    =>  'happy',
    ':('    =>  'sad',
    ':['    =>  'sad',
);

$str = str_replace(array_keys($emoticons), array_values($emoticons), $str);

This does a direct string replacement, and therefore does not take into account if the emoticon is surrounded by other characters.

How can I use regex and preg_replace to determine if it is actually an emoticon and not part of a string?

Also how can I extend my array so that happy element for example can contain both entries; :) and :]?

Was it helpful?

Solution

For maintainability and readability, I would change your emoticons array to:

$emoticons = array(
    'happy' => array( ':)', ':]'),
    'sad'   => array( ':(', ':[')
);

Then, you can form a look-up table just like you originally had, like this:

$emoticon_lookup = array();
foreach( $emoticons as $name => $values) {
    foreach( $values as $emoticon) {
        $emoticon_lookup[ $emoticon ] = $name;
    }
}

Now, you can dynamically form a regex from the emoticon lookup array. Note that this regex requires a non-word-boundary surrounding the emoticon, change it to what you need.

$escaped_emoticons = array_map( 'preg_quote', array_keys( $emoticon_lookup), array_fill( 0, count( $emoticon_lookup), '/'));
$regex = '/\B(' . implode( '|', $escaped_emoticons) . ')\B/';

And then use preg_replace_callback() with a custom callback to implement the replacement:

$str = preg_replace_callback( $regex, function( $match) use( $emoticon_lookup) {
    return $emoticon_lookup[ $match[1] ];
}, $str);

You can see from this demo that this outputs:

Am I happy or sad today? 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top