سؤال

I have a source of html, and an array of keywords. I'm trying to find all words which begin with any keyword in the keywords array and wrap it in a link tag.

For example, the keyword array has two values: [ABC, DEF]. It should match ABCDEF, DEFAD, etc. and wrap each word with hyperlink markup.

Here is the code I've got so far:

$_keys = array('ABC', 'DEF');
$text = 'Some ABCDD <strong>HTML</strong> text. DEF';

function search_and_replace(($key,$text)
{
    $words = preg_split('/\s+/', trim($text)); //to seprate words in $_text
    for($words as $word) 
    {
        if(strpos($word,$key) !== false)
        {
            if($word.startswith($key)) 
            {
                str_replace($word,'<a href="">'.$word.'</a>,$_text);
            }
        }

    }
    return text;
}


for($_keys as $_key)
{
    $text = search_and_replace($key,$text);
}

My questions:

  1. Would this algorithm work?
  2. How would I modify this to work with UTF-8?
  3. How can I recognize hyperlinks in the html and ignore them (don't want to put a hyperlink in a hyperlink).
  4. Is this algorithm safe?
هل كانت مفيدة؟

المحلول

is the algorithm "true"? ( I'm reading "accurate")

No, it is not. Since str_replace functions as follows

a string or an array with all occurrences of search in subject replaced with the given replace value.

The string you're matching is not the only one being replaced. Using your example, if you ran this function against your data set, you'd end up wrapping each occurrence of ABC in multiple tags ( just run your code to see it, but you'll have to fix syntax errors).

work with UTF-8 Alphabets?

Not sure, but as written, I don't think so. See Preg_Replace and UTF8. PREG functions should be multibyte safe.

I want to igonre all words in each a tag for search operetion

That's awefully hard. You'll have to avoid <a ...>word</a>, which starts to make a big mess fast. Regex matching HTML reliably is a fool's errand.

Probably the best would be to interpret the webpage as XML or HTML. Have you considered doing this in javascript? Why do it on the server side? The advantage of JS is twofold - one, it runs on the client side, so you're offloading / distributing the work, and two, since the DOM is already interpreted, you can find all text nodes and replace them fairly easily. In fact, I was helping a frend working on a chrome extension to to almost exactly what you're describing; you could modify it to do what you're looking for easily.

a better alternative method?

Definitely. What you're showing here is one of the worse methods of doing this. I'd push for you to use preg_replace ( another answer has a good start for the regex you'd want, matching word breaks tather than whitespace) but since you want to avoid changing some elements, I'm thinking now that doing this in JS client-side is far better.

نصائح أخرى

In order to maximize your performance you should look into Trie (same as Retrieval Tree) data structure. (http://en.wikipedia.org/wiki/Trie) If I were you I would first build a Trie containing the words in the HTML page. At this step you could also check if the word is inside an <a> tag and if it this then do not add it to the Trie. You can easily do that with a Regex match

How about regex?

preg_match_all("/\b".$word."\B*\b/",$matches);
foreach($matches as $each) {
    print($each[0]);
}

(Sorry, my PHP is a bit rusty)

For a simple task like this PHP regular expressions will serve well. The idea is to find all hyperlinks ( and optionally some other HTML elements ) and replace them with unique tokens. After that we are free to seek and replace desired keywords, and in the end we will restore the removed HTML elements back.

$_keys = array( 'ABC', 'DEF', 'ABČ' );

$text = 
'Some <a href="#" >ABC</a> ABCDđD <strong>ABCDEF</strong> text. DEF
<p class="test">
    <a href="#">PHP</a> is <em>the</em> most ABCwidely used 
    langČuage ABC for ABČogr ammDEFing on the webABC DEFABC.
</p>';

// array for holding html items replaced with tokens
$tokens = array();
$id = 0;

// we will replace all links and strong elements (a|strong)
$text = preg_replace_callback( '/<(a|strong)[^>]*>.*?<\/\1\s*>/s', 
    function( $matches ) use ( &$tokens, &$id ) 
    {
        // store matches into the tokens array
        $tokens[ '#'.++$id.'#' ] = $matches[0];
        // replace matches with the unique id
        return '#'.$id.'#';
    }, 
    $text 
);

echo htmlentities( $text );
/* - outputs: Some #1# ABCDđD #2# text. DEF <p class="test"> #3# is <em>the</em> most ABCwidely used langČuage ABC for pćrogrABCamming on the webABC DEFABC. </p>
   - note the #1# #2# #3# tokens
*/

// wrap the words that starts with items in $_keys array ( with u(PCRE_UTF8) modifier )
$text = preg_replace( '/\b('. implode( '|', $_keys ) . ')\w*\b/u', '<a href="">$0</a>', $text );

// replace the tokens with values
$text = str_replace( array_keys($tokens), array_values($tokens), $text );       

echo $text;

Info about UTF-8 strings in PHP regex:

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top