Question

I need to find a way to read content posted by user to find any hyperlinks that might have been included, create anchor tags, add target and rel=nofollow attribute to all those links.

I have come across some REGEX solutions like this:

 (?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

But on other questions on SO about the same problem, it has been highly recommended NOT to use REGEX instead use DOMDocument of PHP.

Whatever be the best way, I need to add some attributes like mentioned above in order to harden all external links on website.

Was it helpful?

Solution

First of all, the guidelines you mentioned advised against parsing HTML with regexes. As far as I understand, what you are trying to do is to parse plain text from user and convert it into HTML. For that purpose, regexes are usually just fine.

(Note that I assume you parse the text into links yourself and aren't using external library for that. In the latter case you'd need to fix the HTML the library outputs, and for this you should use DOMDocument to iterate over all <a> tags and add them proper attributes.)

Now, you can parse it in two ways: server side, or client side.

Server side

Pros:

  • It outputs ready to use HTML.
  • It doesn't require users to enable Javascript.

Cons:

  • You need to add rel="nofollow" attribute for the bots to not follow the links.

Client side

Pros:

  • You don't need to add rel="nofollow" attribute for the bots, since they don't see the links in the first place - they're generated with Javascript and bots usually don't parse Javascript.

Cons:

  • Creating links that way requires users to enable Javascript.
  • Implementing stuff like that in Javascript can give the impression that site is slow, especially if there is a lot of text to parse.
  • It makes caching parsed text difficult.

I'll focus on implementing it server-side.

Server-side implementation

So, in order to parse links from user input and add them any attribute you want, you can use something like this:

<?php
function replaceLinks($text)
{
    $regex = '/'
      . '(?<!\S)'
      . '(((ftp|https?)?:?)\/\/|www\.)'
      . '(\S+?)'
      . '(?=$|\s|[,]|\.\W|\.$)'
      . '/m';

    return preg_replace_callback($regex, function($match)
    {
        return '<a'
          . ' target=""'
          . ' rel="nofollow"'
          . ' href="' . $match[0] . '">'
          . $match[0]
          . '</a>';
    }, $text);
}

Explanation:

  • (?<!\S): not preceded by non-whitespace characters.
  • (((ftp|https?)?:?)\/\/|www\.): accept ftp://, http://, https://, ://, // and www. as beginning of URLs.
  • (\S+?) match everything that is not whitespace in non-greedy fashion.
  • (?=$|\s|[,]|\.\W|\.$) every URL must be follow by either end of line, a whitespace, a comma, a dot followed by character other than \w (this is to allow .com, .co.jp etc to match) or by a dot followed by end of line.
  • m flag - match multiline text.

Testing

Now, to support my claim that it works I added a few test cases:

$tests = [];
$tests []= ['http://example.com', '<a target="" rel="nofollow" href="http://example.com">http://example.com</a>'];
$tests []= ['https://example.com', '<a target="" rel="nofollow" href="https://example.com">https://example.com</a>'];
$tests []= ['ftp://example.com', '<a target="" rel="nofollow" href="ftp://example.com">ftp://example.com</a>'];
$tests []= ['://example.com', '<a target="" rel="nofollow" href="://example.com">://example.com</a>'];
$tests []= ['//example.com', '<a target="" rel="nofollow" href="//example.com">//example.com</a>'];
$tests []= ['www.example.com', '<a target="" rel="nofollow" href="www.example.com">www.example.com</a>'];
$tests []= ['user@www.example.com', 'user@www.example.com'];
$tests []= ['testhttp://example.com', 'testhttp://example.com'];
$tests []= ['example.com', 'example.com'];
$tests []= [
    'test http://example.com',
    'test <a target="" rel="nofollow" href="http://example.com">http://example.com</a>'];
$tests []= [
    'multiline' . PHP_EOL . 'blah http://example.com' . PHP_EOL . 'test',
    'multiline' . PHP_EOL . 'blah <a target="" rel="nofollow" href="http://example.com">http://example.com</a>' . PHP_EOL . 'test'];
$tests []= [
    'text //example.com/slashes.php?parameters#fragment, some other text',
    'text <a target="" rel="nofollow" href="//example.com/slashes.php?parameters#fragment">//example.com/slashes.php?parameters#fragment</a>, some other text'];
$tests []= [
    'text //example.com. new sentence',
    'text <a target="" rel="nofollow" href="//example.com">//example.com</a>. new sentence'];

Each test case is composed of two parts: source input and expected output. I used following code to determine whether the function passes the tests above:

foreach ($tests as $test)
{
    list ($source, $expected) = $test;
    $actual = replaceLinks($source);
    if ($actual != $expected)
    {
        echo 'Test ' . $source . ' failed.' . PHP_EOL;
        echo 'Expected: ' . $expected . PHP_EOL;
        echo 'Actual:   ' . $actual . PHP_EOL;
        die;
    }
}
echo 'All tests passed' . PHP_EOL;

I think this gives you idea how to solve the problem. Feel free to add more tests and experiment with regex itself to make it suitable for your specific needs.

OTHER TIPS

You might be interested in Goutte you can define your own filters etc.

Get the content to post using jquery and process it before posting it to PHP.

$('#idof_content').val(
  $('#idof_content').val().replace(/\b(http(s|):\/\/|)(www\.\S+)/ig,
    "<a href='http\$2://\$3' target='_blank' rel='nofollow'>\$3</a>"));
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top