Question

I have a PHP script that looks for links on a page that it downloads with CURL_MULTI functions. The downloading is fine and I get the data, but my script randomly crashes when I encounter a page that has the url listed as a nonlink. This is the code:

$fishnof = strpos($nofresult, $supshorturl, 0);
$return[0] = ''; $return[1] = ''; // always good to cleanset

// Make sure we grabbed a link instead of a text url(no href)
if ($fishnof !== false) {
    $linkcheck = rev_strpos($nofresult,'href',$fishnof);
    $endthis = false;
    while($endthis !== true) {
        if($linkcheck > ($fishnof - 25)){ // 19 accounts for href="https://blog. 25 just in case
            $endthis = true;
            break;
        }
        $lastfishnof = $fishnof;
        $fishnof = strpos($nofresult,$supshorturl,$fishnof+1);
        if($fishnof === false){$fishnof = $lastfishnof;$linkcheck = rev_strpos($nofresult,'href',$fishnof);$endthis = true;break;}// This is the last occurance of our URL on this page
        if($linkcheck > $fishnof){$linkcheck = rev_strpos($nofresult,'href',$fishnof);$endthis = true;break;} // We went around past the end of the string(probably don't need this)      
        $linkcheck = rev_strpos($nofresult,'href',$fishnof);
    }
    if($linkcheck < ($fishnof - 25)){ // 19 accounts for href="https://blog. 25 just in case
        $return[0] = 'Non-link.';
        $return[1] = '-';
        $nofresult = NULL; // Clean up our memory
        unset($nofresult); // Clean up our memory
        return $return;
    }
}

This is the custom rev_strpos, which just does a reverse strpos():

// Does a reverse stripos()
function rev_strpos(&$haystack, $needle, $foffset = 0){
    $length = strlen($haystack);
    $offset = $length - $foffset - 1;
    $pos = strpos(strrev($haystack), strrev($needle), $offset);
    return ($pos === false)?false:( $length - $pos - strlen($needle) );
}

so if:

$nofresult = '
Some text.Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
google.com Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
<a href="http://www.google.com">Google</a> Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.';

and

$supshorturl = "google.com";

This should find the position of the second occurance of google.com, where it is inside of a HTML href tag. The problem is that it does not report any error before the crash, my error settings:

ini_set("display_errors", 1);
error_reporting(E_ALL & ~E_NOTICE);
set_error_handler('handle_errors');

My handle_errors() function logs all errors in a file. However no errors are reported before the script crashes. Also my curl_multi processes many URLs, and sometimes it will crash on a certain URL and and other times it crashes on another URL... I am ready to pull out my hair because this seems like such an easy deal... but here I am. Another point of notice is if I remove the while loop then no crash, also if the page has the url in a href tag first then it doesn't crash. Please help me figure this thing out. Thanks a million!

Was it helpful?

Solution

I think you're making it harder than it needs to be. If rev_strpos is only needed to return the last instance of your search string, and if you aren't worried about case, use strripos instead.

From the PHP docs...

strripos — Find position of last occurrence of a case-insensitive string in a string

Description

int strripos ( string $haystack , string $needle [, int $offset = 0 ] )

Find position of last occurrence of a string in a string. Unlike strrpos(), strripos() is case-insensitive.

If you need it to be case-sensitive, or just want to use your own function for some reason, the problem is in how you are calculating the offset. Specifically in these 2 lines:

$offset = $length - $foffset - 1;
$pos = strpos(strrev($haystack), strrev($needle), $offset);

Using your sample "Some text..." and searching for "google.com", if we don't specify an offset it calculates the offset as length (500 chars) - offset (0 chars) - 1. Then you use strpos on a 500-char length string starting at offset character 499. You're never going to find anything that way.

Since you are reversing your haystack and also your needle, you need to "reverse" your offset. Change the line to:

$pos = strpos(strrev($haystack), strrev($needle), $length - $offset);

(Actually, you should change your prior line to calculate the $offset where you want it to be, but you get the point...)

UPDATE:

Further to the recommendations about using Regex, it's really trivial to get locations:

function getOffsets( $url, $baseRegex, $text ){
    $results = array();
    $regex= str_replace( '%URL%', $url, $baseRegex );
    preg_match_all( $regex, $text, $matches, PREG_OFFSET_CAPTURE );

    foreach ( $matches[0] as $match )
        array_push( $results, ($match[1] + strpos( $match[0], $url )) );

    return $results;
}

$linkRegex = '/<a[^>]*href="[^"]*%URL%[^"]*"[^>]*>/i';
$linkLocations = getOffsets( $url, $linkRegex, $text );
//Array
//(
//    [0] => 395
//)

$anyRegex = '/%URL%/i';
$allLocations = getOffsets( $url, $anyRegex, $text );
$nonlinkLocations = array_diff( $allLocations, $linkLocations );  //all non-links
//Array
//(
//    [0] => 188
//)

This really should be preferable to the rev_strpos & while loop gimmicks.

OTHER TIPS

Problem is this parse error

$nofresult = "
Some text.Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
google.com Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
<a href="http://www.google.com">Google</a> Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.";

... it should be

$nofresult = "
Some text.Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
google.com Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.
<a href=\"http://www.google.com\">Google</a> Some text.Some text.
Some text.Some text.Some text.Some text.Some text.Some text.";
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top