Question

So I have an interesting problem: I have a string, and for the most part i know what to expect:

http://www.someurl.com/st=????????

Except in this case, the ?'s are either upper case letters or numbers. The problem is, the string has garbage mixed in: the string is broken up into 5 or 6 pieces, and in between there's lots of junk: unprintable characters, foreign characters, as well as plain old normal characters. In short, stuff that's apt to look like this: Nyþ=mî;ëMÝ×nüqÏ

Usually the last 8 characters (the ?'s) are together right at the end, so at the moment I just have PHP grab the last 8 chars and hope for the best. Occasionally, that doesn't work, so I need a more robust solution.

The problem is technically unsolvable, but I think the best solution is to grab characters from the end of the string while they are upper case or numeric. If I get 8 or more, assume that is correct. Otherwise, find the st= and grab characters going forward as many as I need to fill up the 8 character quota. Is there a regex way to do this or will i need to roll up my sleeves and go nested-loop style?

update:

To clear up some confusion, I get an input string that's like this:

[garbage]http:/[garbage]/somewe[garbage]bsite.co[garbage]m/something=[garbage]????????

except the garbage is in unpredictable locations in the string (except the end is never garbage), and has unpredictable length (at least, I have been able to find patterns in neither). Usually the ?s are all together hence me just grabbing the last 8 chars, but sometimes they aren't which results in some missing data and returned garbage :-\

Was it helpful?

Solution

$var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case

$clean = join(
    array_filter(
        str_split($var, 1),
        function ($char) {
            return (
                array_key_exists(
                    $char,
                    array_flip(array_merge(
                        range('A','Z'),
                        range('a','z'),
                        range((string)'0',(string)'9'),
                        array(':','.','/','-','_')
                    ))
                )
            );
        }
    )
);

Hah, that was a joke. Here's a regex for you:

$clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);

OTHER TIPS

As stated, the problem is unsolvable. If the garbage can contain "plain old normal characters" characters, and the garbage can fall at the end of the string, then you cannot know whether the target string from this sample is "ABCDEFGH" or "BCDEFGHI":

__http:/____/somewe___bsite.co____m/something=__ABCDEFGHI__

What do these values represent? If you want to retain all of it, just without having to deal with garbage in your database, maybe you should hex-encode it using bin2hex().

You can use this regular expression :

if (preg_match('/[\'^£$%&*()}{@#~?><>,|=_+¬-]/', $string) ==1)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top