سؤال

So, I'm going to admit, I've never actually looked into regular expressions. What I'm trying to do is capture the ID of a Reddit URL. The URLs will be formatted like /r/AskReddit/comments/1234 or /r/AskReddit/1234/ or some variation (missing end slash) - it shouldn't match the dsada/... in /r/AskReddit/comments/1234/dsada/...

Here's what I've tried so far:

/r/.*/[comments/]?([a-z0-9])/?

It matches some strange things though,

When trying to match /r/sdifsas/sdfad it will actually match /r/sdifsas/sd and it will even match /r/sdifsas/sdfad/aasdasd/a and /r/sdifsas/comments/a/d

I know for a fact I'm doing something wrong, I have a feeling its to do with the .*, how can I replace .* while still matching everything? Also, how do I make the regex capture more than one (or two in some of the random matches above) of the ending letters?

One more thing, if it isn't too much bother can you explain what each of the things you used do please? I'm a bit of a newbie to this.

هل كانت مفيدة؟

المحلول

Description

This regex will validate the string by requiring a /r/ followed by the name of a subreddit, then it'll move through and capture the id providing it appears after the subreddit name or after the comments. By using the m option on the search and including the ^ to match the start of a line and $ to match the end of the line, this regex can be used against a long string of text containing any number of new line delimited reddit links as demonstrated in the PHP example.

^\/r\/([a-z0-9]*)\/(?:Comments\/)?([a-z0-9]*)(?:\/?.*?)?$

enter image description here

Groups

0 matches the entire string

  1. captures the sub reddit name
  2. captures the id

PHP Code Example:

You didn't specify a language so I picked PHP to show how this regex would work.

<?php
$sourcestring="/r/AskReddit/comments/1234
r/AskReddit/2345/
/r/AskReddit/comments/3456/dsada/
/r/IHeartKittens/comments/4567/dsada/
/r/cats/comments/i2sz9/we_rescued_a_kitten_last_month/
/r/IAmA/comments/18pik4/astronaut_chris_hadfield_comments/c8gud3h";
preg_match_all('/^\/r\/([a-z0-9]*)\/(?:Comments\/)?([a-z0-9]*)(?:\/?.*?)?$/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
 

$matches Array:
(
    [0] => Array
        (
            [0] => /r/AskReddit/comments/1234
            [1] => /r/AskReddit/2345/
            [2] => /r/AskReddit/comments/3456/dsada/
            [3] => /r/IHeartKittens/comments/4567/dsada/
            [4] => /r/cats/comments/i2sz9/we_rescued_a_kitten_last_month/
            [5] => /r/IAmA/comments/18pik4/astronaut_chris_hadfield_comments/c8gud3h
        )

    [1] => Array
        (
            [0] => AskReddit
            [1] => AskReddit
            [2] => AskReddit
            [3] => IHeartKittens
            [4] => cats
            [5] => IAmA
        )

    [2] => Array
        (
            [0] => 1234
            [1] => 2345
            [2] => 3456
            [3] => 4567
            [4] => i2sz9
            [5] => 18pik4
        )

)

نصائح أخرى

First, in your regex .* matches everything until end of string and then begins to backtrack until it can succeed.

Second, [...] do a match of any of the letters inside them, with ? after that gives the meaning of optional.

So, in your test case of /r/sdifsas/sd, the .*/ matches until last forward slash, the following letter is the s inside [...] and the last d is one in the range a-z.

In your test /r/sdifsas/sdfad/aasdasd/a is similar, .*/ matches until last forward slash, the a letter is no inside [...], so skip that part and matches in the range of a-z. Same behaviour for /r/sdifsas/comments/a/d.

I don't know what flavour of regex you are using, but a shot in the dark would be something like:

/r/.*?/(?:comments/)?([a-z0-9]*)/? 

It uses a non-capturing group (?:...) for that part of the path, and a * to match zero or more from letter and/or digits.

try

/r/AskReddit/[comments/]?([a-z0-9])/?

instead.

your solution suffers from 2 flaws:

  1. your .* portion matches everything - in particular the / characters structuring the location part of your urls
  2. you're matching greedily, which is the default for most regex engines afaik. 'greedily' means that in a match the subpattern gobbles up as many chars as possible.

1 & 2 conspire to match larger portioins of the urls than you intend them to.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top