Question

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.

e.g.: How it is currently working

domain.net/stuff/stuff=www.google.com/12345

matches .com/12345 even though I do not want this match because it is not the first TLD in the URL

e.g.: How I want it to work

domain.net/12345/stuff=www.google.com/12345

matches on .net/12345 and ignores the later match on .com/12345

My current expression

(\.[a-z]{2,4})/\d{5}

EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.

Was it helpful?

Solution

You're pretty close :)

You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.

I would suggest you to simply preppend ^[^\/]*\. before your current regex. Thus, the resulting regex would be:

^[^\/]*\.([a-z]{2,4})/\d{5}

How does it work?

  • ^ asserts that this is the beginning of the tested String
  • [^\/]* accepts any sequence of characters that doesn't contain /
  • \.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).

Here is a permalink to a working example on regex101.
Cheers!

OTHER TIPS

You can use this regex:

'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'

Online Demo: http://regex101.com/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top