일반 URL과 일치하려면 Regex가 필요합니다

https://stackoverflow.com/questions/307141

08-07-2019
|

문제

모든 프로토콜 (http, https, shttp, ftp, svn, mysql 및 내가 모르는 것들)을 사용하여 일반 URL을 테스트해야합니다.

내 첫 번째 패스는 이것입니다.

\w+://(\w+\.)+[\w+](/[\w]+)(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?

(PCRE 그리고 .그물 그래서 공상에 아무것도 없음)

해결책

에 따르면 RFC2396:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

다른 팁

그 리그를 위키키로 추가 : 답변 :

[\w+-]+://([a-zA-Z0-9]+\.)+[[a-zA-Z0-9]+](/[%\w]+)(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?

옵션 2 (RE CMS)

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

그러나 그것은 제정신이 다듬어지면서 더 제한적으로 만들고 다른 것들을 구별하기 위해 Lax입니다.

proto      ://  name      : pass      @  server    :port      /path     ? args
^([^:/?#]+)://(([^/?#@:]+(:[^/?#@:]+)?@)?[^/?#@:]+(:[0-9]+)?)(/[^?#]*)(\?([^#]*))?

나는 이것에 약간 다른 방향에서왔다. 나는 gchats와 일치하는 능력을 모방하고 싶었다 something.co.uk 그리고 그것을 연결하십시오. 그래서 나는 . 다음 기간이나 양쪽에 공간이 없으면 공백에 부딪 칠 때까지 주변의 모든 것을 잡습니다. 그것은 URI의 끝에서 기간과 일치하지만 나중에 그것을 벗어납니다. 따라서 일부 잠재력을 잃어버린 것보다 잘못된 긍정을 선호한다면 이것은 옵션이 될 수 있습니다.

url_re = re.compile(r"""
           [^\s]             # not whitespace
           [a-zA-Z0-9:/\-]+  # the protocol and domain name
           \.(?!\.)          # A literal '.' not followed by another
           [\w\-\./\?=&%~#]+ # country and path components
           [^\s]             # not whitespace""", re.VERBOSE) 

url_re.findall('http://thereisnothing.com/a/path adn some text www.google.com/?=query#%20 https://somewhere.com other-countries.co.nz. ellipsis... is also a great place to buy. But try text-hello.com ftp://something.com')

['http://thereisnothing.com/a/path',
 'www.google.com/?=query#%20',
 'https://somewhere.com',
 'other-countries.co.nz.',
 'text-hello.com',
 'ftp://something.com']

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow