一般的なURLに一致する正規表現が必要です

https://stackoverflow.com/questions/307141

08-07-2019
|

質問

任意のプロトコル（http、https、shttp、ftp、svn、mysqlおよび私が知らないこと）を使用して一般的なURLをテストする必要があります。

最初のパスはこれです：

\w+://(\w+\.)+[\w+](/[\w]+)(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?

（ PCRE および。NET なので空想にふさわしいものはありません）

解決

RFC2396 によると：

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

他のヒント

そのRegExをwikiの回答として追加：

[\w+-]+://([a-zA-Z0-9]+\.)+[[a-zA-Z0-9]+](/[%\w]+)(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?

オプション2（Re CMS）

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

しかし、より制限的なものにしたり、他のことを区別したりするためにトリムされた正気なものならどれでもゆるいことになります。

proto      ://  name      : pass      @  server    :port      /path     ? args
^([^:/?#]+)://(([^/?#@:]+(:[^/?#@:]+)?@)?[^/?#@:]+(:[0-9]+)?)(/[^?#]*)(\?([^#]*))?

私はこれにわずかに異なる方向から来ました。 something.co.uk に一致してリンクするgchatsの機能をエミュレートしたかったのです。そのため、次のピリオドまたは両側にスペースを入れずに。を探す正規表現を使用し、空白に達するまで周囲のすべてを取得しました。 URIの最後のピリオドと一致しますが、後で削除します。そのため、いくつかの可能性を逃すよりも誤検知を好む場合、これはオプションになる可能性があります

url_re = re.compile(r"""
           [^\s]             # not whitespace
           [a-zA-Z0-9:/\-]+  # the protocol and domain name
           \.(?!\.)          # A literal '.' not followed by another
           [\w\-\./\?=&%~#]+ # country and path components
           [^\s]             # not whitespace""", re.VERBOSE) 

url_re.findall('http://thereisnothing.com/a/path adn some text www.google.com/?=query#%20 https://somewhere.com other-countries.co.nz. ellipsis... is also a great place to buy. But try text-hello.com ftp://something.com')

['http://thereisnothing.com/a/path',
 'www.google.com/?=query#%20',
 'https://somewhere.com',
 'other-countries.co.nz.',
 'text-hello.com',
 'ftp://something.com']

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow