获取 URL 的一部分（正则表达式）

https://stackoverflow.com/questions/27745

09-06-2019
|

题

给定 URL（单行）：
http://test.example.com/dir/subdir/file.html

如何使用正则表达式提取以下部分：

子域（测试）
域名 (example.com)
没有文件的路径（/dir/subdir/）
文件（file.html）
文件的路径 (/dir/subdir/file.html)
不带路径的 URL (http://test.example.com)
（添加您认为有用的任何其他内容）

即使我输入以下 URL，正则表达式也应该正常工作：

http://example.example.com/example/example/example.html

解决方案

单个正则分析和分解一个完整URL，包括查询参数和锚定，例如

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

RexEx 职位：

网址：正则表达式['$&'],

协议：RegExp.$2,

主机：RegExp.$3,

路径：RegExp.$4,

文件：RegExp.$6,

查询：RegExp.$7,

哈希：RegExp.$8

然后你可以很容易地进一步解析主机（'.'分隔）。

什么我会做的是使用这样的东西：

/*
    ^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4

进一步将“其余”解析得尽可能具体。在一个正则表达式中执行此操作有点疯狂。

其他提示

我意识到我迟到了，但是有一种简单的方法可以让浏览器在没有正则表达式的情况下为您解析 url：

var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';

['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
    console.log(k+':', a[k]);
});

/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/

我迟到了几年，但令我惊讶的是没有人提到统一资源标识符规范有一个关于使用正则表达式解析 URI 的部分. 。Berners-Lee 等人编写的正则表达式为：

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9
上面第二行中的数字只是为了提高可读性；它们指示每个子表达的参考点（即每个配对括号）。我们将匹配的子表达的值称为$。例如，将上面的表达式匹配到

http://www.ics.uci.edu/pub/ietf/uri/#Related

结果如下子表达式匹配：
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

不管怎样，我发现我必须转义 JavaScript 中的正斜杠：

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

我发现得票最高的答案（hometoast 的答案）并不适合我。两个问题：

它无法处理端口号。
哈希部分已损坏。

以下是修改后的版本：

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$

各部件位置如下：

int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12

编辑匿名用户发布的内容：

function getFileName(path) {
    return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}

我需要一个正则表达式来匹配所有 url，并制作了这个：

/(?:([^\:]*)\:\/\/)?(?:([^\:\@]*)(?:\:([^\@]*))?\@)?(?:([^\/\:]*)\.(?=[^\.\/\:]*\.[^\.\/\:]*))?([^\.\/\:]*)(?:\.([^\/\.\:]*))?(?:\:([0-9]*))?(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/

它匹配所有 url、任何协议，甚至像这样的 url

ftp://user:pass@www.cs.server.com:8080/dir1/dir2/file.php?param1=value1#hashtag

结果（在 JavaScript 中）如下所示：

["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]

一个像这样的网址

mailto://admin@www.cs.server.com

看起来像这样：

["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined]

我试图用 javascript 解决这个问题，应该通过以下方式处理：

var url = new URL('http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang');

因为（至少在 Chrome 中）它解析为：

{
  "hash": "#foobar/bing/bo@ng?bang",
  "search": "?foo=bar&bingobang=&king=kong@kong.com",
  "pathname": "/path/wah@t/foo.js",
  "port": "890",
  "hostname": "example.com",
  "host": "example.com:890",
  "password": "b",
  "username": "a",
  "protocol": "http:",
  "origin": "http://example.com:890",
  "href": "http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang"
}

但是，这不是跨浏览器（https://developer.mozilla.org/en-US/docs/Web/API/URL），所以我将其拼凑在一起，取出与上面相同的部分：

^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:@\/#\?]+)(?:\:([^:@\/#\?]*))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?

这个正则表达式的功劳在于 https://gist.github.com/rpflorence 谁发布了这个 jsperf http://jsperf.com/url-parsing （最初在这里找到： https://gist.github.com/jlong/2428561#comment-310066）是谁提出了这个最初基于的正则表达式。

各部分按以下顺序排列：

var keys = [
    "href",                    // http://user:pass@host.com:81/directory/file.ext?query=1#anchor
    "origin",                  // http://user:pass@host.com:81
    "protocol",                // http:
    "username",                // user
    "password",                // pass
    "host",                    // host.com:81
    "hostname",                // host.com
    "port",                    // 81
    "pathname",                // /directory/file.ext
    "search",                  // ?query=1
    "hash"                     // #anchor
];

还有一个小库包装它并提供查询参数：

https://github.com/sadams/lite-url （也可在凉亭上使用）

如果您有改进，请创建一个包含更多测试的拉取请求，我将接受并合并，谢谢。

提出一个更具可读性的解决方案（在Python中，但适用于任何正则表达式）：

def url_path_to_dict(path):
    pattern = (r'^'
               r'((?P<schema>.+?)://)?'
               r'((?P<user>.+?)(:(?P<password>.*?))?@)?'
               r'(?P<host>.*?)'
               r'(:(?P<port>\d+?))?'
               r'(?P<path>/.*?)?'
               r'(?P<query>[?].*?)?'
               r'$'
               )
    regex = re.compile(pattern)
    m = regex.match(path)
    d = m.groupdict() if m is not None else None

    return d

def main():
    print url_path_to_dict('http://example.example.com/example/example/example.html')

印刷：

{
'host': 'example.example.com', 
'user': None, 
'path': '/example/example/example.html', 
'query': None, 
'password': None, 
'port': None, 
'schema': 'http'
}

子域和域很困难，因为子域可以有多个部分，顶级域也可以， http://sub1.sub2.domain.co.uk/

 the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)  
 the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$  
 the path with the file : http://[^/]+/(.*)  
 the URL without the path : (http://[^/]+/)

（Markdown 对正则表达式不太友好）

这个改进的版本应该像解析器一样可靠地工作。

   // Applies to URI, not just URL or URN:
   //    http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
   //
   // http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
   //
   // (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
   //
   // http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
   //
   // $@ matches the entire uri
   // $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
   // $2 matches authority (host, user:pwd@host, etc)
   // $3 matches path
   // $4 matches query (http GET REST api, etc)
   // $5 matches fragment (html anchor, etc)
   //
   // Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
   // Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
   //
   // (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
   //
   // Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
   function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
   {
      if( !schemes )
         schemes = '[^\\s:\/?#]+'
      else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
         throw TypeError( 'expected URI schemes' )
      return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
         new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
   }

   // http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
   function uriSchemesRegExp()
   {
      return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
   }

请尝试以下操作：

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

它支持HTTP / FTP、子域、文件夹、文件等。

我通过谷歌快速搜索找到了它：

http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

/^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)@)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/

从我的回答来看类似的问题. 。比提到的其他一些效果更好，因为它们有一些错误（例如不支持用户名/密码、不支持单字符文件名、片段标识符被破坏）。

您可以使用.NET中的Uri对象获取所有http/https、主机、端口、路径以及查询。只是困难的任务是将主机分解为子域、域名和TLD。

没有这样做的标准，并且不能简单地使用字符串解析或正则表达式来产生正确的结果。起初，我使用 RegEx 函数，但并非所有 URL 都可以正确解析子域。实践方法是使用 TLD 列表。定义 URL 的 TLD 后，左侧部分是域，其余部分是子域。

然而，由于可能出现新的顶级域名 (TLD)，因此需要维护该列表。目前我知道的是 publicsuffix.org 维护最新的列表，您可以使用 google code 中的域名解析器工具来解析公共后缀列表，并通过使用 DomainName 对象轻松获取子域、域和 TLD：域名.子域名、域名.域名和域名.TLD。

这个答案也很有帮助：从 URL 获取子域

卡拉梅兰

这是一个完整的，不依赖于任何协议的。

function getServerURL(url) {
        var m = url.match("(^(?:(?:.*?)?//)?[^/?#;]*)");
        console.log(m[1]) // Remove this
        return m[1];
    }

getServerURL("http://dev.test.se")
getServerURL("http://dev.test.se/")
getServerURL("//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js")
getServerURL("//")
getServerURL("www.dev.test.se/sdas/dsads")
getServerURL("www.dev.test.se/")
getServerURL("www.dev.test.se?abc=32")
getServerURL("www.dev.test.se#abc")
getServerURL("//dev.test.se?sads")
getServerURL("http://www.dev.test.se#321")
getServerURL("http://localhost:8080/sads")
getServerURL("https://localhost:8080?sdsa")

印刷

http://dev.test.se

http://dev.test.se

//ajax.googleapis.com

//

www.dev.test.se

www.dev.test.se

www.dev.test.se

www.dev.test.se

//dev.test.se

http://www.dev.test.se

http://localhost:8080

https://localhost:8080

以上都不适合我。这是我最终使用的：

/^(?:((?:https?|s?ftp):)\/\/)([^:\/\s]+)(?::(\d*))?(?:\/([^\s?#]+)?([?][^?#]*)?(#.*)?)?/

我喜欢“Javascript：”中发布的正则表达式好的部分”。它不太短也不太复杂。github 上的这个页面也有使用它的 JavaScript 代码。但它适用于任何语言。https://gist.github.com/voodooGQ/4057330

Java 提供了一个 URL 类来执行此操作。查询 URL 对象。

顺便说一句，PHP 提供解析网址().

我建议不要使用正则表达式。像这样的 API 调用 WinHttpCrackUrl() 更不容易出错。

http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx

我尝试了其中一些不能满足我的需求的方法，尤其是投票最高的，它没有捕获没有路径的网址（http://example.com/)

缺少组名也导致它在 ansible 中无法使用（或者可能是我缺乏 jinja2 技能）。

所以这是我的版本稍作修改，源代码是这里投票最高的版本：

^((?P<protocol>http[s]?|ftp):\/)?\/?(?P<host>[^:\/\s]+)(?P<path>((\/\w+)*\/)([\w\-\.]+[^#?\s]+))*(.*)?(#[\w\-]+)?$

使用 http://www.fileformat.info/tool/regex.htm hometoast 的正则表达式效果很好。

但事情是这样的，我想在我的程序中的不同情况下使用不同的正则表达式模式。

例如，我有这个 URL，并且有一个枚举，其中列出了我的程序中所有支持的 URL。枚举中的每个对象都有一个 getRegexPattern 方法，该方法返回正则表达式模式，然后将其用于与 URL 进行比较。如果特定的正则表达式模式返回 true，那么我就知道我的程序支持该 URL。因此，每个枚举都有自己的正则表达式，具体取决于它应该在 URL 中查找的位置。

Hometoast 的建议很好，但就我而言，我认为这没有帮助（除非我在所有枚举中复制粘贴相同的正则表达式）。

这就是为什么我希望答案能够分别给出每种情况的正则表达式。虽然hometoast+1。;)

我知道您声称对此与语言无关，但是您能告诉我们您正在使用什么，以便我们知道您拥有哪些正则表达式功能吗？

如果您具有非捕获匹配的功能，则可以修改 hometoast 的表达式，以便将您不感兴趣的捕获子表达式设置为如下：

(?:SOMESTUFF)

您仍然需要将正则表达式复制并粘贴（并稍微修改）到多个位置，但这是有道理的——您不仅要检查子表达式是否存在，还要检查它是否存在 作为 URL 的一部分. 。对子表达式使用非捕获修饰符可以给您所需的东西，仅此而已，如果我没理解错的话，这就是您想要的。

就像一个小注释一样，hometoast 的表达式不需要在“https”的“s”两边加上括号，因为他只有一个字符。量词量化紧邻其前面的一个字符（或字符类或子表达式）。所以：

https?

可以匹配“http”或“https”。

regexp 获取不带文件的 URL 路径。

网址='http://域名/dir1/dir2/somefile'url.scan（/^（http：// [^/]+）（（？：/[^/]+）+（？=/））？/？（？：[^/]+）？$ /I).to_s

它对于添加此 url 的相对路径很有用。

String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";

String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";

System.out.println("1: " + s.replaceAll(regex, "$1"));
System.out.println("2: " + s.replaceAll(regex, "$2"));
System.out.println("3: " + s.replaceAll(regex, "$3"));
System.out.println("4: " + s.replaceAll(regex, "$4"));

将提供以下输出：
1：https://
2：www.thomas-bayer.com
3: /
4：axis2/服务/BLZService?wsdl

如果您将 URL 更改为
字符串 s = "https://www.thomas-bayer.com?wsdl=qwerwer&ttt=888”;输出如下：
1：https://
2：www.thomas-bayer.com
3: ?
4：wsdl=qwerwer&ttt=888

享受..
约西·列夫

进行完整解析的正则表达式非常可怕。为了便于阅读，我添加了命名反向引用，并将每个部分分成单独的行，但它仍然看起来像这样：

^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
(?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
(?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
(?:#(?P<fragment>.*))?$

要求它如此冗长的原因是，除了协议或端口之外，任何部分都可以包含 HTML 实体，这使得片段的描述变得非常棘手。因此，在最后几种情况下 - 主机、路径、文件、查询字符串和片段，我们允许任何 html 实体或任何非 ? 或者 #. 。html 实体的正则表达式如下所示：

$htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"

当它被提取时（我使用小胡子语法来表示它），它变得更加清晰：

^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
(?P<file>(?:{{htmlentity}}|[^?#])+)
(?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
(?:#(?P<fragment>.*))?$

当然，在 JavaScript 中，您不能使用命名反向引用，因此正则表达式变成

^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$

在每场比赛中，协议是 \1, ，主机是 \2, ，端口是 \3, ，路径 \4, ，文件 \5, 查询字符串 \6, ，和片段 \7.

//USING REGEX
/**
 * Parse URL to get information
 *
 * @param   url     the URL string to parse
 * @return  parsed  the URL parsed or null
 */
var UrlParser = function (url) {
    "use strict";

    var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:@\/#\?]+)(?:\:([^:@\/#\?]+))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/,
        matches = regx.exec(url),
        parser = null;

    if (null !== matches) {
        parser = {
            href              : matches[0],
            withoutHash       : matches[1],
            url               : matches[2],
            origin            : matches[3],
            protocol          : matches[4],
            protocolseparator : matches[5],
            credhost          : matches[6],
            cred              : matches[7],
            user              : matches[8],
            pass              : matches[9],
            host              : matches[10],
            hostname          : matches[11],
            port              : matches[12],
            pathname          : matches[13],
            segment1          : matches[14],
            segment2          : matches[15],
            search            : matches[16],
            hash              : matches[17]
        };
    }

    return parser;
};

var parsedURL=UrlParser(url);
console.log(parsedURL);

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow