문제

In my C# program I wrote a Google Search Function, which works by fetching the source from each page and getting the URLs via regex.

My actual Regex is:

(?:(?:(?:http)://)(?:w{3}\\.)?(?:[a-zA-Z0-9/;\\?&=:\\-_\\$\\+!\\*'\\(\\|\\\\~\\[\\]#%\\.])+)

This works good at the moment, but I get for example URLs like http://www.example.com/forums/arcade.php?efdf=332

I just want to get in this case the URL without the ?efdf=332 at the end.

So how should I change the regex?

도움이 되었습니까?

해결책

http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+

does the same as your regex (I've removed a lot of unnecessary cruft) but stops matching a link before a ?.

In C#:

Regex regexObj = new Regex(@"http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+")

That said, I'm not sure this is such a good way of matching URLs (what about https, ftp, mailto etc.?)

다른 팁

You can use the Uri class to access various parts of the URL and either remove the query string from the end, or concatenate the parts you want.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top