Question

I have a huge text file, 20k+ lines, and I want to extract links from it.

What I need is a regular expression that generates a clean list of links.

The links i need start with http:// (without www) and end with .html

What would the expression look like?

Was it helpful?

Solution

Would look like this for global websites that end with .html pages:

(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,}.+[a-zA-Z0-9\-\.].html

And to match exactly what you specified:

http\://[a-zA-Z0-9\-]+\.+[a-z]{2,}\/[a-zA-Z0-9\-]+.html

Just Ctrl+X and Ctrl+V in a new File and u got it.

Works for JavaScript and Notepad++ so on.

\b is for word boundaries that searches whole words only so if there's just this word in the text like that: ewkgml http://test.com/a.html lamklwmwtmk it will find it and \B is the negation of it so wegniwgnwkjnhttp://test.com/a.htmllmwtlkmt34lt will work too. | is the or statement.

OTHER TIPS

In Notepad++ open the Replace Dialog (CTRL+H) insert

.*?(http://.*?\.html).*?

in Find what: input field and

$1\n

in Replace with: input field

You have to check the checkbox Regular Expression and the chebox . match newline

After you have clicked Replace all you get a list of all links - one per line

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top