For your test sample,
/<a[^>]+href="\w{2}.html">([\w\s]+)<\/a>/
Question
so I have this problem where I am to list every country in a list in Excel by using Open-URI. Everything is working properly but I can't seem to figure how to get my RegExp-"string" to include single-named countries (like "Sweden") but also countries like South Africa that is separated with a whitespace etc. I hope i've made myself understood fairly and below I will include the relevant pieces of code.
the text I want to match is the following (for example):
<a href="wf.html">Wallis and Futuna</a>
<a href="ym.html">Yemen</a>
I am currently stuck with this Regexp:
/a.+="\w{2}.html">(\w*)<.+{1}/
As you see, there is no problem with matching 'Yemen'. Though I still want the code to be able to match both "Wallis and Futuna AND Yemen. Perhaps if there was a way to include everything inside the given ">blabla bla<"? Any thoughts? I would be very grateful!
Solution 2
For your test sample,
/<a[^>]+href="\w{2}.html">([\w\s]+)<\/a>/
OTHER TIPS
It is generally bad to use Regex when dealing with HTML entity extraction
require 'nokogiri'
parser = Nokogiri::HTML.parse(your_html)
country_links = parser.css("a")
country_links.each{|link| puts link['href']; puts link.text;}