Question

so I have this problem where I am to list every country in a list in Excel by using Open-URI. Everything is working properly but I can't seem to figure how to get my RegExp-"string" to include single-named countries (like "Sweden") but also countries like South Africa that is separated with a whitespace etc. I hope i've made myself understood fairly and below I will include the relevant pieces of code.

the text I want to match is the following (for example):

<a href="wf.html">Wallis and Futuna</a>
<a href="ym.html">Yemen</a>

I am currently stuck with this Regexp:

/a.+="\w{2}.html">(\w*)<.+{1}/

As you see, there is no problem with matching 'Yemen'. Though I still want the code to be able to match both "Wallis and Futuna AND Yemen. Perhaps if there was a way to include everything inside the given ">blabla bla<"? Any thoughts? I would be very grateful!

Was it helpful?

Solution 2

For your test sample,

/<a[^>]+href="\w{2}.html">([\w\s]+)<\/a>/

OTHER TIPS

It is generally bad to use Regex when dealing with HTML entity extraction

require 'nokogiri' 

parser = Nokogiri::HTML.parse(your_html)
country_links = parser.css("a")
country_links.each{|link| puts link['href']; puts link.text;}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top