Regex matching extra text

https://stackoverflow.com/questions/23208233

07-07-2023
|

Question

I'm writing a little program that finds email addresses given a url, but there seems to be something wrong with my regex. It's printing out the same thing multiple time and matching text that I'm not looking for.

Cleaner cleaner = new Cleaner(Whitelist.basic());
String url = "http://www.fon.hum.uva.nl/paul/";
Document doc = cleaner.clean(Jsoup.connect(url).get());
Elements emails = doc.select(":matches(" + 
                "[0-9a-zA-Z_-]+@[0-9a-zA-Z_-]+\\.[a-zA-Z]{2,4}"
                +")");
for (Element e : emails) {
   System.out.println(e.text());
}

I won't post the complete result here, because it's too long, but it's matching an email, and also a bunch of repeated text that doesn't follow the pattern.

"Paul Boersma Professor of Phonetic Sciences University of Amsterdam "...
"Paul Boersma Professor of Phonetic Sciences University of Amsterdam "...
"Paul Boersma Professor of Phonetic Sciences University of Amsterdam "...

Does anyone know what the problem could be? Is it the regex, or does it have something to do with printing e.text()?

Thank you.

Edit: I have also tried a more complicated expression:

[\\w-]+(\\.[\\w-]+)*@[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,4})

But I have had the same issue with it.

Edit 2: I have used this regex in Notepad++, and it seems to work well. I only have this issue when matching text from webpages.

Edit 3: I tried running it on regexplanet.com and interestingly enough, it matches correctly. So is this a Jsoup thing then? Something having to do with Elements, maybe?

Solution 2

I solved this using Pattern instead of JSoup for pattern matching:

Pattern pattern = Pattern.compile("[\\w-]+(\\.[\\w-]+)*\\s?@\\s?[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,4})");
Document doc = cleaner.clean(Jsoup.connect(url).get());
String text = doc.text();
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
    System.out.println(matcher.group());
}

OTHER TIPS

The problem comes from the css query. Since there is no specific nodes inside it, Jsoup tends to bring back the whole node hierachy. What you get is the node containing an email and ALL its ancestors until root node (<html>).

I can see two options for you:

1. Use a specific css query

a:matches([0-9a-zA-Z_-]+@[0-9a-zA-Z_-]+\\.[a-zA-Z]{2,4})

Demo: http://try.jsoup.org/~fsXXqnQtTNEOSTR3TPvyONtWS64

2. Extract the node immediately containing the email

:matchesOwn([0-9a-zA-Z_-]+@[0-9a-zA-Z_-]+\\.[a-zA-Z]{2,4})

Demo: http://try.jsoup.org/~RgbUgekyWIoSe5bvFhZdQju9ibM

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow