문제

I'm writing a little program that finds email addresses given a url, but there seems to be something wrong with my regex. It's printing out the same thing multiple time and matching text that I'm not looking for.

Cleaner cleaner = new Cleaner(Whitelist.basic());
String url = "http://www.fon.hum.uva.nl/paul/";
Document doc = cleaner.clean(Jsoup.connect(url).get());
Elements emails = doc.select(":matches(" + 
                "[0-9a-zA-Z_-]+@[0-9a-zA-Z_-]+\\.[a-zA-Z]{2,4}"
                +")");
for (Element e : emails) {
   System.out.println(e.text());
}

I won't post the complete result here, because it's too long, but it's matching an email, and also a bunch of repeated text that doesn't follow the pattern.

"Paul Boersma Professor of Phonetic Sciences   University of Amsterdam   "...
"Paul Boersma Professor of Phonetic Sciences   University of Amsterdam   "...
"Paul Boersma Professor of Phonetic Sciences   University of Amsterdam   "...

Does anyone know what the problem could be? Is it the regex, or does it have something to do with printing e.text()?

Thank you.

Edit: I have also tried a more complicated expression:

[\\w-]+(\\.[\\w-]+)*@[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,4})

But I have had the same issue with it.

Edit 2: I have used this regex in Notepad++, and it seems to work well. I only have this issue when matching text from webpages.

Edit 3: I tried running it on regexplanet.com and interestingly enough, it matches correctly. So is this a Jsoup thing then? Something having to do with Elements, maybe?

도움이 되었습니까?

해결책 2

I solved this using Pattern instead of JSoup for pattern matching:

Pattern pattern = Pattern.compile("[\\w-]+(\\.[\\w-]+)*\\s?@\\s?[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,4})");
Document doc = cleaner.clean(Jsoup.connect(url).get());
String text = doc.text();
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
    System.out.println(matcher.group());
}

다른 팁

The problem comes from the css query. Since there is no specific nodes inside it, Jsoup tends to bring back the whole node hierachy. What you get is the node containing an email and ALL its ancestors until root node (<html>).

I can see two options for you:

1. Use a specific css query

a:matches([0-9a-zA-Z_-]+@[0-9a-zA-Z_-]+\\.[a-zA-Z]{2,4})

Demo: http://try.jsoup.org/~fsXXqnQtTNEOSTR3TPvyONtWS64

2. Extract the node immediately containing the email

:matchesOwn([0-9a-zA-Z_-]+@[0-9a-zA-Z_-]+\\.[a-zA-Z]{2,4})

Demo: http://try.jsoup.org/~RgbUgekyWIoSe5bvFhZdQju9ibM

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top