Question

I want to capture dates specified alphabetically . Could be one of the forms below

  • Jan 1st 2013
  • Jan 1 , 2013
  • Jan 1
  • 1st Jan
  • 1st Jan , 2013
  • January

Additionally they will occur in sentence. For example

"Can we meet sometime in January in afternoon."

I am using following regexes in java

((?<month>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t?|tember)?|oct(ober)?|nov(ember)?|dec(ember)?)((\\s+)?(?<date>\\d+)?(st|nd|rd|th))?(\\s+)?,?(\\s+)?(?<year>(20)\\d\\d)?)

((?<date>\\d+)?(st|nd|rd|th)?\\s+(?<month>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t?|tember)?|oct(ober)?|nov(ember)?|dec(ember)?)(\\s+)?,?(?<year>(19|20)\\d\\d)?)

I need to point the exact position of tokens in the string after capturing the regex.

When I look at the index returned by Matcher.end() It seems my expression also captures the space after January. I do want to capture expressions like "Jan 1st" but only when next capture group match is possible.

Is it possible to modify regexes above to do this?

Was it helpful?

Solution

Expanding the pattern for more readability:

(
    (?<month>
        jan(uary)?
      | feb(ruary)?
      | mar(ch)?
      | apr(il)?
      | may
      | jun(e)?
      | jul(y)?
      | aug(ust)?
      | sep(t?|tember)?
      | oct(ober)?
      | nov(ember)?
      | dec(ember)?
    )
    (
        (\\s+)?
        (?<date>\\d+)?
        (st|nd|rd|th)
    )?
    (\\s+)?
    ,?
    (\\s+)?
    (?<year>(20)\\d\\d)?
)

The spaces before the year can match even when the year does not. Additionally, the date suffix can match even when the date does not.

Cleaning and fixing the pattern I got this:

\\b
(?<month>
    jan(uary)?
  | feb(ruary)?
  | mar(ch)?
  | apr(il)?
  | may
  | jun(e)?
  | jul(y)?
  | aug(ust)?
  | sep(t|tember)?
  | oct(ober)?
  | nov(ember)?
  | dec(ember)?
)
(
    \\s*
    (?<date>\\d+)
    (st|nd|rd|th)?
)?
(
    \\s*
    ,?
    \\s*
    (?<year>(19|20)\\d\\d)
)?
\\b

I removed the outer group, since you get that as group 0 anyway. The t? in sep(t?|tember)? was changed to just t. All the (\\s+)? was changed to the equivalent \\s*. I moved the ? from (?<date>\\d+)? to (st|nd|rd|th). I wrapped the year in a group, and moved the ? from (?<year>20\\d\\d) to that. I added word-boundaries (\\b), so that it won't start or end in the middle of a word.

As one line:

\\b(?<month>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t|tember)?|oct(ober)?|nov(ember)?|dec(ember)?)(\\s*(?<date>\\d+)(st|nd|rd|th)?)?(\\s*,?\\s*(?<year>(19|20)\\d\\d))?\\b

Combining it with your second pattern:

\\b
(
    (?<month1>
        jan(uary)?
      | feb(ruary)?
      | mar(ch)?
      | apr(il)?
      | may
      | jun(e)?
      | jul(y)?
      | aug(ust)?
      | sep(t|tember)?
      | oct(ober)?
      | nov(ember)?
      | dec(ember)?
    )
    (
        \\s*
        (?<date1>\\d+)
        (st|nd|rd|th)?
    )?
  |
    (?<date2>\\d+)
    (st|nd|rd|th)?
    \\s*
    (?<month2>
        jan(uary)?
      | feb(ruary)?
      | mar(ch)?
      | apr(il)?
      | may
      | jun(e)?
      | jul(y)?
      | aug(ust)?
      | sep(t|tember)?
      | oct(ober)?
      | nov(ember)?
      | dec(ember)?
    )
)
(
    \\s*
    ,?
    \\s*
    (?<year>(19|20)\\d\\d)
)?
\\b

As one line:

\\b((?<month1>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t|tember)?|oct(ober)?|nov(ember)?|dec(ember)?)(\\s*(?<date1>\\d+)(st|nd|rd|th)?)?|(?<date2>\\d+)(st|nd|rd|th)?\\s*(?<month2>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t|tember)?|oct(ober)?|nov(ember)?|dec(ember)?))(\\s*,?\\s*(?<year>(19|20)\\d\\d))?\\b

OTHER TIPS

Another version:

static private String month = "(?<month>jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(t|tember)?|oct(ober)?|nov(ember)?|dec(ember)?)";
static private String suffix = "(?:st|nd|rd|th)";
static private String date = "(?<date>\\d{1,2})";
static private String year = "(?<year>\\d{4})";

// A month name (optionally followed by space followed by a date (optionally
// followed by a suffix or space and a comma) (optionally followed by space
// followed by a year))
static private String order1 = String.format(
        "%s(?:\\s+%s(?:%s|\\s+,)?(?:\\s+%s)?)?", month, date, suffix,
        year);

// A date followed by a suffix followed by a month (optionally followed by
// space and a comma) optionally followed by space and a year
static private String order2 = String.format(
        "%s%s\\s+%s(?:\\s+,)?(?:\\s+%s)?", date, suffix, month, year);

Yes, there's not a lot of reason for the String.format, but since it's static, it shouldn't be brutal performance-wise, and it makes the regex easier to read than any other way I could think of in Java.

It matches all your example patterns (and gets correct output, IIRC), including the version in a sentence. The only issue that you may have is that it will eat the comma immediately following a date of the form "Let's meet on Jan 1 , okay?", although it will not match the comma if it's written "Let's meet on Jan 1, okay?" (when I say "match the comma", I mean the overall regex will take the comma, although the named captures will be correct). I did change the year to simply match four digits. I also changed the date to only match one or two digits. Like @MarkusJarderot, I changed "september" to not have an optional "t", since the whole suffix is optional. I've tried to write both regexes so that logical blocks are added and removed--compare with the version below, and notice how I was able to change it without rewriting the whole expression. Something to be careful of: In some cases, both regexes will match (order1 just matching the single month, order2 matching a date of the form "1st Jan"). You may want to figure out how to choose which expression to follow in such cases.

Now, these regexes have been written to try to avoid matching any dates not of the provided formats. I'd suggest modifying them to allow the following forms (# indicates item in original list):

  • Jan 1st 2013 #
  • Jan 1st, 2013 // Note comma
  • Jan 1st , 2013
  • Jan 1 , 2013 #
  • Jan 1, 2013 // Note no space before comma
  • Jan 1 #
  • January #
  • Jan // (Already supported by original example)

  • 1 Jan

  • 1st Jan #
  • 1 Jan 2013
  • 1 Jan, 2013
  • 1 Jan , 2013
  • 1st Jan 2013
  • 1st Jan, 2013
  • 1st Jan , 2013 #

This version of the code supports the forms above. It's also better: the months have been converted to use all non-capturing patterns (so there's no extra captures being created for no reason), and I've removed the capture around the whole regex per @MarkusJarderot's answer. The extended number of date formats also allows a less-contorted regex. One small issue introduced by these forms is that now v1 will try to match dates of the form "1 Jan 2013" as being "Jan 20", while v2 matches them correctly. This is the same problem I mentioned above as "something to be careful of"; you'll probably want to figure out how to decide which regex is better to use (try both and use the one that matches more date pieces perhaps).

static private String month = "(?<month>jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|jun(?:e)?|jul(?:y)?|aug(?:ust)?|sep(?:t|tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?)";
static private String suffix = "(?:st|nd|rd|th)";
static private String date = "(?<date>\\d{1,2})";
static private String year = "(?<year>\\d{4})";

// A month name (optionally followed by space followed by a date (optionally
// followed by a suffix)(optionally followed by a comma, possibly with space
// before it)(optionally followed by space followed
// by a year))
static private String v1 = String.format(
        "%s(?:\\s+%s%s?(?:\\s*,)?(?:\\s+%s)?)?", month, date, suffix, year);

// A date (optionally followed by a suffix) followed by space followed by a
// month (optionally followed by
// a comma, possibly with space before it) optionally followed by space and
// a year
static private String v2 = String.format(
        "%s%s?\\s+%s(?:\\s*,)?(?:\\s+%s)?", date, suffix, month, year);

Or, as regexes with no Java (the output of the format):

(?<month>jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|jun(?:e)?|jul(?:y)?|aug(?:ust)?|sep(?:t|tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?)(?:\s+(?<date>\d{1,2})(?:st|nd|rd|th)?(?:\s*,)?(?:\s+(?<year>\d{4}))?)?
(?<date>\d{1,2})(?:st|nd|rd|th)?\s+(?<month>jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|jun(?:e)?|jul(?:y)?|aug(?:ust)?|sep(?:t|tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?)(?:\s*,)?(?:\s+(?<year>\d{4}))?
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top