Question

What is the best way to extract each field from each line where there is no clear separator (deliminator) between each field?

Here is a sample of the lines I need to extract its fields:

3/3/2010 11:00:46 AM                      BASEMENT-IN          
3/3/2010 11:04:04 AM 2, YaserAlNaqeb      BASEMENT-OUT         
3/3/2010 11:04:06 AM                      BASEMENT-IN          
3/3/2010 11:04:18 AM                      BASEMENT-IN          
3/3/2010 11:14:32 AM 4, Dhileep              BASEMENT-OUT         
3/3/2010 11:14:34 AM                      BASEMENT-IN          
3/3/2010 11:14:41 AM                      BASEMENT-IN          
3/3/2010 11:15:33 AM 4, Dhileep           BASEMENT-IN          
3/3/2010 11:15:42 AM                      BASEMENT-IN          
3/3/2010 11:15:42 AM                      BASEMENT-IN          
3/3/2010 11:30:22 AM 34, KumarRaju        BASEMENT-IN          
3/3/2010 11:31:28 AM 39, Eldrin           BASEMENT-OUT         
3/3/2010 11:31:31 AM                      BASEMENT-IN          
3/3/2010 11:31:39 AM                      BASEMENT-IN          
3/3/2010 11:32:38 AM 39, Eldrin           BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN          
3/3/2010 11:33:26 AM 34, KumarRaju        BASEMENT-OUT         
3/3/2010 11:33:28 AM                      BASEMENT-IN    

There are 6 fields in each line and some of them can be empty. What is the best way to approach this problem?

  • I'm using Java

Edition 01

  • Field 5 can be empty (however its existence should be recognized in all cases)
  • Amount of spaces can change
  • Last word can change
Was it helpful?

Solution

To me there seem to be 3 meta-fields:

3/3/2010 11:32:38 AM 39, Eldrin           BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN 

MF1: 3/3/2010 11:32:38 AM

MF2: 39, Eldrin

MF3: BASEMENT-IN

of which MF2 is optional. My delimiters then would be:

MF1 up to and including [AM|PM]

MF2 number,anything except BASEMENT-*

MF3 BASEMENT-*

I'm not all that good at regexes but I would extract those 3 groups as something like

(anything)(AM|PM)(number,anything)?(BASEMENT-anything)

where the ? means optional group.

OTHER TIPS

Well you can strip off the date and the BASEMENT-FOO data by column number, since they always appear at the same point in the line. Then you can split the remainder based on commas. Whether you need to handle escaped commas \, or commas in quotes "foo, bar" is up to you and your business requirements.

You can do:

  • read an entire line as string.
  • split the read line on spaces(\s+). You should get 5 or 6 pieces.
  • piece0, piece1 and piece2 will be date, time and AM/PM.
  • check if piece3 has number: if yes then read next piece as name
  • last piece is that Basement thing.
  • convert the pieces from string to say date,time,int as needed.

Find the columns in each line where blank characters are adjacent to non-blank ones, then do a statistical analysis on those numbers: those which occur in every line or almost every line are very probably the field boundaries.

Similarly for punctuation adjacent to letters, but in general it is impossible to guess whether a - or a , is meant to delimit a field or not. If it occurs in the same position in every line, it might be a delimiter, but in lists of things such as D-FL R-TX D-NY it probably isn't. So there can be no fully automatic solution for arbitrary data.

Since each field is very distinct (atleast in the example you pasted above) you can do this:

  1. Split the string into tokens.
  2. Run each element of the tokenized array through a Regex Pattern.

You can use Strtokenizer from Commons Lang and specify multiple delimiters to split on:

There are a number of built in types that is supports via StrMatcher.

StrTokenizer(char[] input, StrMatcher delim) 

e.g.

StrMatcher delims = StrMatcher.charSetMatcher(new char[] {' ', ',', '\n'});
StrTokenizer str = new StrTokenizer(match.toString(), delims);
while (str.hasNext()) {
    System.out.println("Token:[" + str.nextToken() + "]");
}

will give (from the example above):

Token:[3/3/2010]
Token:[11:00:46]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:04:04]
Token:[AM]
Token:[2]
Token:[YaserAlNaqeb]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:04:06]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:04:18]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:14:32]
Token:[AM]
Token:[4]
Token:[Dhileep]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:14:34]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:14:41]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:33]
Token:[AM]
Token:[4]
Token:[Dhileep]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:42]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:42]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:30:22]
Token:[AM]
Token:[34]
Token:[KumarRaju]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:31:28]
Token:[AM]
Token:[39]
Token:[Eldrin]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:31:31]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:31:39]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:38]
Token:[AM]
Token:[39]
Token:[Eldrin]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:47]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:47]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:33:26]
Token:[AM]
Token:[34]
Token:[KumarRaju]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:33:28]
Token:[AM]
Token:[BASEMENT-IN]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top