Question

I have a serious problem with extracting terms from each string line. To be more specific, I have one csv formatted file which is actually not csv format (it saves all terms into line[0] only)

So, here's just example string line among thousands of string lines:

test.csv

"31451  CID005319044      15939353      C8H14O3S2      beta-lipoic acid     C1C[S@](=O)S[C@@H]1CCCCC(=O)O "
"12232 COD05374044 23439353  C924O3S2    saponin   CCCC(=O)O "
"9048   CTD042032 23241  C3HO4O3S2 Berberine  [C@@H]1CCCCC(=O)O "

I want to extract "beta-lipoic acid" ,"saponin" and "Berberine" only which is located in 5th position. You can see there are big spaces between terms, so that's why I said 5th position.

In this case, how can I extract terms located in 5th position for each line?

One more thing: the length of whitespace between each of the six terms is not always equal. the length could be one, two, three, four, or five, or something like that. Because the length of whitespace is random, I can not use the .split() function. For example, in the first line I would get "beta-lipoic" instead "beta-lipoic acid.**

Was it helpful?

Solution

Providing algorithm for this:

  • Read each line of your file.
  • For each line read:
    • Split it by the separator (not sure if spaces or tab \t character, it depends on your file content).
    • Retrieve the 5th element.
    • Store it in a collection, usually a List<String>.

You can easily accomplish this using Scanner class:

List<String> desiredContent = new ArrayList<>();
Scanner scanner = new Scanner(new File("/path/to/file.csv"));
while (scanner.hasNext()) {
    String line = scanner.nextLine();
    String[] contents = line.split(" ");
    desiredContent.add(contents[4]);
}

OTHER TIPS

You could use a scanner and the next methods.

http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html

http://www.tutorialspoint.com/java/util/scanner_next.htm

Hopefully this puts you on the right track!

You could use the split method of the string..

First you need to get the string line by line...

example:

  String [] result = scanner.nextLine().split(" ");
   System.out.print(result[4]);

split will give you an array of strings that was splitted every space.. index 4 means the 5th position of the string you want

You could try using a regular expression.

List<String> extracted = new ArrayList<String>();
Scanner scanner = new Scanner(new File("filepath/file.csv"));

while (scanner.hasNext()) 
{
    String line = scanner.nextLine();
    String[] contents = line.split("\\s\\s+");  //matches two or more whitespace characters
    extracted.add(contents[4]);
}

\\s\\s+ will split only where there are more than two whitespaces.

Note: This includes tabs, so if there is only one tab it will be ignored.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top