Question

I have a serious problem with extracting terms from each string line. To be more specific, I have one csv formatted file which is actually not csv format (it saves all terms into line[0] only)

So, here's just example string line among thousands of string lines:

(split() doesn't work.!!! )

test.csv

"31451  CID005319044      15939353      C8H14O3S2      beta-lipoic acid     C1C[S@](=O)S[C@@H]1CCCCC(=O)O "
"12232 COD05374044 23439353  C924O3S2    saponin   CCCC(=O)O "
"9048   CTD042032 23241  C3HO4O3S2 Berberine  [C@@H]1CCCCC(=O)O "

I want to extract "beta-lipoic acid" ,"saponin" and "Berberine" only which is located in 5th position. You can see there are big spaces between terms, so that's why I said 5th position.

In this case, how can I extract terms located in 5th position for each line?

One more thing: the length of whitespace between each of the six terms is not always equal. the length could be one, two, three, four, or five, or something like that. Because the length of whitespace is random, I can not use the .split() function. For example, in the first line I would get "beta-lipoic" instead "beta-lipoic acid.**

Was it helpful?

Solution

Here is a solution for your problem using the string split and index of,

import java.util.ArrayList;

public class StringSplit {

    public static void main(String[] args) {
        String[] seperatedStr = null;
        int fourthStrIndex = 0;
        String modifiedStr = null, finalStr = null;
        ArrayList<String> strList = new ArrayList<String>();
        strList.add("31451  CID005319044      15939353      C8H14O3S2    beta-lipoic acid   C1C[S@](=O)S[C@@H]1CCCCC(=O)O ");
        strList.add("12232 COD05374044 23439353   C924O3S2   saponin       CCCC(=O)O ");
        strList.add("9048   CTD042032 23241 C3HO4O3S2  Berberine    [C@@H]1CCCCC(=O)O ");

        for (String item: strList) {
            seperatedStr = item.split("\\s+");
            fourthStrIndex = item.indexOf(seperatedStr[3])  + seperatedStr[3].length();
            modifiedStr = item.substring(fourthStrIndex, item.length());
            finalStr = modifiedStr.substring(0, modifiedStr.indexOf(seperatedStr[seperatedStr.length - 1]));
            System.out.println(finalStr.trim());
        }
    }
}

Output:

beta-lipoic acid

saponin

Berberine

OTHER TIPS

Option 1 : Use spring.split and check for multiple consecutive spaces. Like the code below:

String s[] = str.split("\\s\\s+");
        for (String string : s) {
            System.out.println(string);
        }

Option 2 : Implement your own string split logic by browsing through all the characters. Sample code below (This code is just to give an idea. I didnot test this code.)

public static List<String> getData(String str) {
        List<String> list = new ArrayList<>();
        String s="";
        int count=0;
         for(char c : str.toCharArray()){
             System.out.println(c);
                if (c==' '){
                    count++;
                }else {
                    s = s+c;
                }
                if(count>1&&!s.equalsIgnoreCase("")){
                    list.add(s);
                    count=0;
                    s="";
                }
            }

        return list;
    }

This would be a relatively easy fix if it weren't for beta-lipoic acid...

Assuming that only spaces/tabs/other whitespace separate terms, you could split on whitespace.

Pattern whitespace = Pattern.compile("\\s+");
String[] terms = whitespace.split(line); // Not 100% sure of syntax here...
// Your desired term should be index 4 of the terms array

While this would work for the majority of your terms, this would also result in you losing the "acid" in "beta-lipoic acid"...

Another hacky solution would be to add in a check for the 6th spot in the array produced by the above code and see if it matches English letters. If so, you can be reasonably confident that the 6th spot is actually part of the same term as the 5th spot, so you can then concatenate those together. This falls apart pretty quickly though if you have terms with >= 3 words. So something like

Pattern possibleEnglishWord = Pattern.compile([[a-zA-Z]*); // Can add dashes and such as needed
if (possibleEnglishWord.matches(line[5])) {
    // return line[4].append(line[5]) or something like that
}

Another thing you can try is to replace all groups of spaces with a single space, and then remove everything that isn't made up of just english letters/dashes

line = whitespace.matcher(line).replaceAll("");
Pattern notEnglishWord = Pattern.compile("^[a-zA-Z]*"); // The syntax on this is almost certainly wrong
notEnglishWord.matcher(line).replaceAll("");

Then hopefully the only thing that is left would be the term you're looking for.

Hopefully this helps, but I do admit it's rather convoluted. One of the issues is that it appears that non-term words may have only one space between them, which would fool Option 1 as presented by Hirak... If that weren't the case that option should work.

Oh by the way, if you do end up doing this, put the Pattern declarations outside of any loops. They only need to be created once.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top