How can i extract specific terms from string lines in Java?

Question 1

Here is a solution for your problem using the string split and index of,

import java.util.ArrayList;

public class StringSplit {

    public static void main(String[] args) {
        String[] seperatedStr = null;
        int fourthStrIndex = 0;
        String modifiedStr = null, finalStr = null;
        ArrayList<String> strList = new ArrayList<String>();
        strList.add("31451  CID005319044    　　15939353　　    C8H14O3S2    beta-lipoic acid   C1C[S@](=O)S[C@@H]1CCCCC(=O)O ");
        strList.add("12232 COD05374044 23439353   C924O3S2   saponin       CCCC(=O)O ");
        strList.add("9048   CTD042032 23241 C3HO4O3S2  Berberine    [C@@H]1CCCCC(=O)O ");

        for (String item: strList) {
            seperatedStr = item.split("\\s+");
            fourthStrIndex = item.indexOf(seperatedStr[3])  + seperatedStr[3].length();
            modifiedStr = item.substring(fourthStrIndex, item.length());
            finalStr = modifiedStr.substring(0, modifiedStr.indexOf(seperatedStr[seperatedStr.length - 1]));
            System.out.println(finalStr.trim());
        }
    }
}

Output:

beta-lipoic acid

saponin

Berberine

Question 2

Option 1 : Use spring.split and check for multiple consecutive spaces. Like the code below:

String s[] = str.split("\\s\\s+");
        for (String string : s) {
            System.out.println(string);
        }

Option 2 : Implement your own string split logic by browsing through all the characters. Sample code below (This code is just to give an idea. I didnot test this code.)

public static List<String> getData(String str) {
        List<String> list = new ArrayList<>();
        String s="";
        int count=0;
         for(char c : str.toCharArray()){
             System.out.println(c);
                if (c==' '){
                    count++;
                }else {
                    s = s+c;
                }
                if(count>1&&!s.equalsIgnoreCase("")){
                    list.add(s);
                    count=0;
                    s="";
                }
            }

        return list;
    }

Question 3

This would be a relatively easy fix if it weren't for beta-lipoic acid...

Assuming that only spaces/tabs/other whitespace separate terms, you could split on whitespace.

Pattern whitespace = Pattern.compile("\\s+");
String[] terms = whitespace.split(line); // Not 100% sure of syntax here...
// Your desired term should be index 4 of the terms array

While this would work for the majority of your terms, this would also result in you losing the "acid" in "beta-lipoic acid"...

Another hacky solution would be to add in a check for the 6th spot in the array produced by the above code and see if it matches English letters. If so, you can be reasonably confident that the 6th spot is actually part of the same term as the 5th spot, so you can then concatenate those together. This falls apart pretty quickly though if you have terms with >= 3 words. So something like

Pattern possibleEnglishWord = Pattern.compile([[a-zA-Z]*); // Can add dashes and such as needed
if (possibleEnglishWord.matches(line[5])) {
    // return line[4].append(line[5]) or something like that
}

Another thing you can try is to replace all groups of spaces with a single space, and then remove everything that isn't made up of just english letters/dashes

line = whitespace.matcher(line).replaceAll("");
Pattern notEnglishWord = Pattern.compile("^[a-zA-Z]*"); // The syntax on this is almost certainly wrong
notEnglishWord.matcher(line).replaceAll("");

Then hopefully the only thing that is left would be the term you're looking for.

Hopefully this helps, but I do admit it's rather convoluted. One of the issues is that it appears that non-term words may have only one space between them, which would fool Option 1 as presented by Hirak... If that weren't the case that option should work.

Oh by the way, if you do end up doing this, put the Pattern declarations outside of any loops. They only need to be created once.