Split line in words and remove String & Numeric words in Java Regex

Question 1

I got a a solution in 3 phases Phase 1: Remove all Strings first

String line = srcLine.replaceAll("((?:\"(?:[[^\"]|\"\"]*)\")|(?:\'(?:[[^\']|\'\']*)\'))", "");

Phase 2: Break the line in words

Pattern pattern = Pattern.compile("\\b(?:(?<=\")[^\"]*(?=\")|(?<=\\')[^\\']*(?=\\')|[\\w\\d-]+)\\b");

Pahse 3 : Discard the numeric data at last.

Question 2

No, you didn't.

05  ECPRF-057     PIC S9(4) VALUE +0057 COMP-3.
05  ECPRF-057     COMP-3 PIC S9(4) VALUE +0057.
05  ECPRF-057     VALUE +0057 PIC S9(4) USAGE COMP-3.
05  ECPRF-057     VALUE +0057 PIC S9(4) USAGE IS COMP-3.
05  ECPRF-057     VALUE +0057 PICTURE IS S9(4) USAGE IS COMP-3.
05  ECPRF-057     VALUE +0057 PICTURE S9(4) USAGE IS COMP-3.
05  ECPRF-057     VALUE +0057 PIC S9(4) COMP-3.
05  ECPRF-057     VALUE +0057 COMP-3 PIC S9(4).
05  ECPRF-057     VALUE +0057 PACKED-DECIMAL PIC S9(4).
05  ECPRF-057     VALUE +0057 PIC S9(4).

In addition, COMP-3 can be written as COMPUTATIONAL-3 or PACKED-DECIMAL. And none of these have to be on the same line, and very often won't be.

These are all the same. And make many, many more combinations. Am I sure, about that last one even? Yes, because somewhere before that (immediately, or any number of lines before) is:

02  ECPRF-057-GROUP COMP-3. (which may also have the combinations relating to COMP-3)

That's without getting to the 88-level in your second example.

That's without:

05  PIC X(20) VALUE SPACE.

That's without duplicate data-names which are valid when "qualified" by a higher-level data-name, requiring the use of either IN or OF.

That's without REDEFINES.

That's without COMP/COMP-4/COMP-5/BINARY where things like the maximum value that can be held depend upon a compiler option.

Please don't attempt to do this unless all the data you are processing is already rigorously normalised.

Plus, the word VALUE is useless to you, it is the actual content related to the VALUE clause which you want, which is singular, when present, VALUE is optional on Levels 01-49, or can be an unlimited multiple number of items on a Level 88. Plus you are ignoring the number of digits, or bytes (it varies, depending on PICture, and even whether number is even, odd, or due to compile option).

You were previously asked what you were doing looking at COBOL program programmatically and didn't mention this.

If you want a program to understand COBOL on the Mainframe, it already exists, it is the Enterprise COBOL compiler.

If you really want to do something by trying to "understand" a COBOL program, at least make your task orders of magnitude easier and use the compile listing which the compiler produces. You will still have to work out the number of decimal places, and the number of times something OCCURS, but these are minor things which can be specifically sought within a limited context which can be provided by data on the compile listing.

And, if you genuinely need to ignore the values associated with VALUE, you have the figurative-constants (SPACE(S), LOW-VALUES, HIGH-VALUES, ZERO(S/ES), QUOTE(S)) to deal with as well, plus NULL, which you may find as a VALUE on a USAGE POINTER item. You also need to be aware that these can be specified on the group that a given data-item is part of.

Time now allows some expansion, so have a look at these:

   01  A-GROUP VALUE ZERO. 
       05  PIC 9. 
       05  A-NAME-1 PIC S9(4). 
       05  A-NAME-2 PIC S9999. 
       05  A-NAME-3 REDEFINES A-NAME-2 PIC 9999.
   01  B-GROUP BINARY. 
       05  PIC 9. 
       05  B-NAME-1 PIC S9(4). 
       05  B-NAME-2 PIC S9999. 
       05  B-NAME-3 REDEFINES B-NAME-2 PIC 9999.
   01  C-GROUP COMPUTATIONAL-3. 
       05  PIC 9. 
       05  C-NAME-1 PIC S9(4). 
       05  C-NAME-2 PIC S9999. 
       05  C-NAME-3 REDEFINES C-NAME-2 PIC 9999.

   01  D-GROUP SIGN LEADING SEPARATE. 
       05  PIC 9. 
       05  D-NAME-1 PIC S9(4). 
       05  D-NAME-2 PIC S9999. 
       05  FILLER  REDEFINES D-NAME-2. 
           10  FILLER PIC X. 
           10  D-NAME-3 PIC 9999.

If you look at the 05-level definitions, all these fields look the same from group-to-group. They are not, they are all different due to the additional clauses on the 01-level.

I've not even scratched the surface. COBOL has a very wide range of data-definitions which can be easily applied to produce complex data-structures.

COBOL is an old language. Many COBOL programs are old programs changed already by many people with different coding styles and different levels of knowledge of COBOL. Will you find definitions like the above in all your programs? No. Will you find them in some? Maybe. Can't have Maybes when you are processing data.

The data you are extracting does not make sense to me. The level-number is significant, the content of the value is significant. The number of digits in a field is significant as well as the size of a field in bytes. Perhaps you don't need any of these, but I doubt it.

Abandon this route.

If you seriously need to "understand" a COBOL program on an IBM Mainframe, compile it, with all the listing options, and use the listing. Or look at the SYSADATA appendix in the Enterprise COBOL Programming Guide and use the compiler option to generate that data (this will take longer to compile, but will leave you with less work to do if you need to accomplish several distinct tasks (you have two already)).

If you try to do anything else, you are looking at a very considerable amount of work. If you are not knowledgeable in COBOL and have no source of good knowledge available for the design, your results will be "patchy" at best.

If you'd answered more fully on your previous question relating to this, you'd have saved yourself all of the above as well.

Here are some links to SO questions which may aid you if you look to continue with other solutions:

Generating Record Layouts for EBCDIC Data Files.

Is there a Python library to parse and manipulate COBOL code?

Is there a free (as in beer) Flow chart generator for COBOL Code?