Question

I am working with unstructured text data exported from a PDF. The original data comes from a table in the PDF that was converted to text format, so all that remains is the general structure of it. A particular section I'm looking at used to be a table.

So for example, here is some sample input

  A        B     C     D         E
 1        2                     3
 4              6     7    

The first line indicates the headers, and the following lines are the values.

Fortunately, the spacing is preserved (somewhat): there will always be at least two spaces between each column. However, the actual number of spaces would vary depending on how the parser decided to handle it based on how the table was structured.

I want to parse these lines into the following arrays. I would first parse the header to get the columns, and then use that as the template I need while parsing the rest of the lines.

{"A", "B", "C", "D", "E"}
{"1", "2",  "",  "", "3"}
{"4",  "", "6", "7",  ""}

Is it possible to accurately do this, given only this information?

Was it helpful?

Solution

I guess that you could get the index of the header (A, B, ...) in the String and compare it to the index of the value in each lines to get the closest ... I tried quickly and got this result :

public static void main(String[] args) {
    String headerColumn = "  A        B     C     D         E";
    String firstLine = " 1        2                     3";
    String secondLine = " 4              6     7    ";

    Map<Integer, String> indexHeaderMap = new HashMap<Integer, String>();
    // Get header indexes
    for (int i = 0; i < headerColumn.length(); i++) {
        String currChar = String.valueOf(headerColumn.charAt(i));
        if (!currChar.equals(" ")) {
            indexHeaderMap.put(i, currChar);
        }
    }

    // Parse first line
    parseLine(firstLine, indexHeaderMap);
    // Parse second line
    parseLine(secondLine, indexHeaderMap);
}

And the functions :

private static void parseLine(String pLine, Map<Integer, String> pHeaderMap) {
    for (int i = 0; i < pLine.length(); i++) {
        String currChar = String.valueOf(pLine.charAt(i));
        if (!currChar.equals(" ")) {
            int valueColumnIndex = getNearestColumnIndex(i, pHeaderMap);
            System.out.println("Value " + currChar + " is on column " + pHeaderMap.get(valueColumnIndex));
        }
    }
}

private static int getNearestColumnIndex(int pIndex,
        Map<Integer, String> pHeaderMap) {
    int minDiff = 500;
    int nearestColumnIndex = -1;
    for(Map.Entry<Integer, String> mapEntry : pHeaderMap.entrySet()) {
        int diff = Math.abs(mapEntry.getKey() - pIndex);
        if (diff < minDiff) {
            minDiff = diff;
            nearestColumnIndex = mapEntry.getKey();
        }
    }

    return nearestColumnIndex;
}

Here's the output :

Value 1 is on column A
Value 2 is on column B
Value 3 is on column E
Value 4 is on column A
Value 6 is on column C
Value 7 is on column D

I hope this is helpful enough to get the result you expect !

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top