Domanda

I have code that creates and uses a collection such as:

List<Map<String, Object>> tableData;

This list of maps gets populated with n maps each representing one row in a database. Each row is represented as a map between the field name and the object corresponding to the field (the type is irrelevant in this case). Some of the fields might be missing. The number of fields, m is always much smaller than the number of rows (n ≈ 10000 × m). I need to reuse the same collection a few times to read through all the rows, so I can't just use some kind of lazy iterator.

Is there an efficient data structure to store this? Guava provides a Table collection but that doesn't seem to fit the bill. I am thinking about creating an interface such as:

interface TableData{
  int size();
  Map<String, Object> get(int i);
  // ... (interators, etc.)
}

And then create an implementation that uses one Map<String,List<Object>> so that I only instantiate m lists instead of n maps and create maps on the fly only when needed but I was wondering if there was a more general purpose data structure out there.

Thanks

È stato utile?

Soluzione 2

I ran some tests (not conclusive by any means but very indicative) to establish the memory footprint of different List<Map<String, Object>> implementations. The baseline is Java's ArrayList<> with the elements being instances of Guava's ImmutableMap.

The implementations I compared to are the following:

  1. Implementation based on a Map<String,List<Object>> using a HashMap and ArrayLists;
  2. Implementation based on a List<Object[]> using an ArrayList;
  3. Guava's HashBasedTable<Integer,String,Object>;
  4. Guava's ArrayTable<Integer,String,Object>;

My test consisted in generating n random rows each having m columns and a "fill factor" of k, where the fill factor is defined as the probability that each row contains values for all the columns. For simplicity, the values are random strings of length l generated using Apache Commons RandomStringUtils.

But let's get to the results. Having n = 200000, m = 50, l = 10 and k in (1.0, 7.5, 0.5) I got the following memory footprints as percentage of the baseline:

    | k = 1.0  | k = 0.75 | k = 0.5  |
----------------------------------------
1.  |     71 % |     71 % |     71 % |
2.  |     71 % |     72 % |     73 % |
3.  |    111 % |    107 % |    109 % |
4.  |     71 % |     73 % |     76 % |

I tried reducing n to 20000 with about the same results.

I found the results above quite interesting. First of all, it looks like there isn't much space for improvement beyond 70% of the baseline. Second, I was pleasantly surprised to find out that the efficient Guava's ArrayTable is as good as the two implementations proposed in this question. I'll keep digging for more but I'm leaning towards solution 1.

Thanks

Altri suggerimenti

First please make sure you really need to optimize.

Assuming that on the average no more than maybe 50% of the columns are missing, List<Object[]> is the clear winner:

class TableDataImpl implements TableData {
    private List<Object[]> data;
    private Map<String, Integer> columnNameToIndexMap;

    public Map<String, Object> get(int i) {
        return new ArrayMap(data.get(i));
    }

    private class ArrayMap implements Map<String, Object> {

        private Object[] row;

        ArrayMap(Object[] row) {
            this.row = row;
        }

        public Object get(String key) {
            Integer index = columnNameToIndexMap.get(key);
            if (index==null) return null;
            return row[index];
       }

       // all the other Map stuff... a lot of code!
    }
}

I woudn't call it simple, so make sure you really need to optimize.

Otherwise, assuming that on the average no more than maybe 95% of the columns are missing, a slightly more complicated construction should do: For each row, use a home-grown BitSet (long[]) for storing which columns exist. This way you'd waste a single bit only rather than a whole entry (32 or 64 bits) in the Object[].

This is even more complicated, so make sure you really need to optimize.

Assuming that many rows share the same set of columns, you could store the columnNameToIndexMap within each row.

Well, if it is important to have all of the table data in memory at once, it isn't going to make much of a difference which direction you store the data structures (as a List of Maps or a Map of Lists). The List of Maps is obviously much more intuitive, so I'd keep that.

If you are concerned about the efficiency of Object creation and cleanup, I'd suggest using an Object pool. Here's a basic idea of how it might work:

public class TableRowPool {

    private static final int INITIAL_CAPACITY = 10000;

    private Queue<Map<String, Object>> mapObjects;

    public TableRowPool() {
        mapObjects = new LinkedList<Map<String, Object>>();
        for(int i = 0; i < INITIAL_CAPACITY; i++) {
            mapObjects.add(new HashMap<String, Object>());
        }
    }

    public Map<String, Object> getTableRowObject() {
        if(mapObjects.size() == 0) {
            mapObjects.add(new HashMap<String, Object>());
        }
        return mapObjects.remove();
    }

    public void returnTableRowObject(Map<String, Object> obj) {
        mapObjects.add(obj);
    }

}

The LinkedList performs well as a queue, so object retrieval will be fast. It also is fast at appending new objects, if you want it to dynamically grow. You may need to change the data structures depending on whether it needs to be thread-safe, however.

To use the object pool, you would do something like the following:

//Load data
while((row = getResultSetRow()) != null) {
    Map<String, Object> rowObj = tableRowPool.getTableRowObject();
    //Fill in data
    myRows.add(rowObj);
}

//... Do all your business logic ...

//Cleanup
for(Map<String, Object> rowObj : myRows) {
    tableRowPool.returnTableRowObject(rowObj);
}
myRows = null;

If i have such a large data that i am afraid that i will get OOM , then instead of finding an optimum datastructure to keep this data i would look for how i can use SIMD parallelism or something like Map-Reduce. However much you optimize your datastructure, you can always run out of memory space. E.g if you do find an optimum datastructure which works in a particular machine configuration, it might still not work in a machine with slightly lesser RAM.

But if you still want to stick to your current approach, why cant you normalize the data so that you can represent the fields that are missing by : 'Null' . So when you read data and create a map, why not just add 'null' for the absent fields ? That way you atleast do not have to a key-value datastructure like hashmap and you can just lise List<List<Object>>

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top