Question

I have a huge file in the format:

x a
y c
x d
z a
z s
y k

I want that the output will be sorted and in the form

x a,d
y c,k
z a,s,k

For this type of task a hash is the best:

The straightforward TreeMap solution would be this:

Map<String, StringBuilder> agg = Maps.newTreeMap();

while ((line = r.readLine()) != null) {

    String[] arr = line.split("\t");
            String key = arr[0];
            String value = arr[1];

    if(agg.containsKey(key)) {
            agg.get(key).append(",").append(value);
        }
        else {
            agg.put(key, new StringBuilder(value));
        }
    }

}
r.close();


System.out.println("Printing results");
FileWriter f = new FileWriter("out.txt");

for (String key : agg.keySet()) {
    f.write(key+"\t"+agg.get(key)+"\n");
}

Another option would be to use a hashmap, take the keyset, sort it and iterate over it.

The different part would be

    System.out.println("Sorting array");
    List<String> keys = Lists.newArrayList(agg.keySet());
            Collections.sort(keys);
        System.out.println("Printing results");
        FileWriter f = new FileWriter("out.txt");

        for (String key : keys) {
            f.write(key+"\t"+agg.get(key)+"\n");
        }

For me the big O is less important because it is a batch job.

For me the memory usage is much more important.

Which strategy is more efficient in terms of memory?

HashMap and the peek in the sort phase or the TreeMap approach

Was it helpful?

Solution 2

If it's a batch job, a TreeMap won't use extra memory like HashMap. HashMaps default load factor is I believe 0.75 (i.e. the map can be 75% full before the size is grown).

A TreeMap would be more straight forward too, provided the O(log n) (IIRC) doesn't become a bottle neck. If it does, you could use a List with your own Tuple object and a custom Comparator, but then you don't get a O(1) get().

OTHER TIPS

HashMap is not optimal in memory utilization, rather for specific operations. It is backed by an array, which is allocated on initialization and resized when its size reaches certain limits. Consequently, memory is allocated eagerly. Its size is always a power of 2, which is actually a computational optimization for faster bucket index calculation. As a result, the (unused) allocated memory may significantly surpass the actual memory used by your program.

TreeMap provides the optimal memory utilization, while having worse performance in get, add, remove operations comparing to the HashMap. This becomes more obvious by its constructor, which doesn't have any parameter that can affect its computational complexity. All its entries are lazily allocated and associated with the existing entries to form a tree.

Since number of unique keys is unknown, and could be large, hash based approach may consume more memory if number of keys exceed the product of initial capacity of the hash table & its load factor. Because in such events capacity simply gets doubled increasing memory usage.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top