Question

I'm working on a java project that will allows users to parse multiple files with potentially thousands of lines. The information parsed will be stored in different objects, which then will be added to a collection.

Since the GUI won't require to load ALL these objects at once and keep them in memory, I'm looking for an efficient way to load/unload data from files, so that data is only loaded into the collection when a user requests it.

I'm just evaluation options right now. I've also thought of the case where, after loading a subset of the data into the collection, and presenting it on the GUI, the best way to reload the previously observed data. Re-run the parser/Populate collection/Populate GUI? or probably find a way to keep the collection into memory, or serialize/deserialize the collection itself?

I know that loading/unloading subsets of data can get tricky if some sort of data filtering is performed. Let's say that I filter on ID, so my new subset will contain data from two previous analyzed subsets. This would be no problem is I keep a master copy of the whole data in memory.

I've read that google-collections are good and efficient when handling big amounts of data, and offer methods that simplify lots of things so this might offer an alternative to allow me to keep the collection in memory. This is just general talking. The question on what collection to use is a separate and complex thing.

Do you know what's the general recommendation on this type of task? I'd like to hear what you've done with similar scenarios.

I can provide more specifics if needed.

Was it helpful?

Solution

You can embed a database into the application, like HSQLDB. That way you parse the files the first time and then use SQL to do simple and complex querys.

HSQLDB (HyperSQL DataBase) is the leading SQL relational database engine written in Java. It has a JDBC driver and supports nearly full ANSI-92 SQL (BNF tree format) plus many SQL:2008 enhancements. It offers a small, fast database engine which offers in-memory and disk-based tables and supports embedded and server modes. Additionally, it includes tools such as a command line SQL tool and GUI query tools.

OTHER TIPS

If you have tons of data, lots of files, and you are short on memory, you can do an initial scan of the file to index it. If the file is divided into records by line feeds, and you know how to read the record, you could index your records by byte locations. Later, if you wanted to read a certain set of indeces, you would do a fast lookup to find which byte ranges you need to read, and read those from the File's InputStream. When you don't need those items anymore, they will be GCed. You will never hold more items than you need into the heap.

This would be a simple solution. I'm sure you can find a library to provide you with more features.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top