Question

It is my first time to create a program with file reading and writing involved. Actually I'm wondering what is the best technique on doing this. Because when I compared my work with my classmate, our logic are very different from each other.

You see, our teacher asked us to do a simple student list system where users can add, edit and delete records. And he required us to make a file to save all the records so that we can access it the next time we use the program.

My solution to this problem is before the program open its menu, i read all the records inside and save it in an array[]. In doing so, i could manipulate all the records. Then before the user exits the program, I save it on the same file and overwrite all the records on it.

My classmate's solution is like this. When she adds a record, she access the file and append the data, when she edits a record, she access the file and edit the particular record, and when she deletes a record she access the file and deletes the record. So what she did is for every function she made, she's accessing the file.

Both of our work is of course, possible to code. But I am wondering, which is more efficient and effective to use if we are dealing with thousands or millions of records. Or are there other solutions better than what we did. Maybe you could share your file handling experiences with us... Thank you.

Was it helpful?

Solution

This is a classic case you'll encounter time and time again in programming: do I optimize for speed or memory usage?

And, like all such conundrums, there is no "correct" answer or perfect solution. In other words, you and your classmate are both right in your solutions to the problem.

With your solution of loading all of the records into memory, you "spend" memory in order to make accessing and modifying each of those records faster at run time. Storing all of the records in an array in memory takes up space, but because memory access is almost infinitely faster than disk access, your approach is going to run a lot faster than your classmate's.

By way of contrast, your classmate conserves RAM by waiting to load the data on demand from the hard disk. But that's going to cost her: hitting the hard disk is a terribly expensive process compared to fetching data that's already in memory, and she's going to be stuck doing this each time the user makes a change. Think about how long it takes to start a program versus switching to one that's already open.

And therein lies the tradeoff. Some of the important things to ask yourself here are:

  1. Is the data set (in the common configurations you'll be dealing with) too large (or going to become too large) to fit completely in memory? If you're dealing with typically small sets of data, computers now have enough RAM that it's probably worth it.

  2. How fast do you need to be able to access the data? Is real-time access important? Is it a particularly large or complex data set that would take too long to load from the hard disk on demand? What kind of performance do your users expect?

  3. What kind of system is your application targeting? Sometimes embedded systems and other special cases necessitate their own unique design approaches. You might have an abundance of RAM and very limited amounts of fixed storage, or you might have exactly the opposite. If you're using standard, modern PC hardware, what do your users want/need/already have? If most of your target users are using relatively "beefy" hardware already, you might make different design decisions than if you're aiming to target a larger potential audience—you've surely seen these trade offs made explicit before through a program's expressed system requirements.

  4. Do you need to allow for special situations? Things like concurrent access by multiple users make keeping all of your data in memory much more difficult. How are other users going to be able to read in the data that's only stored in memory on a local computer? Sharing a common file (perhaps even on a shared server) is probably going to be necessary here.

  5. Are there certain portions of your data that are accessed more frequently than others? Consider keeping those specific portions always in memory and lazy-loading the rest (meaning, you only attempt to fetch them into memory when/if they are accessed by the user).

And as that last point hints, something of a balanced or combined approach is probably about as close as you'll come to an "ideal" solution. You could store as much of the data in RAM as possible, while periodically writing any edits or modifications back to the file on disk during your application's idle state. There's plenty of time that the average program spends waiting on the user to do something, as opposed to the other way around. You can take advantage of these idle CPU cycles to flush out things being held in memory back to the disk without incurring any noticeable speed penalty. This approach is used all the time in software development, and helps to avoid the pitfall pointed out by EClaesson's answer. If your application crashes or otherwise quits unexpectedly, only a very small portion of data is likely to be lost because most of it was already committed to disk behind the scenes.

Postscript: Of course, Dark Falcon's answer is correct that in a production application, you would more than likely use something like a database to handle the data. But since this appears to be for educational purposes, I think understanding the basic trade offs behind each approach is far more important.

OTHER TIPS

In any serious application, a good programmer would probably use an existing library to manage the data. Choosing this tool depends on the exact requirements:

  1. Does it need to be accessed concurrently by multiple users?
  2. Does it need to be accessed from multiple machines?

The most common choice for storing a significant amount of information would be a SQL-based database, such as MySQL, Postgres, Microsoft SQL Server, SQLite, etc. These mostly resemble your classmate's solution more than yours.

Your version (keeping all records in memory) will most probably be faster. It requires that you have enough memory if the record count grows though. The bad thing with this is that a program crash or uncorrect exit will make you loose all data as it was never saved to a file.

Your classmates version will not be as fast, as file io isn't the fastest you can do. But it will require less memory and is more safe at crashes as most of the data will already be in the file.

This is a question that cannot be answered without knowing the details of the system on which it is to run, the size of the data set, and the relative cost of development time vs. cpu time. If the system has sufficient memory, working on a copy in ram is probably preferable. In a small system with extremely limited ram (today found mostly in embedded applications) you may have to update the disk file. Other things to think about are any buffering that the operating system may do before actual writing to the disk, what happens with consistency in the file if the program crashes, and even if writing to the disk is "expensive" either because it's really slow or has a limited number of write cycles (some flash disk technologies).

If this were a small practical problem on today's desktop computers you might also want to consider the time spent developing various solutions against the relatively insignificant time they might take to run on small data sets.

Also, today it might be better to solve the problem using an existing database that's good at handling the relevant issues rather than making your own database in the file system.

Editing records in place is subtle if they aren't of fixed size. It is only really possible with a binary format and support for marking a row as unused (for example, with an outside index or with whiteouts). Filesystems aren't atomic, so you can't be sure that what you did ends up on disk in its entirety.

This makes the problem way more complex than the rest of your student notes application, and best delegated to a database (SQLite and TokyoCabinet are some of the more lightweight). If you can't use a database, go with a simple implementation. It will have fewer bugs, and you won't get attached when the time comes to replace it with a database. So, your approach of reading the whole file in memory sounds like the best choice.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top