Question

I am working on a project with a few file formats. Some formats are specified by .xsds, others by documentation on their respective websites, and some are custom in-house formats that have no documentation. Mwahahahaha.

What's the problem?

I would like to test my file readers, but I'm not entirely sure how to go about doing this. The flow of the application is as such:

file.___  ===> read by FileReader.java ===> which creates a Model object

where the FileReader interface is

public interface FileReader {
    public Model read(String filename);
}

The Model has a number of attributes that are populated when the file is read. It looks something like

public class Model {
    List<String> as;
    List<String> bs;
    boolean isAPain = true;
    // ...
}

What have I tried?

My only idea was to create file "generators" for each file format. These generators are basically builders which take in a few variables (eg. number of comments to generate in a file), and output a sample file which I then read in and compare the resulting Model with the variables I used to initially generate the file.

This has a few problems, though:

  • The files that it generates don't look like real files. The generator is in no way aware of context.
  • It's hard to recognize if the generator has generated for edge cases since I'm the one manually setting the variables. This method is barely better than me creating a dozen sample files.

Are there any better ways to do this?

EDIT: Changed unit to integration since that's what I actually mean.

EDIT2: Here is an example of the edge cases I mentioned.

Each file represents a graph made up of vertices and edges. These vertices and edges can be attached in different ways, so:

v1 -- e1 --> v2 <-- e2 -- v3

is different from

v1 -- e1 --> v2 -- e2 --> v3

in that the direction of the edges matter. I'm not sure if this is in the scope of the question, but it's hard to think up all of the pertinent edge cases when I manually set the number of vertices, number of edges, and just generate the connections randomly.

Was it helpful?

Solution

First, lets talk about what your goals are:

  • you obviously don't want to test "file formats" - you want to to test your different FileReader implementations

  • you want to find as many different types of errors as possible by automatic tests

To reach that goal in full, IMHO you have to combine different strategies:

  • first, real unit testing: your FileReader implementations will consist of many different parts and functions. Write small tests which test each of them in isolation; design your functions in a way they don't really need to read the data out of a file. These kind of tests will help you to test most of your edge cases.
  • second, generated files: those are what I would call integration tests. Such files will help you to track down failures different from point 1, for example, combinations of specific parameters, errors in file access etc. To create good test cases, it will also be helpful to learn about some classic techniques like grouping test cases into equivalence classes or boundary value testing. Get a copy of this book by Glenford Myers to learn more about that. The Wikipedia article about software testing is a good resource, too.
  • third, try to get real-world data: it can be hard to verify that these files are evaluated correctly by your FileReaders, but it may be worth it to do this since it often finds bugs not revealed by the first two strategies. Some people would call these kind things also "integration tests", others prefer "acceptance tests", but in fact the term does not really matter.

IMHO there is no "short-cut" approach which would bring you the benefit of all three strategies "for the price of one". If you want to detect edge cases as well as failures in standard cases as well as real-world cases, you have to invest at least some - more probably a lot - effort. Luckily, all of those approaches can be used for creating automatic, repeatable tests.

Beyond that, you should make sure your FileReaders don't mask any errors when reading that data - create in-built checks/assertions, throw exceptions when something goes wrong internally etc.. This gives your testing code a much better chance to detect errors, even when you don't have an explicit test file or test case for an unexpected situation.

Licensed under: CC-BY-SA with attribution
scroll top