Writing tests for code that takes files as input

Question 1

The unit testing approach is to eliminate external dependencies in tests. That way, running the tests doesn't require your environment to do anything other than host your test program.

Inside your code, for the most part you should not be interested in "testfile.epub". That's the job only of the OpenEpubFile() routine. Instead, you are interested in testing the specific logic of "here is a pointer to some data, how do I unzip it?" or "how do I process a title tag?" So your unit tests would provide sample zipped data to test how your unzip logic works, and sample title tags to see how your logic handles titles. You'd pass it titles that are just fine, very long, very short, malformed, you'd present the different kinds of data required to exercise whatever logic you need to test in your code. But that data doesn't have to come from a file every single time, it can come from a test harness.

If your find your logic is hard to test, it's probably a sign that it's time to modularize it. You need to separate out code that opens the file from the code that reads the data. You need to separate the code that reads the data from the code that unzips the data. You need to separate the code that unzips the data from the code that parses the XML. You need to separate code that constructs the screen from code that paints the screen. The Extract Method refactoring process will be very helpful here, as well as Rename Method. And you'll become heavily dependent on the Dependency Inversion pattern.

Every time you can break it down into stateless code that implements pure logic, you can test just those rules very easily. Better, as you break it into the needed modules, you'll find that adding new modules to handle the new cases becomes easier as you can repeat your existing patterns.

Yes, at some point you're going to have one test that assures you that your open() statement can actually open a file. After you have the unit tested code passing all your tests, then it's time to move to integration testing. That's where you can feed it a real set of .epub files and see the output is as desired.

Question 2

No need to reimplement .epub just for testing.

Create some tiny .epub files to show off certain traits, and create tests against those.