Design Pattern to extract arbitrary field from arbitrary file format

https://softwareengineering.stackexchange.com/questions/378414

07-02-2021
|

Domanda

Lets say I have multiple file types:

.json, .csv ... etc

These file types come in different formats:

Second json structure
Extra column added to csv
etc.

I need to extract fields from these files; however, sometimes one format doesn't have all of the same fields.

JSON format 2 has a new element than JSON1 didn't have
CSV 2 swaps out an element from CSV1
etc.

The fields are sometimes shared across file formats

csv1 and json2 both have element A

The outcome should be that if the file has the field requested, it is returned, otherwise a notification is sent that the field does not exist.

I am aiming to make my code extensible for new formats and new fields to be added.

Is there any design pattern that might point in the right direction for this? I am having a hard time coming up with a good design.

I was thinking of using a strategy pattern for loading different file types (csv, json ..etc) , but then extracting arbitrary fields loses me.

If each file format had the same fields this would be trivial. Perhaps I am trying too hard to clump objects together?

Soluzione

Regardless of the pattern you use the fundamental thing here is deciding what you're going to model: whatever the file has in it or what your application needs.

There are use cases for either. A text editor can load either a .json or .csv just fine but while it can show you your arbitrary field and it's value, it has no idea what to do with it other than show it to you. The text editor is only regurgitating the file without understanding it. Your application could do the same. In this situation you don't do any logic against the field. You detect the file format and present the file according to the format, and strategy works fine for that if you want to present different formats differently. Here you let the user deal with understanding the arbitrary field.

If you have some need to use the arbitrary field in some business logic then you're in a different situation. You have an expectation of some qualifying field existing, logic to run against it, and ways of dealing with it not existing. You still need to detect and deal with the file format, and strategy still works for that, but now you have a need to build an accessible data structure that understands this arbitrary field and, regardless of file format, works the same for all the business logic you're running against this arbitrary field.

It might be useful to understand that it's rare to only need one abrirary field in this later case. Usually you find a few that can be grouped together to form one coherent idea. Their might several of these groups in one file. These groupings become data structures, data transfer objects, POJOs. If they have an identity beyond simply their own values they become entities.

Getting the data from the file into memory isn't trivial. Sometimes it's simple because the file closely follows what you need in memory but that isn't always the case and some conversions need to be done. When you have this problem with databases it's called object-relational impedance mismatch.

Regardless of all that when you're in the situation where your code must understand the arbitrary field it's best to start your design by ignoring the file and concentrating on your apps needs. Assume you'll get the field somehow if it exists. Express that you need it by letting it be passed in by something. Use it however you need to.

Only once that's all done do you write the code that goes and finds it and passes it where it's needed. A good technique for this is dependency injection. That lets you seperate your use of the arbitrary field from the construction of whatever data structure you use to model it.

Altri suggerimenti

There's no magical solution to this. If the file formats have some version information built into them, you can use that data, feed it into a Factory, and create the appropriate instance of a Strategy for reading. You may also have separate Strategies for processing that data. The degree to which you'll be able to share logic between them will depend the number of useful abstractions you can come up with (based on your knowledge of what the application does), to build your system around them.

The core idea behind the Strategy pattern, and any other pattern that needs to be agnostic of implementation details, is to write code against an abstraction (like an interface), and have some other implementation-specific object support that abstraction.

You'll have to come up with an interface that's general enough, but at the same time useful enough so that your application can still do the work it needs done, because at a higher level, you cannot refer to data fields that might be unique to a specific file format and version (or other format-specific details, like structure, ordering, etc.). Otherwise, you would have to handle each case differently, and that may limit extensibility and maintainability; the idea is to contain this format-specific behavior in a lower level class, or a collection of related classes. So you have to think through what is it that your application does, and if it's possible to express the logic in higher-level terms (i.e. you should be able to just tell the code DoATask() and let some format-specific subclass handle it, instead of ReadDataField(...); DoStuffWithDataField(...);).

Another possible approach (that may be complementary to what I described above) is to come up with some sort of a unified data model that your application will use internally. Your various format-specific data readers would read the data and convert it to this unified format, and the "core" of your application would take that (it would work exclusively with data in that form). This would require writing some boilerplate code (to translate to/from the internal format), and there may or may not be a performance hit. But design is all about trade-offs, so you have to weigh pros and cons yourself.

P.S. If there's no version information available, then you may be able to use your knowledge of the domains to somehow inspect the file and determine (well, guess) the file format, based on something like the structure of the file. But this (1) may not always be possible, and (2) even when it is possible, it can be unreliable and may backfire.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a softwareengineering.stackexchange