Is a class with a high number of member variables bad and is there a design structure for data processing?

https://softwareengineering.stackexchange.com/questions/411253

12-03-2021
|

Question

Context: Java, fairly new developer

I have inherited code from a friend for a project that processes variables. The first thing i notice is the class has a ton of member variables. I have always been under the impression that a lot of member variables means the class is designed poorly.

The class itself only has one function, to "process" the variables from an input map of varying variable types and with more variables than the class has (it is one step on a multi step variable processing journey) and then return an output map with separate derivation variables. This is currently done by taking in a map type (String, Object), initializing 50 member variables (all related to a the specific function of the class) using the input map, then running helper methods to change the class wide member variables (example later), then using a write method to write the variables to an output map.

I was wondering about the design of this class. Is there a variable processing design structure in Java? Ive heard of Plain Old Data but this is actually doing something with the data and has methods. Still, seeing 50 member variables made my stomach crawl a little. but I dont think I have enough experience to know if it is correct for this functionality or not. It seems the class has one function, it seems like the variables have high cohesion, all the variables are private, the only thing I can see that seems wrong is the helper methods seem to have tight coupling because they use the member variables so if you change any youd have to change it a lot.

I looked up SO and SE questions and many people who said having a lot of member variables usually means low cohesion which i dont think is the case here. I know that global variables are a no no and that passing variables into methods are generally preferred, but I was wondering what the design should be of a class that does processing on a large amount of variables and if there is any best practices. Every example ive seen only deals with like 3-5 variables per class and a couple methods.

public class DataProcessor 
{
    private var1;
    private var2;
    private var3;
    private var4;
    ...
    private var50;


    public Map<String,Object> calculate(Map<String,Object> input) {
        initialize() //wont show but writes each input to the var as the var name same as write
        run()
        write()
        return outputMap
    }
    private run() {
        outputVar = helper();
        outputVar2 = helper2();
        ...
    }
    private helper() {
        var1 = var1 + 10
    }
    private write() {
        writeToMap(outputVar, outputMap)
    }
}

No correct solution

OTHER TIPS

It's hard to say for sure because we don't know what var1 through var50 are exactly but in general, yes, this is a code smell. It might mean that the class is trying to do too many things. Another possibility is that this is a 'property bag' a.k.a Data Transfer Object (DTO), a.k.a 'value object'.

The former is pretty widely considered to be a bad idea. See the Single Responsibility Principle (SRP) and the God Object anti-pattern. One word of caution on the SRP: many a developer has wrapped themselves around that axle. It's a good principle but not terribly well-defined.

The latter is something I consider to be typically a sign of an Anemic Domain Model but a lot of people advocate it's use. Personally, I don't see much use in creating hardcoded dictionaries but I'm not going to make too fine of a point on this. There are valid use cases if you treat them like structs. I just think there are usually better options.

The methods you show suggest this is more of the 'too many things' style but it could be a mixture of the two. My suggestion would be to start grouping the variables logically and see if you start seeing some other classes you could break out. You can then make this class a composition of those. Rinse and repeat until you get a decent set of class structures. Then you can go back and refactor out this class.

You cannot generally say that having fifty variables is bad. Let’s say you are a pilot who has a checklist of 200 items to go through and check. And the test results need to be stored. 200 variables. Of course you can split this into 20 classes of 10 instance variables each, but that doesn’t improve things.

In your case, I dont see any reason for the fifty variables. You have a map with fifty values, each is extracted into an instance variable, each instance variable is replaced with a new value, then each value is written to an output map.

I’d do the simplest thing: Get the first value from the input, calculate the new value, write it to the output. Repeat fifty times. Your decision whether the helper functions read a value, calculate the replacement, write the new value, or just write the new value.

On the other hand, it may be better not to use these dictionaries with fifty elements at all, but use a class with 50 variables everywhere. Not that 50 instance variables is good, but a map containing 50 values is often worse.

It's unusual for a single object to have 50 variables that wouldn't be better grouped into sub objects.

For example a UK address has 14 fields. You could imagine that an address class that covered addresses in many countries might have 50 fields to cover the different formats. But you could obviously split it into multiple country specific address classes with different fields.

Similarly if I model something complex, for example a cricket match, there will be more than 50 parameters, but I can group them into players, teams, weather etc. objects.

On your question about data processing classes. Yes, there are special classes designed to deal with large datasets where you simply cant hold the entire dataset in memory, but want to write code as if its a simple array. But these tend to deal with large collections of objects rather than single objects with many fields.

for example a DataFrame in python

https://www.geeksforgeeks.org/python-pandas-dataframe/#:~:text=Python%20%7C%20Pandas%20DataFrame,fashion%20in%20rows%20and%20columns.

https://docs.dask.org/en/latest/dataframe.html

From your description it sounds like the variables aren't class members but rather input parameters for the processing step. If this is the case they should be passed in as an argument to the run method. Not as 50 separate arguments of course but as a single container object.

Another thing that comes to mind is the question whether you should go through the trouble of defining a typed variable with a specific meaning for each parameter or not. If those variables can be dealt with as a collection of key-value pairs it could make your life a lot easier. In your processor class, what do you really need to know about these parameters? Can you encode them all in strings?

If you do need the types after all, see Ewan's answer.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange