SRP processing file line-by-line

https://softwareengineering.stackexchange.com/questions/358962

21-01-2021
|

Pergunta

I'm writing a class that read lines from a file, process them, store them in an accumulator and when it reaches a threshold bulk inserts in a SqlServer database. Like this:

class FooImporterToSqlServer
{
    private string ConnectionString { get;} //initilized via constuctor
    private AccumulatorCollection Accumulators { get;} //initilized via constuctor
    private FooProcessor Processor { get;} //initilized via constuctor
    private int LastHierarchyId { get; set;} //initilized via constuctor, starts at 0

    public void Import(string path)
    {
        var lines = File.ReadAllLines(path);
        foreach(var line in lines)
        {
            var processedLine = Processor.Process(line, ref LastHierarchyId);
            Accumulators.Add(processedLine);
            if(Accumulators.HaveAnyFull())
                UnloadAccumulators();
        }
    }

    private void UnloadAccumulator()
    {
        var fullAccumulators = Accumulators.GetFullAccumulators();
        foreach(var fullAccumulator in fullAccumulators)
        {
            //bulk insert black magic
        }
    }
}

But I can't help but fell I'm breaking single responsibility with the FooImporterToSqlServer class. It does too much: it reads the file, uses the accumulator and then inserts in the database. But I also don't know how to separate these concerns in this case where I need to stop he process halfway. Specially maintaining state between different insertions. Normally there would be something like FooReaderToMemory then FooInsertToSqlServer but in this case they're one in the same. Any ideas on how can I refactor process like this to not break SRP?

EDIT: for clarification, I can't read, process and then import the file because the processed lines are too much to hold in memory all at once.

Please feel free to note anything else wrong with the example I provided!

Solução

Here goes my suggestion in the following code. I'm trying to solve the problems below:

If I need to change the way of extracting the data (currently is from a file), I want to change only one class (no problem if additional class needs to be created);
If I need to change the database or way of importing the data, I want to change only one class (no problem if additional class needs to be created);

If I want to change the overall process (by adding or changing steps) I want change only one class;

class FooImporter
{
    private IFooImporter Importer { get;} //no more connection string; now the importer is injected (doesn't matter how it imports)

    private AccumulatorCollection Accumulators { get;} //initilized via constuctor
    private FooProcessor Processor { get;} //initilized via constuctor
    private IFooExtractor Extractor {get;} //this class does not open files anymore, this is done by the extractor injected
    private int LastHierarchyId { get; set;} //initilized via constuctor, starts at 0

    public void Import() //now the path is private detail of the Extractor injected
    {
        using (Extractor.BeginExtract())
        {
            while (Extractor.HasNext())
            {
                var line = Extractor.GetNext();
                var processedLine = Processor.Process(line, ref LastHierarchyId);
                Accumulators.Add(processedLine);
                if(Accumulators.HaveAnyFull())
                    UnloadAccumulators();
            }
        }
    }

    private void UnloadAccumulator()
    {
        var fullAccumulators = Accumulators.GetFullAccumulators();
        foreach(var fullAccumulator in fullAccumulators)
        {
            Importer.Import(fullAcumulator);
        }
    }
}

class SqlServerFooImporter : IFooImporter  //You could have different importers
{
    private string ConnectionString { get;} //initilized via constuctor

    private void Import(Accumulator fullAcumulator)
    {
        //bulk insert black magic
    }
}

To summarize: your example code shows a class that might have multiple reasons to change. Therefore I suggested a design on which each class has one reasons to have its code changed; also for new features you implement new classes and inject them in the importer (just like you're already doing, but not for all steps of the process).

Outras dicas

But I can't help but fell I'm breaking encapsulation with the FooImporterToSqlServer class. It does too much: it reads the file, uses the accumulator and then inserts in the database.

Encapsulation? That's not an encapsulation concern. Doing to much is a Single Responsibility Principle concern. The encapsulation concern comes from publicly accessible getters. Those would expose your state. That would leave me wondering what code, that isn't even mentioned by this class, has a need to use your ConnectionString, Accumulators, Processor, and LastHierarchyId?

As for following the Single Responsibility Principle, you are. The single responsibility of the FooImporterToSqlServer is to import files. I don't expect to find anything in here that doesn't help it do that.

As @CandiedOrange pointed out, I do not see you are breaking SRP here. You have a class for processing data, class for accumulating data, and a class for storing the data to the database. Each class does its own work. The only way you could decouple it more would be to have a class that would do the file parsing, and that class would encapsulate the above four classes, but in this case, I think it would be an overkill.

I will, though, address the paragraph below:

EDIT: for clarification, I can't read, process and then import the file because the processed lines are too much to hold in memory all at once.

One way to go around that problem is the one you implemented. The other one would be to implement a consumer-producer pattern, where one thread would populate the collection of lines, and the other one would empty it while importing that data into the database. That way, the buffer would 'breathe', and the memory consumption would be reduced.

The approach you took could have better performance, though, if you are using some bulk import techniques. With your approach, there would be less round-trips to the database.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange