What is the most efficient way to store this data?

https://softwareengineering.stackexchange.com/questions/305186

10-12-2020
|

Question

I am in charge of rewriting some old VB code. I understand how it works, but I feel like there is a far-more efficient way to do what they did. I just can't figure out what it is. Here is a contrived example that in terms of data requirements is really similar to what I need to do.

The user has to pick out the manufacturer, make, model and color of their car in a GUI. I have a large text file that looks something like this:

Ford Truck F150 red
Ford Truck F150 blue
Ford Truck F150 black
Ford Truck F150 silver
Ford Truck F250 red
Ford Truck F250 green
Ford Sedan Taurus red
Ford Sedan Taurus green
Ford Sedan Taurus white
Ford...
...

Subaru SUV Forester blue
Subaru SUV Forester red
Subaru SUV Outback Black
Subaru SUV Outback Green
Subaru SUV Outback Blue
Subaru SUV Outback Red
Subaru...
...

etc.

So if the first selection is Subaru, the second box (make) should not have an option to select Truck because none of the Subarus are trucks. Similarly, if they select Ford, Sedan and Taurus, then the last box (color) should not show an option to select blue. Or Black. Or anything other than red, green or white.

The people who wrote the code before me came up with this (in python-y psuedocode):

def getValidOptions():
    items = []
    for i from 0 to numRows:
        options = getLine().split()
        if selectingManufacturer:
            if options[0] not in items:
                items.append(options[0])
        else if selectingMake:
            if selectedManufacturer == options[0] and options[1] not in items:
               items.append(options[1])
        else if selectingModel:
            if selectedManufacturer == options[0] and selectedMake == options[1] and options[2] not in items:
                items.append(options[2])
        else if selectingColor:
            if selectedManufacturer == options[0] and selectedMake == options[1] and selectedModel == options[2] and options[3] not in items:
                items.append(options[3])
    return items

I think that is just hideous, both on an algorithm level, and on a syntax level. For one, it parses through the entire file, when it only needs to read through a couple of lines if done right. To make this even more inefficient, my real data has 6 options to select through, rather than just 4. This is also storing more data then it needs to, given the amount of data duplication.

I'm looking for either a different way of storing the data in the file, or a different way of parsing it to make the getValidOptions function both more pretty and more efficient. Are there any ways I could do this?

Solution

All the other answers I read seem to ignore two very basic rules of software development:

clarify the requirements first (especially the performance and storage requirements)
Keep It Simple, Stupid (see KISS)

You wrote "the text file is large", but did not write too large, so I assume there is actually nothing wrong with its size except your gut feeling. So if loading the file does actually not take too long, and your IT department or someone else does not complain about the wasted disk space, and nobody complains about having problems maintaining the file, let the file format as it is - do not underestimate the value of simplicity.

You are also complaining about the efficiency of the algorithm, which is actually not as efficient as it could be, but it has one very big advantage: it is brain-dead simple and works. So as long as it is efficient enough, do not apply any premature optimization.

So lets assume the program works fast enough, what you should ask first is how can you improve the code in terms of simplicity and the DRY principle? And that is indeed a point which should be improved, since the current code is not DRY. In your example, there appears four times the almost same code block testing if the options on the "higher levels" match the current line, which results in four times the same kind of "append" call (in your real code, the repetition occurs at least six times, as you wrote). You can easily avoid that by introducing a numeric selection level, or passing the already selected options in an ordered list. Using your pseudo-code, this leads to something along the lines of

# selectedOptions is a list, containing either nothing, or "selectedManufacturer"
# or [selectedManufacturer, selectedMake], ..., and so on
def getValidOptions(selectedOptions):
    items = []
    level = selectedOptions.size()
    for i from 0 to numRows:
        options = getLine().split()
        if selectedOptions == options[0:level-1] and options[level] not in item:
            items.append(options[level])
    return items

So this is essentially the same algorithm with no repeated code any more.

Since it is obvious getValidOptions has to be called more than once (at least once per level), I suggest to apply only one optimization (if this is not already the case): make sure the getLine function pulls its data from main memory, and does not read the file from disk again and again.

OTHER TIPS

Well, the data you have has a tree-like structure, where for each manufacturer you have a tree of models, and for each model you have a color (and so on).

So, you could separate the process of this data in a two step way:

After any update to the text file, you must process that file file and convert it to a tree structure.
When loading the application, you only load the tree structure.

The tree structure can be implemented with what is called a hash, an associative array or a dictionary in languages like Java, PHP, Javascript or Python. With this structure, you have:

The first keys are the manufacturers.
Their values are just another hashes or dictionaries where each key is the model.
Their values are the colors. Or another structure keeping in their keys the third level and as value the fourth level.
And so on...

Depending on your programming language, this can be implemented faster or slower. For example:

C#: You can implement a Tree structure¹, and then mark it as serializable.
VB.Net: you can generate the tree structure in one application, and serialize it in a file.
- For this, something like Runtime.Serialization.Formatters.Binary.BinaryFormatter could be useful, but I'm no expert in serializing with VB.Net.
Javascript: you can generate the tree structure in a JSON file, that must be loaded every time the app is loaded.
PHP: you can generate a serialized version of the tree data structure, or you can load a JSON as well.
Java: you can serialize that data structure, creating a Class that implements the interface java.io.Serializable.

References:

^{1: https://dvanderboom.wordpress.com/2008/03/15/treet-implementing-a-non-binary-tree-in-c/

- A complete explanation on implementing a tree in C#.

- Look for a comment where someone asks about serializing that object, and the answer for that comment.}

One simple way to store the data is to just stuff it into a SQLite database. SQLite, unlike most SQL databases, is well-suited for use as an application file format. This approach has several benefits:

No need to code a serializer or a deserializer.
File is editable and queryable by numerous existing programs.
The conditional messiness you mention in the question is avoided. To limit dropdowns, generate a simple where clause against each column (e.g., select distinct model where manufacturer='ford' and color = 'red').

This does force you to learn SQL, but avoids the need to learn a custom file format.

I assume you can randomly access lines in the file, and thus can treat the file as an array of records. If you can't randomly access the lines, then the algorithm you have is the best you can do.

For speediest access, store the data in 6 files, where each file is an index into the next.

There are mny ways to create flatfile indices. I usually use a subscript range. As the user makes each selection, use the range to limit the reading of the next file.

Here is how I would create the indices for the sample data you provided.

Of course the file must be sorted. I've numbered the lines for illustration; the line numbers should not appear in the file.

--| file3.dat |--
 0 Ford Truck F150 red
 1 Ford Truck F150 blue
 2 Ford Truck F150 black
 3 Ford Truck F150 silver
 4 Ford Truck F250 red
 5 Ford Truck F250 green
 6 Ford Sedan Taurus red
 7 Ford Sedan Taurus green
 8 Ford Sedan Taurus white
 9 Subaru SUV Forester blue
10 Subaru SUV Forester red
11 Subaru SUV Outback Black
12 Subaru SUV Outback Green
13 Subaru SUV Outback Blue
14 Subaru SUV Outback Red

To create the first index, make a record for each unique combination of the first three fields in the file. In each record, store the first and last line numbers in which that combination appears.

--| file2.dat |--
 0 Ford Truck F150       0   3
 1 Ford Truck F250       4   5
 2 Ford Sedan Taurus     6   8
 3 Subaru SUV Forester   9  10
 4 Subaru SUV Outback   11  14

The second index is constructed similarly, using the first two fields of the first index.

--| file1.dat |--
 0 Ford Truck        0   1
 1 Ford Sedan        2   2
 2 Subaru SUV        3   4

And the third, the top level in this case, index.

--| file0.dat |--
 0 Ford          0   1
 1 Subaru        2   2

I think I'm overexplaining the concept, but in general you create an index by dropping the last field and eliminating duplicate records.

You can further reduce the file storage requirement by eliminating some redundant data.

For example, the "first" subscript in each index record is always one more than the "last" subscript of the prior record, or zero if there is no prior record. So you don't need to store the "first" subscripts. I'm leaving them in place below for illustration.

Also, since you will use only the last field in each record to fill the selection list, you don't need to store the other fields.

The minimum rendition of the index cascade ends up looking like this, where the tick ' indicates a number not actually stored in the file.

--| file0.dat |--
 0' Ford         0'   1
 1' Subaru       2'   2

--| file1.dat |--
 0' Truck        0'   1
 1' Sedan        2'   2
 2' SUV          3'   4

--| file2.dat |--
 0' F150         0'   3
 1' F250         4'   5
 2' Taurus       6'   8
 3' Forester     9'  10
 4' Outback     11'  14

--| file3.dat |--
 0' red
 1' blue
 2' black
 3' silver
 4' red
 5' green
 6' red
 7' green
 8' white
 9' blue
10' red
11' Black
12' Green
13' Blue
14' Red

When you fill a selection list from an index, you use the "first" and "last" subscripts from the user's selection in the previous index to limit the lines read.

Example:

You fill the first selection list from all of file0.dat. (Ford, Subaru)

The user selects "Ford". The corresponding subscripts are 0 and 1.

You fill the second selection list from lines 0 thru 1 of file1.dat. (Truck, Sedan)

The user selects "Sedan". The corresponding subscripts are 2 and 2.

As you can see, by the time the user has selected, eg, "Ford" "Sedan" "Taurus" you will find that you need read only lines 6 thru 8 of file3.dat to fill the fourth selection list.

I apologize for the rather long description but it's very late here and I don't have time to write a short one.

ADDED: On further thought, the files can be concatenated into one.

--| file.dat |--
 0' -            1'   2
 1' Ford         3'   4
 2' Subaru       5'   5
 3' Truck        6'   7
 4' Sedan        8'   8
 5' SUV          9'  10
 6' F150        11'  14
 7' F250        15'  16
 8' Taurus      17'  19
 9' Forester    20'  21
10' Outback     22'  25
11' red          -'   -
12' blue         -'   -
13' black        -'   -
14' silver       -'   -
15' red          -'   -
16' green        -'   -
17' red          -'   -
18' green        -'   -
19' white        -'   -
20' blue         -'   -
21' red          -'   -
22' Black        -'   -
23' Green        -'   -
24' Blue         -'   -
25' Red          -'   -

This is used exactly like the multiple file version, except you need the dummy first line to contain the first subscript range.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange