Advantages and disadvantages of using a concatenated key rather than nested Map Container in C++

StackOverflow https://stackoverflow.com/questions/22673370

  •  22-06-2023
  •  | 
  •  

Domanda

I have been given a project to investigate alternatives to the current container that is used to store Data in order to make it more efficient.

The current design involved 4 nested maps like that map< string, map< string, map< int, map< string, string> > > >

Lets Name each data field as Company, Department, ID_of_employee, Name

The time complexity at the moment to retrieve the the name of an employee given Company, Dept, ID is O(log N) and more precisely it involves three lookups.

Space complexity for now is not an issue.

My initial alternatives were the followings:

  • Use nested pairs to represent Company, Dept, Id and then use this nested pairs as key to a map. This seems not easily readable though.
  • Instead of nested pairs i thought about using a tuple or a struct which essentially as I read are not that different. After creating the new struct EmployeeKey that will contain fields for Company, Dept, ID. I can use it as a Key to a map. (i guess i have to write custom compare and less than operators also).
  • Use a concatenated key from company+Dept+ID by converting the int to string and concatenating them. Then feed this key to a map<ConcatenatedKey, Data>
  • Use Boost.MultiIndex. Even though this seems the best option I abandoned this option cause I found it a bit complicated to.

To give a little more info that is necessary. This Container is usually used to retrieve the final nested data and that is why I concluded to use the concatenated key approach. My question basically is, are there any caveats in using this sort of concatenated string? Is it a bad design or something we should avoid?

To my understanding this will improve the lookup time, still keep it logarithmic but perform one rather than four lookups thus it seems like an improvement.

È stato utile?

Soluzione

Since std::map<> is a Red-black tree, which is still a binary tree, the lookup speed is not that fast compared to hashmaps - especially if the number of entries is large.

Using an std::unordered_map<> (a hashmap) will give better performance, assuming the hash spread is good. I recommend fnv or MurmurHash3, since they have the nicest distribution of values.

Now, talking about nested containers - you should never, ever do such a thing! The overall performance can be quite horrible and the memory usage will definitely be very large, since it's essentially a 4-Dimensional RB-tree:

Lets put this into context, where you have 20 Companies, each company has 5 departments, each department has 12 EmployeeID's and each EmployeeID maps to a map of <Name, some_string> (the last bit seems a bit redundant, don't you think?).

  1. each Company leaf node is an std::map => 20 std::map instances
  2. each Department leaf node is an std::map => 20 + 20*5 = 120 std::map instances
  3. each EmployeeID leaf node is an std::map => 120 + 20*5*12 = 1320 std::map instances
  4. each Name leaf node is an std::map => 1320 + 20*5*12*1 = 2520 std::map instances

So you see, nesting containers is very dangerous, since even with a small data set you can end up with a huge amount of container instances. This has ridiculously bad performance, especially when the objects are being destroyed or new elements are inserted.

What I recommend: Use an EmployeeKey struct combined with std::unordered_map. This will give you a good lookup speed and only one std::unordered_map instance.

struct EmployeeKey
{
    int         CompanyID;  // if you want speed, CompanyID is the way to go
    std::string Department;
    int         EmployeeID;
    std::string Name;

    inline bool operator==(const EmployeeKey& key) const {
        return CompanyID != key.CompanyID && ... /* etc */;
    }
};

template<> struct hash<EmployeeKey> {
    size_t operator()(const EmployeeKey& key) const {
        /* perform hash combine here */
    }
};

And this should be enough to get you started. The final layout would look like this:

std::unordered_map<EmployeeKey, std::string> EmployeeData;
// usage:
auto it = EmployeeData.find(selectedEmployee);
if (it != EmployeeData.end())
    it->second = "Good employee";

If you really have to 'speed up' your lookups by every possible means, you can keep in mind that if CompanyID's are integers from [0 .. N], you can use an std::vector to get a fast first level index to right company:

std::vector<std::unordered_map<EmployeeKey2, std::string>> EmployeeData;
// usage:
auto& companyMap = EmployeeData[EBuyNLargeCorp]; // or [selectedCompany(0..N)]
auto it = companyMap.find(selectedEmployee);
if (it != companyMap.end())
    it->second = "Good employee!";

Where EmployeeKey2 would be missing the CompanyID field and selectedCompany would be the Index into the vector instead. But this is only something you do for really crucial performance gains.

Altri suggerimenti

It looks like you are forgetting to use the right tool for the right problem. You tried to emulate a database with map. An easier solution is to use a real DB, SQLite3 is easy to integrate as it works with a file.

You will be able to query a lot of different information in an efficient way. You could even investigate the database with external tools.

If you still do not want to use a DB, imagine each table as a vector, id is the index. And last tables is a map of tuple of ids to value, but i do not recommend as it would be harder to obtain different kind of information.

Below an exemple of DB, notes that i did not write SQL for years, there is probably better design, you may also for example add a table to register valid department for a company and add constraint to catch an invalid registration of an employee to a missing department of a company.

SQL Fiddle

SQLite (SQL.js) Schema Setup:

CREATE TABLE Company(
     id integer primary key autoincrement, 
     name varchar(20) not null unique
);

INSERT INTO Company (name) values ("google");
INSERT INTO Company (name) values ("facebook");

CREATE TABLE Department(
     id integer primary key autoincrement, 
     name varchar(20) not null unique
);

INSERT INTO Department (name) values ( "research");
INSERT INTO Department (name) values ( "development");
INSERT INTO Department (name) values ( "marketing");
INSERT INTO Department (name) values ( "hell");

CREATE TABLE Employee
(
     social_id integer primary key, 
     name varchar(20) not null
);

INSERT INTO Employee  values ( 1,"mark");
INSERT INTO Employee  values ( 2,"john");
INSERT INTO Employee  values ( 3,"david");

CREATE TABLE Assigment(
     emp_id  primary key references Employee(social_id), /* employee have only one job */
     comp_id not null references Company(id), 
     dep_id  not null references Department(id)
);

INSERT INTO Assigment select 1,c.id,d.id from Company c join Department d where (c.name='google' and d.name='hell');
INSERT INTO Assigment select 2,c.id,d.id from Company c join Department d where (c.name='google' and d.name='marketing');
INSERT INTO Assigment select 3,c.id,d.id from Company c join Department d where (c.name='facebook' and d.name='research');

Query 1:

SELECT c.name,d.name,e.name
    FROM assigment a JOIN company c ON a.comp_id=c.id
    JOIN department d ON d.id=dep_id
    JOIN employee e ON a.emp_id=social_id

Results:

|     name |      name |  name |
|----------|-----------|-------|
|   google |      hell |  mark |
|   google | marketing |  john |
| facebook |  research | david |
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top