searching through subsets, algorithm (optimal or heuristic)

https://stackoverflow.com/questions/19440472

01-07-2022
|

Question

I’m looking for help with search algorithm for a problem. The problem can be simplified to the following.

We have an object that can take certain attributes – X1, X2 … XN. N is of the order of 5000. However a particular object takes a subset of these attributes say Xi .. Xj (about 50).

A configuration is a particular subset of the attributes. There are certain configurations, numbering Z (order of 0.1 million), that are optimal.

Input:

Configuration 1: X1, X2, X3.. Xf
Configuration 2: X4, X6, X7, .. Xz
:
:
Configuration Z: X10, X200… XN

Problem: Given a certain object, ALPHA with a subset of attributes {Xi…Xj} the goal is to find the configuration that is closest to this object. The configuration can be a superset of the object ALPHA’s configuration. It could also be that no configuration has all of ALPHA’s attributes. Closest is defined as the configuration that has the most number attributes of ALPHA.

The naïve solution I have basically does the following

1. Take each configuration
2. Loop through each attribute of ALPHA
3. Keep track of the configuration with maximum number of matches to ALPHA
4. Pop out the configuration maximum matches.

I think the naïve solution is correct, however it is too slow. Is there an efficient way to do the search across the configurations? Even approximate heuristic is fine if it is very fast.

Adding C++, Java tags to see if there is software that does this.

Thanks.

Solution

Here's a better non-heuristic approach. Store your optimal configurations in reverse. ie mapping each attribute to the configurations it appears in. eg (unrelated to your your example)

X1 : C1 C4 C5 C99999
X2 : C1 C2
X3 : C1
X4 : C1 C2
...
XN : C100 C200 C300

Now, you make an array of length Z to store how many matches you found. You iterate through ALPHA, and for each attribute you go through the associated configurations. For each configuration, you add one to the corresponding position in your array.

Finally, you run through the array and select the configuration with the largest count. You could keep a running maximum instead if you like, but that would only be more efficient if you expected to do fewer than Z comparisons in total (number of elements in ALPHA multiplied by the average number of associated configurations).

This will find the best match, and ought to be about 50 times faster than your naive solution.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow