Finding the best combination of sets that gives the maximum number of unique items

https://softwareengineering.stackexchange.com/questions/326378

23-12-2020
|

Question

A couple of months ago I went to a Bruce Springsteen concert. I hadn't heard much of his songs before that, but really liked the concert and bought the live recording of it afterwards (which he sells via live.brucespringsteen.net). A bit later I even bought another live recording of a different concert and after listening to them on repeat I'm liking his music more and more. I like it in fact so much that I want to buy more of his live recordings... and this is where I ran into my problem: he has a lot of recordings!

So now as a developer I want to write an piece of code that will give me the minimum/optimum numbers of recordings to buy that will give me a maximum number of different songs with the least duplicates (taking into account the 2 recordings I already have).

I could brute force it of course, but I'm hoping that there is some algorithm or library that could help me determine the correct set of recordings to buy? Scraping the set lists from the site is the easy part.

I've already tried searching for algorithms and it looks like it might be some sort of Cover Set Problem, which would make it NP-complete, which is a problem, but I'm not looking for a combination that gives me all different songs, just which x recordings to buy additionally to get the best amount of different songs in total (song length doesn't matter).

Clarification: as Doc Brown points out I need to be even more specific. I'd like to know which recordings to buy that give me the most, but not all, songs, with a fixed maximum of duplicates.

Possibility? After thinking about this some more I'm starting to wonder if maybe Genetic Programming can be applied to this? You could start off with a population of randomly generated individuals (combinations of recordings) and then applying a fitness function that gives a score based on the amount of unique songs, duplicates & total recordings needed. After enough generations this could also maybe give a good approximation?

Solution

You can use genetic algorithms with a simple binary representation for the individuals:

individual = 01010100000111000011010111001110

Every locus of an individual is related to the recordings of a specific show. If:

individual[k] == 1

then you're going to buy the k-th album.

You should code two / three functions:

unique_songs(individual) = number of unique songs

duplicates(individual) = number of duplicates

cost(individual) = total purchase cost

The fitness function is something like:

fitness(individual) =        unique_songs(individual)
                      - k1 *   duplicates(individual)
                      - k2 *         cost(individual)

k1 / k2 have to be tuned. This will treat duplicates(individual) as a soft constraint.

If you want as many songs as possible with a fixed maximum number of duplicates (limit) you need a hard constraint. The simplest approach is changing the fitness from a scalar value to a bi-dimensional vector:

fitness(individual) = [-penalty(individual),
                       unique_songs(individual) - k * cost(individual)]


                       / duplicates(individual) - limit   constraint not satisfied
penalty(individual) = {
                       \ 0                                otherwise

For further details about this approach you can read An Efficient Constraint Handling Method for Genetic Algorithms - Kalyanmoy Deb.

You can use the same library / framework, just change the fitness comparison operator to a lexicographic comparison function.

OTHER TIPS

Simulated annealing is a technique that is useful for generating approximate results for computationally difficult problems where it is easy to determine if one potential solution is better than another.

At its simplest, you pick a random solution and generate a small change (eg swapping one recording for another) and see whether this has got you closer to your result. If it has, you keep the new solution and repeat. If it hasn't, you have a random chance of accepting the change anyway. The process is governed by a "temperature" variable that decreases as you approach your iteration limit; the chance of accepting a change that is less optimal reduces as the temperature decreases. As well as tracking the current solution you also track the best ever solution.

For many problem types, this approach usually converges to a reasonable approximation of the best solution.

This looks like a variation of the Knapsack problem for me. I solved the problem with the Jenetics GA library:

import static java.util.Objects.requireNonNull;

import java.util.function.Function;
import java.util.stream.Collectors;

import org.jenetics.BitGene;
import org.jenetics.engine.Codec;
import org.jenetics.engine.Engine;
import org.jenetics.engine.EvolutionResult;
import org.jenetics.engine.Problem;
import org.jenetics.engine.codecs;
import org.jenetics.util.ISeq;

public class Springsteen
    implements Problem<ISeq<Springsteen.Record>, BitGene, Double>
{

    public static final class Record {
        final String name;
        final double price;
        final ISeq<String> songs;

        public Record(
            final String name,
            final double price,
            final ISeq<String> songs
        ) {
            this.name = requireNonNull(name);
            this.price = price;
            this.songs = requireNonNull(songs);
        }
    }

    private final ISeq<Record> _records;
    private final double _maxPricePerUniqueSong;

    public Springsteen(final ISeq<Record> records, final double maxPricePerUniqueSong) {
        _records = requireNonNull(records);
        _maxPricePerUniqueSong = maxPricePerUniqueSong;
    }

    @Override
    public Function<ISeq<Record>, Double> fitness() {
        return records -> {
            final double cost = records.stream()
                .mapToDouble(r -> r.price)
                .sum();

            final int uniqueSongCount = records.stream()
                .flatMap(r -> r.songs.stream())
                .collect(Collectors.toSet())
                .size();

            final double pricePerUniqueSong = cost/uniqueSongCount;

            return pricePerUniqueSong <= _maxPricePerUniqueSong
                ? uniqueSongCount
                : 0.0;
        };
    }

    @Override
    public Codec<ISeq<Record>, BitGene> codec() {
        return codecs.ofSubSet(_records);
    }

    public static void main(final String[] args) {
        final double maxPricePerUniqueSong = 2.5;

        final Springsteen springsteen = new Springsteen(
            ISeq.of(
                new Record("Record1", 25, ISeq.of("Song1", "Song2", "Song3", "Song4", "Song5", "Song6")),
                new Record("Record2", 15, ISeq.of("Song2", "Song3", "Song4", "Song5", "Song6", "Song7")),
                new Record("Record3", 35, ISeq.of("Song5", "Song6", "Song7", "Song8", "Song9", "Song10")),
                new Record("Record4", 17, ISeq.of("Song9", "Song10", "Song12", "Song4", "Song13", "Song14")),
                new Record("Record5", 29, ISeq.of("Song1", "Song2", "Song13", "Song14", "Song15", "Song16")),
                new Record("Record6", 5, ISeq.of("Song18", "Song20", "Song30", "Song40"))
            ),
            maxPricePerUniqueSong
        );

        final Engine<BitGene, Double> engine = Engine.builder(springsteen)
            .build();

        final ISeq<Record> result = springsteen.codec().decoder().apply(
            engine.stream()
                .limit(10)
                .collect(EvolutionResult.toBestGenotype())
        );

        final double cost = result.stream()
            .mapToDouble(r -> r.price)
            .sum();

        final int uniqueSongCount = result.stream()
            .flatMap(r -> r.songs.stream())
            .collect(Collectors.toSet())
            .size();

        final double pricePerUniqueSong = cost/uniqueSongCount;

        System.out.println("Overall cost:  " + cost);
        System.out.println("Unique songs:  " + uniqueSongCount);
        System.out.println("Cost per song: " + (cost/uniqueSongCount));
        System.out.println("Records:       " + result.map(r -> r.name).toString(", "));

    }
}

The fitness function is simple the number of unique songs, restricted by the maximal price per song, which you are willing to pay. Running the code will print the following output:

Overall cost:  37.0
Unique songs:  15
Cost per song: 2.466666666666667
Records:       Record2, Record4, Record6

If you have a limit to the number of albums you are willing to buy and that number is not large, I think there's a pretty straightforward approach. First you want to create a distance function between one set and another. At the simplest level, it's the count of songs that are not in both sets. You say you want to limit the number of duplicates but I question that since it could also limit the number of different songs you end up with. For example, if one recording has all new songs except one that you already have a bunch of times, you would reject it and possibly choose an album with very few new songs. Given the artist, I imagine a few songs (e.g. "Born in the USA", "Born to Run", "Born ...") are going to show up very often if not in every recording.

Once you have that calculation. You take your two recordings and combine them into one set. Calculate the distance of that set and every other recording. Then take those sets and repeat until you reach the max number recordings you are willing to buy.

That could be really costly if you are willing to buy a lot of recordings and I'm not sure how many there are but it sounds like a bunch. If that will take too long, you could cull the starting sets based on their score for each iteration. It's not guaranteed that you will get the best result this way but it's likely to be pretty close.

Here's an example to demonstrate why you might not want to set a hard limit on duplicates:

Let's say your starting set is:

Apple
Banana
Carrot
Durian

And you have the following 4 sets to choose from.

Apple       Apple        Banana  Carrot
Escarole    Horseradish  Durian  Escarole
Fig         Jalapeno        
Grapefruit  Kale        
            Lemon

And you want to select the best 2 sets from those four to add to your set. Here are the 6 options:

 1 & 2          1 & 3     1 & 4
Apple       3  Apple  2  Apple    2
Banana      1  Banana 2  Banana   2
Carrot      1  Carrot 1  Carrot   2
Durian      1  Durian 2  Durian   1
Escarole    1            Escarole 1
Fig         1           
Grapefruit  1           
Horseradish 1           
Jalapeno    1           
Kale        1           
Lemon       1           

 2 & 3          2 & 4          3 & 4    
Apple       2  Apple       2  Apple    1
Banana      2  Banana      1  Banana   2
Carrot      1  Carrot      2  Carrot   2
Durian      2  Durian      1  Durian   2
Horseradish 1  Horseradish 1  Escarole 1
Jalapeno    1  Jalapeno    1        
Kale        1  Kale        1        
Lemon       1  Lemon       1        
               Escarole    1

Now let's say you started with the rule that no item could be repeated more than twice. This would exclude the the option of 1 & 2. I would say that excludes the best option since it provides the largest number of unique items.

Obviously this is a contrived example but as you get lots of sets where the same values appear often, this situation becomes more likely.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange