I might not understand what you exactly want. But here is my suggestion. Let me know if my solution is inaccurate/irrelevant and I will edit/delete accordingly.
Assume you have a certain metric (say compatibility of the pairs or waiting time). If you just have the average or total number for this metric over all the users, it is kind of useless. Instead you might want to find the distribution of of this metric over all users. If nothing, you should always keep track of the variance. Once you have the distribution, you can calculate a probability that particular algorithm A is better than B for a certain metric.
If you do not have the distribution of the metric within an experiment, you can always run multiple experiments, and the number of experiments you need to run depends on the variance of the metric and difference between two algorithms.