How to efficiently check if a binary schema is a sub-schema of another binary schema

Question 1

Not exactly sure what you are asking, but I assume that you have a starting list of schemas that you are working off of, and that list is unique (no subsets, etc).

Define a simple IsSubsetOf() function, and then call that as part of a Linq 'any' call, or you can do it in a for-loop:

var startingSchemas = new [] { "100**101", "110*101*", "1111*000" };

startingSchemas.Any(x => IsSubsetOf(x, "11****11")); // false
startingSchemas.Any(x => IsSubsetOf(x, "11011010")); // true

public bool IsSubsetOf(string main, string sub)
{
    for (var i = 0; i < main.Length; i++)
    {
        if (main[i] == '*') continue;    // main is '*', so anything is ok
        if (main[i] == sub[i]) continue; // equal is ok, both 1/0/*
        return false; // if not equal, sub[i] could be *, or the opposite of main[i]
    }
    return true;
}

One issue that I think you might need to clarify is what you want to do when you find something that is NOT a subset, but when combined with another schema then it is.

1*1 => 101 or 111
0*1 => 001 or 011  (not a subset of 1*)

But these two combined = the **1 schema or {001, 011, 101, 111}

Do you want to take a string list and then reduce it to the minimal set of schemas that will still match the same inputs? IsSubset(x,y) = false, but IsSubset(y,x) = true

Addition: Making the starting data unique is pretty easy, if it isn't already:

var uniqueSchemas = startingSchemas.Distinct().ToList();
uniqueSchemas.RemoveAll(x => uniqueSchemas.Any(y => y != x && IsSubsetOf(y, x)));

Compiled (release, no pdb, optimize):

for (int index = 0; index < main.Length; ++index)
{
    if ((int) main[index] != 42 && (int) main[index] != (int) sub[index])
      return false;
}

Performance

Very crude performance check. Run in parallels VM on an i7/4gb 1 core allocated to vm, other processes running, etc.

200 schemas (randomly generated, 800 length)
1 test string (800 length, randomly generated the same way)
1 millions runs each

Output: (all runs were +/- 500ms at the outside, usually in unison)

// unsafe = new 'public static void unsafe IsSubsetOf' function using pointers
//   safe = standard method
Any() warmup : elapsed = 11965 (.0120 ms)
Any() safe   : elapsed = 11300 (.0113 ms)
Any() unsafe : elapsed = 10754 (.0108 ms)
for() safe   : elapsed = 11480 (.0115 ms)
for() unsafe : elapsed =  7964 (.008  ms)

So, that what I get from this. If there is a clever data structure for this, i've no clue.

Unsafe Version

This isn't guaranteed to be 100% correct. I don't normally do this, and I don't know if the discrepancy I saw was because of the test harness or the code. Also, disclaimer, it's been a good 6 years since I wrote a tiny bit of unsafe code. But I don't pump .net for performance this way, there is usually a bigger bottleneck... If you do use unsafe code, my only advice would be to NOT modify anything. If you just read you should be pretty safe. Check all your bounds!

private unsafe static bool IsSubsetOfUnsafe(String main, String sub)
{
    if (main == null && sub == null) return true; // is this what you want? decide
    if (main == null || sub == null) return false; // this too? maybe if null then say "true" and discard?
    if (main.Length != sub.Length) return false;

    fixed (char* m = main)
    fixed (char* s = sub)
    {
        var m1 = m;
        var s1 = s;
        int len = main.Length;
        for (int i = 0; i < len; ++i)
        {
            if ((int)m1 != 42 && m1 != s1) return false;
            m1++;
            s1++;
        }
        return true;
    }
}

Question 2

If I understand the question correctly this should do what you want and it performs as little work as possible assuming mismatching positions are unbiased. It might however be faster not to use LINQ. If you need the resulting list only once you can probably get away without turning the result into a list.

var s = "0*10*";

var sscs = new[] { "11010", "0010*", "0*11*", "1*100" };

var sss = sscs.Where(ssc => s.Zip(ssc, (a, b) => (a == b) || (a == '*')).All(_ => _)).ToList();

Every subschema candidate is compared symbol by symbol with the specified schema. If all symbols match or the schema has a wildcard in case of an mismatch the subschema candidate is a subschema. The comparison is aborted immediately if there is a mismatch and schema has no wildcard.

I heavily abbreviated the variable names to make it (almost) fit.

s     schema
sscs  subschema candidates
ssc   subschema candidate
sss   subschemas
a     symbol in schema
b     symbol in subschema candidate

Question 3

Unfortunately I still don't fully understand what you are doing but I will present my idea anyway, maybe it is useful.

The central idea is to replace your string representation with a more compact bit representation - your string 1*10 for example gets turned into 11001110 or 0xCE. Because one symbol takes up 2 bits you can pack 32 symbols into one UInt64, longer strings become arrays of UInt64s.

0 => 10
1 => 11
* => 00
     01 => unused

Now you can find subschemas with the following LINQ expression

var sss = sscs.Where(ssc => s.Zip(ssc, (a, b) => (a ^ b) & ((a & 0xAAAAAAAAAAAAAAAA) | ((a & 0xAAAAAAAAAAAAAAAA) >> 1))).All(x => x == 0)).ToList();

that is structure like my previous answer to make the comparison more meaningful. The obvious advantage is that it processes 32 symbols in parallel and indeed it is 30 times faster than my previous answer. But I am actually a bit disappointed because I hopped for maybe a 100 times speedup because the compacter representation also means less memory traffic but maybe the overhead from using LINQ is the actual bottleneck. So I turned it into plain for loops and this made it 130 times faster than the LINQ string version. But this is only really useful if it can be deeply integrated into your application because the conversion between the string representation and this representation is quite expensive.