Pregunta

I have 2 Files:

  1. Contains text which is under monoalphabetic cypher ( it is output out of another decypherated cypher)
  2. Is my language analysis file, it contains tetragrams there is 53000 of them(AAAA,AAAB,...,ZZZZ), every tetragram have assigned how often it is found in selected language + percentage, example: (AAAA*:*2*0.71188)

Problem is:

I am generating random keys[it is shuffled alphabet line a=j,k=u,...], decrypting text and comparing it to language analysis, it means if text is 1000 char long that means i must compare 1000-3= 997 tetragrams with language analysis file that means maximum of 997*52000 = 51.8mil comparisons in single try, it takes around 20 000 keys to get original text => up to 51.8mil * 20 000 comparisons. Currently i have language analysis file in string[][].I need to have fastest comparison possible.


My ideas:

create "help" array which contains possition of first letter of language analysis(later only LA) like this A=0; B=3564 (first tetragram starting with B(BAAA) is 3564. string in LA); ... when i will have tetragram out of cyphrated text for example BECA i will start my FOR cycle searching for tetragram BECA out from 3564. string from LA not from 0. Is there any type of multidimensional array - hashtable, directory, which would solve this for me and would be fast enough?

¿Fue útil?

Solución

It sounds like you're referring to an approach similar to the Bucket Sort.

Because there are only 26^4 (456,976) possible combinations in a Tetragram, it should be possible to represent your entire collection of tetragram values in a single array structure. (You could make it multi-dimensional just to make life easier). That should make for super-fast lookup.

void Main()
{
    var buckets = new double[26,26,26,26];
    PopulateBucketsFromFile(buckets);

    foreach(string tetragram in input)
    {
        var frequency = buckets[
            GetBucketIndex(tetragram, 1),
            GetBucketIndex(tetragram, 2),
            GetBucketIndex(tetragram, 3),
            GetBucketIndex(tetragram, 4)];
    }
}

// This reduces repetetive code: if your program still isn't fast
// enough you may try inlining this method (although it's likely
// that the JITter will just do this for you.
int GetBucketIndex(string tetragram, int i)
{
    return tetragram[i] - 'A';
}

Is there any type of multidimensional array - hashtable, directory, which would solve this for me and would be fast enough?

How fast is "fast enough?" It's impossible to say whether this would be fast enough for you without testing it.

My benchmarks show that I can compare all the 997 tetragrams in a 1000-character string upwards of 50,000 times per second. Is that fast enough?

Using a Dictionary instead, I can only do this about 8,000 times per second. That's 5x slower, but the code is simpler. As you mention in your comments, that's still a great deal faster than what you were doing before. "Fast enough" will have to depend on your business case.

Benchmarks, for reference:

/* This is a benchmarking template I use in LINQPad when I want to do a
 * quick performance test. Just give it a couple of actions to test and
 * it will give you a pretty good idea of how long they take compared
 * to one another. It's not perfect: You can expect a 3% error margin
 * under ideal circumstances. But if you're not going to improve
 * performance by more than 3%, you probably don't care anyway.*/
void Main()
{
    // Enter setup code here
    var random = new Random();
    var chars = new char[1000];
    for (int i = 0; i < chars.Length; i++)
    {
        chars[i] = (char)random.Next('A', 'Z' + 1);
    }
    var str = new string(chars);



    var probabilities = Enumerable.Range(0, (int)Math.Pow(26, 4)).Select(i => random.NextDouble()).ToList();
    var probabArray = new double[26,26,26,26];
    for (int i = 0; i < probabilities.Count; i++)
    {
        int j1 = i % 26,
            j2 = i / 26 % 26,
            j3 = i / 26 / 26 % 26,
            j4 = i / 26 / 26 / 26 % 26;
        probabArray[j1, j2, j3, j4] = probabilities[i];
    }

    var probabDict = probabilities.Select((p, i) => new{
        prob = p,
        str = new string(new char[]{(char)(i%26 + 'A'), (char)(i / 26 % 26 + 'A'), (char)(i / 26 / 26 % 26 + 'A'), (char)(i / 26 / 26 / 26 % 26 + 'A')})
    }).ToDictionary(e => e.str, e => e.prob);

    var actions = new[]
    {
        new TimedAction("first", () =>
        {
            for(int i = 0; i < str.Length - 3; i++)
            {
                var prob = probabArray[str[i] - 'A', str[i + 1] - 'A', str[i + 2] - 'A', str[i + 3] - 'A'];
            }
        }),
        new TimedAction("second", () =>
        {
            for(int i = 0; i < str.Length - 3; i++)
            {
                var prob = probabDict[str.Substring(i, 4)];
            }
        }),
        // Enter additional tests here
    };
    const int TimesToRun = 10000; // Tweak this as necessary
    TimeActions(TimesToRun, actions);
}

#region timer helper methods
// Define other methods and classes here
public void TimeActions(int iterations, params TimedAction[] actions)
{
    Stopwatch s = new Stopwatch();
    int length = actions.Length;
    var results = new ActionResult[actions.Length];
    // Perform the actions in their initial order.
    for(int i = 0; i < length; i++)
    {
        var action = actions[i];
        var result = results[i] = new ActionResult{Message = action.Message};
        // Do a dry run to get things ramped up/cached
        result.DryRun1 = s.Time(action.Action, 10);
        result.FullRun1 = s.Time(action.Action, iterations);
    }
    // Perform the actions in reverse order.
    for(int i = length - 1; i >= 0; i--)
    {
        var action = actions[i];
        var result = results[i];
        // Do a dry run to get things ramped up/cached
        result.DryRun2 = s.Time(action.Action, 10);
        result.FullRun2 = s.Time(action.Action, iterations);
    }
    results.Dump();
}

public class ActionResult
{
    public string Message {get;set;}
    public double DryRun1 {get;set;}
    public double DryRun2 {get;set;}
    public double FullRun1 {get;set;}
    public double FullRun2 {get;set;}
}

public class TimedAction
{
    public TimedAction(string message, Action action)
    {
        Message = message;
        Action = action;
    }
    public string Message {get;private set;}
    public Action Action {get;private set;}
}

public static class StopwatchExtensions
{
    public static double Time(this Stopwatch sw, Action action, int iterations)
    {
        sw.Restart();
        for (int i = 0; i < iterations; i++)
        {
            action();
        }
        sw.Stop();

        return sw.Elapsed.TotalMilliseconds;
    }
}
#endregion

Results:

Message DryRun1 DryRun2 FullRun1  FullRun2
first   0.5502  0.3217  221.5343  222.9831 
second  2.4481  1.063   1384.6288 1138.6466 
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top