I'm doing a similar sort of fuzzy phrase matching in Lucene using tokenized sequences. Token distances are computed with Levenshtein or JaroWinkler and then Smith-Waterman is used to find the best sequence alignment. If I were to adapt that approach to your case the trouble would be that alignment scoring doesn't have a way to (directly) favor token swaps (displaced token substitution). The only thing I could do would be to have a lower cost for insertions for tokens that appear in the source vs those that do not.
So I like the n-gram approach to get scoring that is less sensitive to non-local reordering. I suggest checking out BLEU, METEOR, and ROUGE which are standard n-gram metrics for sentence similarity with various approaches to dealing with order sensitivity. They can be used with either character-level n-grams as in your proposal or with token-level n-grams such as I'm doing.