The observed answer points to something that seems to want credentials from Google and I'm not keen on "papers, please". However, I think that this is best solved empirically because how long each choice of parameter takes will depend on details of caching and other memory access behaviour. When we work out the time an algorithm will take in theory we don't normally use such a detailed model - we normally just think of the number of operations or number of memory accesses, and we usually even discard constant factors so we can use notations like O(n) vs O(n^2).
If you were doing a lot of similar radix sorts within a long-running program you could have it time a series of test runs before it started up to chose the best setting. This would make sure that it used the fastest setting even if different computers required different settings, because they had different sized caches, or a different ratio of access times between main memory and cache.