It seems to me you have a couple of options, here. Note that I'm primarily familiar with XNA 4.0 on the PC, so not all of these may be possible/performant in your case.
The Easy, Hacky Way
You don't appear to be using the color channel when drawing your sprites; this technique assumes that your example is representative of your real code.
If you don't need the sprite color for tinting your sprites, you can hijack it as a way to pass per-sprite data into a custom vertex/pixel shader. For example, you could do this:
var shearX = MathHelper.ToRadians(33) / MathHelper.TwoPi;
var shearY = MathHelper.ToRadians(33) / MathHelper.TwoPi;
var color = new Color(shearX, shearY, 0f, 0f);
_spriteBatch.Draw(_texture, rectangle, color);
This represents the x- and y-shear values as factors of 2 * pi
stored in the red and green color channels, respectively.
Then you can create a custom vertex shader that retrieves these values and performs the shearing calculations on the fly. See Shawn Hargreaves's article here for information on how to do that.
Hybrid Approach
Another relatively straightforward possibility is to combine traditional sprite batching with your DrawUserIndexedPrimitives
code.
The key to good performance is to minimize state changes, so careful ordering of your sprites can go a long way. Organize your sprites such that you can draw all non-skewed sprites in a single pass using SpriteBatch
, then only use the slower DrawUserIndexedPrimitives
technique to draw the sprites that actually need it. This should significantly reduce the number of batches being sent to the GPU, assuming that most of the sprites in a given frame aren't skewed.
Batching + Custom Vertex Format
This is probably the best technique, but it also involves writing the most code. Not that any of it is particularly complex.
The way SpriteBatch
works internally is that it maintains a dynamic vertex buffer which is populated on the CPU and then drawn all in a single call. Shawn Hargreaves provides a high-level overview of how this sort of thing is done here.
The problem with extending your DrawUserIndexedPrimitives
to use this technique is that pesky world matrix; shaders don't really have a good way to attach a particular world matrix to a particular sprite (unless you're using hardware instancing, which I don't think your platform supports). So what can you do?
If you create a custom vertex format, you can attach shearing values to each vertex, and use those to perform the shearing in the vertex shader, as in the first technique. This will allow you to draw all of your game's sprites in a single call, which should be very fast.
You can find information on custom vertex declarations here.