1. Split compression parameter generation and compression parameter
packing. This gives a good performance boost, since we don't pack every
single time we compress. The error is computed each time, and only the
best parameters are packed.
2. Allow the shape selection function to specify up to ten shapes to
try for compression. We were already doing this kind of hackily where
we allowed both a three and two partition shape. This makes it a little
cleaner and exposes it to the user.
Change RGBACluster to be a class that only really persists once per block.
When we switch shapes and do operations on them, then we really only need
to change which points in the block are accessed. We don't need to do this
very often, so just change the mask whenever we need it. This brings us back
closer to our original performance, but we're still not where we were when
we started refactoring.
We suffered another performance hit. This time it comes from the fact
that we're copying around a lot of data based on what partition we're
choosing. We can get rid of this a tad by only copying the data that we
need once and then using getters/setters that selectively pull from
an array based on our shape index.
Changed the RGBAEndpoints to use the vector/matrix classes in
FasTCBase. This caused a ~20ms performance hit on an 8-core machine
which is likely due to the compiler having difficulty compiling away
some procedure call overheads. Upon profiling, the biggest bottleneck
is still by far the QuantizedError function, so any and all further
optimization should be focused on that.
I'm not completely sure what the best strategy is in this case. Ultimately, it's good
that the format itself carries the block dimensions. It makes a lot of the code somewhat
uglier though, but really the only thing that we're sullying is the succinct ability to
determine what large-scale format it's in (PVRTC vs ASTC instead of 2bpp PVRTC vs 4bpp).