When we detect that a partition has a single color in each subset, we can generate almost an exact representation of this value for most compression modes. However, when we were doing this subset matching, we were ignoring the error introduced by modes that had completely opaque representations against data that had transparent pixels. This bug fix essentially includes this error in our "best fit" calculations and makes everything work out for the better.
With the old code, it was possible that we skipped a compression with unlucky
preemption of our threads. I'm not exactly sure why, but that caused deadlock
(livelock?) in some very unfortunate circumstances. This new algorithm should
work regardless of how many threads execute at once and should also prevent
textures in the compression job list from being skipped. This algorithm seems
to be an improvement on low-core count machines (around 4 cores), but it is
slower on high-core count machines (40 cores or more)...
In general, we want to use this algorithm only with self-contained compression
lists. As such, we've added all of the proper synchronization primitives in
the list object itself. That way, different threads that are working on the
same list will be able to communicate. Ideally, this should eliminate the
number of user-space context switches that happen. Whether or not this is
faster than the other synchronization algorithms that we've tried remains
to be seen...
This is a first pass of what I believe to be a not too terrible
implementation of a cooperative thread-based compressor. The idea is
simple... If a compressor is invoked with the same parameters on multiple
threads, then the threads cooperate via an atomic counter to compress the
texture. Each thread can take as long as possible until the texture is finished.
If a caller calls a compression routine that has different parameters, then
it will help the current compression finish before starting on its own compression. In this
way, we can split the textures up among the threads and guarantee that we maximize the
resource usage between them. I.e. this becomes more efficient:
Thread 1: Thread 2: Thread N:
tex0 texN tex(N-1)N
tex1 texN+1 tex(N-1)(N+1)
.. .. ..
texN-1 tex2N tex(N-1)N
I have not tested this for bugs, so I'm still not completely convinced that it is deadlock-free
although it should be...
Atomic operations are both supported by the platform and the compiler. If we want
to provide a threadsafe implementation of our compression function, we need to make sure
that the proper settings are available.
eigenvalues of the covariance matrix associated with the cluster.
- Compared results of testing the ratio of eigenvalues as a measurement of
'linearity' for the different shapes, and output statistics.
- Added a #define that controls whether or not we do shape estimation using
quantized AABB error or eigenvalue ratios. The former seems to be better.