ganization of your data structures to make maximum use of the SIMD instructions.
The new instructions also provide techniques for turning control dependencies into data dependencies. For example, there are instructions that set a variable to a mask of zeros or ones based on the comparison of two values. Boolean operations can then be used to set a pointer to one of two values based on the mask. Viewpoint clipping can be implemented by this technique without using a branch instruction. Because unpredictable branches are relatively expensive on modern processors, removing them can increase performance.
As mentioned earlier, memory latency is becoming more and more of a bottleneck. We have tried to carefully organize our data structures to minimize the amount of memory that must be accessed by any single computation stage and to ensure that the memory is accessed in a predictable linear manner. Adding prefetch instructions can also help to hide memory latency, though this is somewhat less important on Pentium 4 processors because the automatic hardware prefetch mechanism often works quite well for linear access patterns.
4. Results
Timings for the render cache to generate one frame at 512x512 on a 1.7GHz Pentium 4 machine are shown in Table 1. Despite that fact that we have added additional computation stages and are using images with four times as many pixels, the frame time is slightly faster than original results reported in10. We estimate that roughly half the speedup comes from using a faster processor and half from the SIMD and other optimizations that we have applied.