Tuesday, January 26, 2010

Example of Cache Coherency

In my almost popular article The State of Ray Tracing in Games, I've frequently mentioned the coherency of rasterization being much better than ray tracing, which is pretty incoherent in many respects. I think programmers often address algorithms in a very high-level manner focusing on the big-O, which is important, but I wanted to present how coherency can directly and significantly affect run-time performance.

First, what is coherency? Coherency is defined by one dictionary as being logically or aesthetically ordered or integrated. The word comes up frequently when someone hears a disorganized speech or discussion, where the speaker jumps back and forth between topics. No one sentence or idea is wrong per se, but as a listener it's easier to bundle ideas together on a topic instead of having to jump around.

Cache coherency diagram I stole from wikipedia



Processors are incredibly complex now a days, but generally speaking there are registers which do the operations, the memory which holds the data and the program, and the cache, which sits in between. One purpose of the cache is to provide a buffer of surrounding data, so I don't have to go all the way back to system memory to get it. That trip might seem relatively fast, but it's pretty slow compared to just looking in the cache.

The Diligent Secretary Example
Imagine I was typing at my office desk and wanted to know the names on a numbered list down the hall. I send my secretary down the hall to pull the name off the list and come back with it. I use the name. I move on to the next name. This trip down the hall is relatively slow since I'm waiting for the secretary to come back. It doesn't make sense for him/her to go all the way down and come back with just one name--even if at that moment I just want one name. My secretary being the witty person that he/she is, goes and writes down ten names on this long list and comes back. I ask for the requested name and he/she gives it to me. I ask for the next name and the secretary instantly gives it to me without going down the hall again. This happens eight more times before he/she has to go down the hall to get more names off the list, but he/she has saved both of us lots of time just by grabbing the names around what I wanted. This is coherency. Because I was asking for names in order down the list, my secretary could easily cache extra names to save the trip down there. It didn't work every time, but when it did it was significantly faster.

Now, what happens if I asked for the names in an incoherent manner? My secretary comes back with the first group of names. I ask for name #3, which he/she quickly gives me. I then ask for name #47. My secretary sees he/she only has 1-10 on his/her list, so he/she needs to make another trip. This time my secretary remembers the names 40-49 and comes back. I then ask him/her for number #212. Once again the secretary has to go back to the list down the hall. I'm not asking anything more complex. I'm just asking in a seemingly incoherent order. This same phenomenon is easily presentable in a real-world example.

Real Example
Just as the secretary grabs the numbers around what I asked for with the possibility of not making another trip down the hall, processors will often pull out blocks or pages of memory around what the program requested hoping not to make the trip to system memory again. If I ask for data close (in terms of memory layout) to what I previously requested, there's a possible chance the cache will already have it and provide it faster.

Here's a really simple python script that tries to add up a bunch of integers in a coherent and incoherent manner. In the coherent function, the numbers are added up sequentially based on their order in memory. In the incoherent function, the numbers are pulled randomly from memory. Both functions provide the exact same result, but with a noticeable difference in performance. Let's see how the two compare for different sizes of data.

import random, time

n = 10000000

# make an array of coherent and incoherent indices
coherentIndices = [number for number in xrange(0,n)]
incoherentIndices = [number for number in xrange(0,n)]
random.shuffle(incoherentIndices)

def print_timing(func):
    def wrapper(*arg):
        t1 = time.time()
        res = func(*arg)
        t2 = time.time()
        print '%s took %0.3f ms' % (func.func_name, (t2-t1)*1000.0)
        return res
    return wrapper

# create some random values
values = []
for i in xrange(0,n):
    values.append(random.randint(-10,10))


# add up all the numbers coherently
@print_timing
def coherentAdding():
    total = 0
    for i in xrange(0,n):
        total += values[coherentIndices[i]]

# add up all the numbers incoherently
@print_timing
def incoherentAdding():
    total = 0
    for i in xrange(0,n):
        total += values[incoherentIndices[i]]

coherentAdding()
incoherentAdding()

Results
48e80eea-0aa4-11df-ae8a-000255111976 Blog_this_caption

With a quick graph using Many Eyes, you can see the performance differences. For small numbers, the performance between the coherent and incoherent data is neglible. In the secretary example, this would be the equivalent of the secretary having all the names in his/her head so regardless of the order I ask them, no extra trips down the hall are necessary.

As the problem scales, however, the memory fetching becomes less and less efficient. For the largest test I did, incoherent access is almost 13 seconds longer than the 8 seconds it took for coherent access. In this simple example where both functions are running at O(n), it's easy to see coherency makes a huge difference.

Relevance to Ray Tracing
From a memory perspective, ray tracing is incredibly incoherent. Firing a ray from the camera to shade a pixel, the next ray over might hit a different geometry requiring different textures and instructions to be fetched. That's bad geometry and texture coherency. Plus ray tracing is touted for it's secondary effects like reflections and refractions. These rays are extremely incoherent. One ray might self-intersect the object being shaded while another ray next to it might fly off into the distant corners of the scene. Rasterization on the other hand is extremely coherent. A model is pulled in once for every pass and thrown out again. When rasterizing textured surfaces, the texture cache can grab large chunks of texture because there's good chance the next fragment being rendered is going to be in that cache. In fact the biggest reason rasterization is probably so much faster than ray tracing is its coherency. It makes hardware extremely easy and cheap to build.