Sunday, October 10, 2010

Incorporating scala, java, sbt, JOGL, Qt, and Ruby/Python

For several years now, I've been iterating on a small project that allows the user to build a Sunflow scene file. When starting it, I tried to effectively design the program with the write technologies for the task.

The language of choice
First, Sunflow is written in java and as such runs on the JVM. Not only do I want to be able to build a Sunflow file, but I want to render it interactively using Sunflow's libraries. There are many langauges available on the JVM right now, but only a few sparked my interest. First, I had done everything in Java previously, which was okay but was at times a little tedious. The languages I looked at as possible replacements were scala and clojure. Although I was very interested in clojure, I thought at the time of starting that it was too foreign to me and I would end up doing something extremely naive in my design.

Scala is a statically-typed language that can at times closely matches the java language. Scala has built a large type-system, which includes many data structures that are both mutable or immutable depending on what you're using them for (functional programming, for example, would probably favor the immutable structures). Scala treats it's functions as first-class citizens, so they're easily passed around to other functions. I can even make large anonymous functions that fit right within another function call. It's syntax is also slightly less verbose than java, and it provides an excellent getter/setter infrastructure that's extremely clean.

Scala isn't all roses, however. First, scala seems to be growing faster than my app, so as I tried to stay up to date with the latest scala features, it's often broken compatibility in significant ways, which has been a bit of a distraction fixing. Second, some of the code is illegible to me. It has a complicated type system and weird language symbols sometimes that seem to pull from esoteric languages for the sake of it instead of creating a simpler style. My scala is probably pretty java-like, so it's fairly easy to read.

Building with scala
Another advantages of scala is its ability to cross-compile with java. I was previously using ant, which was my foray into a java build system. First, I don't like work in XML for a myriad of reasons. After the ant scala compiler tool started throwing bogus errors, I was advised to move over to the simple-build-tool or sbt.

After adding a launcher to my class path, sbt allowed me to quickly setup a new project, whose project file is written in scala, a much more preferred way to configure my app. I simply put all my scala code in ./src/scala and all my java code in ./src/java and sbt combines them. Although it took a moderate amount of time to setup my project, in the end it's been a joy to work with sbt.

JOGL
The JVM requires JOGL to do OpenGL rendering. This can be tedious to setup as I'm used to just tossing a jar onto the classpath and having my build system include it. Because JOGL requires native libraries, which need to be included separately to the java library path. This took a while for me to figure out, but ended up being a simple setup in the end.

import sbt._
import java.io.File

class SunshineProject(info:ProjectInfo) extends DefaultProject(info)
{
  // tells sbt which class to run
  override def mainClass:Option[String] = Some("com.googlecode.sunshine.Sunshine")

  def nativeJOGL:String = {
    var os = System.getProperty("os.name").toLowerCase
    var arch = System.getProperty("os.arch").toLowerCase

    // this is to match the JOGL builds
    if (arch.matches("i386")) arch = "i586"

    if (os.contains("windows")) {
      os = "windows"
      arch = "i586"
    }
    println("OS: %s".format(os))
    println("JOGL Path: %s".format("./lib/jogl-2.0-%s-%s".format(os, arch)))

    "./lib/jogl-2.0-%s-%s".format(os, arch)
  }

  override def fork = forkRun("-Djava.library.path=%s".format(nativeJOGL) :: Nil)


All this basically does is query the OS and architecture from the JVM and adds it's respective JOGL directory to the library path. This required me to fork the JVM, as adding paths to the current JVM doesn't seem to work. Notice I just had to override the fork command and sbt knows I want to fork. Overriding the run command won't fork the JVM when starting. I did something similar in ant, but it was much longer and more difficult to read.

Using Qt
Java Swing is usually an okay GUI library for my small java projects, but it's slightly cumbersome when trying to do something complex. Trolltech's Qt is a very popular framework that has continually gained popularity over the years for it's great documentation, intuitive API, and it's event handling system.

QtJambi is the java binding for using the Qt libraries. A few months ago Trolltech dropped support of QtJambi, but pushed it off to the community to continue updating. So far, they seem to be doing a decent job, although they have been continually asking for help.

sbt supports automatic library manage via Apache Ivy. Instead of shipping every build of Qt for a given architecture, I can setup Qt as a managed library. By pointing sbt to the QtJambi servers, sbt will automatically fetch them during compilation.

val qtDepSnapshots = "Qt Maven2 Snapshots Repository" at "http://qtjambi.sourceforge.net/maven2/"
val qtDep = "net.sf.qtjambi" % "qtjambi" % "4.5.2_01"
val qtjambiBase = "net.sf.qtjambi" % "qtjambi-base-linux32" % "4.5.2_01"
val qtjambiPlatform = "net.sf.qtjambi" % "qtjambi-platform-linux32" % "4.5.2_01"

Right now, I'm only fetching Linux x86 libraries as that's what I'm working off of. Adding the block above directly into my build class will tell sbt to grab them for me. There's a bit of magic going on here for me as I don't understand how sbt knows these variables are library dependencies or just values I've created in my program. Regardless, it's enough to get Qt downloaded and onto the classpath.

Incorporating a scripting language
At this point, I could just start coding my application, but since I'm doing a lot of designing on the fly, it takes a long time to compile my application to see a small change. I wanted to incorporate a scripting language that lets me make changes to the interface quickly without existing the program.

As much as I complain about python, I use it a lot at work and am fairly productive programming in it. The jython project provides a python implementation that runs on the JVM. I've used this library before a year ago, but had to scrap it as it was too slow latency-wise. I've heard it's gotten significantly faster recently, so I gave it another shot and found it to be much faster. Working with Qt, however, seemed to turn up a bug blocker.

[error] Exception caught after invoking slot
[error] Traceback (most recent call last):
[error]   File "", line 15, in 
[error]   File "", line 9, in __init__
[error] TypeError: Proxy instance reused


I was initially wondering if working with Qt from a scripting language was not stable, however, this seemed to be only a jython issue. I reported it to the jython team, which seems to be resolving the issue right now for their upcoming build.

In the mean time, I thought I would give ruby a spin. I'm rather unfamiliar with the Ruby language, although I haven't been oblivious to the huge success it has garnered in web community with Ruby on Rails. I've also heard great things about JRuby, the Ruby implementation on the JVM. At one point it was actually faster than the C++ build of Ruby, although I'm not sure this is necessarily true anymore.

The ruby language seems to be something between python and perl, although that's another comment that will possibly get me shot by a Ruby developer. It's a purely object-oriented language--more so than python--and provides a bit more syntax flexibility than python (for better or for worse) including some parsing syntax from perl.

val factory = new ScriptEngineManager()
val ruby = factory.getEngineByName("jruby")

val urlFile = "clear_scene.rb"
val url = getClass().getResource(urlFile)

ruby.eval(new InputStreamReader(url.openStream()))

With my main scala code, I open a Qt main window. Then hand things over to jruby. My ruby code clears out the window and fills the UI programmatically. I can add menus and event handlers as I go and when I'm ready to see the change, I simply restart the ruby evaluator and my UI rebuilt instantly with no recompilation. I literally see no latency continually reloading my ruby code and seeing my interface changes change on the fly. I plan to do most of my designing in ruby and gradually move the classes I create over to the scala-side once they become stable for a performance boost.

I like to keep one terminal open running sbt with the "~copy-resources" action continually copying my ruby changes over to the build path, while the other terminal compiles and runs my app as I go through the code changes in scala/ruby.  

I've just barely started with it, but I've been enjoying ruby for the most part. A ruby developer would probably say I'm programming like a python programmer (a scala developer would probably say my scala looks like java). Anyway, I haven't had any road blocks importing the various Qt or scala classes besides two minor inconviences.

scala hashmaps
The classes I've built in scala are compiled to byte code and I've been able to read them from java and jruby without any problem. However, I use a few hashmaps in scala and I'd like to be able to iterate through them in my ruby code. JRuby provides some hooks to iterate through java collections, but not for scala, so I made a simple wrapper for my scala hashmaps so I can iterate through them in ruby normally.

class ScalaHashMap
  def initialize(hash)
    @hash = hash
  end

  def each
    @hash.size().times{|key| yield key,@hash.get(key).get}
  end
end

Getting QtJambi to see the JOGL Context
JOGL provides some useful widgets that directly tie into java swing. These widgets don't exist in JOGL for Qt. As such, it was slightly confusing trying to find documentation on getting a GL3 context from JOGL while inside a QGLWidget.

class PanelGL < QGLWidget   
  def initialize(parent = nil)     
    super     
    @camera = Register.cameras.get(0).get      
    profile = GLProfile.get(GLProfile::GL3)     
    glCaps = GLCapabilities.new(profile)     
    glCaps.setPBuffer true     
    @pBuffer = GLDrawableFactory.getFactory(profile).createGLPbuffer(glCaps,DefaultGLCapabilitiesChooser.new(),       1, 1, nil)   
  end   
  attr_reader :camera    

  def initializeGL     
    @ctx = @pBuffer.getContext()   
  end   
  attr_reader :gl    

  def resizeGL(width,height)     
    @ctx.makeCurrent() # this line isn't required for the JOGL Swing components     
    gl = @pBuffer.getContext().getGL.getGL3     
    gl.glViewport(0,0,width,height)   
  end    

  def paintGL     
    @gl = @pBuffer.getContext().getGL.getGL3     
    @gl.glClearColor(0.3, 0.3, 0.3, 0.0)     
    @gl.glClear(GL.GL_COLOR_BUFFER_BIT | GL.GL_DEPTH_BUFFER_BIT)                  
    @gl.glEnable(GL.GL_DEPTH_TEST)      

    ...      

    @gl.glDisable(GL.GL_DEPTH_TEST)   
  end 
end 

I first create an pBuffer, so I can use it's context across GL widgets and share things like display lists and VBOs. The big difference in the Qt code is having to call makeCurrent() in the first GL callback that my program executes, which happens to be resizeGL. Calling makeCurrent() there beforehand makes sure the GL context is running on the same thread as the Qt GUI (or vice versa, I guess).

Conclusion
That's my basic setup and now I've got a lot of coding to do.

Tuesday, January 26, 2010

Example of Cache Coherency

In my almost popular article The State of Ray Tracing in Games, I've frequently mentioned the coherency of rasterization being much better than ray tracing, which is pretty incoherent in many respects. I think programmers often address algorithms in a very high-level manner focusing on the big-O, which is important, but I wanted to present how coherency can directly and significantly affect run-time performance.

First, what is coherency? Coherency is defined by one dictionary as being logically or aesthetically ordered or integrated. The word comes up frequently when someone hears a disorganized speech or discussion, where the speaker jumps back and forth between topics. No one sentence or idea is wrong per se, but as a listener it's easier to bundle ideas together on a topic instead of having to jump around.

Cache coherency diagram I stole from wikipedia



Processors are incredibly complex now a days, but generally speaking there are registers which do the operations, the memory which holds the data and the program, and the cache, which sits in between. One purpose of the cache is to provide a buffer of surrounding data, so I don't have to go all the way back to system memory to get it. That trip might seem relatively fast, but it's pretty slow compared to just looking in the cache.

The Diligent Secretary Example
Imagine I was typing at my office desk and wanted to know the names on a numbered list down the hall. I send my secretary down the hall to pull the name off the list and come back with it. I use the name. I move on to the next name. This trip down the hall is relatively slow since I'm waiting for the secretary to come back. It doesn't make sense for him/her to go all the way down and come back with just one name--even if at that moment I just want one name. My secretary being the witty person that he/she is, goes and writes down ten names on this long list and comes back. I ask for the requested name and he/she gives it to me. I ask for the next name and the secretary instantly gives it to me without going down the hall again. This happens eight more times before he/she has to go down the hall to get more names off the list, but he/she has saved both of us lots of time just by grabbing the names around what I wanted. This is coherency. Because I was asking for names in order down the list, my secretary could easily cache extra names to save the trip down there. It didn't work every time, but when it did it was significantly faster.

Now, what happens if I asked for the names in an incoherent manner? My secretary comes back with the first group of names. I ask for name #3, which he/she quickly gives me. I then ask for name #47. My secretary sees he/she only has 1-10 on his/her list, so he/she needs to make another trip. This time my secretary remembers the names 40-49 and comes back. I then ask him/her for number #212. Once again the secretary has to go back to the list down the hall. I'm not asking anything more complex. I'm just asking in a seemingly incoherent order. This same phenomenon is easily presentable in a real-world example.

Real Example
Just as the secretary grabs the numbers around what I asked for with the possibility of not making another trip down the hall, processors will often pull out blocks or pages of memory around what the program requested hoping not to make the trip to system memory again. If I ask for data close (in terms of memory layout) to what I previously requested, there's a possible chance the cache will already have it and provide it faster.

Here's a really simple python script that tries to add up a bunch of integers in a coherent and incoherent manner. In the coherent function, the numbers are added up sequentially based on their order in memory. In the incoherent function, the numbers are pulled randomly from memory. Both functions provide the exact same result, but with a noticeable difference in performance. Let's see how the two compare for different sizes of data.

import random, time

n = 10000000

# make an array of coherent and incoherent indices
coherentIndices = [number for number in xrange(0,n)]
incoherentIndices = [number for number in xrange(0,n)]
random.shuffle(incoherentIndices)

def print_timing(func):
    def wrapper(*arg):
        t1 = time.time()
        res = func(*arg)
        t2 = time.time()
        print '%s took %0.3f ms' % (func.func_name, (t2-t1)*1000.0)
        return res
    return wrapper

# create some random values
values = []
for i in xrange(0,n):
    values.append(random.randint(-10,10))


# add up all the numbers coherently
@print_timing
def coherentAdding():
    total = 0
    for i in xrange(0,n):
        total += values[coherentIndices[i]]

# add up all the numbers incoherently
@print_timing
def incoherentAdding():
    total = 0
    for i in xrange(0,n):
        total += values[incoherentIndices[i]]

coherentAdding()
incoherentAdding()

Results
48e80eea-0aa4-11df-ae8a-000255111976 Blog_this_caption

With a quick graph using Many Eyes, you can see the performance differences. For small numbers, the performance between the coherent and incoherent data is neglible. In the secretary example, this would be the equivalent of the secretary having all the names in his/her head so regardless of the order I ask them, no extra trips down the hall are necessary.

As the problem scales, however, the memory fetching becomes less and less efficient. For the largest test I did, incoherent access is almost 13 seconds longer than the 8 seconds it took for coherent access. In this simple example where both functions are running at O(n), it's easy to see coherency makes a huge difference.

Relevance to Ray Tracing
From a memory perspective, ray tracing is incredibly incoherent. Firing a ray from the camera to shade a pixel, the next ray over might hit a different geometry requiring different textures and instructions to be fetched. That's bad geometry and texture coherency. Plus ray tracing is touted for it's secondary effects like reflections and refractions. These rays are extremely incoherent. One ray might self-intersect the object being shaded while another ray next to it might fly off into the distant corners of the scene. Rasterization on the other hand is extremely coherent. A model is pulled in once for every pass and thrown out again. When rasterizing textured surfaces, the texture cache can grab large chunks of texture because there's good chance the next fragment being rendered is going to be in that cache. In fact the biggest reason rasterization is probably so much faster than ray tracing is its coherency. It makes hardware extremely easy and cheap to build.