SLI Testing

I was recently looking to upgrade my gpu to a 10x series card (1080 namely) from my current 9x series card (gtx 970 G1 Gaming editing). The main reason for doing this was to get some higher rendering speeds in raymarching applications such as fragtool or tools I’ve built myself. A 10x series graphics card would have provided a boost of around 500-600 mhz given my 970 clocks around 1200 mhz and 1080 would provide up to 1800 mhz (both non overclocked). I decided to take the opportunity to explore sli avenue instead, given the 1080 is so hard to come by right now in Australia and since it has been a grey zone in the td community.

My thoughts were that at a modest 50% performance scaling, I could possibly match or outperform the 1080, and then possibly accrue much higher speeds if scaling results would exceed 50%. I did quite a bit of reading on sli and decided to take the leap. The overall results were grim at best for sli, a result I’d expected for some rendering environments but certainly not all which seemed to be the final verdict. In fact alot of post research indicated this is the case with pretty much all visual programming tools including Unity and Unreal engine.

How SLI works (the rough idea version):

SLI is as you’d probably guessed a way for two or more gpus to combine their processing power. Some applications apparently boast a 200% ‘scaling’ of performance, however this is probably an unrealistic outcome given the ‘law of diminishing returns’. In order for sli to be achieved, the primary card must send and receive information to its slaves through the sli bridge, which is an added overhead that will of course reduce the scaling depending on the configuration and data needed to be sent to each gpu slave. It cannot scale memory nor can it access the gpus of its slaves.

SLI is handled completely by the nvidia driver (unless you have an NDA version of their SDK) and there doesnt appear to be alot of information or techniques readily available to developers in order to maximise the benefit of it. In fact most companies will hand their system over to nvidia so they can do the profiling for them.

In systems such as touchdesigner and other visual programming tools, it would be pretty much impossible to create a universal sli profile given the user can implement many different rendering techniques which would work in some scenarios and lag considerably in others. Hopefully I can clearly explain some of those scenarios below.

Alternate Frame Rendering (AFR):

Alternate frame rendering is generally the most scalable option in the sli family. The idea is that (as the name suggests) each gpu can alternate its turn at taking on the rendering. However it does come at a cost given the cpu needs to update both cpu’s with things like geometry, textures and other data needed in the render call. I believe in touchdesigner one of the benefits of putting your objects inside geo’s is that the gpu will kind of ‘cache’ your geometry so your cpu doesn’t need to push it through each frame. From what I can gather, all of that diminishes once an AFR scenario is established. For my test case I rendered several thousand instanced high resolution spheres (see attached). I figured this would provide the smallest data footprint that would bottleneck the gpu given the transfer would be a single geometry element and the array data fed into the instancing. I used 4 render passes (same geometry with different cameras). A second scenario used the same setup with two additional render tops. In both cases AFR 1 and AFR 2 either kept the same frame rate or dropped it by around 15%. In both cases each gpu load was at around 97%. I was kind of hoping that by having several render top / render passes I could give each cpu cycle several rendering frames in order to give AFR a chance to kick in. No such luck. Quite often this reduced the frame rate considerably, got both of my gpu’s screaming and thats about it. I also tried some scenarios with some of my other patches to investigate things like multiple transparency passes, heavy and light texture loads etc. But same outcome non-the-less (actually more often than not alot slower).

Split Frame Rendering (SFR):

Actually it was this method that I was really pinning my hopes on for ray marching applications such as Fragtool. Ray Marching requires very little information to be passed to the GPU - feed it a quad and a relatively small pixel shader and let the gpu run the millions of calculation on each frame. SFR works by splitting the frame vertically, the primary gpu buses the render data to the second and catches its half after doing its own bit on parallel. The results were often confusing. The top frame would render as expected and often show almost a doubling of performance. But the bottom frame would generally cache the original image and strobe it significantly. I can’t help thinking that double buffering (or window draw calls) was leading the sli to work only on the perform mode window and not the internal rendering chain, but the gpu loads seem to show that was not necessarily the case. It is not clear to me still whether SFR is available in opengl or whether its a directx only thing. Nvidia won’t let you enable it through their control panel, so nvidia inspector is the only profiler I’ve found that will let you enable it.

SLI Anti Aliasing:

This is usually the safe fallback for non sli optimised / profiled applications. Anti-aliasing is a relatively heavy operation and can chew up to 40% of your gpu depending on method and amount applied (2x, 4x, 8x, 16x, 32x etc). I used my original AFR test toe with different AA scenarios including render level antialiasing (varied levels) and post anti-aliasing (using the anti alias top). Pretty much the same result again in that no benefit was achieved, even when aa was overridden by the nvidia driver (sigh). Apparently sli aa only really works on the final draw pass. So I disabled all internal aa and used nvidias override variants and saw no improvement in the overall image quality.

Conclusion

I don’t know if I’m done testing yet. I’m liable to be stubborn and plod along trying different scenarios but I’ve tried out ooodles of profile configurations using nvidia inspector. But realistically the process is super tedious and to find one magic bullet configuration would not be super useful to someone like myself who might use many varying render scenarios in a vj software application etc. SLI seems pretty hit an miss in the gaming community and even moreso in the authorware realm for applications such as touch designer. Derivative are certainly not out of line for not supporting it.

It was a pretty expensive test for me as I had to pretty much upgrade my entire computer given my previous hardware supplier didn’t give me an sli capable motherboard as I’d requested and socket 1150 motherboards with sli availability just don’t exist. But I still learned alot and know that buying cheaper cards with the aim to sli them for a cheap power boost in future really isn’t an option. I’d be happy to try out some other scenarios in the coming week if anyone wants to push up a file. Otherwise, I hope someone found this useful!

For anyone intersted in SLI testing themselves:


  • Generally speaking your sli performance indicator will only show a result when you are in performance mode (quite often a full screen window is not enough)?
  • Triple buffering X
  • Nvidia inspector is the only way to enable features such as SFR. Alot of options are missing from the nvidia control panel.
    sli.toe (6.59 KB)

It was my understanding that it is NVidia that enables SLI for specific games and applications by providing a driver profile. Not sure why its done that way, perhaps some app specific testing or optimization is necessary. In any case there’s several hundred games they’ve enabled it for, it can’t be that hard. Perhaps you (or Derivative) could request it for TD?

From my understanding the way the profiling is done is that Nvidia can analyze how a game does a frame, and create a profile that matches their usage case. For example

  • Render Shadow Maps
  • Render Scene
  • Do a post-process pass (motion blur etc.)

That is predictable and won’t ever change for the game.

However with TD nothing is predictable, there could be 50 TOP operations that occurs followed by a Render TOP followed by another 25 TOP operations, followed by another Render TOP. So there is no way to make a profile for TD really.