Login | Register 
Products Applications Downloads Features Wiki forum Store

Write a CUDA DLL

From Derivative wiki

Jump to: navigation, search

Contents

Overview

CUDA is a programming language developed by NVIDIA to allow developers to use the power of GPUs in a way much more general than using them only for graphics. More details about CUDA can be found at the CUDA Homepage.

Using Touch as a tool to program CUDA has many benefits. It allows you to use all the tools Touch already has to create, load, manipulate, save and visualize data that you are passing to and sending out of your CUDA program. For example if you want to write a CUDA program that does something with audio data, instead of having to write your own code that loads audio using an external API, you can simply load the audio into a CHOP, and pass that data into the CUDA program. Similarly instead of having to write your own OpenGL code to visualize the output of a CUDA program, it can be visualized using the tools Touch already has.

If you are interested in programming CUDA, a good starting place is the CUDA Programming Guide.

If you just want to use a CUDA .dll that someone else wrote for Touch, you don't need to follow this guide. You can simply load the .dll in the CUDA TOP. Be warned thought that a .dll is just like running an executable on your system, so make sure you only use .dll files from trusted sources.

Runtime API vs. Driver API

CUDA has two API's that can be used, the Driver API or the Runtime API. Functions from the Driver API start with cu, while functions from the Runtime API start with cuda.

Touch uses the Runtime API, so you should also only call functions from the Runtime API.

Supported Graphics Cards

A list of graphics cards that support CUDA can be found here. Note that you will need the latest drivers to enable CUDA support.

Brands of graphics cards other than NVIDIA do NOT support CUDA.

Interface Summary

A CUDA program is actually called a CUDA kernel, and will be referred to as such from now on.

Touch interfaces with your code by loading a .dll you compile. The .dll will consist of a set of query functions that Touch will call so it knows what the .dll intends to do and what kind of information it needs. The .dll will also contain the CUDA kernel. Touch will then call functions in that .dll, so it can prepare the input and output data, and then call a function that tells the .dll to execute the kernel. It is up to the .dll to pass the correct information into the kernel, and ensure the kernel writes the correct information out to the given output data buffer.

A detailed description of the API used for Touch to interface with your .dll can be found in the TouchDesigner CUDA API Reference.

Compiling the sample .dll

The first step to compiling this sample project is to install Visual C++ 2005. If you don't already have Visual C++ 2005 installed, Microsoft provides a free Express Edition. It's possible Visual C++ 2003 will be used to compile a CUDA kernel .dll also, and this wiki article will hopefully give all the information needed to do this. Visual C++ 2008 is not currently supported by the CUDA compiler. NVIDIA says they'll add support by the end of the year.

Note: In the next step, download the 32-bit version of the CUDA Toolkit, even if you are on a 64-bit OS. There seems to be a problem with the CUDA compiler and compiling a 32-bit .dll on with the 64-bit CUDA Toolkit.

The second step is to install the CUDA Toolkit and the CUDA SDK. This page will also provide a graphic driver, but if you have the latest graphic driver installed already, you don't need to install this driver. Restart your computer after installing the SDK.

Included with Touch is a sample Microsoft Visual C++ 2005 project that creates a .dll that Touch can interface with. This .dll contains a kernel that takes an input TOP, applies a monochrome operation on it, and outputs it. This project is located in the Touch install directory under touch/CUDA/cudaTOPTemplate. Normally this will be C:/Program Files/Derivative/TouchDesignerPro.077/touch/CUDA/cudaTOPTemplate

Note: You should copy this project somewhere else to avoid any changes you make getting overwritten when you install a newer version of Touch.

Open up the cudaTOPTemplate.sln file, and press F7 to compile the .dll. If it compiled successfully, the .dll file will be created in the debug dorectory of your visual studio project. You can now open up Touch, connect a CUDA TOP to the default Movie In TOP, and load the .dll you just created in the CUDA TOPs parameters. You should see a monochrome version of the image.

Troubleshooting problems with the compile

If you are unable to compile the .dll here are a few hints. Make sure $CUDA_INC_PATH, $CUDA_BIN_PATH and $CUDA_LIB_PATH correct point to the include, bin and lib directions in the location where you installed the CUDA Toolkit. $NVSDKCUDA_ROOT should point to the root of where the CUDA SDK was installed. It's possible you'll need to reboot your machine after installing to make sure these environment variables are accessible.

If you get an error like "Visual Studio configuration file '(null)' could not be found...". This is likely because you are building with the CUDA 64-bit Toolkit. Try installing the 32-bit toolkit instead (even if you are on a 64-bit OS). That should fix the problem.

If you have other compile errors (which may occur if you are using a different version Visual C++), try pasting the errors into Google and see if you can find what you need to do to fix them.

Post other issues you have to the forum.

Sample project walk-through

The template project contains 2 files.

cudaTOPTemplate.cpp - This is a C++ file that contains most of the query functions Touch will call.

TOPkernelTemplate.cu - This is a CUDA file that contains the CUDA kernel, and one C function tCudaExecuteKernel which is the function that Touch calls to tell the .dll to execute the kernel.

Every C function in these files needs to be inside an extern "C" { } statement, otherwise Touch won't be able to find the functions inside the .dll. Also notice that all functions are preceded by DLLEXPORT. This is necessary to export the functions from the .dll so Touch can find them (DLLEXPORT is defined as #define DLLEXPORT __declspec (dllexport).

The cudaTOPTemplate.cpp file is compiled using the normal Visual C++ compiler.

The TOPkernelTemplate.cu however is compiled using a special CUDA compiler that comes with the CUDA SDK called nvcc.exe. To see the command that compiles this file, right click on TOPkernelTemplate.cu in the left pane of Visual C++, and go to Properties. In the dialog that comes up go to the Custom Build Step section. On the pane in the right in the dialog you will see a line that starts with $CUDA_BIN_PATH/nvcc.exe. This is the command that is run to compile TOPkernelTemplate.cu.

This program does two things, it compiles the CUDA Kernel, and it also passes other C++ code to the Visual C++ compiler so it can compile that code. In particular while the CUDA kernel is compiled by nvcc, there is a C++ function called tCudaExecuteKernel which is C++ code, that needs to be compiled by the Visual C++ compiler. The command line that is executing nvcc.exe passes information to it via the -ccbin and -Xcompiler options that lets nvcc.exe know what program to use to compile C++ code.

The template file contains a lot of function declarations, but few of them are actually necessary for this simple example. The rest are there just are placeholders, but they don't necessarily need to be declared for the .dll to work.

When the OP first loads the .dll it will call the tCudaGetAPIVersion() function. This function lets the OP know what version of the TCUDA_Types.h this .dll was written using. This header file is a file defined by Touch, it is not a real CUDA header file.

Next the OP will call tCudaNodeAttached. This is called once for each node that loads this .dll. For this example this isn't too important, but for more complex CUDA kernels the .dll may store intermediate data between cooks, and if more than one node is using the same .dll, then the .dll will need to manage multiple copies of the intermediate data.

Each cook the OP will call a few of the .dlls functions to essentially ask it questions. The first one it will call is tCudaGetTOPOutputInfo(). This function is the OP asking the information about the res/pixelformat/aspect of that the TOP should output. This is similar to settings this options on the Common page of the TOP. If you return 'true' from this function, then the TOP will use the provided settings as the output info. If you return 'false' from this function then the TOP will use the parameters in the Common page to set the output resolution etc. of the TOP. If this function isn't declared then the TOP will behave as if the function was declared and returned 'false'.

Next the TOP will call tCudaTOPKernelOutputInfo. This is related to, but differnt from the previous OutputInfo call. This function gives us information about the specific data format the kernel will output. The width and height are assumed to be the same as set by tCudaGetTOPOutputInfo() or Common page settings. The data format (unsigned byte or float) and the channel order (BGRA, A only etc.) don't need to match the final output image is. For example if your TOP output is a 8-bit RGBA image, your kernel can still output the image data in floating point format, and it will be converted automatically when the kernel's output data is converted into a texture.

Next the OP will call tCudaGetParamInfo() for every type of input that it could potentially upload to the GPU and pass to the .dll to be used in the kernel. Parameters, TOPs, Object and CHOPs are currently supported. So for each parameter, each TOP that is connected, each CHOP that is listed in the CHOP DAT parameter and OBJ that is listed in the OBJ DAT parameter, tCudaGetParamInfo will be called once. If you are interested in the particular input, return 'true', otherwise return 'false'. You can also optionally fill in the TCUDA_ParamRequestResult structure with different options. In this example we are telling the OP we want are only interested in the first TOP that is connected, and we want it's data to be in UNSIGNED_BYTE format. We return 'false' for every other input type. This avoids a lot of work uploading data to the GPU for unneeded inputs.

Finally, tCudaExecuteKernel will be called. This is the OP telling the .dll it's time to execute the kernel. The OP will pass in all of the requested inputs in the params function parameter. It will also give an output parameter that contains all of the information about the memory that this kernel will output input. In this example we find the first TOP input that we can and use that as the source image in the CUDA kernel.

Getting data into the CUDA DLL

As mentioned above, the tCudaGetParamInfo() function will be called once for each potential piece of data that could given to you. In this function you will be given a TCUDA_ParamRequest structure, which contains details of the data the OP is asking you if you want. For example if it's asking about CHOP, the request->dataType member of the structure will be set to TCUDA_DATA_TYPE_CHOP. Once you know the dataType, there are sub-structures that contains more information. For CHOPs request->chop.length contains the length of the CHOP, for example. If you are interested in this data, you should fill in the TCUDA_ParamRequestResult structure and return true from the function, if you aren't interested in the data, just return false. The section of the structure that you fill in is once again dependent on the dataType of the data. So if it's a Object that you are dealing with (like a Camera COMP), you'll fill in the reqResult->obj. If you want the Object's transform, set the reqResult->obj.interestedInTransforms = true;. For a camera, there is a further sub-field you can fill in called reqResult->obj.cam which contains the members projWidth and projHeight, these control how the projection matrix is created.

Looking at the header files is the best way to understand what is contained in the structure.

Now that you've told the CUDA TOP what data you want, it will be passed to tCudaExecuteKernel. In here the data member of the TCUDA_ParamInfo contains the actual data. What data actually is depends on the dataType.

Data Location

The TCUDA_ParamInfo structure also contains a member called dataLocation, this tells you if the data is located on the CPU (TCUDA_DATA_LOCATION_HOST) or the GPU (TCUDA_DATA_LOCATION_DEVICE). If the data is on the GPU, you can only read it inside a CUDA kernel. You can however copy it to the CPU using cudaMemcpy. Some types of data will be given to you in CPU memory, some will be given to you in GPU memory, and some types allow you to specify where you want the data to reside. Types that support being given in either CPU or GPU memory will have a dataLocation memory in their sub-structure portion of the TCUDA_ParamRequestResult structure. For example reqResult->chop.dataLocation.

Data Format

Some data will be given to you as unsigned char (8-bit bytes), or as 32-bit floats. you can tell what the data format is by the dataFormat member of the TCUDA_ParamInfo structure. Much like the Data Location, some types allow you to specify the data format you want the data to be given in. Types that support this will have a dataFormat member in their sub-structure in TCUDA_ParamRequestResult structure.

CHOPs

For CHOPs the data pointer will be an array of floats. The array will be a concatenation of each channel in the CHOP. For example if you have a CHOP that has 3 channels, with a length of 2, then the first two floats will be the values of the first channel, the 3rd and 4th float will be the values of the 2nd channel, and the 5th and 6th float will be the values of the 3rd channel. CHOPs can be given either in CPU or GPU memory.

Object COMPs

For Objects you can get two pieces of data, the transforms and the node's parameters (such has focal length etc.). The obj sub-structure of TCUDA_ParamInfo contains a member called paramType which lets know you if the data is transforms or parameters. If the data is transforms, dataFormat will be set to TCUDA_PARAM_DATA_FORMAT_STRUCT, and data will be one of TCUDA_GeoTransformInfo or TCUDA_CameraTransformInfo, depending on the type of COMP. These structures contains different matrices that you can use as you see fit.

TOPs

A TOP can be given to you as a either a single linear array of data, or as a CUDA memory array. A CUDA memory array is used to create a CUDA Texture. You can select if you get a memory array or a linear array by setting the memType member in the top sub-structure of TCUDA_ParamRequestResult. It defaults to TCUDA_MEM_TYPE_LINEAR

Node Parameters

The parameters that are on the CUDA OP will always be passed to you in CPU memory. Like CHOPs the data will be given to you in an array of floats. Each parameter will be an array of 4 floats.

Understanding all of the TCUDA_* structures

The best way to understand what all of the TCUDA_* structures mean is to put breakpoints in the various tCuda* functions and inspect the contents of the structures in visual studio.

Making the node cook every frame

If the inputs to the CUDA OP aren't cooking, but there is code internally in the .dll that will change data every frame, you can make the node cook every frame via the tCudaGetGeneralInfo function. In this function there is a TCUDA_GeneralInfo you can fill in. Setting the timeDependent member to true will cause the node to cook every frame.

Rules for writing a CUDA DLL

When the .dll gets loaded by Touch is essentially becomes a part of Touch's runtime. This means that anything done inside the .dll that has global ramifications can affect how Touch runs. For example calling exit(0) from the .dll will cause Touch to exit. Also the virtual memory space that Touch uses is shared by the .dll, so calling malloc() will use up memory from Touch memory space. This is fine to do, but just be aware of it if you are allocating a lot of memory.

A very important rule is: Don't call any OpenGL or Direct3D functions! Since Touch is an OpenGL application calling any OpenGL functions will change the OpenGL state to something that Touch doesn't expect, and will likely result in undefined behavior (Expect Touch's UI to stop rendering correctly). This point is especially important since most standalone CUDA applications will contain OpenGL code to store/display CUDA data. When porting these apps into Touch this code needs to be trimmed out. See Porting to a CUDA DLL for more information.

You can call cuda* function as you see fit. For example if you want to allocate your own CUDA memory you can call cudaMalloc and alloc it, and then use cudaMemcpy to copy memory around. Ofcourse it's your job to clean up memory you allocate when the node detaches itself from the .dll.

Visual Studio Custom Build Step

Included with the CUDA SDK 2.3 is a custom build rule that makes dealing with .cu files easier. You can add this to your projects by right clicking on the project name on the left and going to Custom Build Rules. Then choose Find Existing and go to where you installed the CUDA SDK, the file you are looking for is called Cuda.Rules, and is located in {CUDASDKLOCATION}/C/Common.

Then instead of using the manual custom build step (which is how the cudaTOPTemplate file is setup, you can just choose CUDA Build Rule 2.3.0.

Debugging Your DLL

When you are using a .dll compiled in debug mode, you can connect your Visual Studio project to a running version of TouchDesigner and place breakpoints/asserts in your .dll code. These breakpoints/asserts will work correctly, even though TouchDesigner itself isn't compiled as debug.

Debug vs. Release Builds

By default the Visual Studio project will create a debug .dll. When you are done testing and want maximum speed you should create a release .dll by going to Build->Configuration Manager and changing the Active Configuration to Release. Your .dll will be create in the /release directory of where you project is.

CUDA Textures

CUDA has the ability to create it's own special brand of textures can be sampled in a much more intuitive way than the way linear device memory is normally addressed.

Refer to the CUDA documentation on how to use textures. The one piece that may not be clear is that the memory needs to be allocated with cudaMallocArray instead of the normal cudaMalloc. To get Touch to to this for you, you should fill in the memType field in the TCUDA_ParamRequestResult structure with TCUDA_MEM_TYPE_ARRAY. Like this:

 reqRequest->top.memType = TCUDA_MEM_TYPE_ARRAY;

If you do this then you can just bind the data pointer you get in the TCUDA_ParamInfo in tCudaExecuteKernel to your texture.