A gentle Introduction to CUDA

Publié dans: 


The advent of programmable GPUs has opened the doors for many interesting applications other than games. In fact, GPUs are very good at solving problems in parallel thanks to their inherently parallel architecture (SIMD), making them the ideal candidate for embarrassingly parallel problems. NVIDIA has pushed the bar further by offering a set of tools to leverage the GPU computational power to the general public, making GPU programming more accessible to everyone. In this article, we will demonstrate a basic example that shows how GPUs can potentially help us solve problems at unmatched speed.


GPGPU (General Purpose GPU) also known as computing for the mass, is a term coined by NVIDIA to promote CUDA. The idea that GPUs can be used for more than just running games seemed appealing to scientists and graphics hobbyists. While CPU cores have increased linearly in the last few years, GPUs have long been multicore devices with a computational power that far exceeds any high-end CPU in terms of running parallel applications. In 2001, NVIDIA introduced programmable shaders which opened the door for a plethora of possibilities like the ability to inject arbitrary code into the rendering pipeline. This has promoted GPUs even further, and led more people, from outside the gaming industry, to gain interest in the technology.

Enter CUDA

CUDA (Compute Unified Device Architecture) is a library and a C-like programming language that offers a high level interface to do arbitrary computations on NVIDIA’s GPUs.
We won't go over the details of installing the CUDA SDK; you can find everything you need in NVIDIA’s website. Instead, we will work our way through some basic terminologies before delving into code.
The GPU has computational units that are grouped into a grid. A grid is divided into blocks, and it cannot contain more than 65535 blocks, but that might change depending on the CUDA version supported by the graphics card. Each block has one or more warps, a warp can have up to 32 threads. The following diagram sums it up:


Code that is supposedly running on the GPU must be defined inside a kernel. A kernel is a typical C-like function with the keyword __device__ in its header. The device keyword is a function qualifier that tells the CUDA compiler to generate GPU code for that function. We can define as many kernels as needed inside a single CUDA program; however kernels cannot be invoked from host code. To do that, CUDA had added the __global__ qualifier. Functions with that qualifier have too distinctive properties that distinguish them from ordinary device functions; first they can be invoked directly from host code, second and most importantly they allow us to specify configuration parameters.
Let's see an example to better understand this:

 	The following example calculates the sum of two vectors in parallel
__global__ void addVectors(float *res_d, float *vec1_d, float *vec2_d) {
	// Calculate the index value corresponding to the current thread
	int idx = threadIdx.x;
	res_d[idx] = vec1_d[idx] + vec2_d[idx];

void cudaExample() {
	const int N = 128;
	float vec1[N], vec2[N]; // vec1 and vec2 reside in host memory (RAM)
	// Initialize the vectors
	for (int i = 0; i < N; ++i) {
		vec1[i] = i;
		vec2[i] = N - i;
	// Allocate some GPU memory
	float *vec1_d, *vec2_d, *res_d;
	cudaMalloc(&vec1_d, sizeof(float) * N);
	cudaMalloc(&vec2_d, sizeof(float) * N);
	cudaMalloc(&res_d, sizeof(float) * N);
	// Copy vectors data from host to device
	cudaMemcpy(vec1_d, &vec1, sizeof(float) * N, cudaMemcpyHostToDevice);
	cudaMemcpy(vec2_d, &vec2, sizeof(float) * N, cudaMemcpyHostToDevice);
	// Start computation with N = 128 threads
	// (Thread execution order is completely random)
	addVectors<<<1, N>>>(res_d, vec1_d, vec2_d);
	// Copy result back to host memory
	float res[N];
	cudaMemcpy(&res, res_d, sizeof(float) * N, cudaMemcpyDeviceToHost);
	// Make sure that everything went as expected
	bool success = true;
	for (int i = 0; i < N; ++i)
		if (res[i] != N) {
			success = false;
	printf("Test %s \n", success ? "succeeded" : "failed");
	// Free device memory

If you have keen eyes, you will definitely notice a strange notation in the invocation of the global function. No that's not an esoteric form of C++ template instantiation; in fact it is a way to tell the CUDA compiler about the number of parallel threads to be launched. There are actually two parameters: the number of blocks and the maximum number of threads per block. For best performance, the latter has to be expressed in multiple of 32.

We have just scratched the surface in terms of what could be done with CUDA. There is a wealth of information to learn about GPU programming and architecture. With the newest version of the CUDA compiler, NVIDIA had added support for object oriented programming and a lot of C++ features, namely template meta-programming, virtual functions, inheritance among others. This will definitely flatten the learning curve for newcomers and make it easier to port existing C++ applications to the GPU.

Blog Tags: