- T. Lourens The Art of Parallel Processing using General Purpose Graphical Processing units –TiViPE software development [2.43 MB pdf]. Technical Report, 18 pages, TiViPE, September 14, 2010.
The aim of this report to elaborate TiViPE modules that make use of NVIDIA’s compute unified device architecture (CUDA) programming. The focus will be on the construction of these programs making the best use of the GPU hardware using CUDA.
- T. Lourens. The Art of Parallel Processing using General Purpose Graphical Processing units –Hardware, CUDA introduction, and Software architecture[1160 KB pdf]. Technical Report, 26 pages, TiViPE, June 23, 2009.
The aim of this report to elaborate on general purpose graphical processing unit (GP-GPU) programming and provide a cookbook for programming NVIDIA’s compute unified device architecture (CUDA). CUDA contains a programming language that is very similar to C, and thus easy to program. Programming CUDA appeared to result in algorithms that are at least one order of magnitude faster than a best effort multi core SSE implementation. When normal C/C++ code was used the differences where 2 to 3 orders in magnitude. The architecture of the GPU is elegant which makes it easy to construct a hybrid model of data-transfer and pure instruction (or floating point) processing that is much easier to understand than a SSE or cache optimized CPU model. The GPUs as computational units rapidly evolving compared to the CPU, the gap between CPU on both data-transfer and computational power is more than 10 fold, making the GPU an excellent candidate for real time parallel data processing. The latest generation GPUs has become powerful enough to make a single PC solution to become sufficient to control a for instance a medical imaging system. It implies size reduction, cost reduction, energy reduction, programming effort reduction, and lifting on off-the-shelf consumer technology. A follow up report on CUDA parallel computing using TiViPE is available . In this report TiViPE programs using parallelism will be discussed together with the respective computational times for different datatypes.