The new NVIDIA Tesla V100 accelerator incorporates the new Volta GV100 GPU. Equipped with 21 billion transistors, Volta delivers over 7.5 Teraflops per second of double precision performance, ∼1.5x increase compared to the its predecessor, the Pascal GP100 GPU. Moreover, architectural improvements include:
The tensor core of the Volta was explicitly added for deep learning workloads. The NVIDIA Deep Learning SDK provides powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications. It includes libraries for deep learning primitives, inference, video analytics, linear algebra, sparse matrices, and multi-GPU communications.
Several approaches have been developed to exploit the full power of GPUs: from parallel computing platform and application programming interface specific for NVidia GPU, like CUDA 9.0, to the latest version of OpenMP 4.5 which contains directives to offload computational work from the CPU to the GPU. While CUDA currently is likely to achieve best performance from the device, OpenMP allows for better portability of the code across different architectures. Finally, the OpenACC open standard is an intermediate between the two, more similar to OpenMP than CUDA, but allowing better usage of the GPU. Developers are strongly advised to look into these language paradigms.
Moreover, it is fundamental to consider that there the several issues linked to hybrid architectures, like CPU-GPU and GPU-GPU bandwidth communication (the latest greatly improved through NVlink), direct access through Unified Virtual Addressing, the presence of new APIs for programming (such as Tensor Core multiplications specifically designed for deep learning alogrithms).
Finally, it is important to stress the improvements made by NVidia on the implemenation of Unified Memory. This allows the system to automatically migrate data allocated in Unified Memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU making programmability greatly simplified.
At this stage, GPU programming is quite mainstream and there are many training courses available online, see for example the NVidia education site for material related to CUDA and OpenACC. Material for OpenMP is more limited, but as an increasing number of compilers begin to support the OpenMP 4.5 standard, we expect the amount of such material to grow (see this presentation on performance of the Clang OpenMP 4.5 implementaion on NVIDIA gpus for a status report as of 2016).