When implementing a Deep Learning model with the framework of our choice these days we generally have an extensive, fast API to work with that covers most needs. But sometimes a situation may arise where the out-of-the-box components of TensorFlow, Torch, MXnet, and others are simply not enough. This can happen when we want to implement a model that is so new that we cannot create a fast enough implementation using a combination of built-in operators. Or we may have already implemented a working model but are looking to make it faster by replacing a combination of built-in operators with our own, fused version to speed up training and/or inference. In this session, we will have a look at the basics of CUDA and show how to implement and test a real-world custom operation with CUDA and integrate it with a Deep Learning framework.