Sramana Mitra: Can you give an example?
Steve Scott: If you think about deep neural networks in particular, there’s training and there’s inference. Training is the learning part where you take a bunch of data and based on that, you train a model to be able to provide some function. Inference, of course, is using that model that has done the learning already to make decisions on new data. The inference problem is sort of a throughput problem.
Once you’ve got the model designed, you can run lots of data through the model and make ad decisions very quickly. The training problem itself takes a lot of compute and a lot of communication. This is the sort of thing that a Cray system is good at. It’s basically a high-performance computing problem because you have lots and lots of data and you have to feed all the data through the model.
You feed the model through the deep neural network in a forward way, then you calculate the error between what the error predicts and what your labeled training data says it’s supposed to predict. Then you use that to back propagate changes to the weights of your model. That algorithm itself is an HPC problem. You can apply anywhere from just a single CPU or GPU up to hundreds or even thousands of GPU’s to process all the data in parallel.
As part of that, the processor that you’re using in parallel to do the training needs a lot of communication because every time each processor processes a chunk of data and decides, “Based on my data, I want to update these weights by a certain delta.” They have to globally communicate all of those weights together so they can all update the model in lockstep and then continue to process.
There’s a lot of computation but also a lot of synchronization and communication. That’s the classic HPC problem. One of the things at Cray is, we build high-performance interconnects and software that’s particularly good at exchanging data at high rates and doing synchronization without much latency. That’s what makes a good super computer. It’s directly applicable to scaling an AI problem.
Sramana Mitra: You talked about it in the beginning. You made a small detour to Nvidia. It probably gives us a good context to talk about the high performance computing landscape and how that is shifting around in this AI era. Could you talk about who’s doing what? Who are the players who have the most advanced technology in this realm?
Steve Scott: Before you think about that, it’s useful to understand why AI didn’t take root until the past few years. Back when I was in school, we learned about deep neural networks. We’ve known about them for many decades and they just haven’t been terribly useful in practice. That’s because you need a tremendous amount of computational power.
In order to have a useful deep neural net, it needs to be fairly large which means lots of weights. It needs to have a lot of data to feed it. The high-performance compute really becomes the engine and the data becomes the fuel, if you will. Until a few years ago, we simply didn’t have enough computational power to do a good job at training large deep neural nets.
This segment is part 3 in the series : Thought Leaders in Artificial Intelligence: Steve Scott, CTO of Cray
1 2 3 4 5