The recent rise of artificial intelligence (AI) can be partly attributed to improvements in graphics processing unit (GPU) processors, mostly deployed in cloud server architectures. GPUs are massively parallel processors that can map well to the large number of vector and matrix multiplication calculations that need to be performed in deep learning. GPUs were originally designed to perform matrix multiplication operations for three-dimensional (3D) computer graphics, but it turns out that deep learning applications have similar requirements, and GPUs have been successful in accelerating the training and inference of AI algorithms.
Google’s Tensor Processing Unit Server Farm (Source: Google)
Hyperscalar internet companies, including Google, Facebook, Amazon, and Microsoft, have built massive cloud server farms that can perform industrial-scale training and inference operations for AI, fueled by the troves of consumer data they collect, further improving their AI algorithms. NVIDIA has been the main beneficiary of this trend, as its GPUs power the majority of these cloud-based AI data centers. Google is pushing its own TPU chipset for running AI in the cloud, competing with NVIDIA’s GPUs. The hyperscalars are also good at spending vast amounts of research and development (R&D) dollars on improving the algorithms that run on cloud servers. The International Conference on Machine Learning (ICML) 2017, one of the leading AI academic conferences, had Google and Microsoft at the top of the pile ahead of Carnegie Mellon, Stanford, and the Massachusetts Institute of Technology (MIT). It is a self-sustaining cycle for the hyperscalars, as they ride the AI wave, what some would say is an unequal advantage compared to universities and traditional enterprises. As a result, we are seeing a widening gap in AI between the hyperscalar companies and traditional enterprises, many of which are trying to better understand AI transformation as they still deal with digital transformation.
While the cloud is good at processing AI algorithms at scale, using GPUs that do not necessarily have power consumption limitations, there is an inherent delay built into the process. The data first needs to be sent up to the cloud, with the AI being trained on the datasets, and the trained model being sent to the device for inference processing on the device. The latency of cloud-based AI processing is also dependent on the network that the edge devices use, and for a self-driving car that needs to make real-time decisions about objects it sees, the cloud-based AI architecture does not make sense. There are also privacy concerns around how the hyperscalars use the consumer data to train their AI algorithms. For example, Amazon’s Echo and Google Home are under scrutiny about the voice data they collect to improve their voice recognition and conversational engines.
Edge or device-based processing of AI algorithms is something that has been difficult until now because of the large processing needs, and the limitations on power consumption. Running NVIDIA’s Pascal GPU consumes hundreds of watts, which can be addressed by cooling mechanisms in a data center, but on a mobile or car that would be unthinkable.
Hyperscalars Pushing Edge-Based Processing
However, we are beginning to see several trends that suggest edge-based processing for AI algorithms is starting to happen. This is being pushed at one level by the hyperscalars themselves who are aware of privacy concerns, and want to enable real-time device-based AI training or inference. At the same time, startups are also coming up with innovative ideas, while hardware startups are developing custom solutions for embedded AI applications. Both software and hardware approaches are feeding into the edge-based processing for AI.
- Federated learning by Google allows mobile devices to train on local data gathered at the end device, and to have the new learnings sent to the cloud as a small update, which is then averaged with other user updates to improve the shared model. Google is already using this on its Gboard keyboard application on Android devices.
Federated Learning on Mobile (Source: Google)
- Apple’s latest Core ML toolkit allows pre-trained models to run on the iPhone using local data, without needing to send data to the cloud. Apple currently supports Caffe and Keras frameworks for neural network models. Both Apple and Google are taking advantage of the existing processing capabilities on mobile devices, without the need for a specialized application-specific integrated circuit (ASIC) or GPU.
- Compressed versions of machine learning (ML) frameworks like Caffe and TensorFlow have embedded versions that can run on the device, using a reduced code footprint and lower-precision arithmetic calculations. TensorFlow Lite is expected to become available to Android developers soon, enabling on-device speech processing, vision, and augmented reality (AR) capabilities. Facebook is already using Caffe2Go to power on-device style transfer capabilities for videos and images in real time.
- NVIDIA has released Jetson TX2 processors that can be used for embedded platforms like security cameras, robots, or sensor platforms. These can be classified as mid-power applications that are between 5 W and 10 W, which is still not ideal for mobile device applications.
- ARM is improving the performance of its Mali GPU device with its upcoming Mali-G72 chip for use on mobile phones like the Huawei Mate 9 to drive device-based inference. ARM also has launched two new Cortex-A processors, CortexA-75 and CortexA-55 based on its DynamiqIQ technology that is expected to boost AI performance by 50X. These can be used for both server-based and mobile device applications.
- Intel’s Movidius (acquired in 2016) is actively pushing for AI in embedded applications like cameras, wearables, robots, drones, etc. Intel is clearly betting on an embedded future for AI, and one area in which it can compete effectively against the AI hardware market leader, NVIDIA.
- French startup Snips has launched a device-based, chat and voice assistant platform that performs inference on the mobile device itself. It has also been able to run TensorFlow on a Raspberry Pi chipset with some smart tweaks. While device-based inference is not a big issue, it is the training data that requires customer information to be sent to the cloud. Snips claims to have the secret sauce that allows it to generate artificial training data for chats and voice recognition using generative adversarial networks (GANs) and other AI techniques. Therefore, it essentially trains its AI in the cloud using artificial training data, and then sends a pre-trained model down to the device. If what Snips claims to have is true, this is truly going to be a game changer for how AI is trained.
- Microsoft’s Anirudh Koul offers a number of tips on running ML on mobile phones. This is a great repository for any developer that wants to experiment with training and inference on a device.
Training Artificial Intelligence Algorithms with GAN-Based Artificial Data Sets
Most activity trends are centered around the use of pre-trained models, the application of compressed frameworks, and techniques like pruning, weight sharing, and low-precision computation to run inference on the end device. From a device perspective, it is becoming clear that many of the voice, image processing, and AR applications can perform device-based inference using existing processors. The emergence of GAN-based artificial data sets for training AI algorithms is an area to watch and something that could be a game changer not just in terms of solving privacy issues, but in terms of how it could enable a mass-market for AI, and disrupt the hold of hyperscalars on AI, as anyone can now gain access to quality datasets for training AI.
Even with quality artificial data sets, many would argue that there is nothing better than training the AI, in real time on the end device. Federated learning looks like a great solution to the problem, but it might not be ideal for vision- and speech-based applications, which are more processing heavy than training a keyboard. Other devices like closed-circuit television (CCTV) cameras and robots are also likely to see device-based inference techniques using existing processors, with device-based training being a massive area of opportunity for chipset vendors.
Expanding Artificial Intelligence into Industry Sector Use Cases
However, one of the biggest challenges and gaps in the market remains ultra-low-power applications like the Internet of Things (IoT) and sensors. Intel’s Movidius is clearly targeting this space, but should see more competition emerge in the near future. Tractica has a large database of use cases and we are specifically researching embedded applications for AI and deep learning and how that maps to hardware chipsets.
The big emerging question is how the trend of device-based training and inference will impact investment for AI hardware in the cloud. Are we going to see a direct correlation between the rise of AI and investment in cloud-based AI hardware? On one hand, we are seeing an increasing complexity of models with Baidu Deep Speech 2 having 300 million parameters, while Google’s Neural Machine Translation model has an enormous 8.7 billion parameters. Applications like these will clearly require cloud-based processing. But, at the same time, the rise of device-based processing, using pre-trained models and the eventual emergence of device-based training, is likely to have an impact on cloud-based AI hardware. Also, are improvements in compressed frameworks and pruning techniques likely to delay the need for specialized processors? Our current research and analysis at Tractica is highly focused on trying to find answers to these questions.
Ultimately, the focus of AI will move from hyperscalar cloud-based AI, dominated by consumer applications, to the long-tail of embedded applications enabled for a wider range of use cases in industry sectors like manufacturing, healthcare, business, agriculture, automotive, and many others.