Hardware-Aware Auto ML Research: Leveraging AI Automation for Model Optimization

Highlights:

  • Advanced automated machine learning (AutoML) techniques are needed to sustain the trajectory of AI, given the increasing number and complexity of AI applications and a limited workforce of ML experts.

  • Intel Labs is developing HW-Aware AutoML technologies for model optimizations such as mixed-precision quantization and hardware-aware Neural Architecture Search (NAS) to improve developer productivity and the efficiency of AI models on Intel platforms.

  • We continue to explore a variety of strategies, such as dynamic co-optimization techniques and the co-design of neural network (NN) architectures and neural accelerators to maximize AI system efficiency and address productivity challenges.

author-image

By

The future of scalable AI lies in automation, particularly when it comes to machine learning (ML). Emerging technology from Intel Labs uses a novel approach called automated machine learning (AutoML) to automate the time-consuming process of AI optimization and address ML bottlenecks that prevent the scaling of AI innovations.

Current hand-crafted approaches to AI model optimization rely heavily on a highly skilled but also very limited workforce to execute. This is not sustainable given the increasing volume and complexity of AI applications. Automation is necessary to facilitate the highly iterative process of optimization and make development scalable.

Scalable and Efficient AI

Intel Labs created an approach that leverages AI methods to optimize AI algorithms under the supervision of a high-level programmer rather than a deep learning expert. This approach improves the efficiency of the algorithms and optimizes them for deployment on various types of hardware, such as CPUs and GPUs.

Our methods bring optimization closer to data scientists, which helps reduce the iterative process of design and optimization. These automation technologies will create opportunities to accelerate AI software and hardware innovations and improve the performance of AI. The rapid development of efficient “tiny AI models” will further accelerate the adoption of AI in the Internet of Things (IoT), edge, and embedded applications where compute resources are often limited.

Intel Labs is developing several technologies to realize this vision, some of which are available externally, including AutoQ and BootstrapNAS (described below).

AutoQ Automated Mixed-Precision Quantization

Quantization is an optimization method that helps reduce model size and bandwidth requirements while improving compute efficiency. Mixed-precision quantization shows promise for increased power and performance, as well as reduced model size and memory footprint compared to uniform 8-bit quantization.

AutoQ employs automated machine learning (AutoML) algorithms to assign precision for different layers of the deep neural network (DNN) to minimize accuracy degradation, maximize performance, significantly improve productivity, and alleviate the reliance on programming experts.

AutoQ is available as part of OpenVINO/NNCF—a model compression framework. More information about AutoQ, including performance metrics, can be found at Automated Mixed-Precision Quantization for Next-Generation Inference Hardware and SigOpt featuring AutoQ.

BootstrapNAS: Automated Hardware-Aware Neural Architecture Search

BootstrapNAS is an automated hardware-aware model optimization tool that Intel has developed to simplify the optimization of models on Intel hardware, including Intel® Xeon® Scalable processors, which deliver built-in AI acceleration and flexibility. It has been released as part of OpenVINO/NNCF v2.3.

This game-changing technology enables faster training time and efficient discovery of high-performing models that are specialized for the required target platform. It allows developers to transform a pre-trained AI model into a super-network and then use state-of-the-art neural architecture search (NAS) techniques to train it.

BootstrapNAS has extensible training and search modules that allow for efficient super-network optimization and subnetwork search. It improves developers’ productivity, requiring a fraction of the time that hand-crafted approaches require.

The transformation of the original model into a super-network guarantees that the maximal subnetwork has a similar cost and maintains comparable accuracy to the original model. Intel Labs research demonstrates that optimizing AI models with BootstrapNAS can result in 11x better performance while maintaining a high level of accuracy.

For a more detailed look at the optimization capabilities of BootstrapNAS, please watch our video.

Next Steps

Intel Labs continues to look for ways to mitigate the bottlenecks caused by increasingly large models and search spaces. Partnering with OpenVINO, Intel Labs is designing an approach that combines pruning, quantization, and distillation into a single optimization that can be directly deployed on most Intel devices without any manual model translation. This approach will have significant throughput improvement on Intel Xeon processors across transformers for different data modalities. The capability will be released soon as part of the OpenVINO™ toolkit.

As part of our broader vision, Intel Labs is focused on improving the efficiency of AI automation methods with techniques to make AI models more efficient. Our goals include the development of algorithmic techniques that allow sample-efficient, multiple objective searches and extended capability for joint AI accelerator-algorithm optimization. For example, our recent work on Zero-Shot NAS introduces neural architecture scoring metrics (NASM) to identify good NN designs without training them. Techniques like Zero-Shot NAS can help reduce the search space of efficient solutions and significantly reduce the time required to find an efficient AI model.

We believe that automated approaches for co-designing NN architectures and neural accelerators are needed to maximize AI system efficiency and address productivity challenges. There has been growing interest in differentiable NN-hardware co-design to enable the joint optimization of NNs and neural accelerators. To enable efficient and realizable co-design of configurable hardware accelerators with arbitrary NN search spaces, Intel Labs has created realizable hardware and neural architecture (RHNAS), which is a method that combines reinforcement learning for hardware optimization with differentiable NAS.

Finally, we are exploring ways to maintain AI quality-of-service (QoS) in a multitenant environment. As part of AutoQ future research, we are exploring:

 

  • Dynamic co-optimization to address the noisy neighbor problem for predictable QoS.
  • Methods to dynamically scale resources allocated per loads of parallel model serving.
  • Methods to gracefully trade-off model quality for lower resource by switching to alternative model variants.

 

We believe that dynamic co-optimization of resources and models is essential for the control of QoS and maximizing TCO for service providers.