Development of Deep Learning frameworks for Exascale



October 7, 2021 – Corey Adams of Argonne National Laboratory is leading efforts to deploy advanced deep learning frameworks on Aurora, the next exascale system scheduled for delivery next year to the Argonne Leadership Computing Facility (ALCF) , a US Department of Energy (DOE) Office of the User Science Facility.

Adams, a computer scientist in the data science group at ALCF, has a joint position with the Argonne physics division. His research sits at the intersection of deep learning, AI, and fundamental physics, encompassing contributions to Aurora’s one-time engineering efforts (NREs) that target Aurora Early Science program projects. (ESP), including connectomic brain mapping work and the virtual drug CANDLE. cancer response prediction and treatment application, in addition to applications for astrophysics, neutrino physics, quantum lattice chromodynamics, Argonne Advanced Photon Source (APS) and Large Hadron Collider.

Collaborate with Intel to deliver Aurora

As the arrival of Aurora nears, the Data Science Group is working to ensure that the AI ​​applications defined for deployment on the system will perform at full performance from day one, and will perform and scale correctly in the future. relatively bug-free implementations. To this end, Corey and his colleagues have selected a number of Argonne workloads that represent innovative approaches to AI for science that will benefit from the Aurora architecture.

In doing so, in order to develop the capabilities of the applications from a scientific perspective, they relied on computer vision benchmarks established by Intel – for the deep learning projects for which Adams serves as the ALCF point of contact – when developing various deep learning and AI frameworks.

Performance tracking is twofold: Intel reports metrics for selected applications, while the Argonne team uses GitLab CI / CD on the Joint Laboratory for System Evaluation (JLSE) benchmarks to track performance and stability of applications. applications, testing on a weekly basis.

Scale and scale with different deep learning frameworks

Deep learning frameworks can be extended or extended.

The first process, scaling, is to optimize an application for the fastest possible performance on a single graphics processing unit (GPU). Scaling, on the other hand, distributes an application across multiple GPUs. The ALCF predicts that Aurora, like other upcoming exascale systems, will derive most of its power from GPUs.

High-level frameworks in Python, such as TensorFlow and PyTorch, rely on the deep neural network (DNN) framework of Intel oneDNN for compute-intensive GPU processes such as convolution operations, whose complex requirements frustrate out-of-the-box performance attempts. . This requires extensive iterations of development and testing before an effective kernel or source code can be produced.

Once optimal performance has been achieved on a single GPU, the Intel oneCCL Collective Communications Library helps deliver optimal performance across multiple GPUs by distributing optimized communication patterns to allocate parallel pattern training from an arbitrary number of knots. OneCCL and the synchronicity it encourages thereby enable tasks such as the uniform collection of gradients from a training iteration.

The oneDNN framework provides fast concentrated performance in a single GPU, in other words, while oneCCL provides fast distributive GPU performance across multiple GPUs.

For more detailed benchmarks, Adams and his team are working with Intel to track the performance of oneDNN and oneCCL independently of each other and independent of additional GPU operations.

Source: Nils Heinonen, ALCF



Leave A Reply