ACCELERATING COMMUNICATION IN DLRM VIA FREQUENCY-AWARE LOSSY COMPRESSION

Started in September 2022

Description: Deep Learning Recommendation Model (DLRM) is an emerging method used by Meta for tasks in personalization and recommender systems, such as click-through rate (CTR) prediction. State-of-the-art DLRMs often employ embedding tables to map high-dimensional sparse vectors from raw categorical features to low-dimensional dense vector representations. Such embedding tables often have dimensions of tens of millions of rows by hundreds of columns, with sizes up to the order of GBs per table. The typical strategy to handle these large tables is to distribute them across high-performance computer nodes and leverage collective communication primitives to aggregate the looked-up vectors (up to GBs of data per process) during each minibatch iteration in forward passes and corresponding gradients in backward passes. The goal of this project is to reduce communication volume and increase communication throughput in training DLRMs by using lossy compression techniques.

Personnel:

Dingwen Tao (PI)
Tong Geng (collaborator)
Hao Feng, Chengming Zhang, Boyuan Zhang (PhD student)

Publication: to come...

Acknowledgement: This project is supported by Meta Research.

COMPRESSION-ACCELERATED DISTRIBUTED DNN TRAINING SYSTEM AT LARGE SCALES

Started in May 2020

Description: Deep learning (DL) has rapidly evolved to a state-of-the-art technique in many science and technology disciplines, such as scientific exploration, national security, smart environment, and healthcare. Many of these DL applications require using HPC resources to process large amounts of data. For example, researchers and scientists are employing extreme-scale DL applications in HPC infrastructures to classify extreme weather patterns and high-energy particles. In recent years, using GPUs to accelerate DL applications has attracted increasing attention. However, the ever-increasing scales of DL applications bring many challenges to today's GPU-based HPC infrastructures. The key challenge is the huge gap between the memory requirement and its availability on GPUs. This project aims to fill this gap by developing a novel framework to reduce the memory demand effectively and efficiently via data compression technologies for extreme-scale DL applications.

Personnel:

Dingwen Tao (PI)
Yanzhi Wang, Bin Ren, Shuaiwen Leon Song (collaborator)
Sian Jin, Chengming Zhang (PhD student)
Xintong Jiang, Dung Hoang Le (undergraduate student)

Publication: ACM PPoPP'21, ACM ICS '21, VLDB'22.

Acknowledgement: This project is supported by NSF OAC-2034169. More details can be found at https://compdnn.github.io/comptrain/.

FAIR SURROGATE BENCHMARKS SUPPORTING AI AND SIMULATION RESEARCH

Started in August 2021

Description: Computational Science is being revolutionized by the integration of AI and simulation and in particular, by deep learning surrogates that can replace all or part or of traditional large-scale HPC computations. Surrogates can achieve remarkable performance improvements (e.g., several orders of magnitude) and so save in both time and energy. The Surrogate Benchmark Initiative (SBI) project will create a community repository and FAIR data ecosystem for HPC application surrogate benchmarks, including data, code, and all relevant collateral artifacts the science and engineering community needs to use and reuse these data sets and surrogates. We intend that our repositories will generate active research from both the participants in our project and the broad community of AI and domain scientists.

Personnel:

Dingwen Tao (PI)
Xiaodong Yu, Kamil Iskra, Peter Beckman (collaborator)
Baixi Sun (PhD student)

Publication: to come...

Acknowledgement: This project is supported by DOE FAIR SBI.

FAST CPU INFERENCE OF SPARSE DNN MODELS ON AMD EPYC PROCESSORS

Started in Spring 2021

Description: This project aims to develop a system-algorithm co-design method for fast sparse DNN models (including CNNs, RNNs, NLPs) on AMD EPYC CPUs. The co-design efforts include algorithm-level optimizations (such as pattern-based pruning approach) and system-level optimizations (such as AMD BLIS library optimization). The final goal is to achieve high accuracy and fast pruned models for direct deployment without any time overhead or even faster, compared with the baseline inference (without sparsification).

Personnel:

Dingwen Tao (PI)
Dave Ojika (collaborator)
Chengming Zhang (PhD student)
Xintong Jiang, Dung Hoang Le (undergraduate student)

Acknowledgement: This project is supported by AMD and FlapMax.

SZ: AN OPEN, TRANSPARENT LOSSY COMPRESSION FRAMEWORK FOR SCIENCE AND ENGINEERING DATA

Started in June 2018

Description: SZ is a modular parametrizable lossy compressor framework for scientific data (floating point and integers). It has applications in simulations, AI and instruments. It is a production quality software and a research platform for lossy compression. SZ is open and transparent. Open because all interested researchers and students can study or contribute to it. Transparent because all performance improvements are detailed in publications. SZ can be used for classic use-cases: visualization, accelerating I/O, reducing memory and storage footprint and more advanced use-cases like compression of DNN models and training sets, acceleration of computation, checkpoint/restart, reducing streaming intensity and running efficiently large problems that cannot fit in memory. Other use-cases will augment this list as users find new opportunities to benefit from lossy compression of scientific data. This project aims to design and develop the SZ framework for cyberinfrastructures (e.g., exascale computers).

Personnel:

Dingwen Tao (PI)
Franck Cappello, Sheng Di, Xin Liang (collaborator)
Sian Jin, Jiannan Tian, Chengming Zhang (PhD student)
Philip Speegle, Cody Rivera (undergraduate student)

Publication: ACM HPDC'20, IEEE IPDPS'21.

Software: https://github.com/szcompressor/SZ.

Acknowledgement: This project is supported by DOE ECP VeloC/SZ. More details about SZ can be found at https://szcompressor.org/.

CUSZ: AN EFFICIENT ERROR-BOUNDED LOSSY COMPRESSION FRAMEWORK FOR SCIENTIFIC DATA ON GPU ARCHITECTURES

Started in August 2019

Description: Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for post-analysis. Because supercomputers and HPC applications are becoming heterogeneous using accelerator-based architectures, in particular GPUs, several development teams have recently released GPU versions of their lossy compressors. However, existing state-of-the-art GPU-based lossy compressors suffer from either low compression and decompression throughput or low compression quality. This project aims to develop an optimized GPU-based lossy compressor for scientific data.

Personnel:

Dingwen Tao (PI)
Sheng Di, Xiaodong Yu (collaborator)
Jiannan Tian (PhD student)
Cody Rivera, Eric Song (undergraduate student)

Publication: ACM PACT'20, IEEE Cluster'21, IEEE IPDPS'21, IEEE IPDPS'22.

Software: https://github.com/szcompressor/cuSZ/.

Acknowledgement: This project is supported by DOE ECP VeloC/SZ project and the Expec Advanced Research Center of Saudi Aramco.

CEAZ: HARDWARE-SOFTWARE CO-DESIGN OF LOSSY COMPRESSION FOR SCIENTIFIC DATA

Started in August 2019

Description: Nowadays, many different tasks such as artificial intelligence, deep learning, graph analysis, and experimental analysis applications need to be simultaneously executed and managed along with the main simulation tasks in the supercomputer, all of which often generate huge amounts of scientific data that must be transferred for in situ processing or post analysis. To alleviate the network traffic and storage overhead, data reduction is necessarily needed by HPC in leadership computing facilities or even edge computing in experimental and observational facilities. During the past four years, SZ compression has gained much attention as a powerful data reduction technique because of its high reduction capability. However, it suffers from low throughput and high resource utilization, which impedes its adoption in many scenarios that require high-rate streaming data or use low-power embedded processors. FPGA, featuring the capabilities of configurability, high throughput, low latency, and high energy efficiency, can provide a potentially good solution to these issues. This project is to optimize and implement an FPGA-enhanced lossy compression for better scientific data management.

Personnel:

Dingwen Tao (PI)
Tong Geng, Ang Li (collaborator)
Chengming Zhang, Jiannan Tian (PhD student)

Publication: ACM PPoPP'20, ACM ICS'22.

Software: https://github.com/szcompressor/SZ_HLS.

Acknowledgement: This project is supported by Xilinx.

CEAPA: A SYSTEMATIC APPROACH TO MINIMIZE COMPRESSION ERROR PROPAGATION IN HPC APPLICATIONS

Started in August 2022

Description: Today’s high-performance computing (HPC) applications produce vast volumes of data for post-analysis, presenting a major storage and I/O burden for HPC systems. To significantly reduce this burden, researchers have explored to use lossy compression techniques. While lossy compression can effectively reduce the size of data, it also introduces errors to the compressed data that often lead to incorrect computation results. As a result, scientists hesitate to use lossy compression in their scientific research. Thus, there is a critical need to develop an effective method to identify compression strategies which minimize error impact for a diversity of programs. This project aims to develop a systematic approach that helps scientists automatically select a lossy compression algorithm with the lowest error impact based their HPC programs and target compression ratios. It also integrates educational and outreach activities including student training and development of new curriculum on trustworthy data reduction and dependable HPC systems.

Personnel:

Dingwen Tao (PI)
Guanpeng Li, Sheng Di (collaborator)

Publication: ACM/IEEE SC'22.

Acknowledgement: This project is supported by NSF OAC-2211539.

HYLOC: OBJECTIVE-DRIVEN ADAPTIVE HYBRID LOSSY COMPRESSION FRAMEWORK FOR EXTREME-SCALE APPLICATIONS

Started in August 2020

Description: Today's extreme-scale scientific simulations and instruments are producing huge amounts of data that cannot be transmitted or stored effectively. Lossy compression, a data compression approach leading to certain data distortion, has been considered as a promising solution, because it can significantly reduce the data size while maintaining high data fidelity. However, the existing lossy compression methods may not always work effectively on all datasets used in specific applications because of their distinct and diverse characteristics. Moreover, the user objectives in compression quality and performance may vary with applications, datasets or circumstances. This project is to develop a hybrid lossy compression framework to automatically construct the best-fit compression for diverse user objectives in data-intensive scientific research.

Personnel:

Dingwen Tao (PI)
Sheng Di, Zarija Lukic (collaborator)
Sian Jin, Daoce Wang, Baixi Sun (PhD student)

Publication: ACM HPDC'21, IEEE ICDE'22, ACM/IEEE SC'22.

Acknowledgement: This project is supported by NSF OAC-2003624. More details can be found at https://adphylc.github.io/.

ROCCI: INTEGRATED CYBERINFRASTRUCTURE FOR IN SITU LOSSY COMPRESSION OPTIMIZATION BASED ON POST HOC ANALYSIS REQUIREMENTS

Started in August 2021

Description: Today’s simulations and advanced instruments are producing vast volumes of data, presenting a major storage and I/O burden for scientists. Error-bounded lossy compressors, which can significantly reduce the data volume while controlling data distortion with a constant error bound, have been developed for years. However, a significant gap still remains in practice. On the one hand, the impact of the compression errors on scientific research is yet not well understood, so that how to set an appropriate error bound for lossy compression is very challenging to scientists. On the other hand, how to select the bestfit compression technology and run it automatically in scientific application codes is non-trivial because of pros and cons of different compression techniques and diverse characteristics of applications and datasets. This project aims to develop a Requirement-Oriented Compression Cyber-Infrastructure (ROCCI) for data-intensive domains such as astrophysics and materials science, which can select and run the bestfit lossy compressor automatically at runtime, in terms of user’s requirement on their post hoc analysis.

Personnel:

Dingwen Tao (PI)
Sheng Di, Franck Cappello, Junjing Deng, Zarija Lukic, Suren Byna (collaborator)
TBD (PhD student)
TBD (undergraduate student)

Publication: IEEE Cluster'21, IEEE TPDS'22.

Acknowledgement: This project is supported by NSF OAC-2104024.

More details can be found at https://roccilab.github.io/rocci/.

IN SITU DATA REDUCTION FOR LARGE-SCALE COSMOLOGY SIMULATIONS

Started in June 2019

Description: Modern cosmological simulations are used by researchers and scientists to investigate new fundamental astrophysics ideas, develop and evaluate new cosmological probes, assist large-scale cosmological surveys, and investigate systematic uncertainties. Historically such studies have required large simulations that are highly computation and storage intensive, which are run on leadership supercomputers. With the increase in scale of such simulations, saving all the raw data generated to disk becomes impractical due to the limited storage capacity and I/O bandwidth. A better way to address this issue is to use data compression. While lossless compression would have been ideal, it typically only achieves a 2× compression ratio for scientific data. This project aims to investigate and optimize the use of error-bounded lossy compression techniques to achieve both high compression ratios and data fidelity for cosmological post-analysis (e.g., power spectrum and halo finder).

Personnel:

Dingwen Tao (PI)
Pascal Grosset, Jesus Pulido (collaborator)
Sian Jin, Daoce Wang (PhD student)

Publication: IEEE IPDPS'20, ACM HPDC'21, ACM HPDC'22.

Software: https://github.com/lanl/VizAly-Foresight.

Acknowledgement: This project is supported by DOE ECP ExaSky.

TSM2X: HIGH-PERFORMANCE IRREGULAR-SHAPED MATRIX COMPUTATIONS ON GPUS

Started in August 2019

Description: Linear algebra operations have been widely used in big data analytics and scientific computations, especially today's deep learning tasks. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. This project is to design and develop efficient algorithms and implementations for irregular-shaped (e.g, tall-and-skinny or short-and-wide) matrix-matrix multiplications on GPUs, improving performance of many big data analytics and scientific computations.

Personnel:

Dingwen Tao (PI)
Jieyang Chen (collaborator)
Cody Rivera (undergraduate student)

Software: https://github.com/codyjrivera/tsm2x-imp.

Publication: ACM ICS'19, JPDC'21.