Resilism: To explore Resilience Mechanism of HPC applications to soft errors

<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" /> The primary goal of Resilism is to investigate and understand the mechanisms that contribute to the resilience of High-Performance Computing (HPC) applications when faced with soft errors. Soft errors, also known as transient faults, are temporary errors in electronic circuits caused by external radiation, electromagnetic interference, or other environmental factors. These errors can disrupt the execution of HPC applications, potentially leading to incorrect results or system failures.

</aside>

https://github.com/hjiang13/Resilism

Untitled

DARE: DAtasets for REsilience analysis

<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" /> DARE (Datasets for Resilience Analysis) is a comprehensive initiative aimed at providing a curated collection of datasets specifically designed for the resilience analysis of software systems, particularly High-Performance Computing (HPC) applications. The project focuses on compiling, organizing, and distributing datasets that contain various types of soft errors, including transient faults caused by environmental factors, hardware malfunctions, and software bugs. These datasets are essential for researchers and developers working on enhancing the fault tolerance and resilience of HPC systems. By using DARE, users can benchmark and validate their resilience mechanisms, conduct comparative studies, and develop new techniques to mitigate the effects of soft errors. The ultimate goal of DARE is to advance the field of resilience engineering by offering valuable resources that facilitate robust and reliable software development.

</aside>

https://github.com/hjiang13/DARE

HAPPA: A Modular Platform for HPC Application Resilience Analysis with LLMs Embedded

<aside> <img src="/icons/info-alternate_gray.svg" alt="/icons/info-alternate_gray.svg" width="40px" /> HAPPA is a versatile and modular platform designed to enhance the resilience analysis of High-Performance Computing (HPC) applications. By embedding Large Language Models (LLMs) into its architecture, HAPPA offers advanced capabilities for analyzing and predicting the behavior of HPC applications in the presence of various faults and errors. This integration allows for a more sophisticated and comprehensive approach to understanding how HPC applications can withstand and recover from soft errors, hardware malfunctions, and other disruptions. The modular nature of HAPPA ensures that it can be easily adapted and extended to meet the specific needs of different HPC environments, making it an invaluable tool for researchers and practitioners focused on improving the fault tolerance and reliability of HPC systems. Through HAPPA, users can leverage cutting-edge AI techniques to achieve high-assurance performance and resilience in their computational workflows.

</aside>

SRDS2024-workflow5.0.pdf

https://github.com/hjiang13/CODEBERT-REGRESSION

Resilism: To explore Resilience Mechanism of HPC applications to soft errors

DARE: DAtasets for REsilience analysis

HAPPA: A Modular Platform for HPC Application Resilience Analysis with LLMs Embedded

🔙 back to homepage