Date of Award
8-19-2013
Document Type
Thesis
Degree Name
Computer Science, MS
First Advisor
Hai Jiang
Committee Members
Hung-Chi Su; Jeff Jenness,
Abstract
Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many scientific applications. Various implementations have been explored at different levels. However, as GPU's gain an expanding role in high performance computing, there is a need for a more effective checkpoint/restart scheme that does not yet exist due to GPU's batch-mode execution manner. The GPU's complex memory hierarchy also means the states are scattered in different memory locations that are difficult to fetch. Programs that are running in parallel make the states difficult to construct for each thread. The thesis proposes an application-level checkpoint/restart scheme to save and restore GPU computation states. A precompiler and a run-time support module have been developed to construct and save states in CPU system memory dynamically. Memory blocks are registered, and new data structures are proposed to save and restore the computation states represented by variables and pointers in the GPU. Secondary storage can be utilized for scalability and long-term fault tolerance. CUDA applications with complicated memory use are support as well. Experimental results have demonstrated the effectiveness of the proposed scheme.
Rights Management
This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
Zhang, Yulu, "A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy" (2013). Student Theses and Dissertations. 813.
https://arch.astate.edu/all-etd/813