Date of Award

8-19-2013

Document Type

Thesis

Degree Name

Computer Science, MS

First Advisor

Hai Jiang

Committee Members

Hung-Chi Su; Jeff Jenness,

Abstract

Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many scientific applications. Various implementations have been explored at different levels. However, as GPU's gain an expanding role in high performance computing, there is a need for a more effective checkpoint/restart scheme that does not yet exist due to GPU's batch-mode execution manner. The GPU's complex memory hierarchy also means the states are scattered in different memory locations that are difficult to fetch. Programs that are running in parallel make the states difficult to construct for each thread. The thesis proposes an application-level checkpoint/restart scheme to save and restore GPU computation states. A precompiler and a run-time support module have been developed to construct and save states in CPU system memory dynamically. Memory blocks are registered, and new data structures are proposed to save and restore the computation states represented by variables and pointers in the GPU. Secondary storage can be utilized for scalability and long-term fault tolerance. CUDA applications with complicated memory use are support as well. Experimental results have demonstrated the effectiveness of the proposed scheme.

Rights Management

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.