Type of Document Master's Thesis Author Tadepalli, Sriram Satish URN etd-12292003-134023 Title GEMS: A Fault Tolerant Grid Job Management System Degree Master of Science Department Computer Science Advisory Committee
Advisor Name Title Ribbens, Calvin J. Committee Chair Kafura, Dennis G. Committee Member Varadarajan, Srinidhi Committee Member Keywords
- fault tolerance
- grid computing
- grid job management systems
- local resource manager
- job migration
Date of Defense 2003-12-19 Availability unrestricted AbstractThe Grid environments are inherently unstable. Resources join and leave
the environment without any prior notification. Application fault
detection, checkpointing and restart is of foremost importance in the
Grid environments. The need for fault tolerance is especially acute
for large parallel applications since the failure rate grows with the
number of processors and the duration of the computation.
A Grid job management system hides the heterogeneity of the Grid and the
complexity of the Grid protocols from the user. The user submits a job
to the Grid job management system and it finds the appropriate
resource, submits the job and transfers the output files to the user
upon job completion. However, current Grid job management systems do
not detect application failures.
The goal of this research is to develop a Grid job management system
that can efficiently detect application failures. Failed jobs are
restarted either on the same resource or the job is migrated to
another resource and restarted. The research also aims to identify the
role of local resource managers in the fault detection and migration
of Grid applications.
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access thesis.pdf 328.16 Kb 00:01:31 00:00:46 00:00:41 00:00:20 00:00:01
If you have questions or technical problems, please Contact DLA.