The current model for fault tolerance involves collecting periodic collective checkpoints, with all processes restarting from the last successful checkpoint on failure. We have worked on virtualization-based approach to checkpointing in the context of Infiniband clusters.
Rather than rolling back all the processes in the event of a failure, we have developed strategies to restart only the failed processes while other processes co-operate in deriving a consistent state without discarding the work done by them since the last checkpoint. This work is done in the context of the task-based programming framework we have developed for dynamic load balancing.
For more information, please refer to my papers in this area.Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem. I have addressed this issue in the context of a variety of architectural environments, such as shared-memory systems, distributed-memory clusters, and GPUs. In our model, the computation is organized as a collection of tasks. Each task provides sufficient information to the runtime to enable relocation of the task, possibly also handling its data requirements. Solutions have been developed that are tailored to application classes, ranging from hypergraph partitioning-based disk I/O scheduling to distributed memory work stealing for a collection of independent tasks.
For more information, please refer to my papers in this area.