Size-Based Scheduling Policies with Inexact Job Sizes

Get Complete Project Material File(s) Now! »

Size-Based Scheduling with Estimated Sizes

Before starting with the design of the HFSP protocol, we need to understand if the imprecise information about sizes may have a signicant impact on the scheduling choices. Since this problem has not been well studied in the literature, we analyze it starting from a basic conguration: the single server case. Even if the single server case may seem simple, it provides a lot of insights that are useful to understand the behavior of the system. Chapter 3 is therefore dedicated to the study of scheduling policies that use estimated sizes. The insights gained in this Chapter will be used to drive the design of a more complex scheduler for DISC systems such as HFSP. In the rst part of the Chapter, we study the scheduling problem in presence of estimation errors and then provide an overview of the current state-of-the-art for both size-based and blind to size schedulers performance.
We then describe the impact of estimation errors on scheduling policies by dening SRPTE and FSPE, two variants of well known size-based scheduling policies for single server queues that work with estimated sizes. There are two kinds of errors that can be done when a job size is estimated: it can be overestimated or underestimated if its estimated size is, respectively, bigger or smaller than its real size. These two kinds of errors have two completely dierent impacts on size-based policies and require dierent strategies to deal with them. In particular, while jobs with overestimated size delay only themselves, jobs with underestimated size can potentially delay all the jobs in the queue.
Between the two, the second kind of errors has the highest impact on the scheduling policy and can lead to very poor performance.
The next part of Chapter is dedicated to dene FSPE+PS, a size-based scheduling policy for single server queue that is tolerant to estimation errors. Our simulative evaluation, which considers both variation in the workload composition and in the estimation error, shows two main results:
• despite their problems with jobs with underestimated sizes, both SRPTE and FSPE have excellent performance in many cases and they are good choices for an implementation of a real scheduler.
• FSPE+PS is even superior and provides better performance even in extreme cases of both workload composition and error.

The Hadoop Fair Sojourn Protocol

In Hadoop, a job is composed by tasks, and tasks can be run in parallel. As we are going to show in Chapter 4, most of the times tasks of the same job have very similar sizes. A few tasks can be run in the system and then, based on their performance, it is possible to deduce the estimated size of a job with small error. We implemented the Hadoop Fair Sojourn Protocol scheduler, a scheduling policy for DISC systems based on the FSPE policy dened for a single server queue. While the main contribution of this Chapter is a fully edged scheduler for a distributed system, the architectural choices that lead to this scheduler are not less important. Indeed, the adaption of scheduling policies dened for the single server case (such as FSPE) to a real system like Hadoop raises many challenged that must be addressed. HFSP consists of two main components, which are described in detail in Chapter 4: the estimation module and the aging module.
The estimation module is the component that estimates job sizes. It rst provides a very rough estimation when a job is submitted and then upgrades the size to a more precise one based on the performance of a subset of the tasks. This strategy of estimating sizes is designed around DISC systems, in which a job is composed by tasks, and leads to very good estimations in the end.
The aging module is the component that avoids job starvation by applying what is called “aging” to a job. HFSP doesn’t take decisions only based on the size but also based on how much time the job stays in the queue. In this way, even a relatively big job will eventually obtain resources, which solves the starvation problem. To age jobs, HFSP simulates Max-Min Processor Sharing in a virtual cluster with the same characteristics as the real one.

READ Responsibility of an internal auditor to management

Preemption for MapReduce systems

A recent preemption mechanism for Hadoop is Natjam [65]: unlike in our work, where we use the OS to perform job suspension and resuming, Natjam operates at the “application layer”, and saves counters about task progress, which allow to resume tasks by fast-forwarding to their previous states. Since the state of the Java Virtual Machine (JVM) is lost, however, Natjam cannot be applied seamlessly to arbitrary tasks: indeed, many MapReduce programming patterns involve keeping track of a state within the task JVM [103]; this problem is exacerbated by the fact that many MapReduce jobs are created by high-level languages such as Apache Pig [105] or Apache Hive [104]: jobs compiled by these frameworks are highly likely to make use of these “tricks”, which hinders the application of Natjam.
Natjam proposes to handle such stateful tasks with hooks that systematically serialize and deserialize task state. Besides requiring manual intervention to support suspension, this approach has the drawback of always requiring the overhead for serialization, writing to disk, and deserialization of a state that could be large. In contrast, our approach does not incur in a systematic serialization overhead, since it relies on OS paging to swap to disk the state of the tasks, if and when needed.

Table of contents :

Table of Contents
1 Introduction
1.1 Motivations
1.2 Challenges
1.3 Contributions and Thesis Organization
1.3.1 Size-Based Scheduling with Estimated Sizes
1.3.2 The Hadoop Fair Sojourn Protocol
1.3.3 OS-Assisted Task Preemption
1.3.4 Conclusion and Perspectives
2 Background and RelatedWork
2.1 Background
2.1.1 Performance Metrics
2.1.2 Scheduling Policies
2.1.3 MapReduce
2.2 Related Work
2.2.1 Size-Based Scheduling Policies with Inexact Job Sizes
2.2.2 Scheduling for DISC Systems
2.2.3 Job Size Estimation in MapReduce
2.2.4 Preemption for MapReduce systems
3 Size-based scheduling with estimated sizes
3.1 Scheduling Based on Estimated Sizes
3.1.1 SRPTE and FSPE
3.1.2 FSPE+PS
3.2 Evaluation Methodology
3.2.1 Scheduling Policies Under Evaluation
3.2.2 Parameter Settings
3.2.3 Simulator Implementation Details
3.3 Experimental Results
3.3.1 Synthetic Workloads
3.3.2 Real Workloads
3.4 Summary
4 The Hadoop Fair Sojourn Protocol
4.1 Jobs
4.2 The Estimation Module
4.2.1 Tiny Jobs
4.2.2 Initial Size
4.2.3 Final Size
4.3 The Aging Module
4.3.1 Virtual Cluster
4.3.2 Aging Speed
4.3.3 Estimated Size Update
4.3.4 Failures
4.3.5 Manual Priority and QoS
4.4 The Scheduling Policy
4.4.1 Job Submission
4.4.2 Priority To The Training Stage
4.4.3 Virtual Time Update
4.4.4 Data locality
4.4.5 Scheduling Algorithm
4.5 Impact of Estimation Errors
4.6 Task Preemption
4.6.1 Task Selection
4.6.2 When Preemption Occurs
4.7 Summary
5 System Evaluation
5.1 The BigFoot Platform
5.2 Experimental Setup
5.3 Macro Benchmarks
5.3.1 Response Time
5.3.2 Slowdown
5.4 Micro Benchmarks
5.5 Summary
6 OS-Assisted Task Preemption
6.1 OS-assisted Task Preemption
6.1.1 Suspension and Paging in the OS
6.1.2 Thrashing
6.1.3 Implementation Details
6.2 Experimental Evaluation
6.2.1 Experimental Setup
6.2.2 Performance Metrics
6.2.3 Results
6.2.4 Impact of Memory Footprint.
6.3 Discussion
6.3.1 Scheduling and Eviction Strategies
6.3.2 Implications on Task Implementation
6.4 Summary
7 Conclusion
7.1 Future Work
A French Summary
A.1 Motivations
A.2 Challenges
A.3 Contributions et Organisation de thèse
A.3.1 Planication de Taille-base avec taille estimée
A.3.2 The Hadoop Fair Sojourn Protocol
A.3.3 OS-Assisted Task Preemption
A.3.4 Conclusion and Perspectives
A.4 Travaux Futurs