Get Complete Project Material File(s) Now! »

## Schedule Search Tree

The traditional goal of CSPs is to find solutions to a problem. In our case, finding a solution is trivial. A topological sort of the instructions following data dependencies would suﬃce. Instead we are looking for the best solution we can obtain in a limited amount of time. Ideally, we would generate all possible schedules and return the best one. For this, we would build a binary tree where each node is a partial order and the child of a node are the possible orderings between a pair of unordered instructions. This tree corresponds to the possible instantiations of the pairwise orderings by the CSP search procedure, with variables instantiated in a fixed order.

To build the tree, we start from the root representing the full domain. The two children of a node are the domains obtained by respectively assigning before and after to the next unspecified variable and propagating constraints. The leaves of the tree are fully specified schedules for which we can compute an execution time, while the nodes of the tree represent the set of schedules in their sub-tree. Figure 2.1 presents a schedule search tree for a basic block with three instructions (a, b and c) and no data dependencies. Each level of the tree specifies the order between two instructions, with two possible outcomes.

Building the entire tree would, of course, be infeasible when the number of instructions grows. Thus, we only build the parts of the tree in which we expect to find the best implementations. The idea is that we can reason on partially specified schedules to detect uninteresting nodes early in the search tree, thus avoiding building large parts of the tree.

**Optimistic Model of Partially Specified Schedules**

We now present a performance model of partially specified schedules to prune the search tree. The performance model gives a lower bound B(d) on the execution time of any schedule licensed by the domain d ∈ Dom. If S is the set of all valid schedules, then:

∀s ∈ S. s ∈ con(d) =⇒ Ts ≥ B(d) (2.19)

From the point of view of the search tree, the bound obtained for an internal node is a lower bound for the execution time of all the schedules in the subtree rooted in the node.

The performance model relies on a graph that represents timestamp constraints between instructions. Given a domain d, the vertices are the set of instructions I and the edges between any two instructions a, b ∈ I are defined as follows.

• If a is scheduled before b, that is d(order(a, b)) = {before}, then E contains an edge from a to b with weight 1. This corresponds to the constraint imposed by equation (2.1)

• If b depends on a (a b), then E contains an edge from a to b with weight l(a). This corresponds to the constraint imposed by equation (2.3).

We compute B(d) as the weight of the longest path in the graph plus the latency of the last instruction. This corresponds to equation (2.4). Constraints (2.15–2.16) ensure there are no loops in the graph so the longest path weight is well-defined. When the domain is fixed, the bound is equal to the execution time. Otherwise, the graph only contains edges that appear in the graph of every schedule licensed by the domain and the longest path in the graph is a lower bound of possible execution times.

This bound allows us to prune the search tree using a branch and bound strategy. The search algorithm discards branches for which the bound B is bigger than the current best candidate. Because this model works on partially specified implementations, we can prune whole parts of the schedule space at once. We also ensure than we never prune any optimal schedules.

### Summary

This chapter presents a solution to the problem of scheduling instructions within a basic block. For this it first formalized the problem and gave some background on CSPs. It them reformulated the problem of scheduling instructions as a CSP. Intermediate steps of the CSP resolution represent partial orders, while solution of the CSP represent fully specifed implementations of the basic block. In practice, scheduling instructions in straight-line code is only one of the many aspects of choosing an implementation for a kernel.

Chapter 3 builds on the ideas of representing sets of potential implementation using the CSP formal-ism. It presents a generic framework that defines a CSP for each kernel from a declarative description of implementation choices and constraints. Chapter 4 then apply this framework to represent implementa-tion spaces of linear algebra kernels on GPUs, supporting complex transformations. It shows that while the formalism of this chapter may seem overly complex, it is critical when considering several classes of optimizations, interacting with each other.

This chapter also proposes to build a performance model to prune parts of the schedule space. Chapter 5 extends this idea to compute a lower bound for the execution time of any implementation that derives from a partially specified implementation. We show that such models can extract pertinent information early in the compilation process, greatly improving the performance of the search for eﬃcient implementations.

**Representing Candidate Implementations**

This chapter introduces and formalizes the concept of candidates, that are partially specified implementa-tions, with some implementation decisions left open while others are fixed. A candidate thus represents a whole set of potential implementations of the same kernel, up to semantics-preserving implementa-tion decisions. The compilation process starts with all choices open and iteratively specifies them until reaching a fully specified implementation.

Not all combinations of decisions are valid. Candidates define a Constraint Satisfaction Problem (CSP) to model interactions between decisions, respect the semantics of the kernel and enforce hardware limitations. Because manually writing the code to enforce constraints is a hard and error-prone process, we propose a domain specific language (DSL) to describe implementation choices and express constraints between them. This defines a decision space that we can instantiate on diﬀerent kernel to represent candidates.

A dedicated compiler transforms a specification using this DSL into code that we then use to construct the search tree. This consists in generating a data structure that holds the list of available decisions, the code that builds this data structure from the kernel description and the code that propagates constraints on decisions. The DSL and its compiler make our approach independent of a particular encoding of transformations or a of a particular kernel representation. It enables us to easily try out diﬀerent decision spaces and allows external users to define their own decision space for diﬀerent application domains.

Contributions. This chapter focuses on the formalism and the generic framework to define candidate implementations, while Chapter 4 explains how we use our framework to build a decision space for linear algebra kernels. Specifically, this chapter makes the following contributions.

• It introduces the concept of candidate implementation.

• It presents a formalism to define candidates independently of a particular kernel.

• It presents a declarative language and an associated compiler that implements the formalism. Together, they make it easier to extend the implementation space with new decisions or to apply our approach to other domains of applications.

Outline. Section 3.1 gives a definition of candidate implementations and explains how we expose potential decisions. Section 3.2 then explains how we can specify the decisions and constraints that make up candidates independently of kernel instances. This section introduces the concept of decision space, that lists generic choices and constraints that can be instantiated on any kernel. It also introduces the concept of lowering, that allows us to lower some high-level constructs in the kernel definition when certain conditions are met. Section 3.3 defines a declarative language that defines decision spaces. Section 3.4 illustrate our approach by applying our formalism to the problem of Chapter 2. Then, Section 3.5 explains how we generate code from our domain specific language to represent and manipulate candidates. Finally, Sections 3.6 and 3.7 compares our work to existing approaches and discuss how our framework helps finding eﬃcient implementations.

**Candidate Implementations**

We represent the problem of choosing among possible implementations of a computational kernel as a CSP. Each variable of the CSP materializes an implementation choice. The initial domain of such a variable contains one value for each possible outcome of the choice. Not all combinations of decisions are valid, hence constraints ensure decisions are coherent. An implementation is a solution of the CSP, that is, an assignment from choices to outcomes that satisfies the constraints. This section assumes that the encoding of the implementation space into a CSP hdK , CK i for any kernel K is given. Sections 3.2 and 3.4 and Chapter 4 later explain how we build this encoding from a kernel-independent description.

#### Representation of Potential Implementations.

Following the motivating example of Chapter 2, we use domains to represent and manipulate sets of possible implementations.

Definition 3.1 (Candidate and Implementation). A candidate is a pair (K, d), composed of a kernel K to implement and a domain d for the CSP hdK , CK i. The domain d is thus a more constrained version of the initial domain dK of the CSP: d ⊆ dK . An implementation of (K, d) is a solution of hd, CK i. The implementation space J(K, d)K of a (K, d) is the set of all solutions of hd, CK i.

The domain of a candidate lists possible decisions for each choice. A search algorithm enforces a decision by restricting the domain of the corresponding choice. Constraints may forbid some combinations of decisions. Following the formalism of CSPs, we use propagators that implement the constraints to prune the domains from invalid values. To maximize the information about the validity of decisions, we only consider candidates with domains on which propagators have reached a fixed point. We thus run propagators at the creation and after each modification of the domain. Even then, some invalid value may remain undetected until further choices are specified. In fact, it might be possible that a candidate does not represent any valid implementation at all if one of the variable domains only contains invalid values. CSP solvers solve this problem by backtracking.

In our case, Section 4.5.2 shows that the constraints we use are simple enough constraints to generate accurate propagators that almost never miss invalid values, and thus accurately represent possible deci-sions. Our search strategy, which we present in Chapter 6, already backtracks when it encounters some candidates that only lead to ineﬃcient implementations. The impact of backtracking due to suboptimal propagators is negligible compared to the number of backtracking due to the search strategy. The defi-nition of propagators (Section 2.2) ensures no invalid value remains in fixed domains, thus ensuring that fully specified candidates are valid.

**Table of contents :**

**1 Introduction **

1.1 Hardware Accelerators

1.2 Challenges of Code Generation for Hardware Accelerators

1.3 Our Solution: Partially Specified Implementations

1.4 Key Contributions

1.5 Organization

**2 Motivation: Partially Specified Schedules **

2.1 Problem Statement: Scheduling a Basic Block

2.2 Background: Constraint Satisfaction Problems

2.3 Partially Specified Schedule

2.4 Optimistic Model of Partially Specified Schedules

2.5 Summary

**3 Representing Candidate Implementations **

3.1 Candidate Implementations

3.2 Decision Space Definition

3.3 Candidate Representation Description Language

3.4 Example: Instructions Scheduling Within a Basic Block

3.5 Constraint Propagation Code Generation

3.6 Discussion

3.7 Summary

**4 Application: Linear Algebra on GPUs **

4.1 Background: GPUs Architecture

4.2 Kernel Representation

4.3 Decisions Encoding

4.4 Code Generation

4.5 Discussion

4.6 Summary

**5 An Optimistic Performance Model of Candidates **

5.1 Model Overview

5.2 Single Thread Model

5.3 Instantiation of the Model for Kepler GPUs

5.4 Empirical Evaluation

5.5 Discussion

5.6 Summary

**6 Scaling the Search with the Properties of Candidates **

6.1 Search Algorithm

6.2 Sample Kernels

6.3 Empirical Evaluation

6.4 Discussion

6.5 Summary

**7 Conclusion **

7.1 Principal Contributions

7.2 Long Term Perspectives