Process-Level Power Estimation in Multi-Core Systems

Get Complete Project Material File(s) Now! »

CPU Power Models

Along the last decade, the design of CPU power models has been regularly considered by the research community [Bel00; Col+15b; Kan+10; McC+11; VWT13]. Currently, the closest approach to hardware-based monitoring is RAPL, introduced with the Intel “Sandy Bridge” architecture to report on the power consumption of the entire CPU package. As this feature is not available on other architectures and is not always accurate [Col+15b], CPU power models are generally designed based on a wider diversity of raw metrics.
Standard operating system metrics (CPU, memory, disk, or network), directly computed by the kernel, tend to exhibit a large error rate due to their lack of precision [Kan+10; VWT13]. Contrary to usage statistics, hardware performance counters (HPC) can be obtained from the processor (e.g., number of retired instructions, cache misses, non-halted cycles). Modern processors provide a variable number of HPC, depending on architectures and generations. As shown by Bellosa [Bel00] and Bircher [BJ07], some HPC are highly correlated with the processor power consumption whereas the authors in [RRK08] conclude that several performance counters are not useful as they are not directly correlated with dynamic power. Nevertheless, this correlation depends on the processor architecture and the CPU power model computed using some HPCs may not be ported to diﬀerent settings and architectures. Furthermore, the number of HPC that can be monitored simultaneously is limited and depends on the underlying architecture [Int15a], which also limits the ability to port a CPU power model on a diﬀerent architecture. Therefore, finding an approach to select the relevant HPC represents a tedious task, regardless of the CPU architecture.
Power modeling often builds on these raw metrics to apply learning techniques—for example based on sampling [Ber+10]—to correlate the metrics with hardware power measurements using various regression models, which are so far mostly linear [McC+11].
A.Aroca et al. propose to model several hardware components (CPU, disk, network). It is the closest approach of our empirical learning method describes in Section 4.1.1. They use the lookbusy tool to generate a CPU load for each available frequency and a fixed number of active cores. They capture active cycles per second (ACPS) and raw power measurements while loading the CPU. A polynomial regression is used for proposing a power model per combination of frequency and number of active cores. They validate their power models on a single processor (Intel Xeon W3520) by using a map-reduce Hadoop application. During the validation, the authors have not been able to correctly capture the input parameter of their power model—i.e., the overall CPU load—and they use an estimation instead. The resulting “tuned” power model with all components together exhibits an error rate of 7% compared to total amount of energy consumed. Bertran et al. [Ber+10] model the power consumption of an Intel Core2 Duo by selecting 14 HPCs based on an a priori knowledge of the underlying architecture. To compute their model, the authors inject both selected HPCs and power measurements inside a multivariate linear regression. A modified version of perfmon21 is used to collect the raw HPC values. In particular, the authors developed 97 specific micro-benchmarks to stress each component identified in isolation. These benchmarks are written in C and assembly, and cannot be generalized to other architectures. They assess their solution with the SPEC CPU 2006 benchmark suite, reporting an error rate of 5% on a multi-core architecture.
Bircher et al. [Bir+05] propose a power model for an Intel Pentium 4 processor. They provide a first model that uses the number of fetched µ-operations per cycle, reporting on an average error rate of 2.6%. As this model was performing better for benchmarks inducing integer operations, the authors refine their model by using the definition of a floating point operation. As a consequence, their second power model builds on 2 HPC: the µ-operations delivered by the trace cache and the µ-operations delivered by the µ-code ROM. This model is assessed using the SPEC CPU 2000 benchmark suite, which is split in 10 groups. One benchmark is selected per group to train the model and the remaining ones are used to assess their estimation. Overall, the resulting CPU power model reports on an average error of 2.5%.
In [CM05], Contreras et al. propose a multivariate linear CPU power model for the Intel XScale PXA255 processor. They additionally consider diﬀerent CPU frequencies on this processor to build a more accurate power model. They carefully selected the HPCs with the best correlation while avoiding redundancy, resulting in the selection of only 5 HPCs. In their paper, they also consider the power drawn by the main memory using 2 HPCs already used in the CPU power model. However, given that this processor can only monitor 2 events concurrently, they cannot implement an eﬃcient and usable runtime power estimation. They test their solution on SPEC CPU 2000, Java CDC, and Java CLDC, and they report an average error rate of 4% compared to the measured average power consumption.

VM Power Models

In data centers, the eﬃciency of VM consolidation, power dependent cost modeling, and power provisioning are highly dependent on accurate power models [VAN08]. Such models are particularly needed because it is not possible to attach a power meter to a virtual machine [Kri+11]. In general, VMs can be monitored as black-box systems for coarse-grained scheduling decisions. If we want to be able to do fine-grained scheduling decisions— i.e., with heterogeneous hardware—we need to be able to consider finer-grained estimation at sub-system level and might even need to step inside the VM.
So far, fine-grained power estimation of VMs require profiling each application sep-arately. One example is WattApp [KVN10], which relies on application throughput instead of performance counters as a basis for the power model. The developers of pMap-per [VAN08] argue that application power estimation is not feasible and instead perform resource mapping using a centralized step-wise decision algorithm.
To generalize power estimation, some systems like Joulemeter [Kan+10] assume that each VM only hosts a single application and thus treat VMs as black boxes. In a multi-VM system, they try to compute the resource usage of each VM in isolation and feed the resulting values in a power model. Bertran et al. [Ber+12] use a sampling phase to gather data related to HPCs and compute energy models from these samples. With the gathered energy models, it is possible to predict the power consumption of a process, and therefore apply it to estimate the power consumption of the entire VM. Their work does neither consider modern CPU features. Another example is given by Bohra et al. in [BC10], where the authors propose a tool named VMeter that estimates the consumption of all active VMs on a system. A linear model is used to compute the VMs power consumption with the help of available statistics (processor utilization and I/O accesses) from each physical node. The total power consumption is subsequently computed by summing the VMs consumption with the power consumed by the infrastructure.
Janacek et al. [Jan+12] use a linear power model to compute the server consumption with postmortem analysis. The computed power consumption is then mapped to VMs depending on their load. This technique is not eﬀective when runtime information is required.
In Stoess et al. [SLB07] the authors argue that, in virtualized environments, energy monitoring has to be integrated within the VM as well as the hypervisor. To that end, they use the L4 micro-kernel as hypervisor and adapt a guest OS to run on L4. They assume that each device driver is able to expose the power consumption of the corresponding device as well as an energy-aware guest operating system and is limited to integer applications. For application level power monitoring, the VM connects to the hypervisor and maps virtualized performance counters to the hardware counters.

Hardware-Level Granularity

PowerMon2 [Bed+10] uses an external power monitoring board that needs to be inserted between a system’s power supply and a motherboard. This board therefore allows to retrieve the power consumption per connected hardware component and can physically be integrated into a target system by fitting into a 3.5″ drive bay. PowerMon2 allows to measure up to 8 individual DC rails, thus allowing to attach several rails to the motherboard in addition to hardware components (GPUs, disks, etc.). It can read and report power measurements of hardware components at a rate up to 3 KHz to the user through an USB interface. All schematics and source-code are freely available online, but this solution requires some rather expensive investments (up to $150).
PowerInsight [LPD13] follows the same principle as PowerMon2 and is built on top of another external board (BeagleBone1) that uses an ARM Cortex processor. This external board can be connected up to 15 components and is used to acquire power measurements from custom power sensing boards connected to it. It was first designed to work within a cluster and it is therefore required to install and configure one board per available node. Each board is then connected through Ethernet and can send the acquired data to a master node. The master node is finally responsible for aggregating data for postmortem analysis. PowerInsight can provide user-space samples at a rate up to 1 KHz but they only validate their approach while using 1 HZ.
RAPL [Rot+12] oﬀers specific HPCs to retrieve the power consumption of CPU power packages since the “Sandy Bridge” architecture. Intel divides the system into domains (PP0, PP1, PKG, DRAM) to retrieve various power informations according to the requested context. The first domain PP0 represents the core activity of the processor (cores + L1 + L2 + L3), the PP1 domain the uncore activities (LLC, integrated graphic cards, etc.), and PKG represents the sum of PP0 and PP1. The last domain DRAM only exhibits the RAM power consumption. The RAPL tool can be thus easily used in recent Intel architectures as it does not require any hardware modification. Moreover, this tool can also be used as a power capping solution for limiting the CPU power consumption. However, it is limited to specific processor generations and further limited to Intel processors. Icsi et al. [IM03] describe an approach for learning CPU power models based on predefined 15 HPC for 22 selected processor subunits. In addition, they propose a live CPU power monitoring solution that implies diﬀerent modules. First, a reader runs inside the system under test for collecting values for the selected HPC. Once collected, the values are sent via the network to a logger machine. This logger uses together the power model and the extracted values for producing live power estimation of the 22 processor subunits. With this approach, the authors show runtime power estimations for one concurrent running application that can be divided per involved subunit.
Built on top of their CPU usage based power model, Lien et al. [LBL07] propose a window-based GUI to monitor in real-time the overall power consumption of Windows streaming-media servers through a time period.

Process-Level Granularity

The PowerPack [Ge+10] framework can monitor all hardware components separately. To retrieve power measurements, a precision sensing resistor is attached to each DC power line, thus allowing to measure the voltage diﬀerences with a power meter. This approach is not limited to a single physical power meter but can use several, such as NI data acquisition board, Watt’s Up Pro, or ACPI interface. Power measurements are simultaneously collected on all power lines to be representative of all hardware components. The data retrieved are then recorded and used in postmortem analysis. The authors consider their approach as being able to provide per-process power estimation but they only consider one concurrent application during their validation. They also mention that it can target a cluster but only one node can be monitored at a time. The authors propose then a solution by replaying the same workload m times, where m represents the number of nodes. It is therefore not a suitable solution in practice. Furthermore, the diﬀerent components used for acquiring or analyzing data are expensive.
WattProf [Ras+15] supports the profiling of High-Performance Computing applications. As PowerMon2, PowerInsight and PowerPack, this solution uses an author-defined external board as a cornerstone of their solution. This board is fully configurable and can collect raw power measurements that come from various connected hardware components (CPU, disk, memory, etc.) through external sensors attached to power lines. The board is able to be connected up to 128 sensors that can be sampled at up to 12 KHz. The data can be thus be retrieved via Ethernet interface, or can be buﬀered inside the board for later analysis. As in [Ge+10], the authors argue that this solution is able to perform per-process power estimation, but they only validates their approach while running a single application.
WattWatcher [LeB+15] is the tool that can characterize workload energy power consumption. The authors use several calibration phases for computing a power model that fits a modern CPU architecture. This power model uses a lot of predefined set of HPCs as input parameters. As the authors use a special power model generator that can target any CPU architecture, they have to describe entirely the underlying CPU architecture via a configuration file. This file contains all mappings required to match specific HPCs from the underlying CPU with the ones used by the generator and thus requiring a deep knowledge of the underlying architecture. To limit the overhead, the generator is located on another machine and thus requiring at least 2 machines. An eﬃcient network connexion is required to send data to the generator and to monitor the power estimation in real-time.

Table of contents :

Table of Contents
Acronyms
List of Figures
List of Tables
List of Snippets
1 Introduction
1.1 Problem Statement
1.2 Thesis Goals
1.3 Contributions
1.4 Publications
1.5 Outline
I State-of-the-Art
2 Learning Power Models
2.1 CPU Power Models
2.2 VM Power Models
2.3 Disk Power Models
3 Power Measurement Granularities
3.1 Hardware-Level Granularity
3.2 Process-Level Granularity
3.3 Code-Level Granularity
II Contributions
4 Learning Power Models Automatically
4.1 Learning CPU Power Models
4.1.1 Empirical Approach
4.1.2 Architecture-Agnostic Approach
4.2 Learning SSD Power Models
4.2.1 Empirical Approach
5 Building Software-Defined Power Meters “à la carte”
5.1 The Need For Software-Defined Power Meters
5.2 PowerAPI, a Middleware Toolkit
5.3 PowerAPI’s Modules
5.4 PowerAPI’s Assemblies
III Evaluations
6 Process-Level Power Estimation in Multi-Core Systems
6.1 Assessing CPU Power Models
6.1.1 Empirical Learning
6.1.2 Architecture-Agnostic Learning
6.2 Assessing SSD Power Models
6.2.1 Empirical Learning
6.3 Assessing Software-Defined Power Meters
6.3.1 Domain-Specific CPU Power Models
6.3.2 Real-Time Power monitoring
6.3.3 Process-Level Power Monitoring
6.3.4 Adaptive CPU Power Models
6.3.5 System Impact on CPU Power Models
7 SaaS-Level Power Estimation
7.1 Process-Level Power Estimation in VMs
7.1.1 BitWatts, Middleware Toolkit for VMs
7.1.2 Power Consumption Communication Channels
7.1.3 Virtual CPU Power Model
7.1.4 Experimental Setup
7.1.5 Scaling the Number of VMs
7.1.6 Scaling the Number of Hosts
7.2 SD Power Monitoring of Distributed Systems
7.2.1 Case Study
7.2.2 Enabling Service-Level Power Monitoring
7.2.3 To a Service-Level Power Model
7.2.4 WattsKit, a SD Power Meter for Distributed Services
7.2.5 Monitoring the Service-Level Power Consumption
7.2.6 Analyzing the Power Consumption Per Service
8 Code-Level Power Estimation in Multi-Core Systems
8.1 codEnergy, In-Depth Energy Analysis of Source-Code
8.1.1 codAgent, the Runtime Observer
8.1.2 codEctor, the Code-Level Software-Defined Power Meter
8.1.3 codData, the Storage Solution
8.1.4 codVizu, the Visualizer for Code Energy Distribution
8.2 codEnergy’s Overhead
8.3 Study the Methods Energy Distribution of redis
8.3.1 Comparing the Energy Evolution of redis Over Versions
8.3.2 Comparing the Energy Impacts of redis Configurations
9 Conclusion & Perspectives
9.1 Summary of the Dissertation
9.2 Contributions
9.3 Short-Term Perspectives
9.4 Long-Term Perspectives
Bibliography