Optimize for Energy

Ned Bingham

The concepts introduced by Von Neumann in 1945 , remain the centerpiece of computer architectures to this day. His programmable model for general purpose compute combined with a relentless march toward increasingly efficient devices cultivated significant long-term advancement in the performance and power-efficiency of general-purpose computers. For a long time, chip area was the limiting factor and raw instruction throughput was the goal, leaving energy largely ignored. However, technology scaling has demonstrated diminishing returns, and the technology landscape has shifted quite a bit over the last 15 years.

Around 2007, three things happened. First, Apple released the iPhone opening a new industry for mobile devices with limited access to power. Second, chips produced with technology nodes following Intel's 90nm process ceased scaling frequency () as the power density collided with the limitations of air-cooling (). For the first time in the industry, a chip could not possibly run all transistors at full throughput without exceeding the thermal limits imposed by standard cooling technology. By 2011, up to 80% of transistors had to remain off at any given time .

History of the clock frequency of Intel's processors.
History of the power density in Intel's processors. Frequency, Thermal Design Point (TDP), and Die Area were scraped for all Intel processors. Frequency and TDP/Die Area were then averaged over all processors in each technology. Switching Energy was roughly estimated from and and combined with Frequency and Die Area to compute Power Density.

Third, the growth in wire delay relative to frequency introduced new difficulties in clock distribution. Specifically, around the introduction of the 90nm process, global wire delay was just long enough relative to the clock period to prevent reliable distribution across the whole chip ().

Wire and Gate Delay across process technology nodes. These were roughly estimated from and

As a result of these factors, the throughput of sequential programs stopped scaling after 2005 (). The industry adapted, turning its focus toward parallelism. In 2006, Intel's Spec Benchmark scores jump by a 135% with the transtion from NetBurst to the Core microarchitecture, dropping the base clock speed to optimize energy and doubling the width of the issue queue from two to four, targeting Instruction Level Parallelism (ILP) instead of raw execution speed of sequential operations . Afterward, performance grows steadily as architectures continue to optimize for ILP. While Spec2000 focused on sequential tasks, Spec2006 introduced more parallel tasks .

History of SpecINT base mean, with benchmarks scaled appropriately .

By 2012, Intel had pushed most other competitors out of the Desktop CPU market, and chips following Intel's 32nm process ceased scaling total transistor counts. While smaller feature sizes supported higher transistor density, it also brought higher defect density () causing yield losses that make larger chips significantly more expensive ().

History of Intel process technology defect density. Intel's defect density trends were very roughly estimated from and .
History of transistor count in Intel chips. Transistor density was averaged over all Intel processors developed in each technology.

Today, energy has superceded area as the limiting factor and architects must balance throughput against energy per operation. Furthermore, improvements in parallel programs have slowed due to a combination of factors (). First, all available parallelism has already been exploited for many applications. Second, limitations in power density and device counts have put an upper bound on the amount of computations that can be performed at any given time. And third, memory bandwidth has lagged behind compute throughput, introducing a bottleneck that limits the amount of data that can be communicated at any given time () .

History of memory and compute peak bandwidth.
John Von Neumann. First Draft of a Report on the EDVAC. Annals of the History of Computing, Volume 15 Number 4 Pages 27-75. IEEE, 1993. Hadi Esmaeilzadeh, et al. Dark silicon and the end of multicore scaling. 38th Annual international symposium on computer architecture (ISCA). IEEE, 2011. Mark Bohr. Silicon Technology Leadership for the Mobility Era. Intel Developer Forum, 2012. (mirror) SPEC CPU Subcommittee. SPEC CPU Benchmarks. 1992. Sanjay Natarajan, et al. Process Development and Manufacturing of High-Performance Microprocessors on 300mm Wafers. Intel Technology Journal, Volume 6 Number 2. May 2002. (mirror) Kelin J Kuhn. CMOS Transistor Scaling Past 32nm and Implications on Variation. Advanced Semiconductor Manufacturing Conference, 2010. (mirror) Bill Holt. Advancing Moore's Law. Investor Meeting Santa Clara, 2015. (mirror) Eugene S. Meieran. 21st Century Semiconductor Manufacturing Capabilities. Intel Technology Journal. 4th Quarter 1998. (mirror) Linley Gwennap, Estimating IC Manufacturing Costs: Die size, process type are key factors in microprocessor cost. Microprocessor report, Volume 7. August 1993. (data mirror) John D. McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance Computers. Department of Computer Science School of Engineering and Applied Science University of Virginia, 1991. Accessed: August 8, 2019. Available: https://www.cs.virginia.edu/stream/. Intel. Energy-Efficient, High Performing and Stylish Intel–Based Computers to Come with Intel® Core™ Microarchitecture. Intel Developer Forum, San Francisco CA, March 2006. (mirror) Venkatesan Packirisamy, et al. Exploring speculative parallelism in SPEC2006. International Symposium on Performance Analysis of Systems and Software. IEEE, 2009.