Lance's Publications

The Book: Chip Multiprocessor Architecture

A summary of the highlights of my research is now available in the concise book Chip Multiprocessor Architecture, by Kunle Olukotun, myself, and James Laudon (Morgan & Claypool, December 2007). Several chapters are condensed and streamlined versions of text from several of the publications below.

A PDF version is available directly from Morgan & Claypool's Synthesis Lectures on Computer Architecture website.

Paper versions are available from many technical booksellers, including Amazon.com

A Guide:

I authored or co-authored all of the following publications during my years at Stanford. All of the following documents are downloadable in Adobe PDF format if you would like to read the entire article, report, or paper.

We start with publications describing general background information about my work.

Chip Multiprocessors: The Future of Microprocessors (Sept. 2005)

The following papers describe Transactional Coherence and Consistency (TCC), a method for using speculation to eliminate conventional cache coherence and consistency models.

Characterization of TCC on Chip-Multiprocessors (Sept. 2005)
TAPE: A Transactional Application Profiling Environment (June 2005)
Transactional Coherence and Consistency: Simplifying Parallel Hardware and Software (Nov.-Dec. 2004)
Programming with Transactional Coherence and Consistency (TCC) (Oct. 2004)
Transactional Memory Coherence and Consistency (June 2004)

These describe computer architecture case studies we performed that led to the Hydra design.

A Single Chip Multiprocessor (Sep. 1997)
The Case for a Single-Chip Multiprocessor (Oct. 1996)
Evaluation of Design Alternatives for a Multiprocessor Microprocessor (May 1996)

The following papers offer views of low-level aspects of Hydra, including descriptions of the originally planned chip design and our later FPGA emulation environment.

A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems (Feb. 2005)
The Stanford Hydra CMP (Aug. 1999 / Mar.-Apr. 2000)
Improving the Performance of Speculatively Parallel Applications on the Hydra CMP (June 1999)
Data Speculation Support for a Chip Multiprocessor (Oct. 1998)
Considerations in the Design of Hydra: A Multiprocessor-on-a-Chip Microarchitecture (Feb. 1998)

My PhD dissertation includes most of the information in the "low-level" papers, organized into a logical sequence, plus a large amount of additional details on the Hydra hardware and software design.

Hydra: A Chip Multiprocessor with Support for Speculative Thread-Level Parallelization (May 2001 / March 2002)

The following papers deal with a variant of Hydra we explored using DRAM instead of SRAM as a basis for the large, on-chip secondary memory bank.

A Single Chip Multiprocessor Integrated with High Density DRAM (Aug. 1999)
The Hierarchical Multi-Bank DRAM: A High-Performance Architecture for Memory Integrated with Processors (Aug. 1997)
A Single Chip Multiprocessor Integrated with DRAM (June/Aug. 1997)

The Publications, in reverse chronological order:

Chip Multiprocessors: The Future of Microprocessors
by Kunle Olukotun and Lance Hammond, in ACM Queue Magazine, September 2005.
This article summarizes our arguments from the late 1990's for why chip multiprocessors will replace conventional, monolithic microprocessors, all in a concise and easy-to-read form, along with some updates to account for changes in the marketplace since our original publications appeared (namely, the appearance of real CMPs and the "power wall" problem).
Characterization of TCC on Chip-Multiprocessors
by Austen McDonald, JaeWoong Chung, Hassan Chafi, Chi Cao Minh, Brian D. Carlstrom, Lance Hammond, Christos Kozyrakis, and Kunle Olukotun, in 14th International Conference on Parallel Architectures and Compilation Techniques, St. Louis, Missouri, September 2005.
This is a further analysis of the TCC system described in 2004's papers. However, we have been able to perform the analysis in more detail here because it is now performed on a fully execution-driven simulator, instead of the trace-based simulator used by the older papers.
TAPE: A Transactional Application Profiling Environment
by Hassan Chafi, Chi Cao Minh, Austen McDonald, Brian D. Carlstrom, JaeWoong Chung, Lance Hammond, Christos Kozyrakis, and Kunle Olukotun, in 19th ACM International Conference on Supercomputing, Cambridge, MA, June 2005.
This is a description of TAPE, a mechanism that could be added to a TCC system in order to measure runtime statistics from executing programs. These statistics could then be used to adjust how the target programs are parallelized.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems
by John D. Davis, Lance Hammond, and Kunle Olukotun, in the Workshop on "Architecture Research using FPGA Platforms" preceding the 11th International Symposium on High-Performance Computer Architecture, Feb. 2005.
This is a one-page abstract summarizing FAST, a board designed to emulate the Hydra design with several FPGAs, some SRAM, and four MIPS hard-core processors and FPUs.
Transactional Coherence and Consistency: Simplifying Parallel Hardware and Software
by Lance Hammond, Brian D. Carlstrom, Vicky Wong, Mike Chen, Christos Kozyrakis, and Kunle Olukotun, in IEEE MICRO Magazine, Nov.-Dec. 2004.
This magazine article, chosen for MICRO's "Top Picks from Computer Architecture Conferences" issue looking at 2004's most important research papers, summarizes highlights from the ISCA and ASPLOS papers in a shorter and somewhat easier-to-read format.
Programming with Transactional Coherence and Consistency (TCC)
----> The full Conference Paper
----> The Talk Slides
by Lance Hammond, Brian D. Carlstrom, Vicky Wong, Ben Hertzberg, Mike Chen, Christos Kozyrakis, and Kunle Olukotun, in Proceedings of the Eleventh ACM Conference on Architectural Support for Programming Languages and Operating Systems, Boston, Massachusetts, October 2004.
Transactional Coherence and Consistency (TCC) offers a way to simplify parallel programming by executing all code within transactions. In TCC systems, transactions serve as the fundamental unit of parallel work, communication and coherence. As each transaction completes, it writes all of its newly produced state to shared memory atomically, while restarting other processors that have speculatively read stale data. With this mechanism, a TCC-based system automatically handles data synchronization correctly, without programmer intervention. To gain the benefits of TCC, programs must be decomposed into transactions. We describe two basic programming language constructs for decomposing programs into transactions, a loop conversion syntax and a general transaction- forking mechanism. With these constructs, writing correct parallel programs requires only small, incremental changes to correct sequential programs. The performance of these programs may then easily be optimized, based on feedback from real program execution, using a few simple techniques.
Transactional Memory Coherence and Consistency
----> The full Conference Paper
----> The Talk Slides
by Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Olukotun, in Proceedings of the 31st International Symposium on Computer Architecture, June 2004.
In this paper, we propose a new shared memory model: Transactional memory Coherence and Consistency (TCC). TCC provides a model in which atomic transactions are always the basic unit of parallel work, communication, memory coherence, and memory reference consistency. TCC greatly simplifies parallel software by eliminating the need for synchronization using conventional locks and semaphores, along with their complexities.
TCC hardware must combine all writes from each transaction region in a program into a single packet and broadcast this packet to the permanent shared memory state atomically as a large block. This simplifies the coherence hardware because it reduces the need for small, low-latency messages and completely eliminates the need for conventional snoopy cache coherence protocols, as multiple speculatively written versions of a cache line may safely coexist within the system. Meanwhile, automatic, hardware-controlled rollback of speculative transactions resolves any correctness violations that may occur when several processors attempt to read and write the same data simultaneously. The cost of this simplified scheme is higher interprocessor bandwidth.
To explore the costs and benefits of TCC, we study the characteristics of an optimal transaction-based memory system, and examine how different design parameters could affect the performance of real systems. Across a spectrum of applications, the TCC model itself did not limit available parallelism. Most applications are easily divided into transactions requiring only small write buffers, on the order of 4-8 KB. The broadcast requirements of TCC are high, but are well within the capabilities of CMPs and smallscale SMPs with high-speed interconnects.
Hydra: A Chip Multiprocessor with Support for Speculative Thread-Level Parallelization
----> The full thesis
----> The orals talk slides
by Lance Hammond, orals given May 2001 and dissertation filed March 2002.
This thesis describes the design and provides a detailed analysis of Hydra, a chip multiprocessor (CMP) made up of four normal MIPS cores, each with their own primary instruction and data caches. The cores are connected to each other, a shared on-chip secondary cache, and a high-speed off-chip DRAM interface by a pair of wide, pipelined buses that are specialized to support reads and writes, with relatively simple cache coherency protocols. The basic design supports interprocessor communication latencies on the order of 10 cycles, using the shared secondary cache, allowing a much wider variety of programs to be parallelized than is possible with a conventional, multichip multiprocessor. Our simulations show that such a design allows excellent speedup on most matrix-intensive floating point and multiprogrammed applications, but achieves speedup only comparable to a superscalar processor of similar area, at best, on the large family of integer applications that can really take advantage of the low communication latencies provided.
In order to make execution of integer programs easier on Hydra, we examined the possibility of adding thread-level speculation (TLS) support to Hydra. This is a mechanism in which processors are enhanced so that they can attempt to execute threads from a sequential program in parallel without knowing in advance whether the threads are parallel or not. The speculation hardware then monitors data produced and consumed by the different threads to ensure that no thread attempts to use data too early, before it is actually produced. If such an attempt is made, the offending thread is restarted. In this manner, threads may be generated from existing program constructs such as loops or subroutines almost automatically. Such support can be added to Hydra simply, with a few extra bits attached to the primary caches and some speculation buffers attached to the shared secondary cache. In practice, we found that most of our integer applications could be sped up to a level comparable to or better than an equal-area superscalar processor or our handparallelized benchmarks Ñ and with very little programmer effort. However, we usually had to apply several manual optimization techniques to the code to achieve this speedup.
The Stanford Hydra CMP
----> The full MICRO article
----> The Hot Chips Talk Slides
by Lance Hammond, Ben Hubbert, Michael Siu, Manohar Prabhu, Mike Chen, and Kunle Olukotun in IEEE MICRO Magazine, March-April 2000, and presented at Hot Chips 11, August 1999.
The Hydra chip multiprocessor (CMP) integrates four MIPS-based processors, each with pairs of primary caches, on a single chip together with a shared secondary cache. A standard CMP offers implementation and performance advantages compared to wide-issue superscalar designs. With multiprogrammed workloads or highly parallel applications a CMP can significantly outperform a uniprocessor of comparable cost by running multiple threads. However, it must be programmed with a more complicated parallel programming model to obtain maximum performance. To simplify parallel programming, the Hydra CMP supports thread-level speculation and memory renaming. This small addition allows performance similar to a uniprocessor of comparable die area on integer programs. This article motivates the design of a CMP, describes the features of the Hydra design with a focus on the speculative thread support, and finally describes our prototype implementation.
A Single Chip Multiprocessor Integrated with High Density DRAM
by Tadaaki Yamauchi, Lance Hammond, Kunle Olukotun, and Kazutami Arimoto, in the IEICE Transactions on Electronics, August 1999, pp. 1567-1577.
This is an expanded form of the earlier "A Single Chip Multiprocessor Integrated with DRAM" workshop paper and technical report. It has been expanded with additional analysis and benchmarks and turned into a full journal paper.
No online version is currently available.
Improving the Performance of Speculatively Parallel Applications on the Hydra CMP
by Kunle Olukotun, Lance Hammond, and Mark Willey, in Proceedings of the 1999 ACM International Conference on Supercomputing, Rhodes, Greece, June 1999.
Hydra is a chip multiprocessor (CMP) with integrated support for thread-level speculation. Thread-level speculation provides a way to parallelize sequential programs without the need for data dependence analysis or synchronization. This makes it possible to parallelize applications for which static memory dependence analysis is difficult or impossible. While performance of the baseline Hydra system on applications with medium to large grain parallelism is good, the performance on integer applications with fine-grained parallelism is unimpressive. In this paper, we describe a collection of software and hardware techniques for improving speculation performance over the baseline speculative Hydra CMP. These techniques focus on reducing the overheads associated with speculation and improving the speculation behavior of the applications using code restructuring. When these techniques are applied to a set of eleven integer, multimedia and floating-point benchmarks, significant performance improvements result. In particular, the overall performance of the integer benchmarks is improved by seventy-five percent.
Data Speculation Support for a Chip Multiprocessor
----> The full Conference Paper
----> The Talk Slides
by Lance Hammond, Mark Willey, and Kunle Olukotun, in Proceedings of the Eighth ACM Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, California, October 1998.
Thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. This paper describes the complete implementation of the support for thread-level speculation on the Hydra chip multiprocessor (CMP). The support consists of a number of software speculation control handlers and modifications to the shared secondary cache memory system of the CMP. This support is evaluated using five representative integer applications. Our results show that the speculative support is only able to improve performance when there is a substantial amount of mediumÐgrained loop-level parallelism in the application. When the granularity of parallelism is too small or there is little inherent parallelism in the application, the overhead of the software handlers overwhelms any potential performance benefits from speculative-thread parallelism. Overall, thread-level speculation still appears to be a promising approach for expanding the class of applications that can be automatically parallelized, but more hardware intensive implementations for managing speculation control are required to achieve performance improvements on a wide class of integer applications.
Considerations in the Design of Hydra: A Multiprocessor-on-a-Chip Microarchitecture
by Lance Hammond, and Kunle Olukotun, published in CSL Technical Report #749, February 1998.
Hydra offers a promising way to build a small-scale MP-on-a-chip using a fairly simple design that still maintains excellent performance on a wide variety of applications. This report examines key parts of the Hydra design -- the memory hierarchy, the on-chip buses, and the control and arbitration mechanisms -- and explains the rationale for some of the decisions made in the course of finalizing the design of this memory system, with particular emphasis given to applications that stress the memory system with numerous memory accesses. With the balance between complexity and performance that we obtain, we feel Hydra offers a promising model for future MP-on-a-chip designs.
A Single Chip Multiprocessor
by Lance Hammond, Basem Nayfeh, and Kunle Olukotun, in IEEE Computer Magazine, September 1997.
Integrated circuit processing technology offers increasing integration density, which fuels microprocessor performance growth. Within 10 years it will be possible to integrate a billion transistors on a reasonably sized silicon chip. At this integration level, it is necessary to find parallelism to effectively utilize the transistors. Conventional processor designers advocate using the large numbers of transistors that will be available on these chips for building massive, monolithic uniprocessors using advanced out-of-order and branch prediction techniques. However, we feel that the limited amount of parallelism in conventional instruction streams will cause drastically diminishing performance returns on the transistor investments made in these huge processors, since the techniques used in advanced uniprocessors rely so heavily on extraction of this limited instruction-level parallelism. These processors will also consist of numerous closely-coupled logic blocks that will be increasingly difficult to design and implement successfully in huge future microchips. Hence, we suggest the single-chip multiprocessor as a way to use future transistor budgets in a helpful and relatively easy-to-design manner.
This article offers an easy-to-read comparison of a future chip multiprocessor, a comparable uniprocessor, and a hybrid simultaneously multithreaded design, analyzing and discussing the performance and design benefits of each before presenting a selection of simulated performance data.
The Hierarchical Multi-Bank DRAM: A High-Performance Architecture for Memory Integrated with Processors
by Tadaaki Yamauchi, Lance Hammond, and Kunle Olukotun, in the Proceedings of the 19th Conference on Advanced Research in VLSI, September 1997.
A microprocessor integrated with DRAM on the same die has the potential to improve system performance by reducing the memory latency and improving the memory bandwidth. However, a high performance microprocessor will typically send more accesses than the DRAM can handle due to the long cycle time of the embedded DRAM, especially in applications with significant memory requirements.
A multi-bank DRAM can hide the long cycle time by allowing the DRAM to process multiple accesses in parallel, but it will incur a significant area penalty and will therefore restrict the density of the embedded DRAM main memory. In this paper, we propose a hierarchical multi-bank DRAM architecture to achieve high system performance with a minimal area penalty. In this architecture, the independent memory banks are each divided into many semi-independent subbanks that share I/O and decoder resources.
A hierarchical multi-bank DRAM with 4 main banks each composed of 32 subbanks occupies approximately the same area as a conventional 4 bank DRAM while performing like a 32 bank one -- up to 65% better than a conventional 4 bank DRAM when integrated with a Hydra single-chip multiprocessor.
A Single Chip Multiprocessor Integrated with DRAM
----> The shorter Workshop Version
----> The full Tech Report Version
by Tadaaki Yamauchi, Lance Hammond, and Kunle Olukotun, in the Workshop on "Mixing Logic and DRAM" preceding the 24th International Symposium on Computer Architecture, June 1997, or, in longer form, as published in CSL Technical Report #731, August 1997.
In this paper we evaluate the performance of a single chip multiprocessor integrated with DRAM when the DRAM is organized as on-chip main memory and as on-chip cache. We compare the performance of this architecture with that of a more conventional chip which only has SRAM-based on-chip cache.
The DRAM-based architecture with four processors outperforms the SRAM-based architecture on floating point applications, which are very parallelizable and typically have large working sets. This performance difference is significantly better than that possible in a uniprocessor DRAM-based architecture, which performs only slightly faster than an SRAM-based architecture on the same applications. In addition, on multiprogrammed workloads, in which independent processes are assigned to every processor in a single chip multiprocessor, the large bandwidth of on-chip DRAM can handle the inter-access contention better. These results demonstrate that a multiprocessor takes better advantage of the large bandwidth provided by the on-chip DRAM than a uniprocessor.
The Case for a Single-Chip Multiprocessor
by Kunle Olukotun, Basem Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang, in Proceedings of the Seventh ACM Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, Massachusetts, October 1996.
Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires in advanced integrated circuit technologies require that we look for new ways to use their capabilities effectively. This paper shows that in advanced technologies it is possible to implement a single-chip multiprocessor in the same area as a wide issue superscalar processor. We find that for applications with little parallelism, the performance of the two microarchitectures is comparable. For applications with large amounts of parallelism at both the fine and coarse grained levels, the multiprocessor microarchitecture outperforms the superscalar architecture by a significant margin. Single-chip multiprocessor architectures have the advantage in that they offer localized implementation of a high-clock rate processor for inherently sequential applications and low latency interprocessor communication for parallel applications.
Evaluation of Design Alternatives for a Multiprocessor Microprocessor
by Basem Nayfeh, Lance Hammond, and Kunle Olukotun, in Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
In this paper we consider three architectures: shared-primary cache, shared-secondary cache, and shared-memory. We evaluate these three architectures using a complete system simulation environment which models the CPU, memory hierarchy and I/O devices in sufficient detail to boot and run a commercial operating system. Within our simulation environment, we measure performance using representative hand and compiler generated parallel applications, and a multiprogramming workload. Our results show that when applications exhibit fine-grained sharing, both shared-primary and shared-secondary architectures perform similarly when the full costs of sharing the primary cache are included.
This older study was done when we considered building Hydra on an MCM instead of a single chip, so the latencies and overall design of the shared-L2 cache model reflect this design strategy.

Last Revised 7/21/08