A summary of the highlights of my research is now available in the concise book Chip Multiprocessor Architecture, by Kunle Olukotun, myself, and James Laudon (Morgan & Claypool, December 2007). Several chapters are condensed and streamlined versions of text from several of the publications below.
A PDF version is available directly from Morgan & Claypool's Synthesis Lectures on Computer Architecture website.
Paper versions are available from many technical booksellers, including Amazon.com
I authored or co-authored all of the following publications during my years at Stanford. All of the following documents are downloadable in Adobe PDF format if you would like to read the entire article, report, or paper.
We start with publications describing general background information about my work.
The following papers describe Transactional Coherence and Consistency (TCC), a method for using speculation to eliminate conventional cache coherence and consistency models.
These describe computer architecture case studies we performed that led to the Hydra design.
The following papers offer views of low-level aspects of Hydra, including descriptions of the originally planned chip design and our later FPGA emulation environment.
My PhD dissertation includes most of the information in the "low-level" papers, organized into a logical sequence, plus a large amount of additional details on the Hydra hardware and software design.
The following papers deal with a variant of Hydra we explored using DRAM instead of SRAM as a basis for the large, on-chip secondary memory bank.
This article summarizes our arguments from the late 1990's for why chip multiprocessors will replace conventional, monolithic microprocessors, all in a concise and easy-to-read form, along with some updates to account for changes in the marketplace since our original publications appeared (namely, the appearance of real CMPs and the "power wall" problem).
This is a further analysis of the TCC system described in 2004's papers. However, we have been able to perform the analysis in more detail here because it is now performed on a fully execution-driven simulator, instead of the trace-based simulator used by the older papers.
This is a description of TAPE, a mechanism that could be added to a TCC system in order to measure runtime statistics from executing programs. These statistics could then be used to adjust how the target programs are parallelized.
This is a one-page abstract summarizing FAST, a board designed to emulate the Hydra design with several FPGAs, some SRAM, and four MIPS hard-core processors and FPUs.
This magazine article, chosen for MICRO's "Top Picks from Computer Architecture Conferences" issue looking at 2004's most important research papers, summarizes highlights from the ISCA and ASPLOS papers in a shorter and somewhat easier-to-read format.
Transactional Coherence and Consistency (TCC) offers a way to simplify parallel programming by executing all code within transactions. In TCC systems, transactions serve as the fundamental unit of parallel work, communication and coherence. As each transaction completes, it writes all of its newly produced state to shared memory atomically, while restarting other processors that have speculatively read stale data. With this mechanism, a TCC-based system automatically handles data synchronization correctly, without programmer intervention. To gain the benefits of TCC, programs must be decomposed into transactions. We describe two basic programming language constructs for decomposing programs into transactions, a loop conversion syntax and a general transaction- forking mechanism. With these constructs, writing correct parallel programs requires only small, incremental changes to correct sequential programs. The performance of these programs may then easily be optimized, based on feedback from real program execution, using a few simple techniques.
In this paper, we propose a new shared memory model: Transactional memory Coherence and Consistency (TCC). TCC provides a model in which atomic transactions are always the basic unit of parallel work, communication, memory coherence, and memory reference consistency. TCC greatly simplifies parallel software by eliminating the need for synchronization using conventional locks and semaphores, along with their complexities.
TCC hardware must combine all writes from each transaction region in a program into a single packet and broadcast this packet to the permanent shared memory state atomically as a large block. This simplifies the coherence hardware because it reduces the need for small, low-latency messages and completely eliminates the need for conventional snoopy cache coherence protocols, as multiple speculatively written versions of a cache line may safely coexist within the system. Meanwhile, automatic, hardware-controlled rollback of speculative transactions resolves any correctness violations that may occur when several processors attempt to read and write the same data simultaneously. The cost of this simplified scheme is higher interprocessor bandwidth.
To explore the costs and benefits of TCC, we study the characteristics of an optimal transaction-based memory system, and examine how different design parameters could affect the performance of real systems. Across a spectrum of applications, the TCC model itself did not limit available parallelism. Most applications are easily divided into transactions requiring only small write buffers, on the order of 4-8 KB. The broadcast requirements of TCC are high, but are well within the capabilities of CMPs and smallscale SMPs with high-speed interconnects.
This thesis describes the design and provides a detailed analysis of Hydra, a chip multiprocessor (CMP) made up of four normal MIPS cores, each with their own primary instruction and data caches. The cores are connected to each other, a shared on-chip secondary cache, and a high-speed off-chip DRAM interface by a pair of wide, pipelined buses that are specialized to support reads and writes, with relatively simple cache coherency protocols. The basic design supports interprocessor communication latencies on the order of 10 cycles, using the shared secondary cache, allowing a much wider variety of programs to be parallelized than is possible with a conventional, multichip multiprocessor. Our simulations show that such a design allows excellent speedup on most matrix-intensive floating point and multiprogrammed applications, but achieves speedup only comparable to a superscalar processor of similar area, at best, on the large family of integer applications that can really take advantage of the low communication latencies provided.
In order to make execution of integer programs easier on Hydra, we examined the possibility of adding thread-level speculation (TLS) support to Hydra. This is a mechanism in which processors are enhanced so that they can attempt to execute threads from a sequential program in parallel without knowing in advance whether the threads are parallel or not. The speculation hardware then monitors data produced and consumed by the different threads to ensure that no thread attempts to use data too early, before it is actually produced. If such an attempt is made, the offending thread is restarted. In this manner, threads may be generated from existing program constructs such as loops or subroutines almost automatically. Such support can be added to Hydra simply, with a few extra bits attached to the primary caches and some speculation buffers attached to the shared secondary cache. In practice, we found that most of our integer applications could be sped up to a level comparable to or better than an equal-area superscalar processor or our handparallelized benchmarks Ñ and with very little programmer effort. However, we usually had to apply several manual optimization techniques to the code to achieve this speedup.
The Hydra chip multiprocessor (CMP) integrates four MIPS-based processors, each with pairs of primary caches, on a single chip together with a shared secondary cache. A standard CMP offers implementation and performance advantages compared to wide-issue superscalar designs. With multiprogrammed workloads or highly parallel applications a CMP can significantly outperform a uniprocessor of comparable cost by running multiple threads. However, it must be programmed with a more complicated parallel programming model to obtain maximum performance. To simplify parallel programming, the Hydra CMP supports thread-level speculation and memory renaming. This small addition allows performance similar to a uniprocessor of comparable die area on integer programs. This article motivates the design of a CMP, describes the features of the Hydra design with a focus on the speculative thread support, and finally describes our prototype implementation.
This is an expanded form of the earlier "A Single Chip Multiprocessor Integrated with DRAM" workshop paper and technical report. It has been expanded with additional analysis and benchmarks and turned into a full journal paper.
No online version is currently available.
Hydra is a chip multiprocessor (CMP) with integrated support for thread-level speculation. Thread-level speculation provides a way to parallelize sequential programs without the need for data dependence analysis or synchronization. This makes it possible to parallelize applications for which static memory dependence analysis is difficult or impossible. While performance of the baseline Hydra system on applications with medium to large grain parallelism is good, the performance on integer applications with fine-grained parallelism is unimpressive. In this paper, we describe a collection of software and hardware techniques for improving speculation performance over the baseline speculative Hydra CMP. These techniques focus on reducing the overheads associated with speculation and improving the speculation behavior of the applications using code restructuring. When these techniques are applied to a set of eleven integer, multimedia and floating-point benchmarks, significant performance improvements result. In particular, the overall performance of the integer benchmarks is improved by seventy-five percent.
Thread-level speculation is a technique that enables parallel execution of sequential applications on a multiprocessor. This paper describes the complete implementation of the support for thread-level speculation on the Hydra chip multiprocessor (CMP). The support consists of a number of software speculation control handlers and modifications to the shared secondary cache memory system of the CMP. This support is evaluated using five representative integer applications. Our results show that the speculative support is only able to improve performance when there is a substantial amount of mediumÐgrained loop-level parallelism in the application. When the granularity of parallelism is too small or there is little inherent parallelism in the application, the overhead of the software handlers overwhelms any potential performance benefits from speculative-thread parallelism. Overall, thread-level speculation still appears to be a promising approach for expanding the class of applications that can be automatically parallelized, but more hardware intensive implementations for managing speculation control are required to achieve performance improvements on a wide class of integer applications.
Hydra offers a promising way to build a small-scale MP-on-a-chip using a fairly simple design that still maintains excellent performance on a wide variety of applications. This report examines key parts of the Hydra design -- the memory hierarchy, the on-chip buses, and the control and arbitration mechanisms -- and explains the rationale for some of the decisions made in the course of finalizing the design of this memory system, with particular emphasis given to applications that stress the memory system with numerous memory accesses. With the balance between complexity and performance that we obtain, we feel Hydra offers a promising model for future MP-on-a-chip designs.
Integrated circuit processing technology offers increasing integration density, which fuels microprocessor performance growth. Within 10 years it will be possible to integrate a billion transistors on a reasonably sized silicon chip. At this integration level, it is necessary to find parallelism to effectively utilize the transistors. Conventional processor designers advocate using the large numbers of transistors that will be available on these chips for building massive, monolithic uniprocessors using advanced out-of-order and branch prediction techniques. However, we feel that the limited amount of parallelism in conventional instruction streams will cause drastically diminishing performance returns on the transistor investments made in these huge processors, since the techniques used in advanced uniprocessors rely so heavily on extraction of this limited instruction-level parallelism. These processors will also consist of numerous closely-coupled logic blocks that will be increasingly difficult to design and implement successfully in huge future microchips. Hence, we suggest the single-chip multiprocessor as a way to use future transistor budgets in a helpful and relatively easy-to-design manner.
This article offers an easy-to-read comparison of a future chip multiprocessor, a comparable uniprocessor, and a hybrid simultaneously multithreaded design, analyzing and discussing the performance and design benefits of each before presenting a selection of simulated performance data.
A microprocessor integrated with DRAM on the same die has the potential to improve system performance by reducing the memory latency and improving the memory bandwidth. However, a high performance microprocessor will typically send more accesses than the DRAM can handle due to the long cycle time of the embedded DRAM, especially in applications with significant memory requirements.
A multi-bank DRAM can hide the long cycle time by allowing the DRAM to process multiple accesses in parallel, but it will incur a significant area penalty and will therefore restrict the density of the embedded DRAM main memory. In this paper, we propose a hierarchical multi-bank DRAM architecture to achieve high system performance with a minimal area penalty. In this architecture, the independent memory banks are each divided into many semi-independent subbanks that share I/O and decoder resources.
A hierarchical multi-bank DRAM with 4 main banks each composed of 32 subbanks occupies approximately the same area as a conventional 4 bank DRAM while performing like a 32 bank one -- up to 65% better than a conventional 4 bank DRAM when integrated with a Hydra single-chip multiprocessor.
In this paper we evaluate the performance of a single chip multiprocessor integrated with DRAM when the DRAM is organized as on-chip main memory and as on-chip cache. We compare the performance of this architecture with that of a more conventional chip which only has SRAM-based on-chip cache.
The DRAM-based architecture with four processors outperforms the SRAM-based architecture on floating point applications, which are very parallelizable and typically have large working sets. This performance difference is significantly better than that possible in a uniprocessor DRAM-based architecture, which performs only slightly faster than an SRAM-based architecture on the same applications. In addition, on multiprogrammed workloads, in which independent processes are assigned to every processor in a single chip multiprocessor, the large bandwidth of on-chip DRAM can handle the inter-access contention better. These results demonstrate that a multiprocessor takes better advantage of the large bandwidth provided by the on-chip DRAM than a uniprocessor.
Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires in advanced integrated circuit technologies require that we look for new ways to use their capabilities effectively. This paper shows that in advanced technologies it is possible to implement a single-chip multiprocessor in the same area as a wide issue superscalar processor. We find that for applications with little parallelism, the performance of the two microarchitectures is comparable. For applications with large amounts of parallelism at both the fine and coarse grained levels, the multiprocessor microarchitecture outperforms the superscalar architecture by a significant margin. Single-chip multiprocessor architectures have the advantage in that they offer localized implementation of a high-clock rate processor for inherently sequential applications and low latency interprocessor communication for parallel applications.
In this paper we consider three architectures: shared-primary cache, shared-secondary cache, and shared-memory. We evaluate these three architectures using a complete system simulation environment which models the CPU, memory hierarchy and I/O devices in sufficient detail to boot and run a commercial operating system. Within our simulation environment, we measure performance using representative hand and compiler generated parallel applications, and a multiprogramming workload. Our results show that when applications exhibit fine-grained sharing, both shared-primary and shared-secondary architectures perform similarly when the full costs of sharing the primary cache are included.
This older study was done when we considered building Hydra on an MCM instead of a single chip, so the latencies and overall design of the shared-L2 cache model reflect this design strategy.