Tuning the Symphony of Heterogeneous Memory Systems

Tuning the Symphony of Heterogeneous Memory Systems | itkovian

Image generated by DALL·E through ChatGPT.

Heterogeneous Memory: A Journey Toward Performance

The heterogeneous memory system is pivotal in today’s computer architecture. By integrating various types of memory for specific tasks and organizing them as a hierarchy, the heterogeneous memory system achieves a balance of performance, capacity, and energy efficiency.

However, the scope of heterogeneous memory systems is limited by the cache-coherence domain. The conventional memory system only allows processors to coherently access their own cache and main memory using load and store instructions. To access remote memories, such as PCIe-attached memory or storage, processors need to rely on software drivers due to a lack of cache-coherence in PCIe or similar interconnection links. Such an inconvenience imposes sub-optimal design choices for new memory technologies: One would need to either place the memory on the main memory bus to allow backward-compatible access via load/store instructions — while trading off performance to accommodate memory bus constraints such as power delivery and thermal dissipation — or choose to position the memory on the PCIe bus yet trading programmability for performance.

Compute Express Link (CXL) addresses such challenges by introducing a cache-coherent interconnection standard. This feature enables shared access to memory resources, while maintaining data consistency across multiple caches. In general, CXL ensures that all connected components, such as CPUs, GPUs, and memory devices, will access the most recent data without the typical delays or complexities associated with non-coherent systems. Such a coherent link may also be used to create a more composable memory system by allowing future memory technologies to be easily interconnected.

Yet CXL is just one of the new technologies on our way toward heterogeneous memory systems, which are becoming increasingly extensible to catch up with the ever-growing performance requirements of modern workloads. As an example, our recent study demonstrated that efficiently utilizing heterogeneous memory systems enabled by such technologies effectively expands both memory capacity and bandwidth; this expansion enables the training of different types of large machine learning models with much fewer GPUs.

To fully utilize the performance of a heterogeneous memory system, the first step is to characterize its performance, before we can delve into its architecture design and finally simulate the architecture for current and future designs. To achieve these goals, we developed and keep enhancing a series of open-source tools: LENS allows researchers to reverse engineer the memory architecture throughout the hierarchy, and VANS provides a simulation infrastructure to model the heterogeneous memory and design for future systems.

Characterization: The Quest for Uncharted Knowledge

Over the years, we have been building and enhancing LENS (a low-level heterogeneous memory profiler) for performance characterizations. LENS is a software framework with a large set of carefully crafted microbenchmarks to trigger specific hardware behaviors, which leads to different latency and bandwidth patterns with various memory architecture properties. By analyzing the performance patterns, we identify the corresponding microarchitecture properties and parameters in a heterogeneous memory system. LENS also employs mechanisms to reduce noise in the results, including running benchmarks in Linux kernel spaces, turning off cache-prefetchers, and even changing the CPU cache write-back policy.

LENS allows researchers to reverse-engineer various memory architecture designs without having to rely on the public information provided by memory and systems vendors. The uncovered architecture components and their performance properties, such as latency and bandwidth, offer explanations for application-level performance characteristics.

1702045367 116 Tuning the Symphony of Heterogeneous Memory Systems | itkovian

Figure 1. Characterizing heterogeneous memory with LENS.

LENS is implemented with common x86 assembly and C, without using any specialized library or compiler. Therefore, LENS can be used on any type of memory device as long as Linux maps a memory address for it. As shown in Figure 1, we have been using LENS to profile CPU cache, main memory, and even FPGA memory attached to PCIe. As an ongoing effort, we are also using LENS to profile CXL memory systems. We will perhaps write another blog article to report our results in the near future.

Simulation: Prelude to Today’s Model, Overture to Tomorrow’s Design

Architectural characterization offers insight into the current memory design and highlights potential avenues for future improvements.  To evaluate such design choices, we built a simulation infrastructure, VANS, to accurately simulate the heterogeneous memory architecture.

1702045367 952 Tuning the Symphony of Heterogeneous Memory Systems | itkovian

Figure 2. Example setup of heterogeneous memory based on VANS.

VANS consists of building blocks for memory architecture simulation. The simulation model incorporates a set of pre-built memory architecture components, such as queues, caches, buffers, and DRAM. Such components can be connected and configured by VANS’s configuration file to simulate various types of memory architectures. Such a flexible design allows users to chain the simulated devices as a tiered memory or organize multiple types of memory devices in a flat tier to study the heterogeneous memory. An example is illustrated in Figure 2. VANS also offers users an interface to construct customized components and controllers for exploring new memory types, such as process-in-memory.  In addition, VANS can not only operate independently, but also facilitate full-system simulations by seamlessly integrating with other processors and simulators, such as gem5.


As memory systems incorporate increasingly diverse memory architectures, a crucial factor in achieving high performance is the comprehensive understanding and simulation of these architectures. We hope our tools will serve as a platform to facilitate the research and development endeavors of future memory systems.

About the authors:

Zixuan Wang is currently a PhD candidate in the Computer Science and Engineering Department at University of California, San Diego. His research concerns the scalability and security of heterogeneous systems. He is currently on the academic job market seeking assistant professorship jobs.

Jishen Zhao is an Associate Professor in the Computer Science and Engineering Department at University of California, San Diego. Her research spans and stretches the boundary between computer architecture and system software, with an emphasis on memory systems, machine learning for systems, and system support for smart applications.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

Hi, I’m Samuel