Blog

Current Status, Needs and Challenges in Heterogeneous and Composable Memory from HCM Workshop (HPCA’23)

1685543731 Current Status Needs and Challenges in Heterogeneous and Composable Memory | itkovian

introduction

Memory systems are evolving into heterogeneous and composable architectures. Heterogeneous and Composable Memory (HCM) offers a viable solution for terabyte- or petabyte-scale systems, addressing the performance and efficiency needs of emerging big data applications. However, building and using HCM presents challenges, including interconnecting various memory technologies (for example, using Compute Express Link or CXL), organizing memory components for optimal performance, the adaptation of traditionally designed system software for homogeneous memory systems; and the development of memory abstractions and programming constructs for HCM Management. To understand the current state, needs and challenges in HCM, the first Laboratory for heterogeneous and composable memory was held with HPCA’23 on February 26, 2023. This article exposes the ideas and discussions shared during the workshop.


Figure 1: Heterogeneous memory with CXL (source: Maruf et al., ASPLOS’23)

Current status, use cases and needs

In the opening speech, Irina Calciu (co-founder of Graft) explored the potential of heterogeneous and disaggregated memory in modern applications (ASPLOS’21). Memory disaggregation separates memory pools from traditional box servers, allowing for more flexible resource allocation and potentially reducing costs. There are three common mechanisms for accessing remote memory: application modification, virtual memory modification, and hardware-level cache coherence support. A combination of these mechanisms may be required to address challenges arising from heterogeneous memory systems and NUMA architectures. Such a combination requires new programming abstractions and models for effective management. Calciu projected multiple use cases of CXL-based memory systems, such as in-place data transformation, encryption, checkpointing, replication, near-in-memory processing, and detection of illegal data access patterns.

Daniel S. Berger (Microsoft) shared the state of memory systems in the public cloud and predicted two use cases of CXL-based disaggregated memory in the cloud (IEEE Micro AND ASPLOS’23). The state of cloud memory systems can be characterized as 1) there are no page faults as memories are mostly statically locked, 2) as memories are increasingly locked up, more memory may not lead to performance improvement a due to CPU core saturation, 3) customers tend to overprovision memory (average 45% of memory is not touched), 4) is extremely latency sensitive; bandwidth is not an issue in the cloud. CXL-based unbundled memory offers a new opportunity with two orders of magnitude better performance than RDMA (as shown in Figure 2). The recently announced CXL3.0 even lowered latency by introducing a multi-headed device that compresses switches and memory controllers. Given this state, Daniel anticipated two use cases for CXL technology in the public cloud; 1) reuse decommissioned DDR4 in DDR5 servers (CXL 1.1) and 2) create a flexible DRAM pool shared between hosts for peak usage (CXL 2 or 3).

1685543729 65 Current Status Needs and Challenges in Heterogeneous and Composable Memory | itkovian

Figure 2: Latency characteristics of memory technologies (source: Maruf et al., ASPLOS’23)

Junhyeok Jang presented a collaborative study between KAIST and Panmnesia, which showed another application of CXL technology to improve the robustness of recommendation model formation (IEEE MICRO). In large recommendation models, fault tolerance is important to continue training on errors, but the checkpoint can cause performance degradation. Leveraging CXL-based cache coherent communication, they introduce a platform with a CXL GPU and multiple CXL-mem persistent memory devices that allow background checkpointing by leveraging statically determined batch shape information of the target recommendation model.

Jason Lowe-Power (UC Davis) discussed intelligent memory management and the need for an efficient interface. He stressed the importance and need to support various application semantic differences with large and heterogeneous memory systems. Sharing the power of profile-driven optimization for tensor movement and allocation between DRAM and NVDIMM (ASPLOS’20), predicted that heterogeneous memory systems need a combination of a high-level data management interface (DMI) and a low-level DMI for efficient data block handling.

Hasan Al Maruf (University of Michigan) introduced a Linux interface to intelligently manage CXL-based tiered memory systems (ASPLOS’23). In heterogeneous tiered memory system, memory management should include light downgrading to slow memory level, efficient promotion of hot pages to fast memory level, optimized page allocation path to reduce latency, and workload-aware page allocation policies. Based on these goals, Hasan and his collaborators propose proactively identifying inactive pages and keeping them in a separate demotion pool, allowing migration to remote nodes when local storage is low. To reduce unnecessary demotions of promoted pages, consider the LRU heuristics active before promoting pages from slow to fast memory.

Christina Giannoula (University of Toronto) introduced architectural support for efficient data movement in fully disaggregated systems (ACM measurement and analysis of computer systems). The proposed solution minimizes data transfer costs through selective granularity of data movement (cache lines, pages or both) by monitoring sub-blocks and page buffers in transit, page compression and allocation of priority to critical cache lines.

Research challenges

Various research challenges were discussed in a panel, Jason Lowe-Power (UC Davis), Daniel S. Berger (Microsoft), Michele Gazzetti (IBM), Manoj Wadekar (Meta), Debendra Das Sharma (Intel) and Hasan Al Maruf (University of Michigan).

Current Status Needs and Challenges in Heterogeneous and Composable Memory | itkovian

Figure 3: A large shared memory pool using CXL 3.0 (source: CXL Consortium)

Learn about the different stages of storage disaggregation (especially CXL memory-based storage disaggregation). CXL-based memory can appear in the production environment, following several phases: the first phase can appear as a memory expander, and the second phase can appear as a memory pool (for example, GFAM in Figure 3). The first phase is expected to be implemented within one to two years, while the second phase, even on a small scale, could take around five years. Regardless of what stage memory disaggregation is in, there are many commonalities.

  • Cost and potency are the main driving factors. Use cases will be prioritized based on which use cases can benefit the most from memory disaggregation.
  • Memory bandwidth will be a key factor because the traditional method of adding memory bandwidth by adding memory channels is not scalable. Remote storage will provide one level of memory bandwidth (not just one level of memory capacity).

About application transparency. Application adoption of CXL memory (or other memory disaggregation solution) will go through multiple stages. In the first stage, CXL memory should be transparent to applications without requiring changes, especially in public cloud environments. As CXL memory becomes more common, new programming models or disruptive changes may be accepted. We also cannot expect the programmer to know data placement and deal with different memory latencies and bandwidth in heterogeneous memory. But it might be reasonable to allow the programmer to pass application semantics (coming from the application layer up to the CXL mechanism). Currently, application transparency has a higher priority.

Learn about the availability of CXL hardware with academia. Industry can provide some FPGA based simulation platforms. Using emulation (e.g. based on NUMA) can also work. Also, in addition to the hardware, we see the software ecosystem (for example, Meta TPP) starting to appear.

Summary

All talk and discussion agrees that disaggregated and heterogeneous memory systems can provide better resource utilization, resource scaling, and fault tolerance. To maximize performance benefits, a thorough understanding of individual data objects (e.g., hot or cold data, access orders, granularity, latency or bandwidth limits, and duration) and designing interfaces that they exploit semantic information for memory management. Therefore, software-hardware co-design is an essential component. But, above all else, the most pressing obstacle seems to be the limited commercialization of platforms and simulators for testing ideas and developing killer applications.

About the Authors:

Hyeran Jeon is an assistant professor at the University of California, Merced. He directs the Merced Computer Architecture Lab for research into energy-efficient, reliable, and secure computer systems and architecture.

DongLi is an Associate Professor at the University of California, Merced. He is the director of the Parallel System Lab at Merced and was a research scientist at the Oak Ridge National Laboratory (ORNL). His research focuses on system support for memory heterogeneity.

Jie Ren is Assistant Professor of Computer Science at the College of William & Mary. His research aims to improve the performance and resource efficiency of heterogeneous memory systems.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author, and do not represent those of ACM SIGARCH or its parent organization, ACM.

Hi, I’m Samuel