Time to give the storage allocator a room of its own in the house?

The implications of machine learning (ML), be it training or inference, have accordingly driven systems and architecture research. A significant amount of work is underway in the Systems for ML space ranging from building efficient systems for data preprocessing (e.g. Cachew, ATC’22) to timely and automatic (re)generation of accurate models (e.g. MatchMaker, MLSys’22) in various ecosystems driven by these models. We recently looked at the implications of machine learning for memory management and thought about what next-generation memory allocators should look like. One of my co-advisors at UT Austin, Ruihao Li, will be presenting this vision at HotOS’23 in June 2023. This blog post briefly describes our vision where we advocate for memory allocation offloading (and other similar management) from the main processing core to other processing units to increase performance and reduce power consumption.
The gap between processor speed and memory access latencies is becoming a growing barrier to efficiently running ML tasks. ML accelerators spend only 23% of their overall cycles on computations and over 50% on data preparation, as reported in the « In-Datacenter Performance Analysis of a TPU » paper, published in ISCA’17. Even in general warehouse-scale data centers, 50 percent of processing cycles are idle waiting for memory accesses, causing performance degradation for applications.
Identifying this challenge, multiple libraries for memory allocation have been proposed and are widely used, eg Google’s TCMalloc, Microsoft’s MiMalloc. To impact real-world use cases where concurrent requests for memory allocation need to be supported, these libraries are multi-threaded; maintain and rely heavily on complex metadata. While this allowed for memory allocation for concurrent requests, since the same core does memory management and application code, there is resource contention which ends up impacting overall performance. We ran some measurement experiments where we compared the performance of four different memory allocators: PTMalloc2, Jemalloc, TCMalloc (OSDI’21), and Mimalloc on a set of representative workloads, from SPEC cpu2017. We have observed that depending on the allocator used, the performance can vary by more than 10x.
So, we dug deeper. The existing memory allocation mechanisms implemented in the software achieve higher performance than the default Glibc (PTMalloc2) memory allocators. However, because they ignore the underlying hardware, they still fail to take advantage of system-level optimizations such as reducing cache pollution and TLB misses. Alternatively, to avoid resource contention, memory allocation functions are implemented on hardware accelerators. However, relying on custom hardware units can hinder their development into general-purpose memory management solutions or require frequent hardware changes as algorithms evolve.
Next generation memory allocators will likely continue to be complex. One way to manage its complexity without impacting application performance is to isolate the allocation functions from the rest of the code and provide them with separate resources. This will prevent allocators from polluting the cache and interfering with other application metadata. However, current allocators cannot be easily « pulled » out of the code and offloaded onto dedicated cores, because the metadata is usually tightly coupled to user data. Thus, the design of next generation memory allocators requires innovations in software, hardware and requires their co-design.
To summarize, we found an analogy for thinking about offloading the allocation to a separate dedicated core: just like a child in a large family, the memory allocator is growing. The time has come when we have to give it a new room (core) in our house (CPU). This growing child raises many new research questions. How to find the right trade-off between the overhead (additional inter-core communication) and the benefit (reduced cache pollution and asynchronous execution) of offloading memory allocators? The choice of room type (i.e., type of processing core) is another research question. Should the room be the same as other rooms (e.g. other CPU cores) or is a small room enough for memory allocation? Can the stanza be used for other functions instead of solely using it for memory allocation?
About the author: Neeraja J. Yadwadkar is an assistant professor in the ECE department of the University of Texas at Austin and an affiliate researcher with VMWare Research. Most of his research straddles the boundaries of computer systems and machine learning.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author, and do not represent those of ACM SIGARCH or its parent organization, ACM.