Workshop #1 (RP2)

Hardware-Software Co-Design for Energy-Efficient and High-Performance Edge AI

"Maxwell: a Near-SRAM Co-Design Computing Architecture for Edge AI Applications"

Prof. David ATIENZA ALONSO (EPFL)

Abstract

Recent advances in machine learning (ML) have dramatically increased model size and computational requirements, increasingly straining computing system capabilities. This tension is particularly acute for resource-constrained edge AI scenarios, for which careful hardware acceleration of computing-intensive patterns and the optimization of data reuse to limit costly data transfers are key. Addressing these challenges, we have co-designed at EPFL a new compute-memory architecture named Maxwell, which supports the execution near memory of entire inference algorithms with heterogeneous quantization levels near memory.
Leveraging the regular structure of memory arrays, Maxwell achieves a high degree of parallelization for both convolutional and fully connected layers, while supporting fine-grained quantization. Additionally, the architecture effectively minimizes data movements by performing near-memory all intermediate computations, such as scaling, quantization,activation functions, and pooling layers. We demonstrate that such an approach leads to up to 8-10x speed-ups, with respect to state-of-the-art edge AI accelerators that require the transfer of data at the boundaries of ML layers. Moreover, thanks to the proposed co-design approach, an acceleration of up to 250x with respect to pure software optimizations are observed on the X-HEEP edge AI platform that integrates Maxwell logic and a 32-bit RISC-V core, with Maxwell-specific components only accounting for 10.6% of the memory area.

Prof. David ATIENZA ALONSO (EPFL)

"LLM Acceleration based on Processing-in-memory Architectures"

Prof. Xiaoming CHEN (CAS)

Abstract

The development of large language models (LLMs) poses significant challenges to computing systems in terms of computing power and storage capacity. The processing-in-memory (PIM) technology is one of the promising solutions which has the potential to overcome these challenges. This talk will introduce some preliminary attempts in building LLM accelerators using PIM technology. The different operators of LLMs exhibit different computational and memory access characteristics. In response to this issue, this talk mainly explores how to integrate multiple different computing modes (in-memory computing, near-memory computing, traditional GPU, etc.) to build heterogeneous acceleration systems to accelerate LLMs. The simulation results demonstrate the superiority of heterogeneous PIM systems for accelerating LLMs.

Prof. Xiaoming CHEN (CAS)

"AHS: Agile Hardware Specialization"

Prof. Yun LIANG (PKU)

Abstract

Compared to software design, hardware design is more expensive and time-consuming. This is partly because software community has developed a rich set of modern tools to help software programmers to get projects started and iterated easily and quickly. However, the tools are seriously antiquated and lacking for hardware design. In this talk, I will introduce AHS: An EDA toolbox for Agile Chip Front-end Design, which includes various EDA tools for both chip design and verification. From the design perspective, AHS present three ways that use different programming interfaces, including 1) a multi-level hardware intermediate representation based high-level synthesis flow, 2) an embedded hardware description language; 3) a large language model (LLM)-powered hardware design flow. These three different methodologies exhibit different trade-offs in productivity and PPA (performance, power, and area) for chip design. From the verification perspective, I will present agile simulation and debugging tools, which can check the functional and performance behaviors of the hardware.

Prof. Yun LIANG (PKU)

Back