Sections
Text Area

Workshop #1

Exploring the Future of AI: Device and Technology for In-Memory and In-Sensor Computing

 

 

"MINOTAUR: Enabling Transformer Models at the Edge with Posits and Resistive RAM"

Prof. Priyanka RAINA (Stanford)

Abstract

Transformer models achieve state-of-the-art accuracy, but are challenging to run in resource-constrained edge environments, as they are large and difficult to quantize to 8b integers. MINOTAUR overcomes these challenges and enables both inference and training of machine learning models at the edge through (1) an alternative 8b floating-point data type, (2) a deep neural network accelerator optimized for operator fusion, and (3) temporal power-gating of on-chip non-volatile resistive RAM (RRAM). MINOTAUR uses 8b posits, an alternative to the IEEE-754 floating point standard that achieves a higher dynamic range for the same number of bits. Compared to bfloat16, the current state-of-the-art, 8b posit reduces both the memory capacity required and the memory access energy by 2x, with a 1.36x smaller area and 1.1x lower power multiply-accumulate (MAC) operator. Posit operations require type conversions that can introduce quantization errors, leading to accuracy degradation. MINOTAUR addresses this challenge through fusion of transformer operations, enabled by a configurable vector datapath; this improves inference accuracy and reduces the number of memory accesses. Finally, MINOTAUR uses RRAM to fit large models (e.g., 12 MB MobileBERT-tiny) entirely on-chip without any external memory. MINOTAUR further reduces memory power by exploiting the non-volatility of RRAM with software-controlled fine-grained temporal power gating. MINOTAUR is designed in a commercial 40nm process and achieves ResNet-18 inference in 37.5 ms, which is 1.6x faster than CHIMERA, the current state-of-the-art, and MobileBERT-tiny inference in 30.5 ms, with only a 0.6% accuracy drop on the QNLI question-answering task relative to bfloat16.

Image
Priyanka RAINA

Prof. Priyanka RAINA (Stanford)

"Accurate In-Memory Computing with MRAM Device Variation-Aware Adaptive Quantization"

Prof. Qiming SHAO (The Hong Kong University of Science and Technology)

Abstract

The development of large language models (LLMs), such as GPT-4, requires massive amounts of data. However, current computing systems are facing a memory bottleneck due to the demand for large data processing. In response to this challenge, in-memory computing (IMC) has emerged as a potential solution. The importance of developing emerging nonvolatile memories (eNVM) lies in their ability to improve the performance and efficiency of IMC systems by allowing data to be stored and processed much closer to the central system. Additionally, the nonvolatility of eNVM enables ultra-low power consumption to maintain the data, which can significantly reduce static energy consumption.

However, eNVM devices face several challenges before they can be widely used. One of the main challenges is the device variation of eNVM since IMC relies on Ohm’s law and Kirchhoff's current law to achieve multiplication and accumulation (MAC) operations in the crossbar array. This seminar will focus on the unique variation property of magnetic random-access memory (MRAM) to design a reliable IMC macro for accurate deep neuron network inference. The variation of the MRAM device is spatially random but temporally fixed after the device is fabricated. Based on this property, we propose a device variation adaptive quantization scheme to accurately adjust the quantization value of the parameters that will be mapped to the eNVM device. Applying such a scheme makes the inference accuracy comparable to the on-chip training models without any on-chip training cycles.

Reference

  1. Shao, Q., Wang, Z., & Yang, J. J. (2022). Efficient AI with MRAM. Nature Electronics, 5(2). https://doi.org/10.1038/s41928-022-00725-x 2.Wang, Z., Wu, H., Burr, G. W., Seong Hwang, C., Wang, K. L., Xia, Q., & Joshua Yang, J. (2020). Resistive switching materials for information processing. Nature Reviews Materials. https://doi.org/10.1038/s41578-019-0159-3
  2. Z. Xiao et al., "Device Variation-Aware Adaptive Quantization for MRAM-based Accurate In-Memory Computing Without On-chip Training," 2022 International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 2022, pp. 10.5.1-10.5.4, doi: 10.1109/IEDM45625.2022.10019482.
  3. Y. LeCun, "1.1 Deep Learning Hardware: Past, Present, and Future," 2019 IEEE International Solid- State Circuits Conference - (ISSCC), San Francisco, CA, USA, 2019, pp. 12-19, doi: 10.1109/ISSCC.2019.8662396.
Image
Qiming Shao

Prof. Qiming SHAO (The Hong Kong University of Science and Technology)

"Building AI Accelerators using N3XT 3D MOSAIC, Illusion and Co-design"

Prof. Subhasish MITRA (Stanford)

Abstract

The computation demands of 21st-century abundant-data workloads, such as AI/machine learning, far exceed the capabilities of today’s computing systems. For example, a Dream AI Chip would ideally co-locate all memory and compute on a single chip, quickly accessible at low energy. Such Dream Chips aren’t realizable today. Computing systems instead use large off-chip memory and spend enormous time and energy shuttling data back-and-forth. This memory wall gets worse with growing problem sizes, especially as conventional transistor miniaturization gets increasingly difficult. The next leap in computing requires transformative NanoSystems by exploiting unique characteristics of nanotechnologies and abundant-data workloads. We create new chip architectures through ultra-dense 3D integration of logic and memory – the N3XT 3D approach. Multiple N3XT 3D chips are integrated through a continuum of chip stacking/interposer/wafer-level integration — the N3XT 3D MOSAIC. To scale with growing problem sizes, new Illusion systems orchestrate workload execution on N3XT 3D MOSAIC creating an illusion of a Dream Chip with near-Dream energy and throughput. Several hardware prototypes, built in commercial and research fabrication facilities, demonstrate the effectiveness of our approach. We target 1,000X system-level energy-delay-product benefits, especially for abundant-data workloads.

Image
 Subhasish Mitra

Prof. Subhasish MITRA (Stanford)

Text Area

 

Back