China has discovered a clever alternative to NVIDIA’s scaled-down AI accelerators. DeepSeek’s recent project shines with the Hopper H800s AI accelerators, achieving a whopping performance boost of eight times the usual TFLOPS.
### DeepSeek’s FlashMLA: A Game-Changer for China’s AI Ambitions with NVIDIA’s Modified GPUs
China seems to be forging its own path, especially when it comes to advancing their hardware game. Homegrown companies like DeepSeek are at the forefront, using innovative software solutions to maximize the potential of the hardware they currently have. The latest breakthroughs from DeepSeek are nothing short of impressive. The company claims they have extracted substantial performance from NVIDIA’s “cut-down” Hopper H800 GPUs by optimizing memory usage and resource allocation across different inference requests.
On the first day of their “OpenSource” week, DeepSeek showcased “FlashMLA,” their innovative decoding kernel tailored specifically for NVIDIA’s Hopper GPUs. It’s exciting to see these advancements being shared openly through platforms like Github. The developments DeepSeek has rolled out are ground-breaking, to say the least.
DeepSeek has asserted that they’ve pushed the limits to achieve 580 TFLOPS for BF16 matrix operations on the Hopper H800—an astronomical leap, about eight times the industry norm. And it doesn’t stop there. FlashMLA, through expert memory optimization, delivers memory bandwidth reaching up to 3000 GB/s, nearly doubling the H800’s theoretical max. And what’s mind-boggling is that all this is achieved through clever coding, no hardware changes involved.
A post on Twitter from Visionary x AI highlights this incredible feat—580 TFLOPS and memory speeds hitting 3000 GB/s—transformative achievements that are well beyond the current standards.
FlashMLA employs “low-rank key-value compression,” essentially breaking down data into more manageable parts, boosting processing speed while cutting down memory use by 40%-60%. It also introduces a block-based paging system that allocates memory dynamically based on task demands rather than a fixed amount, which is particularly beneficial for handling variable-length sequences efficiently.
What DeepSeek has accomplished underscores the multifaceted nature of AI computing. It’s not solely reliant on hardware; it’s also about innovative software strategies. Presently, FlashMLA is optimized for Hopper GPUs, but I can’t help but ponder the potential performance gains awaiting the H100, should DeepSeek adapt FlashMLA for it. It’s an exciting time in AI computing, for sure!