LLM Split Inference - Search News

Defeating Nondeterminism in LLM Inference by Thinking Machines

A research article by Horace He and the Thinking Machines Lab (X-OpenAI CTO Mira Murati founded) addresses a long-standing issue in large language models (LLMs). Even with greedy decoding bu setting ...

Business Wire

Inception Launches Mercury 2, the Fastest Reasoning LLM — 5x Faster Than Leading Speed-Optimized LLMs, with Dramatically Lower Inference Cost

PALO ALTO, Calif.--(BUSINESS WIRE)--Inception, the company behind the first commercial diffusion large language models (dLLMs), today announced the launch of Mercury 2, the fastest reasoning LLM and ...

Semiconductor Engineering

Four Architectural Opportunities for LLM Inference Hardware (Google)

“Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI ...

Forbes

Cerebras Now The Fastest LLM Inference Processor; Its Not Even Close

The company tackled inferencing the Llama-3.1 405B foundation model and just crushed it. And for the crowds at SC24 this week in Atlanta, the company also announced it is 700 times faster than ...

Gizmodo

ChatGPT Can ‘Infer’ Personal Details From Anonymous Text

New research shows how popular LLMs are able to accurately guess a user’s race, occupation, or location, after being fed seemingly trivial chats. Reading time 4 minutes Quiz time: If you or your ...

Hackaday

TurboQuant: Reducing LLM Memory Usage With Vector Quantization

Large language models (LLMs) aren’t actually giant computer brains. Instead, they are massive vector spaces in which the probabilities of tokens occurring in a specific order is encoded. Billions of ...

Semiconductor Engineering

Ultra-low-bit LLM Inference Allows AI-PC CPUs And Discrete Client GPUs To Approach High-end GPU-Level (Intel)

A new technical paper titled “Pushing the Envelope of LLM Inference on AI-PC and Intel GPUs” was published by researcher at Intel. “The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match ...

SDxCentral

AI inference crisis: Google engineers on why network latency and memory trump compute

Google researchers have warned that large language model (LLM) inference is hitting a wall amid fundamental problems with memory and networking problems, not compute. In a paper authored by ...

Digi Times

In-depth: Google TurboQuant cuts LLM memory 6x, resets AI inference cost curve

Google has introduced TurboQuant, a compression algorithm that reduces large language model (LLM) memory usage by at least 6x while boosting performance, targeting one of AI's most persistent ...

TweakTown

Dell PowerEdge XE9712: NVIDIA GB200 NVL72-based AI GPU cluster for LLM training, inference

Dell has just unleashed its new PowerEdge XE9712 with NVIDIA GB200 NVL72 AI servers, with 30x faster real-time LLM performance over the H100 AI GPU. Dell Technologies' new AI Factory with NVIDIA sees ...

VentureBeat

Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding

As agentic AI workflows multiply the cost and latency of long reasoning chains, a team from the University of Maryland, Lawrence Livermore National Labs, Columbia University and TogetherAI has found a ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results