Why AI Chip Makers Need In-House AI Research

Why AI Chip Makers Need In-House AI Research – Now More Than Ever

Investors sometimes ask: “Why build both the chip and run an AI research team? Isn’t that two businesses?”

On the surface, it may look like two parallel efforts. But in reality, chip design and AI research are inseparable. One cannot succeed without the other.

In CPUs, you design once for a fixed ISA and a standardized toolchain, and workloads remain broadly predictable. In AI chips, the “ISA” is the model zoo – CNNs yesterday, Transformers today, LLMs on the edge tomorrow, and new architectures like RWKV or Liquid Fluid Models right around the corner. Supporting models, not just opcodes, is the real job. And that demands chip design and AI research working hand-in-hand.

Why AI Chip Design Is Not Like Traditional Chip Design

What makes designing an AI chip fundamentally different from traditional chip design?

The answer lies in four realities that push chip development into a new paradigm:

Because AI evolves at an entirely different pace Traditional chips were designed for stability, lasting across product cycles. AI reinvents itself every few months. Chips must now adapt at this pace – or risk instant irrelevance.
Because AI chips must be built for constant diversity Workloads are no longer uniform. Vision, language, reasoning, and multi-modal tasks demand architectures that can flex and adapt. Just as insect brains evolved for insect lifestyles and mammalian brains for mammals, AI chips must specialize to suit the diversity of applications. There will never be a “one chip fits all” solution.
Because AI research is inseparable from chip design The CPU and GPU era of generalized computing no longer serves edge AI. In this new era, hardware and AI research co-design each other. The days when software ran behind hardware are over … today, hardware and software evolve in lockstep.
Because success depends on software ecosystems co-evolving with hardware Performance doesn’t come from silicon alone. It requires software frameworks, compilers, and optimization layers to mature alongside hardware. This co-evolution also helps reduce chip costs and increase fab unit efficiency, making edge AI scalable and sustainable.
Because hardware–software synergy is non-negotiable Even the famous Bubble Sort algorithm runs efficiently on CPUs but performs poorly on GPUs … proving that not every workload suits every architecture. With today’s evolving AI applications, hardware and software must be co-designed as a unified system, tuned to complement each other’s strengths and demands. This synergy is essential for advancing both AI chips and AI models.

From CNNs on the Edge → LLMs Everywhere

A couple of years ago, edge AI was dominated by CNNs for vision. Computer Vision accelerators were the big driver. That continues even today.

But the world has shifted. LLMs are moving to the edge for privacy, latency, and cost. Multiple market surveys point to the same trend: in the next 2–3 years, LLMs on-device and at the edge will be a massive growth area.

The Model Landscape Is Fragmenting

Chip designers aren’t just chasing “Transformers” anymore. They’re being asked to support multiple LLM families and entirely new architectures, each with different kernel and precision needs:

RWKV – an RNN/Transformer hybrid with linear-time, stateful inference. Great for long sequences on limited memory, but it changes the attention and cache story for chips.
Liquid Fluid Models (LFM) – continuous-time models that adapt their dynamics; promising for time-series and edge deployments, but with very different numerical profiles than Transformers.
Foundation model families like Amazon Titan – enterprise-oriented, long-context, and optimized for real-world workloads. These bring their own kernel and memory-access needs.
Mixture-of-Experts (MoE) – sparse activation, router/dispatcher layers, and heavy all-to-all communication; totally different bandwidth/latency trade-offs compared to dense Transformers.

Even within Transformers, “stable” blocks like attention and feed-forward are constantly being reinvented:

FlashAttention rewrites how attention uses SRAM to achieve real wall-clock speedups.
Grouped Query Attention (GQA) , Variable Grouped Query Attention (VGQA) and Multi-Query Attention (MQA) shrink KV-cache bandwidth during decoding.
RoPE positional encoding, RMSNorm, and SwiGLU activations are now standard in LLaMA-style models, changing how normalization and activations must be implemented in low precision.

And precision itself keeps shifting: FP32 → FP16 → INT8 → FP8 → INT4. A couple of years ago, we at SandLogic took the decision to stop at FP8. We focused on multi-precision FP8 MACs and exponent-heavy formats that are hardware-friendly and reduce memory/power overhead while still preserving model quality. Today, the industry is catching up to that same conclusion.

What This Means for a Chip Team

To run modern and upcoming LLMs well, an AI chip must ship with first-class support for:

KV-cache management and decode-optimized attention (FlashAttention, MQA/GQA).
Low-precision friendly ops like RoPE, RMSNorm, and SwiGLU.
MoE routing and dispatch kernels with efficient all-to-all communication.
Stateful inference paths for architectures like RWKV.
Quantization-aware toolchains that make INT8/INT4/FP8 practical without excessive dequant/requant overhead.

Why We Keep an In-House AI Research Team

Because these shifts aren’t theoretical – they show up every day in the conversations between our chip and AI research teams, and they directly change both the silicon and the models we build.

Sample Conversations That Shape Our Chips and Models

At SandLogic, this co-design isn’t a slogan – it’s visible in the daily back-and-forth between our chip architects and AI researchers:

INT8 reality check

Chip team: “If a model is quantized to INT8, do we stay in INT8 compute, or dequantize back to float?”

AI team: “On L40S, weights are INT8 but dequantized to FP16 GEMM. On H100, GEMM runs in INT8 – though norms still float.”

This shaped how we designed our MAC datapaths.

Compression results

AI team: “We’ve built a new compression method.”

Chip team validated: DDR transactions per token dropped 30% (Qwen-7B INT8), 20% (LLaMA-8B INT8), 4.5% (Shakti-500M INT8). That didn’t boost raw TOPS, but it cut power and memory needs – exactly what matters on the edge.

Softmax optimization

Chip team: “If we fuse max, exp, and sum in one pass, and keep them in registers, we can halve DDR trips.” AI team tested and confirmed the accuracy held.

This became part of ExSLerate’s softmax kernel.

Chip guiding AI models

Often the chip designers tell the AI team: “This operator fires up millions of transistors – use the alternative one, it’s more efficient.”

That feedback makes our Shakti LLMs hardware-friendly by design.

FP8 early bet

Two years ago, both teams agreed FP8 was the sweet spot for multi-precision MACs. We committed early – long before FP8 became an industry buzzword.

Shakti LLMs

Every debate is validated on our in-house Shakti models (100M -> 8B). This way, no idea remains theoretical; everything is benchmarked end-to-end.

The Future: Agentic AI on Devices

Global research and industry forecasts consistently highlight Agentic AI moving into devices as the next big wave. These workloads will be even more complex like multi-modal, autonomous, and context-aware …. and they will demand tight silicon-software co-design.

SandLogic is positioning itself ahead of this curve:

Ensuring our chips can support emerging model families and agentic workloads.
Building the surrounding software ecosystem : model zoo, optimized kernels, and developer frameworks to shorten customer time-to-market.

TL;DR for Each Audience

VCs: This isn’t two separate bets – it’s one integrated, defensible strategy. The moat is the feedback loop between models and silicon.
Chip Designers: Throughput alone doesn’t win. Memory bandwidth, attention kernels, and quantization paths matter as much as raw TOPS.
AI Researchers: Hardware constraints aren’t a burden – they’re a design frontier. Small changes in normalization, attention, or precision swing tokens/sec, joules/token, and accuracy on real hardware.

Closing Thought

AI chipmaking is not like CPU design. It’s not “design once, run forever.” It’s a living process of co-design, where hardware and models evolve together.

That’s why the future leaders in AI silicon will always have in-house AI research teams. Not as a side project, but as a core necessity.

At SandLogic, our chip and AI teams constantly challenge each other…. debating operators, compression, quantization, and even transistor counts – until both the model and the silicon become leaner, faster, and more efficient.

Because in AI silicon, accuracy, performance, and efficiency are born from collaboration.

0 Comments