Google Launches Gemini 3.1 Flash-Lite at $0.25 Per Million Tokens

Thursday 2 April 2026|Google|

AI Growth EngineEmployee Amplification Systems

Google has released Gemini 3.1 Flash-Lite, its most cost-efficient AI model to date, priced at $0.25 per million input tokens, one-eighth the cost of Gemini 3.1 Pro. The model delivers 2.5 times faster responses and 45% higher output speeds than its predecessor, while supporting a one-million-token context window and multimodal inputs including text, images, audio, video, and PDFs. For operators running high-volume AI workflows, the pricing shift opens use cases that were previously too expensive to sustain.

Operator Insight

At $0.25 per million tokens, the economics of AI automation have fundamentally changed for operators. Workflows that were cost-prohibitive at $2 to $5 per million tokens, processing large documents, running bulk content moderation, generating personalised communications at scale, are now viable at a fraction of the cost. The question is no longer whether AI is affordable. It is whether you have identified the workflows worth automating.

30-Second Summary

Google released Gemini 3.1 Flash-Lite in early March 2026, pricing it at $0.25 per million input tokens and $1.50 per million output tokens. The model is 2.5 times faster than its predecessor and outperforms competing budget models from OpenAI and Anthropic on the majority of major benchmarks. It accepts a one-million-token context window and handles text, images, audio, video, and PDFs natively. For operators building or running AI-powered workflows, this is the most significant cost reduction in frontier-adjacent AI capability since budget models first emerged, and it materially changes which use cases are worth automating.

At a Glance

Topic: Model Releases
Company: Google
Date: 3 March 2026 (preview release)
Announcement: Gemini 3.1 Flash-Lite released via Google AI Studio and Vertex AI at $0.25 per million input tokens
What Changed: A frontier-adjacent AI model with a one-million-token context window and multimodal support is now available at one-eighth the cost of Google's flagship model
Why It Matters: High-volume automation use cases that were previously cost-prohibitive are now viable for operators running lean businesses
Who Should Care: Operators using AI APIs for document processing, content generation, customer communications, or any high-volume workflow

Key Facts

Company: Google
Launch Date: 3 March 2026 (preview via Google AI Studio and Vertex AI)
Pricing: $0.25 per million input tokens, $1.50 per million output tokens
Comparison: One-eighth the cost of Gemini 3.1 Pro ($2/M input), lower than Gemini 2.5 Flash ($0.30/M input), significantly cheaper than Claude 4.5 Haiku ($1/M input)
Speed: 2.5 times faster Time to First Token, 45% higher output speed versus Gemini 2.5 Flash; generates 225 tokens per second
Context Window: One million tokens
Modalities: Text, images, audio, video, PDFs
Architecture: Mixture-of-experts (MoE), inherited from Gemini 3.1 Pro
Who It Affects: Developers, operators, and enterprises running AI at volume via API
Primary Source: Google blog and Google AI Studio, 3 March 2026

What Happened

Google released Gemini 3.1 Flash-Lite in preview on 3 March 2026, completing a tiered model strategy launched alongside Gemini 3.1 Pro in February. Flash-Lite sits at the efficiency end of the range, designed for high-volume workloads where cost and speed take priority over maximum capability.

The pricing is the headline: $0.25 per million input tokens and $1.50 per million output tokens. For context, that is one-eighth the cost of Gemini 3.1 Pro and below the previous generation Gemini 2.5 Flash. Competing budget models from Anthropic (Claude 4.5 Haiku at $1/M input) and OpenAI (GPT-5 mini) are priced higher for input, making Flash-Lite the most affordable option among frontier-adjacent models at launch.

Despite the lower price, the performance is competitive. Flash-Lite achieved the top score across six of eleven benchmark tests in independent evaluations, outperforming GPT-5 mini and Claude 4.5 Haiku. On the Arena.ai leaderboard it holds an Elo score of 1,432. It scores 86.9% on GPQA Diamond and 76.8% on MMMU Pro, both results that exceed what larger Gemini models from previous generations achieved.

The model uses a mixture-of-experts (MoE) architecture, activating only a subset of its parameters per inference call. This is the same structural approach as Gemini 3.1 Pro, which means Flash-Lite benefits from a large training base while keeping per-inference compute costs low. The result is performance that exceeds its price tier more consistently than previous budget models managed.

Developers can control the model's reasoning depth through four thinking modes: minimal, low, medium, and high. This allows operators to balance response quality against cost and latency depending on the task. The one-million-token context window is available at all thinking levels, meaning document-heavy workflows do not require chunking or pre-processing.

Why It Matters

At $0.25 per million input tokens, operators can now run AI across millions of documents or customer interactions per month at a cost that fits inside existing operational budgets
The one-million-token context window eliminates the chunking problem for large documents, contracts, audio transcripts, and historical data, making these workflows practical without custom engineering
Multimodal support at this price point means a single model can process mixed content, text alongside images, audio, or PDFs, reducing the number of different tools an operator needs to manage
The speed improvement (225 tokens per second, 2.5 times faster than predecessor) reduces latency in real-time applications like customer-facing chat, automated email responses, and live document analysis
Budget model performance catching up to previous-generation frontier models shifts the decision calculus: operators no longer need to choose between quality and cost at the same rate they did 12 months ago
Availability on both Google AI Studio and Vertex AI means operators can access Flash-Lite through Google's consumer developer tools or its enterprise-grade platform with compliance and access controls

The David and Goliath View

The release of Gemini 3.1 Flash-Lite matters because it changes the economics of what is worth automating. Twelve months ago, running AI across a large document library, a year of customer emails, or thousands of product images required either significant API budget or a willingness to accept lower-quality models. At $0.25 per million tokens with frontier-adjacent performance, that trade-off has collapsed.

For operators running businesses with 10 to 200 people, this is not an incremental improvement. It is a genuine capability shift. A workflow that processes 10 million tokens per month, roughly the equivalent of reading thousands of customer contracts or generating personalised outreach at meaningful scale, now costs $2.50 in input processing. The barrier to AI-powered operations is no longer price. It is workflow design and implementation.

The practical implication is straightforward: operators should revisit every AI use case they dismissed in the past 18 months because the economics did not stack up. Many of those decisions were correct at the time and are now wrong. The operators who move quickly to identify and implement the newly viable workflows will compound advantages over the next 12 months that will be difficult for slower movers to close.

Start with your highest-volume, most repetitive knowledge work. Calculate what it currently costs in staff time. Run the numbers at $0.25 per million tokens. The business case will often be obvious.

Where This Fits in the AI Stack

AI Growth Engine: Flash-Lite makes high-volume AI for commercial workflows, outbound personalisation, lead enrichment, content generation, and customer communication, economically viable at a scale that was previously reserved for large enterprises with significant AI budgets.

Employee Amplification Systems: Document processing, internal knowledge retrieval, meeting transcript analysis, and policy review are all high-volume tasks that can now be automated at minimal API cost, freeing staff for higher-value work without material budget impact.

Questions Operators Are Asking

How does Flash-Lite compare to the models we are already using? If you are using Claude 4.5 Haiku or GPT-5 mini for high-volume tasks, Flash-Lite is priced lower and benchmarks competitively. For most document processing, content generation, and classification tasks, Flash-Lite should produce comparable results at lower cost. The only meaningful trade-off is that it is a preview release and not yet at full general availability on Vertex AI.

What is the one-million-token context window useful for in practice? It lets you pass entire documents into a single API call rather than splitting them into chunks and reassembling responses. Practically, this means processing full contracts, complete customer histories, long transcripts, or large product catalogues without custom pre-processing logic. For most operators, eliminating chunking alone reduces engineering complexity significantly.

Is a preview model reliable enough for production? Preview releases from Google typically reach general availability within one to three months. For non-critical, reversible workflows, running Flash-Lite in preview carries acceptable risk. For customer-facing or compliance-sensitive applications, wait for general availability or run Flash-Lite in parallel with your existing model while evaluating quality.

What should we actually use this for? High-volume classification (routing, tagging, categorisation), document summarisation across large libraries, personalised content generation at scale, multimodal analysis (images plus text, audio plus text), and first-pass research across large datasets. These are the use cases where volume is high enough that cost matters, and Flash-Lite's economics make them viable.

Does switching models require significant engineering work? If you are already using the Google AI or Vertex AI SDKs, switching to Flash-Lite is a model name change in your configuration. If you are migrating from another provider, the standard REST and Python SDKs for Vertex AI are well-documented and the migration is typically straightforward for engineers familiar with API-based AI integration.

Citable Summary

What happened: Google released Gemini 3.1 Flash-Lite on 3 March 2026 at $0.25 per million input tokens, making it the most affordable frontier-adjacent AI model on the market, with a one-million-token context window, multimodal support, and benchmark results that exceed competing budget models.

Why it matters: The pricing shift makes high-volume AI automation economically viable for operators who previously could not justify the API cost, reopening use cases that were dismissed over the past 18 months.

David and Goliath view: The barrier to AI-powered operations is no longer price. It is workflow design. Operators who revisit previously rejected use cases and move quickly to implement the newly viable ones will compound advantages that slower movers will struggle to close.

Offer relevance:

AI Growth Engine: high-volume commercial workflows including personalisation, content generation, and lead enrichment are now cost-effective at meaningful scale
Employee Amplification Systems: document processing, knowledge retrieval, and transcript analysis can be automated at minimal API cost, freeing staff for higher-value work

Why This Matters for Operators

✓
Audit your highest-volume AI workflows and reprice them against Gemini 3.1 Flash-Lite's rates. Processes you dismissed as too expensive six months ago may now deliver a clear return.
✓
The one-million-token context window means you can process entire contracts, customer histories, or document batches in a single API call. This eliminates chunking workarounds that previously added complexity and cost.
✓
Multimodal support at this price point means operators can build AI workflows that process images, audio, and PDFs alongside text, without switching between different models or providers.
✓
Benchmark your current model usage against Flash-Lite before your next billing cycle. The 2.5 times speed improvement may also reduce latency in customer-facing applications where response time matters.

←Microsoft Releases Open-Source Agent Governance Toolkit Addressing All 10 OWASP Agentic AI Risks

All Briefings AI Signals

AI Agent-Level Exploits Emerge as Top Enterprise Security Threat→

Related Intelligence

Related Briefings

Google Cloud Next 2026: Agents Are Now the Enterprise ArchitectureGoogle | Enterprise AI
OpenAI Launches GPT-5.5: First Fully Retrained Base Model Since GPT-4.5OpenAI | Model Releases
OpenAI Launches GPT-5.5 with Stronger Agentic and Computer-Use CapabilitiesOpenAI | Model Releases
Google Launches Workspace Studio: No-Code AI Agent Builder for Business UsersGoogle | Agent Systems

Related Signals

[High] OpenAI launches GPT-5.5, first fully retrained base model since GPT-4.5
GPT-5.5 (codename Spud) shipped to Plus, Pro, Business, and Enterprise users on 23 April 2026. API pricing is $5/M input and $30/M output tokens with a 1M context window. GPT-5.5 Pro lists at $30/$180 per million tokens.
[High] OpenAI GPT-5.4 launches with a 1M-token context window
OpenAI launched GPT-5.4 in three variants (Standard, Thinking, Pro) with a 1.05M-token context window and 33% fewer factual errors than GPT-5.2. API pricing starts at $2.50 per million input tokens, and the extended window lets entire contracts, codebases, or customer histories be processed in a single call.

Explore Related Intelligence

More on Model Releases All AI Signals Briefing Archive AI Consulting Landscape Best AI Consulting Firms

How This Maps to David & Goliath

AI Growth Engine →Employee Amplification Systems →

Apply This to Your Business

Want to see what this means for your team?

Tell us a little about your business and we will map the specific opportunity for your sector and team size.