Google's Veo 3.1 is the latest iteration of its video generation model, and the API release has made it accessible to developers building video into their products. Whether you are prototyping a text-to-video tool or adding video generation to an existing SaaS app, understanding the API surface, pricing tiers, and practical limitations matters more than hype.
This guide breaks down what the Veo 3.1 API actually offers, how much it costs per second of generated video, and includes working examples so you can evaluate it against alternatives like Kling and Runway before committing your budget.
What Veo 3.1 Brings to the Table
Veo 3.1 generates 1080p video from text prompts or image inputs with native synchronized audio, including dialogue, ambient sound effects, and background music. This is a meaningful upgrade from Veo 3, which treated audio as a separate post-processing step. The model also supports prompt-based video generation with fine-grained control over camera angles, lighting, and scene composition. Key capabilities include:
- Text-to-video generation at 1080p resolution
- Image-to-video conversion (single image or frames-to-video with two images)
- Native audio generation synced to visual content
- Scene extension supporting up to 20 chained clips for 140+ second narratives
- Vertical video output for short-form content
- 4K upscaling on generated clips
The model is available through Google's Vertex AI platform and the Gemini API. For developers already working within the Google Cloud ecosystem, integration is relatively straightforward.

API Access and Authentication
Getting started with the Veo 3.1 API requires a Google Cloud project with the Vertex AI API enabled. Authentication uses standard Google Cloud service account credentials or API keys through the Gemini API. If you have worked with other AI content generation APIs, the setup pattern will feel familiar.
Here is a basic text-to-video request using curl:
curl -X POST \
"https://generativelanguage.googleapis.com/v1beta/models/veo-3.1:generateVideo" \
-H "x-goog-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"instances": [{
"prompt": "A drone shot flying over a coastal city at golden hour"
}],
"parameters": {
"aspectRatio": "16:9",
"duration": "8s",
"personGeneration": "allow_adult"
}
}'
The response returns an operation ID. Video generation is asynchronous, so you poll the operation endpoint until the video is ready, then download the result. Typical generation time for an 8-second clip runs between 60 and 120 seconds depending on server load and the quality tier selected.
For Python developers, the google-genai SDK simplifies the process. If you are comparing this flow to how other video generation APIs handle requests, the pattern is similar: submit a job, poll for completion, retrieve the output.
Pricing Breakdown
Veo 3.1 pricing varies significantly based on the quality tier and access method. Here is what each tier costs as of mid-2026. For context on how this compares to other AI video generator pricing, the per-second cost is competitive at the Fast tier but premium at Standard.
API (pay-per-use via Vertex AI / Gemini API)
- Veo 3.1 Fast: $0.15 per second of generated video (includes audio)
- Veo 3.1 Standard: $0.40 per second (includes audio)
- Veo 3.1 Lite: approximately $0.05 per second (lower quality, faster generation)
Subscription plans (consumer access) are another option. Google AI Plus starts at $7.99/month with limited video generation credits, while the Pro tier at $19.99/month includes 1,000 credits where a 10-second video uses about 125 credits, working out to roughly $0.16/second effective cost. Google AI Ultra at $249.99/month provides the highest limits. Several free access methods exist: a 1-month Pro trial, a 12-month student plan, $300 in Google Cloud credits for new accounts, and limited access through Google AI Flow. For a comparison of how these tiers stack up against Flux Pro API pricing, the per-second math is similar.
For a production workload generating 100 ten-second clips per day at Standard quality, you are looking at roughly $400/day or $12,000/month. At the Fast tier, that drops to $150/day. These numbers make sense for polished marketing video content but get expensive for high-volume applications like automated ad generation or social media content at scale.

Practical Code Examples
Image-to-Video with Python
Converting a product photo into a short video clip is one of the more practical use cases. Here is an example using the Gemini API, a pattern that also works well for creating AI avatars from photos:
from google import genai
client = genai.Client(api_key="YOUR_API_KEY")
operation = client.models.generate_videos(
model="veo-3.1",
prompt="Slow zoom into this product with soft studio lighting",
image=genai.types.Image.from_file("product-photo.jpg"),
config=genai.types.GenerateVideoConfig(
aspect_ratio="16:9",
duration="6s",
),
)
# Poll until complete
while not operation.done:
time.sleep(10)
operation = client.operations.get(operation)
# Save the result
for video in operation.result.generated_videos:
video.video.save("output.mp4")
This pattern works well for e-commerce product videos, where you already have high-quality stills and want to create motion content without a full video shoot. Teams building image-to-video pipelines at scale often chain this with background removal and prompt templates for consistent output.
Scene Extension for Longer Narratives
One of Veo 3.1's standout features is scene extension. You can chain clips together while maintaining visual and audio continuity. For a workflow-based AI image platform that supports multi-step pipelines, this kind of sequential generation fits naturally into node-based architectures where each step feeds the next.
# Generate initial scene
scene1 = client.models.generate_videos(
model="veo-3.1",
prompt="A woman walks into a coffee shop on a rainy morning",
config=genai.types.GenerateVideoConfig(duration="8s"),
)
# Extend the scene
scene2 = client.models.generate_videos(
model="veo-3.1",
prompt="She orders a latte and sits by the window, watching rain",
video=scene1.result.generated_videos[0].video,
config=genai.types.GenerateVideoConfig(duration="8s"),
)
Each extension maintains the character appearance and scene lighting from the previous clip, which is a significant improvement over stitching independently generated clips together. For developers exploring no-code approaches to similar workflows, visual pipeline builders can abstract this polling logic away entirely.
How Veo 3.1 Compares to Alternatives
The video generation API space has gotten crowded. Here is how Veo 3.1 stacks up against the main alternatives for developer use. For a deeper dive, the full comparison of AI video generators covers output quality and watermark policies in detail.
- Kling 2.5 / 3 Pro: Competitive quality at similar price points with strong motion coherence. Available through multiple third-party API providers. Lacks native audio generation
- Runway Gen-4: Lower latency for short clips and good for rapid prototyping. More limited in duration and resolution options compared to Veo 3.1. Higher per-second pricing at the top tier
- Sora (OpenAI): Strong prompt adherence and cinematic quality. API access is still limited and pricing remains unclear for production workloads
- Seedance 2.1: Budget-friendly option for simpler video generation tasks. Lower quality ceiling but useful for high-volume workflows where cost matters more than polish
The native audio generation in Veo 3.1 is currently its biggest differentiator. If your use case requires synced dialogue or sound effects, it eliminates a full post-production step that competitors still require you to handle separately.

Limitations and Gotchas
Before building your pipeline around Veo 3.1, consider these practical constraints. For teams working with AI photo enhancement, many of these limitations will feel familiar from the image generation side:
- Generation time is slow. An 8-second clip can take 60 to 120 seconds to generate. For real-time or near-real-time applications, this is a non-starter
- Rate limits apply. Vertex AI imposes per-minute and per-day generation limits that vary by project tier. Burst workloads need careful queue management
- Audio quality varies. While native audio is impressive, dialogue generation can produce uncanny results, especially with emotional delivery or overlapping speakers
- Content filtering is aggressive. Google's safety filters reject a broad range of prompts that competing APIs handle without issue. For creative or editorial content, this can be frustrating
- No fine-tuning yet. Unlike image models that support LoRA or DreamBooth, Veo 3.1 does not currently allow model customization
Developers building production video features should also plan for fallback logic. API availability has been inconsistent during high-demand periods, and having a secondary model like Kling or an alternative video generation pipeline ready is good practice.
FAQ
What does Veo 3.1 cost per second of video? The Fast tier costs $0.15/second, Standard is $0.40/second, and Lite runs approximately $0.05/second. All tiers include native audio generation in the per-second price.
Can I use Veo 3.1 for free? Yes, through limited channels. New Google Cloud accounts get $300 in credits, and there is a 1-month Pro trial that includes video generation. Student plans offer 12 months of free access.
How long does it take to generate a video? An 8-second clip typically takes 60 to 120 seconds. Longer clips with scene extension take proportionally longer since each segment generates sequentially. Batch approaches like running multiple API calls in parallel can improve throughput.
Does Veo 3.1 generate audio automatically? Yes. Unlike most competitors, Veo 3.1 generates synchronized audio natively, including dialogue, ambient sounds, and background music. You do not need a separate audio generation step.
What video resolutions does Veo 3.1 support? Base output is 1080p in 16:9 or 9:16 (vertical). 4K upscaling is available as a post-processing step for Standard tier outputs. This makes it suitable for both social media reels and full-resolution marketing content.
Can I chain multiple clips together? Yes. Scene extension lets you chain up to 20 clips while maintaining visual and audio continuity, enabling narratives over 140 seconds long. You can find practical guidance on multi-clip workflows at wireflow.ai.
Is there a Python SDK for Veo 3.1?
Yes. The google-genai Python package provides native support for Veo 3.1 video generation, including image-to-video, scene extension, and batch processing. The SDK handles polling and result retrieval, so you can focus on building your video pipeline rather than managing HTTP requests.
Conclusion
Veo 3.1 is a capable video generation API with native audio support that sets it apart from competitors. The pricing is reasonable for quality-focused use cases but adds up quickly at scale. For developers evaluating it, the free tier options make it possible to prototype without financial commitment. The main trade-offs are generation speed, content filtering strictness, and the lack of model customization.
Start with the Lite tier for testing, move to Fast for production prototypes, and reserve Standard for final output where quality justifies the $0.40/second cost. Build fallback logic for when the API is under heavy load, and keep an eye on the broader video generation landscape as competitors like Kling 3 Pro are closing the gap on audio generation.
