Transform your images into stunning AI videos with Alibaba's advanced model
Please sign in to generate videos
From text to video or image to video, Wan 2.5 on ArtisanAI delivers cinematic visuals, synchronized audio, and flexible outputs — all at a fraction of the cost.
Alibaba Wan 2.5 is a state-of-the-art AI video generation model, designed to transform text prompts and reference images into cinematic video outputs. Originally released on Alibaba Cloud's DashScope platform, it demonstrates advanced capabilities in visual realism, motion dynamics, and audio synchronization.
To make these features easier to integrate, Alibaba offers Wan 2.5, which includes both text-to-video (T2V) and image-to-video (I2V) preview endpoints. With the wan2.5-t2v-preview
and wan2.5-i2v-preview
endpoints, developers can generate short videos enhanced by lip-sync and audio alignment.
Beyond DashScope, ArtisanAI now provides direct access to Wan 2.5, giving creators and developers a more flexible, cost-effective way to bring Alibaba's cutting-edge video technology into apps, workflows, and creative projects—making it a strong alternative to Google's Veo 3.
wan2.5-t2v-preview
The wan2.5-t2v-preview
endpoint enables developers to generate videos directly from text prompts. By describing scenes, actions, and environments, it produces cinematic video clips with smooth motion and synchronized audio—perfect for storyboards, marketing campaigns, and social media content.
wan2.5-i2v-preview
The wan2.5-i2v-preview
endpoint transforms static images into dynamic short videos. It preserves the original identity and style of the image while adding lifelike animations and perspective changes, making it ideal for portraits, product showcases, and creative storytelling.
Wan 2.5 makes it possible to generate video and audio together in a single request. Dialogues, ambient sounds, and background music are automatically synchronized with visuals, delivering immersive outputs without extra editing.
With Wan 2.5 text-to-video, complex prompts are followed more faithfully. Camera angles, lighting setups, and scene dynamics are captured with higher precision, giving developers confidence that each request will translate creative instructions into consistent video results.
Wan 2.5 supports a wide range of visual styles—from cinematic realism to anime or illustration. It preserves character identity and scene coherence, allowing developers to integrate versatile aesthetics into their applications.
Wan 2.5 provides both wan2.5-t2v-preview
(text-to-video) and wan2.5-i2v-preview
(image-to-video) endpoints. All modes support multiple resolutions (720p, 1080p), while aspect ratio choices (16:9, 9:16, 1:1) are available for text-to-video generation.
Both Alibaba Wan 2.5 and Google Veo 3 represent the latest in AI video generation, offering text-to-video and image-to-video capabilities with audio. But their strengths are not the same. Veo 3 is built for cinematic realism, while Wan 2.5 focuses on native audio-video sync, flexible output options, and stronger multilingual performance.
Feature | Wan 2.5 (Alibaba) | Veo 3 (Google) |
---|---|---|
Generation Modes | Text-to-Video (wan2.5-t2v-preview ) & Image-to-Video (wan2.5-i2v-preview ) | Text-to-Video & Image-to-Video |
Audio & A/V Sync | ✓ Native audio-video generation with dialogue, ambient sound, and BGM | Audio available but less integrated; focus remains on visuals |
Prompt Adherence | ✓ Strong fidelity to complex instructions Including camera, lighting, and motion | Excellent realism, but may struggle with highly detailed or abstract prompts |
Style Adaptation | ✓ Cinematic realism, anime, illustration Strong stylization support | Focus on cinematic realism, less flexible for stylized outputs |
Multilingual Support | ✓ Reliable with Chinese & minor languages | Limited; often defaults to "unknown language" in non-English prompts |
Video Duration | Up to 10 seconds | Up to ~8 seconds |
Aspect Ratio Options | ✓ 16:9, 9:16, 1:1 (T2V) | Primarily cinematic formats; fewer ratio options |
To make the most of Wan 2.5, it's important to craft clear, detailed, and structured prompts. The model responds best when both the visual and audio instructions are spelled out. Here are practical recommendations:
When adding speech, don't just request "dialogue." Instead, provide the exact words to be spoken and specify who says them. This is especially important in multi-character scenes where order and clarity matter.
Example: Character A: "We have to keep moving." Character B: "Not until we find shelter."
By writing dialogue this way, you ensure the model assigns the right lines to the right characters.
In some videos, the atmosphere should be driven by visuals or sound effects alone. If you don't want dialogue, make that clear in your prompt. Adding phrases such as "no dialogue" or "no actors speaking" prevents unintended voices from appearing.
This small detail keeps your output aligned with the creative vision.
Beyond dialogue, ambient sound and music set the emotional tone. Be specific about the kind of environment or soundtrack you want, whether it's natural or dramatic.
Examples:
• "soft rain tapping on windows with distant thunder"
• "fast-paced action music with heavy percussion"
The clearer you are, the better the model can synchronize visuals with sound to create an immersive result.
Wan 2.5 excels when prompts include setting, lighting, camera perspective, and mood. Instead of writing "a person walking on a road," expand the description to capture cinematic elements.
Example: A wide shot of a mountain road at sunset, golden light flooding the sky, a cyclist racing downhill, with energetic background music in the background.
This depth of description allows the model to produce more natural, dynamic, and visually coherent videos.
Start generating cinematic AI videos with synchronized audio today
Discover all AI models and features on ArtisanAI platform
© 2024 ArtisanAI. Professional AI Content Creation Platform | Image Generation, Video Production, Audio Synthesis All-in-One Service