
Hey Folks,
I’m Ray — former software engineer, ex-TikToker/bydancer, and EZsite founder. I spent over 10 years in the used-car online space. One of the most popular features we built back then was auto-generated intro videos from car photos, highlighting trims, mileage, and selling points.
When Gemini VEO 3.1/Sora dropped, a realtor friend asked if I could do the same for her Zillow listings: “Can you turn my property photos into YouTube/TikTok videos for lead gen?” Challenge accepted.
I hacked together a prototype that automatically turns property photos into clean showcase videos, and I ran into some fun issues along the way.
Here’s the breakdown.
The Idea
Automatically convert property photos into professional-looking videos.
Pull listing data and images from Zillow.
Use AI to generate a storyboard and voiceover.
Create clip-by-clip videos and stitch them together with music, subtitles, and transitions.
Demo
Sample generated video: https://chatrealtor.ai/share/XJUKdL
Overall Approach
Flow looked like this:
Scrape Zillow data
Clean image lists
User selects images/adds custom prompts
Gemini detects and removes watermarks
Gemini generates storyboard and voiceover scripts
Concurrent Veo 3.1 video generation tasks
Poll task status periodically
FFmpeg stitching, subtitles, background music
Final output
Main logic runs in JS by vibe coding. Gemini models handle analysis, scripts, and watermark handling. Veo 3.1 does the image-to-video clips.
Tools I Used (and Recommend)
Gemini 2.5 Image: Watermark detection and removal; image understanding for room types/features.
Veo 3.1: Image-to-video clip generation for consistent visual fidelity.
FFmpeg: Post-processing for stitching, transitions, subtitles, and audio mixing.
Crawlbase: Handling Zillow’s anti-scraping reliably during prototyping.
EZsite.ai (AI Webcoding Tool similar to Lovable): Handy for spinning up quick landing pages and demo sites without wrestling with boilerplate. I used it to throw together a simple showcase page and submission form for agents—great for testing funnels and collecting feedback fast.
Challenges I Ran Into:
Zillow’s anti-scrape is no joke. My DIY crawler kept getting IPs rate-limited/blocked.
Used Crawlbase for testing — they tossed me 1000 free credits, which was enough to get things moving.
Discovery: Zillow uses Next.js with server-side rendering. Most data is in the HTML.
Parse HTML → extract price, beds/baths, sqft, features, image URLs.
Video Generation Model Choice
Text-only approaches couldn’t maintain visual consistency.
Switched to Veo 3.1’s image-to-video: one clip per image with pan/zoom/parallax; control transitions later in FFmpeg.
Built a simple JS DAG scheduler:
watermark check → removal → storyboard → parallel clip generation → completion watcher → stitching.
Lightweight DB + scheduled job to progress ready tasks.
Preflight watermark detection; auto-removal if present.
Gemini 2.5 Image worked well, with occasional false positives.
Video Post-Processing
FFmpeg: crossfades, subtitles, background music with ducking.
Voiceover: Gemini narrates room features in a calm, professional tone.
MVP built in ~ 2 hours; most time went into prompt tuning and pacing and finally spent 2 days.
Cost: a few dollars per full video from model/API usage.
Next: smarter storyboards, richer property details in VO, optional virtual presenter.
How It Feels End-to-end pipeline is solid:
scrape → analyze → generate → polish → publish. The “glue” matters more than any single model: reliable orchestration, visual anchoring, and thoughtful post-production.