From Games to AI: How Sandy Zhou Builds Product Discipline for Voice AI People Trust
21 hour ago / Read about 20 minute
Source:TechTimes

Sandy Zhou

As voice becomes a primary interface for work and learning, "good" is no longer a single metric. Product-ready voice AI demands a blend of aesthetic judgment and quality standards—so voices stay consistent, comfortable, and well-matched to real user intent.

Sandy Zhou is an audio designer working across interactive media and voice-first AI products. At Speechify, she leads launch readiness and expert evaluation for major voice releases—spanning model evaluation, voice design, and voice UX—to keep voices stable, usable, and easy to personalize at scale. She has led launches for key Speechify collections, including Storyteller and Christmas, and owned full launch readiness for additional high-profile releases, including celebrity voice launches, across 25+ voice launches on the platform.

Speechify x Standford

Launching Christmas Collection


Voice AI is entering a new phase. For many people, voice is no longer an occasional feature—it's becoming a daily interface: listening while commuting, studying on the move, and turning long documents into something usable in real life.

This shift is happening at real scale. Speechify says its products are trusted by more than fifty million users and backed by over 500,000 five-star reviews. Stanford also provides campus-wide access to Speechify through a university site license, with Speechify referencing 30,000+ Stanford students, faculty, and staff. As voice libraries expand, Speechify highlights 1,000+ voices in more than sixty languages—voice quality becomes a product discipline: not just how speech is generated, but how reliably the experience holds up across real use cases.

Recording

In conversation, Zhou explains why voice AI quality is a multi-variable problem and why product-ready voices require standards, expert listening, and UX choices that help users find the right voice for their intent.


Q: Voice AI has progressed quickly. What's the gap between a voice that sounds "good" and a voice people actually trust?

Sandy Zhou: The gap is that "good" depends on how it's used.

A voice can sound strong in a controlled moment and still feel inconsistent in everyday use—across different text structures and contexts, like studying versus storytelling. Trust comes from stability, comfort, predictability over time, and whether the voice matches what the user is trying to do.

A product voice isn't defined by one impressive moment; it's defined by whether people keep choosing it in real life.


Q: When you say "quality," what are you actually evaluating?

Sandy Zhou: Quality isn't one knob. It's a bundle of dimensions that can break independently.

I'm listening for similarity (does it stay recognizable?), noise/artifacts (is it clean?), timbre stability (does tonal character drift?), phrasing and prosody (does it speak naturally?), pacing and pauses (does it handle structure?), and overall consistency across sessions.

But there's another piece that's just as product-critical: use-case fit. A voice can be technically strong and still be wrong for the use case. A high-energy delivery might be great for certain content, but exhausting for long-form study. A voice that's perfect for narrative might not be ideal for dense technical reading.

Quality includes intent fit. A voice can be "good" and still be the wrong voice for what a user is trying to do.


Q: You've said, "a product voice has to win an hour." What do you mean by that?

Sandy Zhou: It's about long-form trust.

A short clip is a surface test. Long-form listening is where fatigue and unpredictability show up—especially on real content: headings, lists, quotes, numbers, and technical terms. If the voice is going to live inside someone's work or learning routine, it must remain comfortable and coherent over time.

A product voice has to win an hour because that's how people actually use it.


Q: Your work spans aesthetics and product discipline. How do you explain "aesthetic judgment" without making it sound purely subjective?

Sandy Zhou: Aesthetic judgment is subjective in taste, but not subjective in outcome.

If voice is the interface, then aesthetics becomes a functional requirement. Tone, pacing, and delivery affect comprehension, comfort, and trust. A voice can be technically clean and still feel "off" in a way users immediately react to. They may not name the technical reason, but they'll stop using it.

"Users don't only hear audio quality," I tell teams. "They feel whether the voice is coherent, comfortable, and believable."

The key is translating that perception into shared standards—clear language teams can align on, and experiences users can rely on.


Q: Where does model evaluation fit into your role?

Sandy Zhou: Model benchmarks are essential—they're how research teams measure progress and move fast.

My role complements that with product-facing evaluation: translating what improves in benchmarks into what users will feel across real content, long sessions, and different intents. That includes stability and comfort over time, clarity in phrasing and pacing, and whether the voice stays within a consistent persona range that feels recognizable as "itself."

"It's collaboration," I say. "Benchmarks tell us what moved, and listening tells us whether it moved in the direction users will feel."


Q: You've led launches for Storyteller and Christmas and owned full launch readiness for other high-profile releases, including celebrity voice launches. What does "launch readiness" mean to you?

Sandy Zhou: Launch readiness is about trust at the moment of adoption.

It's not only whether the voice sounds good—it's also whether it behaves predictably in real use and is positioned in a way that helps users choose it for the right intent. If a user expects "calm study mode" and gets "high-energy narrator," trust breaks instantly.

Launch readiness also means driving alignment across teams so the quality bar is shared and decisions stay consistent as the library evolves. My job is to drive alignment around a shared quality bar so what ships matches what the product promises.


Q: You often say "voice is UX." How does UX connect to personalization?

Sandy Zhou: Personalization is what turns voice into a daily habit—and discoverability is how you get there.

As libraries grow, the user problem isn't "give me any voice." It's "help me find my voice for my day." People choose voices like modes: study, story, calm focus, energetic delivery. UX has to guide users to the right match quickly—without forcing them to audition endlessly.

That's why I focus on voice UX: descriptions, tags, collection structure, and how voices are surfaced. The goal is to reduce choice overload and increase confidence—so users feel like the product understands what they're trying to do.


Q: Where do you see voice going next—especially with voice agents?

Sandy Zhou: Voice is moving from output to interaction.

Reading and productivity are already strong voice use cases. The next step is voice-first agents. Instead of just hands-free playback, there will be two-way conversation: voices that can listen, respond, and help users act on information in context.

This direction raises the bar because the voice isn't only delivering content—it's representing the product in real time.


Voice AI becomes real when it supports real work: when someone finishes a chapter on a walk, reviews a contract between meetings, or keeps learning when they otherwise wouldn't have time.

Zhou says, "The goal isn't to generate speech—it's to ship trust, so the voice is dependable in real life." As voice becomes conversational, trust is no longer a nice-to-have; it becomes a core system requirement. When the voice is the interface, consistency, intent fit, and long-form comfort are foundational rather than "polish."