Why Building with Voice Is a UX Design Challenge, Not Just a Tech One
Shola Jegede

Shola Jegede @sholajegede

About: Developer Champion at Convex (https://convex.dev)

Location:
Lagos, Nigeria
Joined:
Jun 29, 2024

Why Building with Voice Is a UX Design Challenge, Not Just a Tech One

Publish Date: Aug 4
97 2

When people think about building with voice, they think about hard problems:

  • Real-time transcription
  • Latency management
  • GPT inference pipelines
  • Audio quality, noise filtering, etc

All valid. All difficult.

But none of them are what truly broke my first version of Learnflow AI, a voice-first tutor platform powered by Vapi.

The real challenge? UX.

Because when your user isn't looking at a screen — when they're speaking instead of typing — you lose most of the affordances we've come to rely on.

No hover states. No tooltips. No loading spinners.

And as I learned the hard way: No clarity.

This is a breakdown of what went wrong when I first shipped a real-time voice app and how I reworked it to be understandable, usable, and even delightful.

Voice Tech Is Easy (When Vapi Handles It)

I built the first version of Learnflow AI over a weekend.

  • Vapi handled the entire voice loop: speech-in, text-to-GPT, voice-out
  • Convex tracked sessions, user data, and credits
  • Kinde managed auth, billing, and plan-based access control

Thanks to Vapi, I didn’t need to stitch together Whisper, GPT-4, ElevenLabs, and a WebSocket architecture. One agent definition and a vapi.start() call handled it all.

Sample flow agent session start:

const assistantOverrides = {
  variableValues: { subject, topic, style },
  clientMessages: ["transcript"],
  serverMessages: [],
};

vapi.start(configureAssistant(voice, style), assistantOverrides)
Enter fullscreen mode Exit fullscreen mode

But that just gave me the plumbing.

It didn’t solve what my users were facing.

What Actually Went Wrong

1. No Clarity on When the Session Was Active

Vapi is fast — the session starts within seconds. But users had no idea.

They’d click "Start Session"...

Then wait.

Then say, "Hello?"

Then say it again.

Why? Because I didn’t give them visual cues.

There was no feedback that their voice was being heard, transcribed, and responded to. For a voice interface, that’s a dealbreaker.

2. Muted Mic Confusion

Vapi offers a setMuted toggle, but I didn't expose that clearly.

One user turned off the mic thinking it was ending the session.

Another forgot it was off and kept talking. Silence.

3. No Transcript = No Confirmation

Even though I was getting real-time transcripts from Vapi, I didn't display them at first.

Result? Users didn’t know what was being heard, understood, or ignored.

They didn’t trust the app.

Before: The Broken Voice UX

How I Fixed It

Voice UI Is Feedback UI

I rebuilt the voice session component from scratch with one goal:

Always show users what’s happening.

Design Fix 1: Real-Time Transcript Feed

As Vapi emits transcript messages, I append them to a rolling transcript UI.

vapi.on('message', (message) => {
  if (message.type === 'transcript' && message.transcriptType === 'final') {
    const newMessage = { role: message.role, content: message.transcript };
    setMessages((prev) => [newMessage, ...prev]);
  }
});
Enter fullscreen mode Exit fullscreen mode

The transcript appears like a conversation thread. This helps users feel heard.

Design Fix 2: Speaking Animation (Lottie)

When the assistant is speaking, I show a wave animation using Lottie.

vapi.on('speech-start', () => setIsSpeaking(true));
vapi.on('speech-end', () => setIsSpeaking(false));
Enter fullscreen mode Exit fullscreen mode

This became the signal for active state.

Users now intuitively know:

  • When it's listening
  • When it's thinking
  • When it's speaking

Design Fix 3: Microphone Toggle That Makes Sense

I added a visible mic toggle button:

<button onClick={toggleMicrophone}>
  {isMuted ? "Mic Off" : "Mic On"}
</button>
Enter fullscreen mode Exit fullscreen mode

Plus a tooltip: "Turn this off if you want silence. Your session continues."

After: Fixed UX Flow

Real User Flow Example

Let’s say Joy signs up for Learnflow AI.

  1. She picks the free plan (10 voice sessions)
  2. Lands on the dashboard and clicks “Start Session”
  3. A Lottie animation appears
  4. She says: “Hey, what’s a HTML?”
  5. Sees: “You: Hey, what’s a HTML?”
  6. Hears: “The Hypertext Markup Language is the standard markup language for documents designed to be…”
  7. Credit drops from 10 → 9 in real-time

Then a nudge appears: “You have 9 sessions left. Upgrade for 100/month.”

She clicks “Upgrade”, gets routed to Kinde’s billing page, and instantly returns as a Pro user.

Convex + Kinde: Infra That Made My App’s UX Better

Convex: Session + Credit Logic

Every time a session begins, I log it in Convex and deduct a credit:

// schema.ts
export const users = defineTable({
  credits: v.number(),
  plan: v.string(),
});

export const sessions = defineTable({
  userId: v.id("users"),
  startedAt: v.number(),
});

// mutation.ts
export const startSession = mutation(async (ctx, args) => {
  const user = await ctx.db.get(args.userId);
  if (!user || user.credits <= 0) throw new Error("Out of credits");

  await ctx.db.insert("sessions", {
    userId: args.userId,
  });

  await ctx.db.patch(args.userId, {
    credits: user.credits - 1,
  });
});

Enter fullscreen mode Exit fullscreen mode

If credits hit 0:

  • The user is unable to create a new session with Vapi
  • Full-screen upgrade modal appears

Kinde: Role Gating + Plan Sync

I use Kinde’s hosted pricing page.

After signup, users are assigned free or pro roles via metadata:

const user = await getUser();

let plan: "starter" | "pro" | "plus" = "pro";

const plans = entitlements?.data?.plans ?? [];
console.log("Plans:", plans);

if (plans.some((p: any) => p.key === "pro")) {
  plan = "pro";
} else if (plans.some((p: any) => p.key === "starter")) {
  plan = "starter";
} else if (plans.some((p: any) => p.key === "plus")) {
  plan = "plus";
}

console.log("Plan:", plan);
Enter fullscreen mode Exit fullscreen mode

Then I sync that in Convex for backend enforcement.

Edge Cases I Had to Handle

  • User hits 0 credits mid-session: Block next attempt with modal
  • User mutes mic, thinks session is paused. Solution: copy + mic color state
  • User switches tabs mid-session. Solution: session timer auto-ends after 60s idle
  • User upgrades mid-session. Solution: full reload refreshes plan + credit count

Final UX Checklist Before Launch

  • Real-time transcript feed ✅
  • Visual signal for when assistant is speaking ✅
  • Sticky credit counter ✅
  • Mic toggle with explanation ✅
  • Upgrade nudge after session ✅
  • Kinde role sync across backend ✅

What I Learned

  • Building voice is not just about latency and speech quality
  • Voice-first UX is not like chatbot UX
  • UX clarity is everything when there are no visual anchors
  • Trust comes from visibility: show the transcript, show the state
  • Feedback loops build confidence

And most of all:

If your user isn't sure whether they're being heard, they won't speak again.

Takeaways If You’re Building Voice AI Apps

  1. Don’t launch voice without a feedback loop
  2. Show users their words (transcript)
  3. Show agent activity (Lottie or animation)
  4. Use a backend like Convex to gate usage in real time
  5. Use Kinde's roles to simplify access control
  6. Let something like Vapi handle the hard infra
  7. Don't assume your user will "figure it out" — measure their hesistation and guide them
  8. Build with latency in mind, but design for confidence

Your Turn

Have you tried building voice-first UX?

Did you run into any of these challenges?

Drop a comment, let’s compare notes below.

Written by Shola Jegede, building Learnflow AI

Built with:

Late nights, live feedback, and lots of learning.

See you in the comments.

Comments 2 total

  • IdrisO26
    IdrisO26Aug 8, 2025

    Apprently. voice is the new aid for users. It makes UI interactive and improves UX.

  • Chariot Claims
    Chariot ClaimsAug 8, 2025

    That is fantastic

Add comment