Short answer: An "AI receptionist" in 2026 is not one product — it is a job archetype that spans three channels (phone, WhatsApp, web chat), each with different latency budgets, different failure modes, and different vendor stacks. Done well, it greets, identifies intent, routes, schedules, qualifies leads, and hands off to humans on the small percentage of conversations that need a person. Done poorly, it irritates exactly the customers a business is trying to keep. The architecture decisions made in the first month determine whether the system scales gracefully or becomes a permanent maintenance burden.
Why "receptionist" is the right frame
The market keeps describing these systems by their channel — "voice agent," "WhatsApp bot," "web chat." That is how vendors sell, but it is not how the work decomposes. A small clinic, a law firm, a real-estate agency, and a SaaS company all want the same job done: somebody (or something) that answers, understands what the caller wants, books them in, qualifies them, or routes them to the right human — across whatever channel the caller chose.
Framing the job as a "receptionist role" rather than as a channel choice has two practical consequences. First, conversations cross channels. A prospect calls, leaves a voicemail, then continues on WhatsApp; the receptionist needs to know it is the same person and pick up where the previous conversation ended. Second, the success metric is the same across channels — booked appointments, qualified leads, deflected questions — even though the technical surface is wildly different.
The three channels and why their latency budgets differ
The single biggest design constraint is how long a caller will wait for a response before the conversation feels broken. The numbers come from product research the major voice-AI vendors have published over the past two years, and they are remarkably consistent across studies.
| Channel | Target end-to-end latency | What "broken" feels like |
|---|---|---|
| Phone (synchronous voice) | 500–900 ms | Caller starts talking over the agent, or hangs up; perceived as "awkward" above 1.2 s, "broken" above 2 s |
| WhatsApp / SMS (asynchronous) | 2–4 s | Conversation feels lifeless past 8 s; users assume nobody is reading and drop off |
| Web chat (semi-synchronous) | 1–3 s for first token, 2–6 s for full answer | Below 1 s feels artificial; above 6 s users abandon |
The phone budget is brutal because human conversation runs on roughly 200 ms turn-taking gaps. Every component in the voice stack — VAD, ASR, LLM inference, TTS, network — has to fit inside the budget together. Async channels are easier on the engineering side but harder on UX, because users have nothing visual confirming the system is "working" while it thinks. A typing indicator on WhatsApp buys roughly three to five seconds of patience; nothing buys more.
Channel-by-channel architecture
Phone
The voice stack has consolidated around a recognisable shape: a SIP or PSTN provider (Twilio, Telnyx, Vonage) terminates the call; an orchestration layer handles voice activity detection, barge-in, and turn-taking; an ASR component transcribes (Deepgram, AssemblyAI, OpenAI Whisper variants); the LLM produces a response; a TTS layer renders it (ElevenLabs, Cartesia, OpenAI TTS, PlayHT). Newer platforms collapse several of these into a single product — Vapi, Retell, Bland, Synthflow — selling on the integration rather than on individual components.
Two architectural choices matter most. First, streaming everything: partial ASR results, streaming LLM tokens, streaming TTS chunks. Any non-streaming component blows the latency budget. Second, realtime models vs cascaded models: OpenAI's Realtime API and similar speech-native models collapse ASR + LLM + TTS into a single call, cutting roundtrips at the cost of less control over intermediate steps. Cascaded stacks expose every step for instrumentation and barge-in handling, at the cost of more latency to fight.
WhatsApp uses Meta's Cloud API as the message transport. The async nature of the channel means latency is not the constraint — the constraints are template messages (regulated templates required to initiate conversations outside the 24-hour service window), session windows (the 24-hour rolling window in which freeform replies are allowed), and the fact that media messages (voice notes, images) require their own pipeline. A receptionist that ignores voice notes loses a meaningful percentage of inbound traffic in many markets.
The architecture is straightforward: a webhook ingests inbound messages, an agent runtime decides what to do, the Cloud API sends outbound. The complexity lives in two places — keeping the 24-hour session in mind for billing and template management, and integrating with a CRM and calendar without leaking state across conversations.
Web chat
The easiest channel technically and the one teams most often over-engineer. A streaming LLM API, a typed state object, and a thin frontend widget cover 90% of real use cases. The right place to spend engineering effort is grounding — feeding the LLM the right knowledge base content via RAG — and handover protocols, not the chat widget itself.
Cross-channel identity and continuity
The hardest part of a multi-channel receptionist is not building any one channel. It is making "same person, different channel" work. A caller dials in, leaves a voicemail summarising what they need, then continues on WhatsApp; the WhatsApp side needs to retrieve the voicemail transcript, the inferred intent, and the contact record, then continue without making the caller repeat themselves.
The practical solution is a customer identity service that the agent runtime can call on every turn: lookup by phone number, by email, by WhatsApp ID, by webchat session token. Anything else — duplicate contact records, lost context, repeated greetings — is the channel-by-channel approach showing through. The piece on custom CRM cost covers the data model side of this in more depth.