Guide
AI Voice Agent: A Complete Glossary for UAE Businesses (STT, TTS, Latency, Barge-In)

Quick answer
An AI voice agent glossary for UAE businesses: STT converts spoken words to text, TTS converts text back to speech, latency is the delay before the agent responds, barge-in lets callers interrupt, and DNCR/TDRA are UAE regulatory frameworks governing outbound calls. Understanding these terms helps you evaluate, buy, and audit any AI voice deployment.
Why do UAE businesses need an AI voice agent glossary?
AI voice vendors talk fast — and they lean on acronyms. When a sales deck mentions "sub-300ms TTS latency" or "barge-in with intent preservation," most buyers nod and move on. That gap matters in the UAE specifically, because local deployments carry regulatory obligations (TDRA approval, DNCR screening, caller-ID registration) that have real penalty exposure under Cabinet Resolution 57 of 2024. If you can not read the spec sheet, you can not verify whether your vendor has handled compliance correctly.
This glossary covers the technical, conversational-AI, and UAE-regulatory terms you will encounter when buying, deploying, or auditing an AI voice agent. Each definition is written in plain language with enough context to be useful in a real conversation with a vendor — or a regulator. For a deeper dive into how these terms fit a real deployment, see our guide on how to choose an AI voice agent in the UAE.
What do the core AI terms (STT, TTS, LLM) mean?
STT — Speech-to-Text
Speech-to-Text (STT) is the engine that converts a caller's spoken words into a text transcript in real time. The accuracy of STT determines how well the agent understands what the caller said — especially with accented English, Khaleeji Arabic, Hindi, or Malayalam. Poor STT accuracy is one of the most common root causes of failed AI voice deployments. When evaluating a vendor, ask which STT model they use and whether it has been tested on the specific accents and languages your callers speak.
TTS — Text-to-Speech
Text-to-Speech (TTS) is the engine that converts the agent's generated text response back into audible speech for the caller. TTS quality determines whether the agent sounds natural and trustworthy, or robotic and off-putting. Modern neural TTS engines (used by platforms such as Vapi) produce voices that are difficult to distinguish from a human in casual conversation. Key variables are voice naturalness, speaking pace, and how well the engine handles punctuation and sentence rhythm across languages.
LLM — Large Language Model
Large Language Model (LLM) is the AI brain that reads the STT transcript and decides what to say next. The LLM processes the caller's intent, applies the instructions in the system prompt, and generates a response — which the TTS engine then speaks aloud. The LLM is also where conversation logic lives: qualifying questions, objection handling, escalation triggers, and booking flows are all expressed through how the system prompt instructs the LLM to behave.
System Prompt
The system prompt is the set of instructions given to the LLM before a conversation begins. It defines the agent's persona, its goals, the questions it should ask, how it should handle specific scenarios, and when it should escalate to a human. A well-written system prompt is the single biggest determinant of how useful an AI voice agent is in practice. It is authored by the deploying business (or their vendor) and is invisible to the caller.
What do latency, barge-in, and turn-taking mean for call quality?
Latency
Latency in an AI voice agent is the delay between when the caller finishes speaking and when the agent begins its response. It is the sum of STT processing time, LLM inference time, and TTS rendering time. High latency (above roughly 1.5–2 seconds) makes conversations feel broken — callers assume the line has dropped and either repeat themselves or hang up. Sub-1-second response latency is the standard target for deployments where caller experience is a priority.
Barge-In
Barge-in is the industry term for a caller's ability to interrupt the agent while it is speaking — the same way you would cut off a human mid-sentence. Without barge-in support, the agent speaks its full response even if the caller tries to correct it or give a shorter answer, which feels unnatural and frustrating. Well-implemented barge-in detects that the caller has started speaking, halts TTS playback immediately, and re-routes the audio to STT for processing.
Turn-Taking
Turn-taking refers to how the agent determines when the caller has finished speaking and it is the agent's turn to respond — and vice versa. It is more complex than it sounds: short pauses, filler words ("um", "uh"), and mid-sentence breathing all need to be handled correctly. Premature turn-taking causes the agent to interrupt; delayed turn-taking causes uncomfortable silences. The quality of turn-taking is one of the markers that separates a polished deployment from a frustrating one.
Intent
Intent is the underlying goal the caller is trying to achieve, as interpreted by the LLM. A caller might say "I need to see a doctor this week" — the intent is appointment booking, even though the words never said "appointment." Intent recognition determines whether the agent routes the conversation appropriately (booking flow, FAQ answer, escalation) or misunderstands and goes off-track.
What is code-switching, and why does it matter in UAE calls?
The UAE is home to residents of more than 200 nationalities, and in a city like Dubai, conversations frequently blend languages mid-sentence — English and Arabic, Arabic and Hindi, or English and Malayalam. This is code-switching.
Code-Switching
Code-switching is the practice of alternating between two or more languages within a single conversation — sometimes within a single sentence. A caller might ask a question in English and then confirm a detail in Arabic. A capable AI voice agent detects this shift in real time and responds in the same language the caller used, without requiring the caller to select a language option upfront. This is a meaningful capability in a multilingual market like the UAE.
Khaleeji-Neutral MSA Arabic
Modern Standard Arabic (MSA) is the formal, written form of Arabic used across the Arab world. Khaleeji Arabic is the Gulf dialect spoken across the UAE, Saudi Arabia, Kuwait, and Bahrain. A Khaleeji-neutral MSA approach uses vocabulary and pronunciation that is broadly understood across the Gulf region without targeting any single local dialect specifically. This is the practical middle ground for AI voice agents serving the UAE market.
What do SIP, WebRTC, and telephony gateway mean?
SIP — Session Initiation Protocol
SIP (Session Initiation Protocol) is the signalling protocol used to set up, manage, and terminate voice calls over an IP network. When an AI voice agent receives or places a phone call using a traditional phone number (rather than a browser-based call), SIP is almost always the protocol handling that connection. SIP trunks are the "lanes" through which calls travel between a business's phone system and the public telephone network.
Telephony Gateway
A telephony gateway is hardware or software that bridges a traditional phone line (PSTN — Public Switched Telephone Network) and an IP-based voice system. In UAE office deployments, hardware gateways convert the physical SIM card or landline connection into a SIP stream that the AI voice platform can process. For businesses that need to retain an existing landline or SIM number, a gateway is the practical way to route calls through an AI agent without changing the number callers dial.
WebRTC
WebRTC (Web Real-Time Communication) is an open standard that enables real-time audio and video calls directly through a web browser, with no additional software required. Some AI voice platforms use WebRTC for browser-based demos, testing, or web widget deployments. For phone-based deployments (the most common UAE use case), SIP rather than WebRTC typically handles the call path.
IVR — Interactive Voice Response
IVR (Interactive Voice Response) is the legacy technology most people associate with "press 1 for sales, press 2 for support" menus. IVR is rule-based and menu-driven; an AI voice agent is conversational and understands natural language. The two terms are often confused because they both handle incoming calls automatically, but the caller experience is fundamentally different. An AI voice agent replaces the IVR menu with a conversation.
What is a warm transfer, and how does escalation work?
Warm Transfer / Escalation
A warm transfer (also called a supervised transfer or escalation) is when the AI agent hands a live call to a human agent while the caller stays on the line — as opposed to a cold transfer, which drops the caller into a queue with no context. In a well-designed AI deployment, the warm transfer includes a brief spoken summary or a CRM note delivered to the human agent before they pick up, so the caller does not have to repeat themselves. Escalation is typically triggered by caller request ("I want to speak to someone"), by intent detection (high-value lead, complaint), or by a question the AI cannot answer.
Inbound vs Outbound
Inbound calls are calls initiated by the customer — they dial your number. Outbound calls are calls initiated by the AI agent — it dials the customer. This distinction is the most important legal dividing line in UAE voice AI. Under Cabinet Resolution 56 of 2024, inbound calls are largely exempt from outbound telemarketing rules (DNCR screening, calling-window restrictions, prior TDRA approval). Outbound calls to consumer numbers are subject to those rules in full. See our post on AI voice agent costs in the UAE for context on how inbound and outbound deployments are typically priced and structured.
What do TDRA, DNCR, and Cabinet Resolution 56/57 mean?
TDRA — Telecommunications and Digital Government Regulatory Authority
TDRA is the UAE federal body that regulates telecommunications and digital services. For AI voice agent deployments, TDRA matters primarily in the outbound context: businesses must obtain prior TDRA approval before running outbound telemarketing campaigns. TDRA also sets the rules on caller-ID registration, calling windows, and call recording obligations. Compliance with TDRA requirements is not optional — failure to obtain approval carries substantial penalties under Resolution 57.
DNCR — Do Not Call Registry
The DNCR (Do Not Call Registry) is the UAE national list of phone numbers whose owners have opted out of receiving telemarketing calls. Before placing any outbound telemarketing call, a business must screen the target number against the DNCR. Calling a number that is registered on the DNCR carries penalties of AED 50,000 (first offence), AED 75,000 (second), and AED 150,000 (third) under Cabinet Resolution 57. These figures are attributable to the official resolution — verify current amounts against the official text and seek legal advice before relying on them operationally.
Cabinet Resolution 56 of 2024
Cabinet Resolution 56 of 2024 is the UAE federal regulation that governs outbound telemarketing, including AI-powered calls. It sets the rules on DNCR screening, the 09:00–18:00 calling window, caller-ID registration, prior TDRA approval, and call recording with notification. It came into effect on 27 August 2024. The official text is published on the UAE legislation portal.
Cabinet Resolution 57 of 2024
Cabinet Resolution 57 of 2024 is the companion regulation that sets the penalty schedule for violations of Resolution 56. It defines fine amounts for operating without TDRA approval, using an unregistered caller ID, calling DNCR-registered numbers, calling outside the permitted window, and other breaches. The official text is published on the UAE legislation portal.
Legal caveat
PDPL — Personal Data Protection Law
The PDPL (UAE Personal Data Protection Law, Federal Decree-Law No. 45 of 2021) governs the collection, processing, and retention of personal data in the UAE. For AI voice deployments, the PDPL is relevant to call recording storage, CRM data handling, and any cross-border data transfer to cloud platforms. Businesses deploying AI voice agents should review their data retention and consent practices against PDPL requirements.
Quick-reference: all terms at a glance
| Term | What it means in plain language |
|---|---|
| STT (Speech-to-Text) | Converts caller speech to text for the AI to read |
| TTS (Text-to-Speech) | Converts the AI's text response into audible speech |
| LLM (Large Language Model) | The AI brain that decides what the agent says next |
| System prompt | Hidden instructions that define the agent's behaviour and goals |
| Latency | Delay between caller finishing a sentence and agent responding |
| Barge-in | Caller's ability to interrupt the agent mid-sentence |
| Turn-taking | How the agent detects when the caller has finished speaking |
| Intent | The underlying goal the caller is trying to achieve |
| Code-switching | Switching languages mid-conversation; agent follows automatically |
| SIP | Signalling protocol for phone calls over IP networks |
| Telephony gateway | Hardware or software bridging a phone line to an IP voice system |
| WebRTC | Browser-based real-time audio/video communication standard |
| IVR | Legacy press-1/press-2 menu system; replaced by AI voice agents |
| Warm transfer | Handing a live call to a human agent with context intact |
| Inbound | Call initiated by the customer — largely exempt from outbound rules |
| Outbound | Call initiated by the AI — subject to TDRA/DNCR obligations |
| TDRA | UAE telecom regulator; must approve outbound campaigns |
| DNCR | Do Not Call Registry; must be screened before every outbound dial |
| Resolution 56 of 2024 | UAE law governing outbound telemarketing rules |
| Resolution 57 of 2024 | UAE law setting penalties for Resolution 56 breaches |
| PDPL | UAE Personal Data Protection Law governing caller data handling |
If you are comparing vendors and want to know how these terms map to real deployment decisions — cost, compliance, and capability trade-offs — the guide on how to choose an AI voice agent in the UAE walks through each factor in detail.
Sources
Frequently asked questions
What is the difference between STT and TTS in an AI voice agent?
What does barge-in mean in AI voice agents?
What is the DNCR and do AI voice agents have to screen against it?
What is latency, and how much does it matter for caller experience?
Is an AI voice agent the same as an IVR?
What does code-switching mean for UAE AI voice deployments?
Anam Jalal
Founder & CEO, MAJ Leads
Anam Jalal is the founder of MAJ Leads, a Dubai-based AI voice agent company deploying TDRA-compliant AI receptionists and callers for UAE clinics, brokerages and SMEs — working hands-on across UAE telephony and CRM integrations, from SIP provisioning to TDRA compliance configuration.
Read more about Anam →Related articles
Guide
How to Choose an AI Voice Agent in the UAE: 12 Questions to Ask Any Vendor
Not all AI voice agents are built for the UAE. Twelve vendor questions — covering TDRA/DNCR compliance, Arabic quality, latency, CRM integration, PDPL data residency, and pricing — that separate a serious provider from a repackaged offshore product.
Guide
How Much Does an AI Voice Agent Cost in the UAE? Vapi/Retell Per-Minute Math vs Done-For-You
Vapi charges $0.05/min for its platform — but that’s just one layer. Here is the full per-minute math, plus an honest comparison of DIY build costs versus a done-for-you AI voice agent in the UAE.
