Turn detection
Build responsive voice applications by detecting when users finish speaking.
Benefits
- Create natural conversational experiences with proper turn-taking
- Reduce response latency in voice assistants and chatbots
- Improve user experience with timely system responses
- Enable more human-like interactions in voice applications
Use cases
- Voice AI - Detect when to generate responses in conversational agents
- Real-time translation - Deliver translations as soon as speakers complete thoughts
- Dictation - Determine when users have finished speaking to finalize transcription
How it works
A turn, or utterance, is a continuous piece of speech from a single speaker, typically separated by pauses. In conversation systems, detecting the end of an utterance helps determine when it's appropriate for another speaker (or AI system) to respond.
Speechmatics offers two complementary approaches to detect when a speaker has finished their turn:
- Silence-based detection - Identifies pauses between speech
- Semantic detection - Analyzes linguistic context to identify natural endpoints
Silence-based detection
Detect natural pauses in speech by configuring the silence threshold in your transcription request.
Configuration
Add the end_of_utterance_silence_trigger
parameter to your StartRecognition message:
{
"type": "transcription",
"transcription_config": {
"conversation_config": {
"end_of_utterance_silence_trigger": 0.5
},
"language": "en"
}
}
The end_of_utterance_silence_trigger
parameter specifies the silence duration (0-2s) that triggers end of utterance detection.
Setting end_of_utterance_silence_trigger
to 0 disables detection.
Recommended settings
- Voice AI applications: 0.5-0.8 seconds
- Dictation applications: 0.8-1.2 seconds
Response format
When an end of utterance is detected, you'll receive:
- A
Final
transcript message - An
EndOfUtterance
message
{
"message": "EndOfUtterance",
"format": "2.9",
"metadata": {
"start_time": 1.07,
"end_time": 1.07
}
}
- Keep
end_of_utterance_silence_trigger
lower than themax_delay
value - Messages are only sent after speech is recognized
- Duplicate messages are never sent for the same silence period
- Messages don't contain speaker information from diarization
Semantic end of turn
For more natural conversations, combine silence detection with linguistic context analysis. This approach understands when a speaker has completed their thought based on the content of their speech.
Semantic end of turn detection is available through our Flow service, which combines multiple signals for optimal turn detection:
- Silence duration
- Linguistic completeness
- Question detection
- Prosodic features
Try semantic end of turn detection with our free Flow service demo or read our implementation guide.
Code examples
Real-time streaming from microphone - ideal for voice AI applications.
import speechmatics
import pyaudio
import threading
import time
import asyncio
API_KEY = "YOUR_API_KEY"
LANGUAGE = "en"
CONNECTION_URL = f"wss://eu2.rt.speechmatics.com/v2"
# Audio recording parameters
SAMPLE_RATE = 16000
CHUNK_SIZE = 1024
FORMAT = pyaudio.paFloat32
class AudioProcessor:
def __init__(self):
self.wave_data = bytearray()
self.read_offset = 0
async def read(self, chunk_size):
while self.read_offset + chunk_size > len(self.wave_data):
await asyncio.sleep(0.001)
new_offset = self.read_offset + chunk_size
data = self.wave_data[self.read_offset : new_offset]
self.read_offset = new_offset
return data
def write_audio(self, data):
self.wave_data.extend(data)
class VoiceAITranscriber:
def __init__(self):
self.ws = speechmatics.client.WebsocketClient(
speechmatics.models.ConnectionSettings(
url=CONNECTION_URL,
auth_token=API_KEY,
)
)
self.audio = pyaudio.PyAudio()
self.stream = None
self.is_recording = False
self.audio_processor = AudioProcessor()
# Set up event handlers
self.ws.add_event_handler(
event_name=speechmatics.models.ServerMessageType.AddPartialTranscript,
event_handler=self.handle_partial_transcript,
)
self.ws.add_event_handler(
event_name=speechmatics.models.ServerMessageType.AddTranscript,
event_handler=self.handle_final_transcript,
)
self.ws.add_event_handler(
event_name=speechmatics.models.ServerMessageType.EndOfUtterance,
event_handler=self.handle_end_of_utterance,
)
def handle_partial_transcript(self, msg):
transcript = msg["metadata"]["transcript"]
print(f"[Listening...] {transcript}")
def handle_final_transcript(self, msg):
transcript = msg["metadata"]["transcript"]
print(f"[Complete] {transcript}")
def handle_end_of_utterance(self, msg):
print("🔚 End of utterance detected - ready for AI response!")
# This is where your voice AI would process the complete utterance
# and generate a response
def stream_callback(self, in_data, frame_count, time_info, status):
self.audio_processor.write_audio(in_data)
return in_data, pyaudio.paContinue
def start_streaming(self):
try:
# Set up pyaudio stream with callback
self.stream = self.audio.open(
format=FORMAT,
channels=1,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=CHUNK_SIZE,
stream_callback=self.stream_callback,
)
# Configure audio settings
settings = speechmatics.models.AudioSettings()
settings.encoding = "pcm_f32le"
settings.sample_rate = SAMPLE_RATE
settings.chunk_size = CHUNK_SIZE
# Configure transcription with end-of-utterance detection
conversation_config = speechmatics.models.ConversationConfig(
end_of_utterance_silence_trigger=0.75
) # Adjust as needed
conf = speechmatics.models.TranscriptionConfig(
operating_point="enhanced",
language=LANGUAGE,
enable_partials=True,
max_delay=1,
conversation_config=conversation_config,
)
print("🎤 Voice AI ready - start speaking!")
print("Press Ctrl+C to stop...")
# Start transcription using the working approach
self.ws.run_synchronously(
transcription_config=conf,
stream=self.audio_processor,
audio_settings=settings,
)
except KeyboardInterrupt:
print("\n🛑 Stopping voice AI transcriber...")
except Exception as e:
print(f"Error in transcription: {e}")
finally:
self.stop_streaming()
def stop_streaming(self):
self.is_recording = False
if self.stream:
self.stream.stop_stream()
self.stream.close()
self.audio.terminate()
# Usage
if __name__ == "__main__":
transcriber = VoiceAITranscriber()
transcriber.start_streaming()
Copy in your API key and file name to get started.
import speechmatics
API_KEY = "YOUR_API_KEY"
PATH_TO_FILE = "example.wav"
LANGUAGE = "en"
CONNECTION_URL = "wss://eu2.rt.speechmatics.com/v2"
# Create a transcription client
ws = speechmatics.client.WebsocketClient(
speechmatics.models.ConnectionSettings(
url=CONNECTION_URL,
auth_token=API_KEY,
)
)
# Define an event handler to print the partial transcript
def print_partial_transcript(msg):
print(f"[partial] {msg['metadata']['transcript']}")
# Define an event handler to print the full transcript
def print_transcript(msg):
print(f"[ FULL] {msg['metadata']['transcript']}")
# Define an event handler for the end-of-utterance event
def print_eou(msg):
print("EndOfUtterance")
# Register the event handler for partial transcript
ws.add_event_handler(
event_name=speechmatics.models.ServerMessageType.AddPartialTranscript,
event_handler=print_partial_transcript,
)
# Register the event handler for full transcript
ws.add_event_handler(
event_name=speechmatics.models.ServerMessageType.AddTranscript,
event_handler=print_transcript,
)
# Register the event handler for end of utterance
ws.add_event_handler(
event_name=speechmatics.models.ServerMessageType.EndOfUtterance,
event_handler=print_eou,
)
settings = speechmatics.models.AudioSettings()
# Define transcription parameters
# Full list of parameters described here: https://speechmatics.github.io/speechmatics-python/models
conversation_config = speechmatics.models.ConversationConfig(
end_of_utterance_silence_trigger=0.75
) # Adjust as needed
conf = speechmatics.models.TranscriptionConfig(
operating_point="enhanced",
language=LANGUAGE,
enable_partials=True,
max_delay=1,
conversation_config=conversation_config,
)
print("Starting transcription (type Ctrl-C to stop):")
with open(PATH_TO_FILE, "rb") as fd:
try:
ws.run_synchronously(fd, conf, settings)
except KeyboardInterrupt:
print("\nTranscription stopped.")