Documentation Index
Fetch the complete documentation index at: https://mintlify.com/moeru-ai/airi/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Airi’s voice chat system provides low-latency, real-time voice interaction through a sophisticated audio pipeline. The system handles audio input/output, voice activity detection (VAD), speech recognition, and text-to-speech synthesis in a unified pipeline.
Architecture
The voice chat system consists of several integrated components:
- Audio Context Management: High-quality audio processing with configurable sample rates
- Voice Activity Detection: Client-side speech detection using Silero VAD
- Audio Pipeline: Streaming audio processing with resampling and encoding
- Speech Pipeline: Orchestrates TTS generation, playback scheduling, and intent management
Audio Context
The audio context provides the foundation for all audio operations in Airi.
Initialization
import { initializeAudioContext, getAudioContextState } from '@proj-airi/audio/audio-context'
// Initialize with high-quality sample rate
const audioContext = await initializeAudioContext(48000)
// Check state
const state = getAudioContextState()
console.log(state.isReady, state.sampleRate)
Creating Audio Nodes
import {
createAudioSource,
createAudioAnalyser,
createAudioGainNode,
createResamplingWorkletNode
} from '@proj-airi/audio/audio-context'
// Create source from MediaStream
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true })
const source = createAudioSource(mediaStream)
// Create analyser for visualization
const analyser = createAudioAnalyser({
fftSize: 2048,
smoothingTimeConstant: 0.8
})
// Create gain node for volume control
const gainNode = createAudioGainNode(0.8)
// Create resampling worklet for format conversion
const worklet = createResamplingWorkletNode(source, {
inputSampleRate: 48000,
outputSampleRate: 16000,
channels: 1,
converterType: 2 // SRC_SINC_MEDIUM_QUALITY
})
// Connect nodes
source.connect(gainNode)
gainNode.connect(analyser)
analyser.connect(worklet)
Audio Context State Management
import {
subscribeToAudioContext,
suspendAudioContext,
resumeAudioContext
} from '@proj-airi/audio/audio-context'
// Subscribe to state changes
const unsubscribe = subscribeToAudioContext((state) => {
console.log('Audio context state:', state.state)
console.log('Current time:', state.currentTime)
console.log('Worklets loaded:', state.workletLoaded)
})
// Suspend/resume context
await suspendAudioContext()
await resumeAudioContext()
// Cleanup subscription
unsubscribe()
Voice Activity Detection (VAD)
VAD automatically detects when the user is speaking, enabling push-to-talk-free interaction.
VAD Configuration
import { createVAD } from '@proj-airi/stage-ui/workers/vad'
const vad = await createVAD({
sampleRate: 16000,
speechThreshold: 0.3, // Probability threshold for speech start
exitThreshold: 0.1, // Probability threshold for speech end
minSilenceDurationMs: 400, // Min silence before ending speech
speechPadMs: 80, // Padding around speech segments
minSpeechDurationMs: 250, // Min duration to consider as speech
maxBufferDuration: 30, // Max recording duration in seconds
newBufferSize: 512 // Audio chunk size
})
Using VAD
// Listen to VAD events
vad.on('speech-start', () => {
console.log('User started speaking')
})
vad.on('speech-end', () => {
console.log('User stopped speaking')
})
vad.on('speech-ready', ({ buffer, duration }) => {
console.log(`Speech segment ready: ${duration}ms`)
// Process the audio buffer
})
vad.on('debug', ({ data }) => {
console.log('Speech probability:', data.probability)
})
// Process audio
await vad.initialize()
const audioBuffer = new Float32Array(512)
await vad.processAudio(audioBuffer)
Vue Composable for VAD
import { useVAD } from '@proj-airi/stage-ui/stores/ai/models/vad'
import vadWorkletUrl from '@proj-airi/stage-ui/workers/vad/process.worklet?worker&url'
const vad = useVAD(vadWorkletUrl, {
threshold: 0.6,
onSpeechStart: () => {
console.log('Speech started')
},
onSpeechEnd: () => {
console.log('Speech ended')
}
})
// Initialize and start
await vad.init()
const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
await vad.start(stream)
// Access state
console.log(vad.isSpeech.value) // Boolean
console.log(vad.isSpeechProb.value) // 0-1 probability
console.log(vad.loaded.value) // Boolean
// Cleanup
vad.dispose()
Speech Pipeline
The speech pipeline manages TTS generation, playback scheduling, and intent prioritization.
Creating a Speech Pipeline
import { createSpeechPipeline } from '@proj-airi/pipelines-audio'
const pipeline = createSpeechPipeline({
// TTS generation function
tts: async (request, signal) => {
// Generate audio from text
const audio = await generateSpeech(request.text, signal)
return audio
},
// Playback manager
playback: {
schedule: (item) => {
// Schedule audio for playback
playAudio(item.audio)
},
stopAll: (reason) => {
// Stop all playback
stopAllAudio()
},
stopByIntent: (intentId, reason) => {
// Stop specific intent
stopAudioByIntent(intentId)
},
stopByOwner: (ownerId, reason) => {
// Stop by owner ID
stopAudioByOwner(ownerId)
},
onStart: (listener) => { /* ... */ },
onEnd: (listener) => { /* ... */ },
onInterrupt: (listener) => { /* ... */ },
onReject: (listener) => { /* ... */ }
},
logger: console,
priority: createPriorityResolver(),
segmenter: createTtsSegmentStream
})
Using Speech Intents
// Open an intent for speech output
const intent = pipeline.openIntent({
priority: 'high',
behavior: 'interrupt', // or 'queue' or 'replace'
ownerId: 'user-message-123'
})
// Write text tokens
intent.writeLiteral('Hello, ')
intent.writeLiteral('how can I help you today?')
// Write special tokens (emotions, delays, etc.)
intent.writeSpecial('emotion:happy')
// Flush immediately
intent.writeFlush()
// End the intent
intent.end()
// Or cancel it
intent.cancel('User interrupted')
Pipeline Events
// Listen to pipeline events
pipeline.on('onIntentStart', (intentId) => {
console.log('Intent started:', intentId)
})
pipeline.on('onIntentEnd', (intentId) => {
console.log('Intent completed:', intentId)
})
pipeline.on('onSegment', (segment) => {
console.log('Text segment:', segment.text)
})
pipeline.on('onTtsRequest', (request) => {
console.log('TTS requested:', request.text)
})
pipeline.on('onTtsResult', (result) => {
console.log('TTS completed:', result.segmentId)
})
pipeline.on('onPlaybackStart', ({ item, startedAt }) => {
console.log('Playback started:', item.text)
})
pipeline.on('onPlaybackEnd', ({ item, endedAt }) => {
console.log('Playback ended')
})
Configuration
Audio Quality Settings
// High-quality audio (default)
const audioContext = await initializeAudioContext(48000)
// Lower latency (trade-off with quality)
const audioContext = await initializeAudioContext(24000)
VAD Sensitivity
// More sensitive (picks up quieter speech)
const vad = await createVAD({
speechThreshold: 0.2,
exitThreshold: 0.05
})
// Less sensitive (reduces false positives)
const vad = await createVAD({
speechThreshold: 0.5,
exitThreshold: 0.2
})
Speech Pipeline Priority
import { createPriorityResolver } from '@proj-airi/pipelines-audio'
const priority = createPriorityResolver()
// Use priority levels
const intent = pipeline.openIntent({
priority: 'high', // or 'normal', 'low', or a number
behavior: 'interrupt'
})
- Sample Rate: Higher sample rates (48kHz) provide better quality but use more processing power
- Buffer Size: Smaller buffers reduce latency but may cause audio glitches on slower devices
- VAD Thresholds: Adjust based on microphone quality and ambient noise levels
- Worklet Processing: Audio worklets run on a separate thread for optimal performance
Best Practices
- Initialize Early: Set up the audio context before user interaction to avoid delays
- Cleanup Resources: Always disconnect and remove audio nodes when done
- Handle Errors: Audio context can fail on iOS without user gesture
- Monitor State: Subscribe to context state changes for debugging
- Test Across Devices: Audio behavior varies significantly across browsers and devices
Troubleshooting
Audio Context Suspended
if (audioContext.state === 'suspended') {
// Resume on user interaction
document.addEventListener('click', async () => {
await resumeAudioContext()
}, { once: true })
}
VAD Not Detecting Speech
// Check microphone permissions
const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
// Verify audio is flowing
const analyser = createAudioAnalyser()
source.connect(analyser)
const dataArray = new Uint8Array(analyser.frequencyBinCount)
analyser.getByteTimeDomainData(dataArray)
console.log('Audio level:', Math.max(...dataArray))
High Latency
// Reduce buffer sizes
const worklet = createResamplingWorkletNode(source, {
bufferSize: 512 // Smaller buffer = lower latency
})
// Use lower sample rate
const audioContext = await initializeAudioContext(24000)