Documentation Index
Fetch the complete documentation index at: https://mintlify.com/moeru-ai/airi/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Airi provides advanced speech recognition capabilities with support for multiple transcription providers, client-side Voice Activity Detection (VAD), and both streaming and batch transcription modes. The system works entirely in the browser with optional server-side provider support.
Architecture
The speech recognition system consists of:
- Voice Activity Detection (VAD): Client-side speech detection using Silero VAD (Transformers.js)
- Audio Pipeline: Real-time audio capture, processing, and streaming
- Transcription Providers: Multiple provider support (Web Speech API, Aliyun, OpenAI-compatible)
- Streaming Support: Real-time transcription as the user speaks
- Session Management: Handle continuous recognition with idle timeouts
Voice Activity Detection
VAD automatically detects when speech starts and ends, enabling hands-free interaction.
VAD Implementation
Airi uses the Silero VAD model via Hugging Face Transformers.js:
import { VAD, createVAD } from '@proj-airi/stage-ui/workers/vad'
import { AutoModel, Tensor } from '@huggingface/transformers'
// Create VAD instance
const vad = await createVAD({
sampleRate: 16000,
speechThreshold: 0.3, // Speech detection threshold (0-1)
exitThreshold: 0.1, // Silence detection threshold
minSilenceDurationMs: 400, // Min silence to end speech
speechPadMs: 80, // Padding around speech segments
minSpeechDurationMs: 250, // Min duration to count as speech
maxBufferDuration: 30, // Max recording duration (seconds)
newBufferSize: 512 // Audio chunk size
})
// VAD uses Silero model from HuggingFace
// Model: 'onnx-community/silero-vad'
VAD Events
// Speech detection events
vad.on('speech-start', () => {
console.log('User started speaking')
// Show recording indicator
})
vad.on('speech-end', () => {
console.log('User stopped speaking')
// Hide recording indicator
})
vad.on('speech-ready', ({ buffer, duration }) => {
console.log(`Speech segment: ${duration}ms`)
// buffer is Float32Array of audio data
// Process or send for transcription
})
vad.on('debug', ({ data }) => {
// Real-time probability score
const probability = data.probability // 0-1
console.log('Speech probability:', probability)
})
vad.on('status', ({ type, message }) => {
if (type === 'info') {
console.log('Status:', message)
} else if (type === 'error') {
console.error('Error:', message)
}
})
Processing Audio with VAD
// Initialize VAD model
await vad.initialize()
// Process audio buffers
const audioBuffer = new Float32Array(512) // 16kHz mono audio
await vad.processAudio(audioBuffer)
// VAD automatically detects speech and emits events
// No need to manually check thresholds
VAD Configuration
// Update VAD settings dynamically
vad.updateConfig({
speechThreshold: 0.5, // Less sensitive
exitThreshold: 0.2,
minSilenceDurationMs: 600 // Wait longer before ending
})
// Sensitive settings (quiet environments)
const sensitiveVAD = await createVAD({
speechThreshold: 0.2,
exitThreshold: 0.05,
minSilenceDurationMs: 300
})
// Conservative settings (noisy environments)
const conservativeVAD = await createVAD({
speechThreshold: 0.5,
exitThreshold: 0.25,
minSilenceDurationMs: 600
})
Transcription Providers
Airi supports multiple speech recognition providers:
Web Speech API (Browser-Native)
import { useHearingStore } from '@proj-airi/stage-ui/stores/modules/hearing'
const hearingStore = useHearingStore()
// Configure Web Speech API
hearingStore.activeTranscriptionProvider.value = 'browser-web-speech-api'
hearingStore.activeTranscriptionModel.value = 'web-speech-api'
// Web Speech API is free and works offline
// Available in Chrome, Edge, Safari
// Supports 50+ languages
OpenAI Whisper
import { useHearingStore } from '@proj-airi/stage-ui/stores/modules/hearing'
import { useProvidersStore } from '@proj-airi/stage-ui/stores/providers'
const hearingStore = useHearingStore()
const providersStore = useProvidersStore()
// Configure provider
await providersStore.saveProviderConfig('openai', {
apiKey: 'sk-...'
})
// Set transcription provider
hearingStore.activeTranscriptionProvider.value = 'openai'
hearingStore.activeTranscriptionModel.value = 'whisper-1'
// Supports multiple languages, high accuracy
Aliyun NLS (Streaming)
// Configure Aliyun for real-time streaming transcription
await providersStore.saveProviderConfig('aliyun-nls-transcription', {
appKey: 'your-app-key',
token: 'your-token',
url: 'wss://nls-gateway.cn-shanghai.aliyuncs.com/ws/v1'
})
hearingStore.activeTranscriptionProvider.value = 'aliyun-nls-transcription'
Streaming Transcription
Real-Time Speech Recognition
import { useHearingSpeechInputPipeline } from '@proj-airi/stage-ui/stores/modules/hearing'
const hearingPipeline = useHearingSpeechInputPipeline()
// Get microphone access
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
sampleRate: 16000
}
})
// Start streaming transcription
await hearingPipeline.transcribeForMediaStream(stream, {
sampleRate: 16000,
idleTimeoutMs: 15000, // Stop after 15s of idle
// Callbacks for real-time results
onSentenceEnd: (delta) => {
// Called when a sentence completes
console.log('Sentence:', delta)
// Display or append to UI
},
onSpeechEnd: (text) => {
// Called when speech session ends
console.log('Full text:', text)
// Send to chat or process
},
// Provider-specific options
providerOptions: {
language: 'en-US',
continuous: true, // Keep listening
interimResults: true // Show partial results
}
})
// Stop transcription
const finalText = await hearingPipeline.stopStreamingTranscription()
console.log('Final transcription:', finalText)
Web Speech API Streaming
import { streamWebSpeechAPITranscription } from '@proj-airi/stage-ui/stores/providers/web-speech-api'
const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
const result = streamWebSpeechAPITranscription(stream, {
language: 'en-US',
continuous: true,
interimResults: true,
maxAlternatives: 1,
abortSignal: abortController.signal,
onSentenceEnd: (delta) => {
console.log('New text:', delta)
},
onSpeechEnd: (text) => {
console.log('Complete:', text)
}
})
// Stream text deltas
for await (const chunk of result.textStream) {
console.log('Chunk:', chunk)
}
// Or get full text
const fullText = await result.text
Batch Transcription
Transcribe Audio File
import { useHearingStore } from '@proj-airi/stage-ui/stores/modules/hearing'
const hearingStore = useHearingStore()
const hearingPipeline = useHearingSpeechInputPipeline()
// Transcribe from recorded audio
const audioBlob = new Blob([audioData], { type: 'audio/wav' })
const text = await hearingPipeline.transcribeForRecording(audioBlob)
console.log('Transcription:', text)
Using Generate API
import { generateTranscription } from '@xsai/generate-transcription'
import { useProvidersStore } from '@proj-airi/stage-ui/stores/providers'
const providersStore = useProvidersStore()
const provider = await providersStore.getProviderInstance('openai')
const audioFile = new File([audioBlob], 'recording.wav')
const result = await generateTranscription({
...provider.transcription('whisper-1'),
file: audioFile,
responseFormat: 'verbose_json' // or 'json', 'text'
})
console.log('Text:', result.text)
console.log('Language:', result.language)
console.log('Duration:', result.duration)
Audio Pipeline
Audio Stream Creation
// Internal: Create audio stream from MediaStream
async function createAudioStreamFromMediaStream(
stream: MediaStream,
sampleRate = 16000,
onActivity?: () => void
) {
const audioContext = new AudioContext({
sampleRate,
latencyHint: 'interactive'
})
// Load VAD worklet
await audioContext.audioWorklet.addModule(vadWorkletUrl)
const workletNode = new AudioWorkletNode(
audioContext,
'vad-audio-worklet-processor'
)
// Create readable stream for transcription
let audioStreamController: ReadableStreamDefaultController<ArrayBuffer>
const audioStream = new ReadableStream<ArrayBuffer>({
start(controller) {
audioStreamController = controller
}
})
// Process audio chunks
workletNode.port.onmessage = ({ data }) => {
const buffer = data?.buffer
if (buffer) {
// Convert Float32 to Int16 PCM
const pcm16 = float32ToInt16(buffer)
audioStreamController.enqueue(pcm16.buffer.slice(0))
onActivity?.()
}
}
// Connect audio nodes
const mediaStreamSource = audioContext.createMediaStreamSource(stream)
mediaStreamSource.connect(workletNode)
// Create silent output to prevent echo
const silentGain = audioContext.createGain()
silentGain.gain.value = 0
workletNode.connect(silentGain)
silentGain.connect(audioContext.destination)
return {
audioContext,
workletNode,
mediaStreamSource,
audioStream,
controller: audioStreamController
}
}
Session Management
Continuous Recognition
import { useHearingSpeechInputPipeline } from '@proj-airi/stage-ui/stores/modules/hearing'
const hearingPipeline = useHearingSpeechInputPipeline()
// Start continuous recognition
const stream = await navigator.mediaDevices.getUserMedia({ audio: true })
await hearingPipeline.transcribeForMediaStream(stream, {
// No idle timeout = continuous
idleTimeoutMs: 0,
providerOptions: {
continuous: true,
interimResults: true
},
onSentenceEnd: (delta) => {
// Process each completed sentence
appendToChat(delta)
}
})
// Session stays active until explicitly stopped
// or user disables microphone
Idle Timeout
// Auto-stop after period of inactivity
await hearingPipeline.transcribeForMediaStream(stream, {
idleTimeoutMs: 15000, // 15 seconds
onSpeechEnd: (text) => {
console.log('Session ended due to inactivity')
}
})
// Timer resets on each audio activity
// Helps save API costs and resources
Manual Stop
// Stop current session
const finalText = await hearingPipeline.stopStreamingTranscription()
// Abort session (don't wait for final result)
await hearingPipeline.stopStreamingTranscription(true)
Language Support
Supported Languages
// Web Speech API supports 50+ languages
const languages = [
'en-US', // English (US)
'en-GB', // English (UK)
'zh-CN', // Chinese (Simplified)
'zh-TW', // Chinese (Traditional)
'ja-JP', // Japanese
'ko-KR', // Korean
'es-ES', // Spanish
'fr-FR', // French
'de-DE', // German
'it-IT', // Italian
'pt-BR', // Portuguese (Brazil)
'ru-RU', // Russian
// ... and many more
]
// Set language in provider options
await hearingPipeline.transcribeForMediaStream(stream, {
providerOptions: {
language: 'ja-JP'
}
})
Auto Language Detection
// Some providers support auto-detection
const result = await generateTranscription({
...provider.transcription('whisper-1'),
file: audioFile,
language: undefined // Auto-detect
})
console.log('Detected language:', result.language)
Sample Rate
// Lower sample rate = less bandwidth, faster processing
// 16kHz is optimal for speech recognition
await hearingPipeline.transcribeForMediaStream(stream, {
sampleRate: 16000 // Standard for speech
})
// Higher quality (but larger data)
const highQuality = 24000 // Better for music/high-fidelity
VAD Tuning
// Adjust VAD for faster response
const fastVAD = await createVAD({
speechThreshold: 0.3,
minSilenceDurationMs: 300, // End speech faster
minSpeechDurationMs: 150, // Shorter minimum
newBufferSize: 256 // Smaller chunks = lower latency
})
// Quality over speed
const qualityVAD = await createVAD({
speechThreshold: 0.4,
minSilenceDurationMs: 600, // Wait longer for pauses
speechPadMs: 120, // More padding
newBufferSize: 512
})
Model Selection
// Web Speech API = Fastest, free, offline-capable
// Good for: Real-time interaction, privacy-focused
// OpenAI Whisper = High accuracy, supports many languages
// Good for: Important transcriptions, multilingual
// Aliyun NLS = Streaming, low latency
// Good for: Real-time Chinese speech recognition
Error Handling
Common Errors
import { useHearingSpeechInputPipeline } from '@proj-airi/stage-ui/stores/modules/hearing'
const hearingPipeline = useHearingSpeechInputPipeline()
try {
await hearingPipeline.transcribeForMediaStream(stream)
} catch (error) {
if (error.name === 'NotAllowedError') {
console.error('Microphone permission denied')
} else if (error.name === 'NotFoundError') {
console.error('No microphone found')
} else if (error.name === 'AbortError') {
console.log('Transcription was cancelled')
} else {
console.error('Transcription error:', error)
}
}
// Check error state
if (hearingPipeline.error.value) {
console.error('Pipeline error:', hearingPipeline.error.value)
}
Retry Logic
async function transcribeWithRetry(
stream: MediaStream,
maxRetries = 3
) {
for (let i = 0; i < maxRetries; i++) {
try {
await hearingPipeline.transcribeForMediaStream(stream)
return
} catch (error) {
if (i === maxRetries - 1) throw error
console.log(`Retry ${i + 1}/${maxRetries}...`)
await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1)))
}
}
}
Best Practices
- Check Support: Verify Web Speech API availability before using
- Request Permissions Early: Get microphone access before user needs it
- Use VAD: Implement VAD to avoid sending silence to APIs
- Handle Errors: Always catch and handle transcription errors
- Set Timeouts: Implement idle timeouts to save resources
- Choose Appropriate Provider: Use Web Speech API for free/offline, Whisper for accuracy
- Monitor Performance: Track latency and adjust buffer sizes
- Cleanup Sessions: Always stop and cleanup audio resources
Troubleshooting
Web Speech API Not Available
if (!('webkitSpeechRecognition' in window || 'SpeechRecognition' in window)) {
console.error('Web Speech API not supported')
// Fall back to batch transcription or different provider
}
No Audio Detected
// Check microphone permissions
const permissions = await navigator.permissions.query({ name: 'microphone' })
console.log('Microphone permission:', permissions.state)
// Test audio levels
const analyser = audioContext.createAnalyser()
source.connect(analyser)
const dataArray = new Uint8Array(analyser.frequencyBinCount)
analyser.getByteTimeDomainData(dataArray)
const level = Math.max(...dataArray)
console.log('Audio level:', level)
if (level < 128) {
console.warn('No audio detected - check microphone')
}
High Latency
// Reduce buffer sizes
const stream = await hearingPipeline.transcribeForMediaStream(mediaStream, {
sampleRate: 16000, // Lower sample rate
providerOptions: {
interimResults: false // Disable partial results
}
})
// Use Web Speech API for lowest latency
hearingStore.activeTranscriptionProvider.value = 'browser-web-speech-api'
VAD Not Triggering
// Lower threshold
vad.updateConfig({
speechThreshold: 0.2, // More sensitive
exitThreshold: 0.05
})
// Check VAD probability in real-time
vad.on('debug', ({ data }) => {
console.log('VAD score:', data.probability)
// Should be > threshold when speaking
})