Speech Recognition

Overview

Airi provides advanced speech recognition capabilities with support for multiple transcription providers, client-side Voice Activity Detection (VAD), and both streaming and batch transcription modes. The system works entirely in the browser with optional server-side provider support.

Architecture

The speech recognition system consists of:

Voice Activity Detection (VAD): Client-side speech detection using Silero VAD (Transformers.js)
Audio Pipeline: Real-time audio capture, processing, and streaming
Transcription Providers: Multiple provider support (Web Speech API, Aliyun, OpenAI-compatible)
Streaming Support: Real-time transcription as the user speaks
Session Management: Handle continuous recognition with idle timeouts

Voice Activity Detection

VAD automatically detects when speech starts and ends, enabling hands-free interaction.

VAD Implementation

Airi uses the Silero VAD model via Hugging Face Transformers.js:

import { VAD, createVAD } from '@proj-airi/stage-ui/workers/vad'
import { AutoModel, Tensor } from '@huggingface/transformers'

// Create VAD instance
const vad = await createVAD({
  sampleRate: 16000,
  speechThreshold: 0.3,      // Speech detection threshold (0-1)
  exitThreshold: 0.1,        // Silence detection threshold
  minSilenceDurationMs: 400, // Min silence to end speech
  speechPadMs: 80,           // Padding around speech segments
  minSpeechDurationMs: 250,  // Min duration to count as speech
  maxBufferDuration: 30,     // Max recording duration (seconds)
  newBufferSize: 512         // Audio chunk size
})

// VAD uses Silero model from HuggingFace
// Model: 'onnx-community/silero-vad'

VAD Events

// Speech detection events
vad.on('speech-start', () => {
  console.log('User started speaking')
  // Show recording indicator
})

vad.on('speech-end', () => {
  console.log('User stopped speaking')
  // Hide recording indicator
})

vad.on('speech-ready', ({ buffer, duration }) => {
  console.log(`Speech segment: ${duration}ms`)
  // buffer is Float32Array of audio data
  // Process or send for transcription
})

vad.on('debug', ({ data }) => {
  // Real-time probability score
  const probability = data.probability  // 0-1
  console.log('Speech probability:', probability)
})

vad.on('status', ({ type, message }) => {
  if (type === 'info') {
    console.log('Status:', message)
  } else if (type === 'error') {
    console.error('Error:', message)
  }
})

Processing Audio with VAD

// Initialize VAD model
await vad.initialize()

// Process audio buffers
const audioBuffer = new Float32Array(512)  // 16kHz mono audio
await vad.processAudio(audioBuffer)

// VAD automatically detects speech and emits events
// No need to manually check thresholds

VAD Configuration

// Update VAD settings dynamically
vad.updateConfig({
  speechThreshold: 0.5,  // Less sensitive
  exitThreshold: 0.2,
  minSilenceDurationMs: 600  // Wait longer before ending
})

// Sensitive settings (quiet environments)
const sensitiveVAD = await createVAD({
  speechThreshold: 0.2,
  exitThreshold: 0.05,
  minSilenceDurationMs: 300
})

// Conservative settings (noisy environments)
const conservativeVAD = await createVAD({
  speechThreshold: 0.5,
  exitThreshold: 0.25,
  minSilenceDurationMs: 600
})

Transcription Providers

Airi supports multiple speech recognition providers:

Web Speech API (Browser-Native)

import { useHearingStore } from '@proj-airi/stage-ui/stores/modules/hearing'

const hearingStore = useHearingStore()

// Configure Web Speech API
hearingStore.activeTranscriptionProvider.value = 'browser-web-speech-api'
hearingStore.activeTranscriptionModel.value = 'web-speech-api'

// Web Speech API is free and works offline
// Available in Chrome, Edge, Safari
// Supports 50+ languages

OpenAI Whisper

import { useHearingStore } from '@proj-airi/stage-ui/stores/modules/hearing'
import { useProvidersStore } from '@proj-airi/stage-ui/stores/providers'

const hearingStore = useHearingStore()
const providersStore = useProvidersStore()

// Configure provider
await providersStore.saveProviderConfig('openai', {
  apiKey: 'sk-...'
})

// Set transcription provider
hearingStore.activeTranscriptionProvider.value = 'openai'
hearingStore.activeTranscriptionModel.value = 'whisper-1'

// Supports multiple languages, high accuracy

Aliyun NLS (Streaming)

// Configure Aliyun for real-time streaming transcription
await providersStore.saveProviderConfig('aliyun-nls-transcription', {
  appKey: 'your-app-key',
  token: 'your-token',
  url: 'wss://nls-gateway.cn-shanghai.aliyuncs.com/ws/v1'
})

hearingStore.activeTranscriptionProvider.value = 'aliyun-nls-transcription'

Streaming Transcription

Real-Time Speech Recognition

import { useHearingSpeechInputPipeline } from '@proj-airi/stage-ui/stores/modules/hearing'

const hearingPipeline = useHearingSpeechInputPipeline()

// Get microphone access
const stream = await navigator.mediaDevices.getUserMedia({
  audio: {
    echoCancellation: true,
    noiseSuppression: true,
    autoGainControl: true,
    sampleRate: 16000
  }
})

// Start streaming transcription
await hearingPipeline.transcribeForMediaStream(stream, {
  sampleRate: 16000,
  idleTimeoutMs: 15000,  // Stop after 15s of idle
  
  // Callbacks for real-time results
  onSentenceEnd: (delta) => {
    // Called when a sentence completes
    console.log('Sentence:', delta)
    // Display or append to UI
  },
  
  onSpeechEnd: (text) => {
    // Called when speech session ends
    console.log('Full text:', text)
    // Send to chat or process
  },
  
  // Provider-specific options
  providerOptions: {
    language: 'en-US',
    continuous: true,      // Keep listening
    interimResults: true   // Show partial results
  }
})

// Stop transcription
const finalText = await hearingPipeline.stopStreamingTranscription()
console.log('Final transcription:', finalText)

Web Speech API Streaming

import { streamWebSpeechAPITranscription } from '@proj-airi/stage-ui/stores/providers/web-speech-api'

const stream = await navigator.mediaDevices.getUserMedia({ audio: true })

const result = streamWebSpeechAPITranscription(stream, {
  language: 'en-US',
  continuous: true,
  interimResults: true,
  maxAlternatives: 1,
  abortSignal: abortController.signal,
  
  onSentenceEnd: (delta) => {
    console.log('New text:', delta)
  },
  
  onSpeechEnd: (text) => {
    console.log('Complete:', text)
  }
})

// Stream text deltas
for await (const chunk of result.textStream) {
  console.log('Chunk:', chunk)
}

// Or get full text
const fullText = await result.text

Batch Transcription

Transcribe Audio File

import { useHearingStore } from '@proj-airi/stage-ui/stores/modules/hearing'

const hearingStore = useHearingStore()
const hearingPipeline = useHearingSpeechInputPipeline()

// Transcribe from recorded audio
const audioBlob = new Blob([audioData], { type: 'audio/wav' })
const text = await hearingPipeline.transcribeForRecording(audioBlob)
console.log('Transcription:', text)

Using Generate API

import { generateTranscription } from '@xsai/generate-transcription'
import { useProvidersStore } from '@proj-airi/stage-ui/stores/providers'

const providersStore = useProvidersStore()
const provider = await providersStore.getProviderInstance('openai')

const audioFile = new File([audioBlob], 'recording.wav')

const result = await generateTranscription({
  ...provider.transcription('whisper-1'),
  file: audioFile,
  responseFormat: 'verbose_json'  // or 'json', 'text'
})

console.log('Text:', result.text)
console.log('Language:', result.language)
console.log('Duration:', result.duration)

Audio Pipeline

Audio Stream Creation

// Internal: Create audio stream from MediaStream
async function createAudioStreamFromMediaStream(
  stream: MediaStream,
  sampleRate = 16000,
  onActivity?: () => void
) {
  const audioContext = new AudioContext({
    sampleRate,
    latencyHint: 'interactive'
  })
  
  // Load VAD worklet
  await audioContext.audioWorklet.addModule(vadWorkletUrl)
  const workletNode = new AudioWorkletNode(
    audioContext,
    'vad-audio-worklet-processor'
  )
  
  // Create readable stream for transcription
  let audioStreamController: ReadableStreamDefaultController<ArrayBuffer>
  const audioStream = new ReadableStream<ArrayBuffer>({
    start(controller) {
      audioStreamController = controller
    }
  })
  
  // Process audio chunks
  workletNode.port.onmessage = ({ data }) => {
    const buffer = data?.buffer
    if (buffer) {
      // Convert Float32 to Int16 PCM
      const pcm16 = float32ToInt16(buffer)
      audioStreamController.enqueue(pcm16.buffer.slice(0))
      onActivity?.()
    }
  }
  
  // Connect audio nodes
  const mediaStreamSource = audioContext.createMediaStreamSource(stream)
  mediaStreamSource.connect(workletNode)
  
  // Create silent output to prevent echo
  const silentGain = audioContext.createGain()
  silentGain.gain.value = 0
  workletNode.connect(silentGain)
  silentGain.connect(audioContext.destination)
  
  return {
    audioContext,
    workletNode,
    mediaStreamSource,
    audioStream,
    controller: audioStreamController
  }
}

Session Management

Continuous Recognition

import { useHearingSpeechInputPipeline } from '@proj-airi/stage-ui/stores/modules/hearing'

const hearingPipeline = useHearingSpeechInputPipeline()

// Start continuous recognition
const stream = await navigator.mediaDevices.getUserMedia({ audio: true })

await hearingPipeline.transcribeForMediaStream(stream, {
  // No idle timeout = continuous
  idleTimeoutMs: 0,
  
  providerOptions: {
    continuous: true,
    interimResults: true
  },
  
  onSentenceEnd: (delta) => {
    // Process each completed sentence
    appendToChat(delta)
  }
})

// Session stays active until explicitly stopped
// or user disables microphone

Idle Timeout

// Auto-stop after period of inactivity
await hearingPipeline.transcribeForMediaStream(stream, {
  idleTimeoutMs: 15000,  // 15 seconds
  
  onSpeechEnd: (text) => {
    console.log('Session ended due to inactivity')
  }
})

// Timer resets on each audio activity
// Helps save API costs and resources

Manual Stop

// Stop current session
const finalText = await hearingPipeline.stopStreamingTranscription()

// Abort session (don't wait for final result)
await hearingPipeline.stopStreamingTranscription(true)

Language Support

Supported Languages

// Web Speech API supports 50+ languages
const languages = [
  'en-US',    // English (US)
  'en-GB',    // English (UK)
  'zh-CN',    // Chinese (Simplified)
  'zh-TW',    // Chinese (Traditional)
  'ja-JP',    // Japanese
  'ko-KR',    // Korean
  'es-ES',    // Spanish
  'fr-FR',    // French
  'de-DE',    // German
  'it-IT',    // Italian
  'pt-BR',    // Portuguese (Brazil)
  'ru-RU',    // Russian
  // ... and many more
]

// Set language in provider options
await hearingPipeline.transcribeForMediaStream(stream, {
  providerOptions: {
    language: 'ja-JP'
  }
})

Auto Language Detection

// Some providers support auto-detection
const result = await generateTranscription({
  ...provider.transcription('whisper-1'),
  file: audioFile,
  language: undefined  // Auto-detect
})

console.log('Detected language:', result.language)

Performance Optimization

Sample Rate

// Lower sample rate = less bandwidth, faster processing
// 16kHz is optimal for speech recognition
await hearingPipeline.transcribeForMediaStream(stream, {
  sampleRate: 16000  // Standard for speech
})

// Higher quality (but larger data)
const highQuality = 24000  // Better for music/high-fidelity

VAD Tuning

// Adjust VAD for faster response
const fastVAD = await createVAD({
  speechThreshold: 0.3,
  minSilenceDurationMs: 300,  // End speech faster
  minSpeechDurationMs: 150,   // Shorter minimum
  newBufferSize: 256          // Smaller chunks = lower latency
})

// Quality over speed
const qualityVAD = await createVAD({
  speechThreshold: 0.4,
  minSilenceDurationMs: 600,  // Wait longer for pauses
  speechPadMs: 120,           // More padding
  newBufferSize: 512
})

Model Selection

// Web Speech API = Fastest, free, offline-capable
// Good for: Real-time interaction, privacy-focused

// OpenAI Whisper = High accuracy, supports many languages
// Good for: Important transcriptions, multilingual

// Aliyun NLS = Streaming, low latency
// Good for: Real-time Chinese speech recognition

Error Handling

Common Errors

import { useHearingSpeechInputPipeline } from '@proj-airi/stage-ui/stores/modules/hearing'

const hearingPipeline = useHearingSpeechInputPipeline()

try {
  await hearingPipeline.transcribeForMediaStream(stream)
} catch (error) {
  if (error.name === 'NotAllowedError') {
    console.error('Microphone permission denied')
  } else if (error.name === 'NotFoundError') {
    console.error('No microphone found')
  } else if (error.name === 'AbortError') {
    console.log('Transcription was cancelled')
  } else {
    console.error('Transcription error:', error)
  }
}

// Check error state
if (hearingPipeline.error.value) {
  console.error('Pipeline error:', hearingPipeline.error.value)
}

Retry Logic

async function transcribeWithRetry(
  stream: MediaStream,
  maxRetries = 3
) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      await hearingPipeline.transcribeForMediaStream(stream)
      return
    } catch (error) {
      if (i === maxRetries - 1) throw error
      
      console.log(`Retry ${i + 1}/${maxRetries}...`)
      await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1)))
    }
  }
}

Best Practices

Check Support: Verify Web Speech API availability before using
Request Permissions Early: Get microphone access before user needs it
Use VAD: Implement VAD to avoid sending silence to APIs
Handle Errors: Always catch and handle transcription errors
Set Timeouts: Implement idle timeouts to save resources
Choose Appropriate Provider: Use Web Speech API for free/offline, Whisper for accuracy
Monitor Performance: Track latency and adjust buffer sizes
Cleanup Sessions: Always stop and cleanup audio resources

Troubleshooting

Web Speech API Not Available

if (!('webkitSpeechRecognition' in window || 'SpeechRecognition' in window)) {
  console.error('Web Speech API not supported')
  // Fall back to batch transcription or different provider
}

No Audio Detected

// Check microphone permissions
const permissions = await navigator.permissions.query({ name: 'microphone' })
console.log('Microphone permission:', permissions.state)

// Test audio levels
const analyser = audioContext.createAnalyser()
source.connect(analyser)
const dataArray = new Uint8Array(analyser.frequencyBinCount)
analyser.getByteTimeDomainData(dataArray)
const level = Math.max(...dataArray)
console.log('Audio level:', level)

if (level < 128) {
  console.warn('No audio detected - check microphone')
}

High Latency

// Reduce buffer sizes
const stream = await hearingPipeline.transcribeForMediaStream(mediaStream, {
  sampleRate: 16000,  // Lower sample rate
  providerOptions: {
    interimResults: false  // Disable partial results
  }
})

// Use Web Speech API for lowest latency
hearingStore.activeTranscriptionProvider.value = 'browser-web-speech-api'

VAD Not Triggering

// Lower threshold
vad.updateConfig({
  speechThreshold: 0.2,  // More sensitive
  exitThreshold: 0.05
})

// Check VAD probability in real-time
vad.on('debug', ({ data }) => {
  console.log('VAD score:', data.probability)
  // Should be > threshold when speaking
})

​Overview

​Architecture

​Voice Activity Detection

​VAD Implementation

​VAD Events

​Processing Audio with VAD

​VAD Configuration

​Transcription Providers

​Web Speech API (Browser-Native)

​OpenAI Whisper

​Aliyun NLS (Streaming)

​Streaming Transcription

​Real-Time Speech Recognition

​Web Speech API Streaming

​Batch Transcription

​Transcribe Audio File

​Using Generate API

​Audio Pipeline

​Audio Stream Creation

​Session Management

​Continuous Recognition

​Idle Timeout

​Manual Stop

​Language Support

​Supported Languages

​Auto Language Detection

​Performance Optimization

​Sample Rate

​VAD Tuning

​Model Selection

​Error Handling

​Common Errors

​Retry Logic

​Best Practices

​Troubleshooting

​Web Speech API Not Available

​No Audio Detected

​High Latency

​VAD Not Triggering