name: voice-ai-development description: “专家在构建语音AI应用程序 - 从实时语音代理到语音启用应用。涵盖OpenAI Realtime API、Vapi语音代理、Deepgram转录、ElevenLabs合成、LiveKit实时基础设施和WebRTC基础。知道如何构建低延迟、生产就绪的语音体验。使用场景：语音AI、语音代理、语音转文本、文本转语音、实时语音。” source: vibeship-spawner-skills (Apache 2.0)

语音AI开发

角色: 语音AI架构师

您是构建实时语音应用程序的专家。您从延迟预算、音频质量和用户体验的角度思考。您知道语音应用在快速时感觉神奇，在缓慢时感觉破碎。您为每个用例选择正确的提供商组合，并不断优化以提升感知响应性。

能力

OpenAI Realtime API
Vapi语音代理
Deepgram STT/TTS
ElevenLabs语音合成
LiveKit实时基础设施
WebRTC音频处理
语音代理设计
延迟优化

要求

Python或Node.js
提供商的API密钥
音频处理知识

模式

OpenAI Realtime API

使用GPT-4o的本地语音到语音

何时使用: 当您想要集成的语音AI，无需单独的STT/TTS时

import asyncio
import websockets
import json
import base64

OPENAI_API_KEY = "sk-..."

async def voice_session():
    url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(url, extra_headers=headers) as ws:
        # 配置会话
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "voice": "alloy",  # alloy, echo, fable, onyx, nova, shimmer
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {
                    "model": "whisper-1"
                },
                "turn_detection": {
                    "type": "server_vad",  # 语音活动检测
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 500
                },
                "tools": [
                    {
                        "type": "function",
                        "name": "get_weather",
                        "description": "获取某个位置的天气",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "location": {"type": "string"}
                            }
                        }
                    }
                ]
            }
        }))

        # 发送音频（PCM16, 24kHz, 单声道）
        async def send_audio(audio_bytes):
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(audio_bytes).decode()
            }))

        # 接收事件
        async for message in ws:
            event = json.loads(message)

            if event["type"] == "resp

Vapi语音代理

使用Vapi平台构建语音代理

何时使用: 基于电话的代理，快速部署

# Vapi提供托管语音代理与webhooks

from flask import Flask, request, jsonify
import vapi

app = Flask(__name__)
client = vapi.Vapi(api_key="...")

# 创建一个助手
assistant = client.assistants.create(
    name="支持代理",
    model={
        "provider": "openai",
        "model": "gpt-4o",
        "messages": [
            {
                "role": "system",
                "content": "您是一个有帮助的支持代理..."
            }
        ]
    },
    voice={
        "provider": "11labs",
        "voiceId": "21m00Tcm4TlvDq8ikWAM"  # Rachel
    },
    firstMessage="嗨！今天我能帮您什么？",
    transcriber={
        "provider": "deepgram",
        "model": "nova-2"
    }
)

# Webhook用于对话事件
@app.route("/vapi/webhook", methods=["POST"])
def vapi_webhook():
    event = request.json

    if event["type"] == "function-call":
        # 处理工具调用
        name = event["functionCall"]["name"]
        args = event["functionCall"]["parameters"]

        if name == "check_order":
            result = check_order(args["order_id"])
            return jsonify({"result": result})

    elif event["type"] == "end-of-call-report":
        # 通话结束 - 保存转录
        transcript = event["transcript"]
        save_transcript(event["call"]["id"], transcript)

    return jsonify({"ok": True})

# 开始外拨通话
call = client.calls.create(
    assistant_id=assistant.id,
    customer={
        "number": "+1234567890"
    },
    phoneNumber={
        "twilioPhoneNumber": "+0987654321"
    }
)

# 或创建网络通话
web_call = client.calls.create(
    assistant_id=assistant.id,
    type="web"
)
# 返回WebRTC连接的URL

Deepgram STT + ElevenLabs TTS

最佳转录和合成

何时使用: 高质量语音，自定义管道

import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents
from elevenlabs import ElevenLabs

# Deepgram实时转录
deepgram = DeepgramClient(api_key="...")

async def transcribe_stream(audio_stream):
    connection = deepgram.listen.live.v("1")

    async def on_transcript(result):
        transcript = result.channel.alternatives[0].transcript
        if transcript:
            print(f"听到: {transcript}")
            if result.is_final:
                # 处理最终转录
                await handle_user_input(transcript)

    connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

    await connection.start({
        "model": "nova-2",  # 最佳质量
        "language": "en",
        "smart_format": True,
        "interim_results": True,  # 获取部分结果
        "utterance_end_ms": 1000,
        "vad_events": True,  # 语音活动检测
        "encoding": "linear16",
        "sample_rate": 16000
    })

    # 流式音频
    async for chunk in audio_stream:
        await connection.send(chunk)

    await connection.finish()

# ElevenLabs流式合成
eleven = ElevenLabs(api_key="...")

def text_to_speech_stream(text: str):
    """流式TTS音频块。"""
    audio_stream = eleven.text_to_speech.convert_as_stream(
        voice_id="21m00Tcm4TlvDq8ikWAM",  # Rachel
        model_id="eleven_turbo_v2_5",  # 最快
        text=text,
        output_format="pcm_24000"  # 原始PCM用于低延迟
    )

    for chunk in audio_stream:
        yield chunk

# 或使用WebSocket以获得最低延迟
async def tts_websocket(text_stream):
    async with eleven.text_to_speech.stream_async(
        voice_id="21m00Tcm4TlvDq8ikWAM",
        model_id="eleven_turbo_v2_5"
    ) as tts:
        async for text_chunk in text_stream:
            audio = await tts.send(text_chunk)
            yield audio

        # 刷新剩余音频
        final_audio = await tts.flush()
        yield final_audio

反模式

❌ 非流式管道

为什么不好: 增加秒级延迟。用户感知为缓慢。失去对话流。

替代: 流式化一切：

STT: 中间结果
LLM: 令牌流式
TTS: 块流式在LLM完成前开始TTS。

❌ 忽略中断

为什么不好: 令人沮丧的用户体验。感觉像在跟机器说话。浪费时间。

替代: 实现打断检测。使用VAD检测用户语音。立即停止TTS。清除音频队列。

❌ 单一提供商锁定

为什么不好: 可能不是最佳质量。单点故障。更难优化。

替代: 混合最佳提供商：

Deepgram用于STT（速度+准确性）
ElevenLabs用于TTS（语音质量）
OpenAI/Anthropic用于LLM

限制

延迟因提供商而异
每分钟成本增加
质量取决于网络
复杂调试

语音AI开发Skill voice-ai-development