Skip to content

6.2 语音模型

学习时长:2-3 周

语音模型使 AI 能够"听懂"和"说话",是构建语音助手、实时翻译、会议转录等应用的核心技术。本节覆盖语音识别(ASR)、语音合成(TTS)和语音理解的完整技术栈。


6.2.1 语音处理基础

核心任务分类

任务类型描述输入输出典型应用
语音识别(ASR)将语音转换为文字音频文本语音输入、字幕生成
语音合成(TTS)将文字转换为语音文本音频有声读物、语音助手
说话人识别识别说话人身份音频说话人ID声纹认证、会议记录
语音增强去除噪声、回声噪声音频干净音频通话降噪、录音优化
语音情感识别识别情感状态音频情感标签客服质检、心理分析
语音翻译跨语言语音转换音频翻译文本/音频实时翻译、国际会议

音频基础概念

python
"""
音频参数说明:

1. 采样率(Sample Rate)
   - 每秒采样点数,单位 Hz
   - 常见值:8000(电话)、16000(语音识别)、44100(CD)、48000(专业)
   - 语音识别推荐:16000 Hz

2. 位深度(Bit Depth)
   - 每个采样点的比特数
   - 常见值:16-bit、24-bit、32-bit
   - 语音识别推荐:16-bit

3. 声道数(Channels)
   - 单声道(Mono):1 个声道
   - 立体声(Stereo):2 个声道
   - 语音识别推荐:单声道

4. 音频格式
   - WAV:无损,文件大
   - MP3:有损压缩,文件小
   - FLAC:无损压缩
   - 语音识别推荐:WAV 或 FLAC
"""

# 音频处理示例
import librosa
import soundfile as sf
import numpy as np

def load_audio(file_path: str, target_sr: int = 16000):
    """加载音频文件"""
    # librosa 自动重采样到目标采样率
    audio, sr = librosa.load(file_path, sr=target_sr, mono=True)
    return audio, sr

def save_audio(audio: np.ndarray, file_path: str, sr: int = 16000):
    """保存音频文件"""
    sf.write(file_path, audio, sr)

def get_audio_duration(audio: np.ndarray, sr: int) -> float:
    """获取音频时长(秒)"""
    return len(audio) / sr

# 使用示例
audio, sr = load_audio("speech.wav", target_sr=16000)
duration = get_audio_duration(audio, sr)
print(f"音频时长: {duration:.2f} 秒")
print(f"采样率: {sr} Hz")
print(f"音频形状: {audio.shape}")

6.2.2 语音识别(ASR)

1. 使用 Whisper 进行语音识别(推荐)

Whisper 是 OpenAI 开源的多语言语音识别模型,支持 99 种语言,准确率高。

python
# pip install openai-whisper

import whisper
import torch
from typing import Dict, List

class WhisperASR:
    """Whisper 语音识别器"""
    
    def __init__(self, model_size: str = "base"):
        """
        Args:
            model_size: 模型大小
                - tiny: 最快,39M 参数,英文 WER ~10%
                - base: 74M 参数,英文 WER ~7%
                - small: 244M 参数,英文 WER ~5%
                - medium: 769M 参数,英文 WER ~4%
                - large: 1550M 参数,英文 WER ~3%
                - large-v2: 改进版
                - large-v3: 最新版(推荐)
        """
        print(f"🔧 加载 Whisper {model_size} 模型...")
        self.model = whisper.load_model(model_size)
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"✅ 模型已加载到 {self.device}")
    
    def transcribe(
        self,
        audio_path: str,
        language: str = None,
        task: str = "transcribe",
        return_timestamps: bool = False
    ) -> Dict:
        """
        转录音频
        
        Args:
            audio_path: 音频文件路径
            language: 语言代码(如 "zh", "en"),None 表示自动检测
            task: "transcribe"(转录)或 "translate"(翻译成英文)
            return_timestamps: 是否返回时间戳
        
        Returns:
            {
                "text": "转录文本",
                "language": "zh",
                "segments": [...],  # 如果 return_timestamps=True
                "duration": 10.5
            }
        """
        print(f"🎤 转录音频:{audio_path}")
        
        # 转录
        result = self.model.transcribe(
            audio_path,
            language=language,
            task=task,
            verbose=False,
            word_timestamps=return_timestamps
        )
        
        # 整理结果
        output = {
            "text": result["text"].strip(),
            "language": result["language"]
        }
        
        if return_timestamps:
            output["segments"] = result["segments"]
        
        print(f"✅ 转录完成(语言: {output['language']})")
        
        return output
    
    def transcribe_with_timestamps(self, audio_path: str) -> List[Dict]:
        """
        转录并返回带时间戳的片段
        
        Returns:
            [
                {"start": 0.0, "end": 2.5, "text": "你好"},
                {"start": 2.5, "end": 5.0, "text": "世界"}
            ]
        """
        result = self.transcribe(audio_path, return_timestamps=True)
        
        segments = []
        for seg in result["segments"]:
            segments.append({
                "start": seg["start"],
                "end": seg["end"],
                "text": seg["text"].strip()
            })
        
        return segments
    
    def translate_to_english(self, audio_path: str) -> str:
        """
        将任意语言音频翻译成英文
        
        Args:
            audio_path: 音频文件路径
        
        Returns:
            英文翻译文本
        """
        result = self.transcribe(audio_path, task="translate")
        return result["text"]

# 使用示例
asr = WhisperASR(model_size="base")

# 基础转录
result = asr.transcribe("speech_chinese.wav")
print(f"转录结果: {result['text']}")
print(f"检测语言: {result['language']}")

# 带时间戳的转录
segments = asr.transcribe_with_timestamps("speech.wav")
print("\n📝 转录片段:")
for seg in segments:
    print(f"[{seg['start']:.2f}s - {seg['end']:.2f}s] {seg['text']}")

# 翻译成英文
translation = asr.translate_to_english("speech_chinese.wav")
print(f"\n🌐 英文翻译: {translation}")

输出示例

🔧 加载 Whisper base 模型...
✅ 模型已加载到 cuda
🎤 转录音频:speech_chinese.wav
✅ 转录完成(语言: zh)
转录结果: 今天天气很好,我们一起去公园散步吧。
检测语言: zh

📝 转录片段:
[0.00s - 2.50s] 今天天气很好,
[2.50s - 5.00s] 我们一起去公园散步吧。

🌐 英文翻译: The weather is nice today, let's go for a walk in the park together.

2. 使用 Faster Whisper(加速版本)

python
# pip install faster-whisper

from faster_whisper import WhisperModel
from typing import List, Dict

class FasterWhisperASR:
    """Faster Whisper(速度优化版)"""
    
    def __init__(
        self, 
        model_size: str = "base",
        device: str = "cuda",
        compute_type: str = "float16"
    ):
        """
        Args:
            model_size: 模型大小
            device: "cuda" 或 "cpu"
            compute_type: "float16"(GPU)或 "int8"(CPU)
        """
        self.model = WhisperModel(
            model_size,
            device=device,
            compute_type=compute_type
        )
        print(f"✅ Faster Whisper {model_size} 已加载")
    
    def transcribe(
        self,
        audio_path: str,
        language: str = None,
        beam_size: int = 5,
        vad_filter: bool = True
    ) -> Dict:
        """
        转录音频(比原版快 4-5 倍)
        
        Args:
            audio_path: 音频路径
            language: 语言代码
            beam_size: beam search 大小(越大越准确但越慢)
            vad_filter: 是否使用 VAD 过滤静音
        
        Returns:
            {"text": "...", "segments": [...]}
        """
        segments, info = self.model.transcribe(
            audio_path,
            language=language,
            beam_size=beam_size,
            vad_filter=vad_filter
        )
        
        # 收集所有片段
        all_segments = []
        full_text = []
        
        for segment in segments:
            all_segments.append({
                "start": segment.start,
                "end": segment.end,
                "text": segment.text.strip()
            })
            full_text.append(segment.text.strip())
        
        return {
            "text": " ".join(full_text),
            "language": info.language,
            "segments": all_segments,
            "duration": info.duration
        }

# 使用示例
fast_asr = FasterWhisperASR(model_size="base", compute_type="float16")

result = fast_asr.transcribe("long_audio.wav", vad_filter=True)
print(f"转录结果: {result['text']}")
print(f"音频时长: {result['duration']:.2f} 秒")

3. 使用 Azure Speech API(云端方案)

python
# pip install azure-cognitiveservices-speech

import azure.cognitiveservices.speech as speechsdk
from typing import Dict

class AzureSpeechASR:
    """Azure 语音识别"""
    
    def __init__(self, subscription_key: str, region: str):
        """
        Args:
            subscription_key: Azure 订阅密钥
            region: 区域(如 "eastus")
        """
        self.speech_config = speechsdk.SpeechConfig(
            subscription=subscription_key,
            region=region
        )
    
    def transcribe_file(
        self,
        audio_path: str,
        language: str = "zh-CN"
    ) -> Dict:
        """
        转录音频文件
        
        Args:
            audio_path: 音频文件路径
            language: 语言代码(zh-CN, en-US, ja-JP 等)
        
        Returns:
            {"text": "...", "confidence": 0.95}
        """
        # 设置语言
        self.speech_config.speech_recognition_language = language
        
        # 创建音频配置
        audio_config = speechsdk.AudioConfig(filename=audio_path)
        
        # 创建识别器
        recognizer = speechsdk.SpeechRecognizer(
            speech_config=self.speech_config,
            audio_config=audio_config
        )
        
        # 执行识别
        result = recognizer.recognize_once()
        
        if result.reason == speechsdk.ResultReason.RecognizedSpeech:
            return {
                "text": result.text,
                "confidence": result.properties.get(
                    speechsdk.PropertyId.SpeechServiceResponse_JsonResult
                )
            }
        elif result.reason == speechsdk.ResultReason.NoMatch:
            return {"error": "无法识别语音"}
        else:
            return {"error": f"识别失败: {result.reason}"}
    
    def transcribe_realtime(self, language: str = "zh-CN"):
        """实时语音识别(从麦克风)"""
        self.speech_config.speech_recognition_language = language
        
        recognizer = speechsdk.SpeechRecognizer(
            speech_config=self.speech_config
        )
        
        print("🎤 开始说话...")
        
        # 识别回调
        def recognized(evt):
            print(f"识别结果: {evt.result.text}")
        
        recognizer.recognized.connect(recognized)
        
        # 开始连续识别
        recognizer.start_continuous_recognition()
        
        input("按 Enter 停止...\n")
        recognizer.stop_continuous_recognition()

# 使用示例(需要 Azure 账号)
# asr = AzureSpeechASR(
#     subscription_key="YOUR_KEY",
#     region="eastus"
# )
# result = asr.transcribe_file("speech.wav", language="zh-CN")
# print(result["text"])

4. 实战:会议转录系统

python
import os
from pathlib import Path
from datetime import datetime
from typing import List, Dict

class MeetingTranscriber:
    """会议转录系统"""
    
    def __init__(self, model_size: str = "medium"):
        self.asr = WhisperASR(model_size=model_size)
    
    def transcribe_meeting(
        self,
        audio_path: str,
        output_dir: str = "./transcripts"
    ) -> str:
        """
        转录会议音频,生成带时间戳的文本
        
        Args:
            audio_path: 会议音频路径
            output_dir: 输出目录
        
        Returns:
            转录文本文件路径
        """
        print(f"📋 开始转录会议:{audio_path}\n")
        
        # 创建输出目录
        os.makedirs(output_dir, exist_ok=True)
        
        # 转录
        segments = self.asr.transcribe_with_timestamps(audio_path)
        
        # 生成文本
        transcript = self._format_transcript(segments)
        
        # 保存
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_path = os.path.join(
            output_dir,
            f"meeting_transcript_{timestamp}.txt"
        )
        
        with open(output_path, "w", encoding="utf-8") as f:
            f.write(transcript)
        
        print(f"\n✅ 转录完成:{output_path}")
        
        # 生成摘要
        summary_path = self._generate_summary(segments, output_dir, timestamp)
        
        return output_path
    
    def _format_transcript(self, segments: List[Dict]) -> str:
        """格式化转录文本"""
        lines = ["# 会议转录\n"]
        lines.append(f"生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        lines.append("=" * 60 + "\n\n")
        
        for seg in segments:
            start_time = self._format_time(seg["start"])
            end_time = self._format_time(seg["end"])
            lines.append(f"[{start_time} - {end_time}]\n")
            lines.append(f"{seg['text']}\n\n")
        
        return "".join(lines)
    
    def _format_time(self, seconds: float) -> str:
        """格式化时间(秒 -> HH:MM:SS)"""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        return f"{hours:02d}:{minutes:02d}:{secs:02d}"
    
    def _generate_summary(
        self,
        segments: List[Dict],
        output_dir: str,
        timestamp: str
    ) -> str:
        """生成会议摘要(使用 LLM)"""
        from openai import OpenAI
        
        client = OpenAI()
        
        # 合并所有文本
        full_text = " ".join([seg["text"] for seg in segments])
        
        # 调用 LLM 生成摘要
        prompt = f"""请为以下会议内容生成摘要:

会议内容:
{full_text[:4000]}  # 限制长度

请生成:
1. 会议主题
2. 关键讨论点(3-5 条)
3. 行动项(如果有)
4. 决策事项(如果有)

格式要求:简洁明了,使用 Markdown 格式。"""
        
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.3
            )
            
            summary = response.choices[0].message.content
            
            # 保存摘要
            summary_path = os.path.join(
                output_dir,
                f"meeting_summary_{timestamp}.md"
            )
            
            with open(summary_path, "w", encoding="utf-8") as f:
                f.write(summary)
            
            print(f"📝 摘要已生成:{summary_path}")
            
            return summary_path
        
        except Exception as e:
            print(f"⚠️  摘要生成失败: {e}")
            return None

# 使用示例
transcriber = MeetingTranscriber(model_size="medium")
transcript_path = transcriber.transcribe_meeting("meeting_recording.wav")

print(f"\n📄 转录文件:{transcript_path}")

6.2.3 语音合成(TTS)

1. 使用 Edge TTS(免费、高质量)

python
# pip install edge-tts

import edge_tts
import asyncio
from typing import List

class EdgeTTS:
    """Edge TTS 语音合成"""
    
    def __init__(self):
        self.voices = None
    
    async def list_voices(self, language: str = None) -> List[Dict]:
        """
        列出可用的语音
        
        Args:
            language: 语言代码(如 "zh-CN", "en-US")
        
        Returns:
            语音列表
        """
        voices = await edge_tts.list_voices()
        
        if language:
            voices = [v for v in voices if v["Locale"].startswith(language)]
        
        return voices
    
    async def synthesize(
        self,
        text: str,
        output_path: str,
        voice: str = "zh-CN-XiaoxiaoNeural",
        rate: str = "+0%",
        volume: str = "+0%",
        pitch: str = "+0Hz"
    ):
        """
        合成语音
        
        Args:
            text: 要合成的文本
            output_path: 输出音频路径
            voice: 语音名称
                中文:
                - zh-CN-XiaoxiaoNeural: 晓晓(女声,通用)
                - zh-CN-YunxiNeural: 云希(男声)
                - zh-CN-YunyangNeural: 云扬(男声,新闻)
                英文:
                - en-US-JennyNeural: Jenny(女声)
                - en-US-GuyNeural: Guy(男声)
            rate: 语速(-50% 到 +100%)
            volume: 音量(-50% 到 +100%)
            pitch: 音调(-50Hz 到 +50Hz)
        """
        communicate = edge_tts.Communicate(
            text=text,
            voice=voice,
            rate=rate,
            volume=volume,
            pitch=pitch
        )
        
        await communicate.save(output_path)
        print(f"✅ 语音已保存:{output_path}")
    
    def synthesize_sync(self, text: str, output_path: str, **kwargs):
        """同步版本的 synthesize"""
        asyncio.run(self.synthesize(text, output_path, **kwargs))

# 使用示例
tts = EdgeTTS()

# 列出中文语音
voices = asyncio.run(tts.list_voices(language="zh-CN"))
print("🎤 可用中文语音:")
for v in voices[:5]:
    print(f"  • {v['ShortName']}: {v['FriendlyName']}")

# 合成语音
text = "你好,我是人工智能语音助手。今天天气很好,适合出门散步。"
tts.synthesize_sync(
    text=text,
    output_path="output.mp3",
    voice="zh-CN-XiaoxiaoNeural",
    rate="+10%"  # 稍快一点
)

输出示例

🎤 可用中文语音:
  • zh-CN-XiaoxiaoNeural: Microsoft Xiaoxiao Online (Natural) - Chinese (Mainland)
  • zh-CN-YunxiNeural: Microsoft Yunxi Online (Natural) - Chinese (Mainland)
  • zh-CN-YunyangNeural: Microsoft Yunyang Online (Natural) - Chinese (Mainland)
  • zh-CN-XiaoyiNeural: Microsoft Xiaoyi Online (Natural) - Chinese (Mainland)
  • zh-CN-YunjianNeural: Microsoft Yunjian Online (Natural) - Chinese (Mainland)
✅ 语音已保存:output.mp3

2. 使用 Coqui TTS(开源、可定制)

python
# pip install TTS

from TTS.api import TTS
import torch

class CoquiTTS:
    """Coqui TTS 语音合成"""
    
    def __init__(self, model_name: str = "tts_models/zh-CN/baker/tacotron2-DDC-GST"):
        """
        Args:
            model_name: 模型名称
                中文:
                - tts_models/zh-CN/baker/tacotron2-DDC-GST
                英文:
                - tts_models/en/ljspeech/tacotron2-DDC
                - tts_models/en/vctk/vits(多说话人)
        """
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"🔧 加载 TTS 模型:{model_name}")
        self.tts = TTS(model_name=model_name).to(self.device)
        print(f"✅ 模型已加载到 {self.device}")
    
    def synthesize(
        self,
        text: str,
        output_path: str,
        speaker: str = None
    ):
        """
        合成语音
        
        Args:
            text: 文本
            output_path: 输出路径
            speaker: 说话人(多说话人模型)
        """
        if speaker:
            self.tts.tts_to_file(
                text=text,
                file_path=output_path,
                speaker=speaker
            )
        else:
            self.tts.tts_to_file(
                text=text,
                file_path=output_path
            )
        
        print(f"✅ 语音已保存:{output_path}")
    
    def list_speakers(self):
        """列出可用的说话人"""
        if hasattr(self.tts, 'speakers'):
            return self.tts.speakers
        return None

# 使用示例
tts = CoquiTTS()
tts.synthesize(
    text="欢迎使用开源语音合成系统。",
    output_path="coqui_output.wav"
)

3. 使用 Azure Speech TTS(云端方案)

python
import azure.cognitiveservices.speech as speechsdk

class AzureTTS:
    """Azure 语音合成"""
    
    def __init__(self, subscription_key: str, region: str):
        self.speech_config = speechsdk.SpeechConfig(
            subscription=subscription_key,
            region=region
        )
    
    def synthesize(
        self,
        text: str,
        output_path: str,
        voice: str = "zh-CN-XiaoxiaoNeural",
        style: str = None,
        rate: float = 1.0
    ):
        """
        合成语音
        
        Args:
            text: 文本
            output_path: 输出路径
            voice: 语音名称
            style: 语音风格(如 "cheerful", "sad", "angry")
            rate: 语速(0.5 - 2.0)
        """
        # 设置语音
        self.speech_config.speech_synthesis_voice_name = voice
        
        # 创建音频配置
        audio_config = speechsdk.audio.AudioOutputConfig(
            filename=output_path
        )
        
        # 创建合成器
        synthesizer = speechsdk.SpeechSynthesizer(
            speech_config=self.speech_config,
            audio_config=audio_config
        )
        
        # 构建 SSML(如果需要风格或语速调整)
        if style or rate != 1.0:
            ssml = self._build_ssml(text, voice, style, rate)
            result = synthesizer.speak_ssml_async(ssml).get()
        else:
            result = synthesizer.speak_text_async(text).get()
        
        if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
            print(f"✅ 语音已保存:{output_path}")
        else:
            print(f"❌ 合成失败: {result.reason}")
    
    def _build_ssml(
        self,
        text: str,
        voice: str,
        style: str = None,
        rate: float = 1.0
    ) -> str:
        """构建 SSML"""
        rate_str = f"{rate * 100:.0f}%"
        
        ssml = f"""
        <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" 
               xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="zh-CN">
            <voice name="{voice}">
        """
        
        if style:
            ssml += f'<mstts:express-as style="{style}">'
        
        ssml += f'<prosody rate="{rate_str}">{text}</prosody>'
        
        if style:
            ssml += '</mstts:express-as>'
        
        ssml += """
            </voice>
        </speak>
        """
        
        return ssml

# 使用示例(需要 Azure 账号)
# tts = AzureTTS(subscription_key="YOUR_KEY", region="eastus")
# tts.synthesize(
#     text="今天天气真好!",
#     output_path="azure_output.wav",
#     voice="zh-CN-XiaoxiaoNeural",
#     style="cheerful",
#     rate=1.2
# )

4. 实战:有声读物生成器

python
import re
from pathlib import Path
from typing import List

class AudiobookGenerator:
    """有声读物生成器"""
    
    def __init__(self):
        self.tts = EdgeTTS()
    
    def generate_audiobook(
        self,
        text_file: str,
        output_dir: str = "./audiobook",
        voice: str = "zh-CN-XiaoxiaoNeural",
        chapters: bool = True
    ):
        """
        生成有声读物
        
        Args:
            text_file: 文本文件路径
            output_dir: 输出目录
            voice: 语音
            chapters: 是否按章节分割
        """
        print(f"📚 生成有声读物:{text_file}\n")
        
        # 读取文本
        with open(text_file, "r", encoding="utf-8") as f:
            content = f.read()
        
        # 创建输出目录
        Path(output_dir).mkdir(parents=True, exist_ok=True)
        
        if chapters:
            # 按章节分割
            chapter_list = self._split_chapters(content)
            
            for i, chapter in enumerate(chapter_list, 1):
                print(f"🎙️  生成第 {i} 章...")
                output_path = Path(output_dir) / f"chapter_{i:02d}.mp3"
                
                self.tts.synthesize_sync(
                    text=chapter["content"],
                    output_path=str(output_path),
                    voice=voice
                )
        else:
            # 整本书一个文件
            print("🎙️  生成完整音频...")
            output_path = Path(output_dir) / "full_audiobook.mp3"
            
            self.tts.synthesize_sync(
                text=content,
                output_path=str(output_path),
                voice=voice
            )
        
        print(f"\n✅ 有声读物生成完成:{output_dir}")
    
    def _split_chapters(self, content: str) -> List[Dict]:
        """分割章节"""
        # 简单的章节分割(基于 "第X章" 或 "Chapter X")
        chapter_pattern = r'([一二三四五六七八九十百千\d]+|Chapter\s+\d+)'
        
        chapters = []
        parts = re.split(chapter_pattern, content)
        
        for i in range(1, len(parts), 2):
            if i + 1 < len(parts):
                chapters.append({
                    "title": parts[i].strip(),
                    "content": parts[i + 1].strip()
                })
        
        # 如果没有找到章节标记,整本书作为一章
        if not chapters:
            chapters.append({
                "title": "全文",
                "content": content.strip()
            })
        
        return chapters

# 使用示例
generator = AudiobookGenerator()
generator.generate_audiobook(
    text_file="novel.txt",
    output_dir="./my_audiobook",
    voice="zh-CN-XiaoxiaoNeural",
    chapters=True
)

6.2.4 说话人识别与分离

1. 说话人分离(Diarization)

python
# pip install pyannote.audio

from pyannote.audio import Pipeline
import torch

class SpeakerDiarization:
    """说话人分离"""
    
    def __init__(self, hf_token: str):
        """
        Args:
            hf_token: HuggingFace 访问令牌
                获取:https://huggingface.co/settings/tokens
                接受协议:https://huggingface.co/pyannote/speaker-diarization
        """
        self.pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.1",
            use_auth_token=hf_token
        )
        
        # 移动到 GPU
        if torch.cuda.is_available():
            self.pipeline.to(torch.device("cuda"))
    
    def diarize(self, audio_path: str) -> List[Dict]:
        """
        识别说话人
        
        Returns:
            [
                {"start": 0.0, "end": 2.5, "speaker": "SPEAKER_00"},
                {"start": 2.5, "end": 5.0, "speaker": "SPEAKER_01"}
            ]
        """
        print(f"🎤 分析说话人:{audio_path}")
        
        # 执行分离
        diarization = self.pipeline(audio_path)
        
        # 整理结果
        segments = []
        for turn, _, speaker in diarization.itertracks(yield_label=True):
            segments.append({
                "start": turn.start,
                "end": turn.end,
                "speaker": speaker
            })
        
        print(f"✅ 识别到 {len(set(s['speaker'] for s in segments))} 个说话人")
        
        return segments
    
    def transcribe_with_speakers(
        self,
        audio_path: str,
        asr_model: WhisperASR
    ) -> List[Dict]:
        """
        转录并标注说话人
        
        Returns:
            [
                {"start": 0.0, "end": 2.5, "speaker": "SPEAKER_00", "text": "你好"},
                {"start": 2.5, "end": 5.0, "speaker": "SPEAKER_01", "text": "你好"}
            ]
        """
        # 1. 说话人分离
        speaker_segments = self.diarize(audio_path)
        
        # 2. 语音识别
        print("\n🎤 转录音频...")
        asr_segments = asr_model.transcribe_with_timestamps(audio_path)
        
        # 3. 匹配说话人和文本
        print("\n🔗 匹配说话人和文本...")
        result = self._match_segments(speaker_segments, asr_segments)
        
        return result
    
    def _match_segments(
        self,
        speaker_segments: List[Dict],
        asr_segments: List[Dict]
    ) -> List[Dict]:
        """匹配说话人和转录文本"""
        result = []
        
        for asr_seg in asr_segments:
            asr_mid = (asr_seg["start"] + asr_seg["end"]) / 2
            
            # 找到对应的说话人
            speaker = None
            for sp_seg in speaker_segments:
                if sp_seg["start"] <= asr_mid <= sp_seg["end"]:
                    speaker = sp_seg["speaker"]
                    break
            
            result.append({
                "start": asr_seg["start"],
                "end": asr_seg["end"],
                "speaker": speaker or "UNKNOWN",
                "text": asr_seg["text"]
            })
        
        return result

# 使用示例(需要 HuggingFace token)
# diarizer = SpeakerDiarization(hf_token="YOUR_HF_TOKEN")
# asr = WhisperASR(model_size="base")

# # 转录并标注说话人
# result = diarizer.transcribe_with_speakers("conversation.wav", asr)

# print("\n📝 对话转录:")
# for seg in result:
#     print(f"[{seg['speaker']}] ({seg['start']:.2f}s - {seg['end']:.2f}s)")
#     print(f"  {seg['text']}\n")

6.2.5 语音情感识别

python
# pip install transformers torch

from transformers import Wav2Vec2Processor, Wav2Vec2ForSequenceClassification
import torch
import librosa

class SpeechEmotionRecognition:
    """语音情感识别"""
    
    def __init__(self):
        model_name = "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
        
        self.processor = Wav2Vec2Processor.from_pretrained(model_name)
        self.model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
        self.model.eval()
        
        # 情感标签
        self.emotions = {
            0: "angry",
            1: "disgust",
            2: "fear",
            3: "happy",
            4: "neutral",
            5: "sad"
        }
    
    def recognize_emotion(self, audio_path: str) -> Dict:
        """
        识别语音情感
        
        Returns:
            {"emotion": "happy", "confidence": 0.85, "all_scores": {...}}
        """
        # 加载音频
        audio, sr = librosa.load(audio_path, sr=16000)
        
        # 预处理
        inputs = self.processor(
            audio,
            sampling_rate=16000,
            return_tensors="pt",
            padding=True
        )
        
        # 推理
        with torch.no_grad():
            logits = self.model(**inputs).logits
            probs = torch.nn.functional.softmax(logits, dim=-1)[0]
        
        # 解析结果
        pred_id = torch.argmax(probs).item()
        emotion = self.emotions[pred_id]
        confidence = probs[pred_id].item()
        
        all_scores = {
            self.emotions[i]: probs[i].item()
            for i in range(len(self.emotions))
        }
        
        return {
            "emotion": emotion,
            "confidence": confidence,
            "all_scores": all_scores
        }

# 使用示例
ser = SpeechEmotionRecognition()
result = ser.recognize_emotion("emotional_speech.wav")

print(f"🎭 情感识别结果:")
print(f"主要情感: {result['emotion']} ({result['confidence']:.2%})")
print(f"\n所有情感得分:")
for emotion, score in result['all_scores'].items():
    print(f"  • {emotion}: {score:.2%}")

6.2.6 实战案例:智能语音助手

python
"""
综合案例:智能语音助手
功能:
1. 语音唤醒
2. 语音识别
3. 意图理解(调用 LLM)
4. 语音回复
"""

from openai import OpenAI
import pyaudio
import wave
import numpy as np

class VoiceAssistant:
    """智能语音助手"""
    
    def __init__(self):
        print("🤖 初始化语音助手...")
        
        # 语音识别
        self.asr = WhisperASR(model_size="base")
        
        # 语音合成
        self.tts = EdgeTTS()
        
        # LLM
        self.llm_client = OpenAI()
        
        # 对话历史
        self.conversation_history = []
        
        print("✅ 初始化完成")
    
    def listen(self, duration: int = 5, save_path: str = "temp_audio.wav") -> str:
        """
        录音并识别
        
        Args:
            duration: 录音时长(秒)
            save_path: 临时音频文件路径
        
        Returns:
            识别的文本
        """
        print(f"🎤 开始录音({duration} 秒)...")
        
        # 录音参数
        CHUNK = 1024
        FORMAT = pyaudio.paInt16
        CHANNELS = 1
        RATE = 16000
        
        # 初始化 PyAudio
        p = pyaudio.PyAudio()
        
        # 打开音频流
        stream = p.open(
            format=FORMAT,
            channels=CHANNELS,
            rate=RATE,
            input=True,
            frames_per_buffer=CHUNK
        )
        
        frames = []
        
        # 录音
        for _ in range(0, int(RATE / CHUNK * duration)):
            data = stream.read(CHUNK)
            frames.append(data)
        
        print("✅ 录音完成")
        
        # 停止录音
        stream.stop_stream()
        stream.close()
        p.terminate()
        
        # 保存音频
        wf = wave.open(save_path, 'wb')
        wf.setnchannels(CHANNELS)
        wf.setsampwidth(p.get_sample_size(FORMAT))
        wf.setframerate(RATE)
        wf.writeframes(b''.join(frames))
        wf.close()
        
        # 识别
        print("🔍 识别中...")
        result = self.asr.transcribe(save_path)
        text = result["text"]
        
        print(f"📝 识别结果: {text}")
        
        return text
    
    def think(self, user_input: str) -> str:
        """
        理解并生成回复
        
        Args:
            user_input: 用户输入
        
        Returns:
            助手回复
        """
        # 添加到历史
        self.conversation_history.append({
            "role": "user",
            "content": user_input
        })
        
        # 调用 LLM
        messages = [
            {"role": "system", "content": "你是一个友好的语音助手,回答要简洁明了。"}
        ] + self.conversation_history
        
        response = self.llm_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            temperature=0.7,
            max_tokens=200
        )
        
        assistant_reply = response.choices[0].message.content
        
        # 添加到历史
        self.conversation_history.append({
            "role": "assistant",
            "content": assistant_reply
        })
        
        return assistant_reply
    
    def speak(self, text: str, output_path: str = "temp_response.mp3"):
        """
        语音播报
        
        Args:
            text: 要播报的文本
            output_path: 临时音频文件
        """
        print(f"🔊 播报: {text}")
        
        # 合成语音
        self.tts.synthesize_sync(
            text=text,
            output_path=output_path,
            voice="zh-CN-XiaoxiaoNeural",
            rate="+10%"
        )
        
        # 播放音频(需要安装 playsound)
        try:
            from playsound import playsound
            playsound(output_path)
        except:
            print("⚠️  无法播放音频(需要安装 playsound)")
    
    def run(self):
        """运行助手"""
        print("\n" + "="*60)
        print("🤖 语音助手已启动")
        print("说 '退出' 或 '再见' 结束对话")
        print("="*60 + "\n")
        
        while True:
            try:
                # 1. 监听
                user_input = self.listen(duration=5)
                
                # 检查退出指令
                if any(word in user_input for word in ["退出", "再见", "拜拜"]):
                    goodbye = "好的,再见!"
                    print(f"🤖 {goodbye}")
                    self.speak(goodbye)
                    break
                
                # 2. 思考
                response = self.think(user_input)
                
                # 3. 回复
                self.speak(response)
                
                print("\n" + "-"*60 + "\n")
            
            except KeyboardInterrupt:
                print("\n\n👋 助手已退出")
                break
            except Exception as e:
                print(f"❌ 错误: {e}")

# 使用示例
assistant = VoiceAssistant()
assistant.run()

6.2.7 学习资源

推荐课程

  • Stanford CS224S: Speech Recognition and Synthesis
  • Fast.ai Audio: 音频深度学习实战

开源项目

  • Whisper: OpenAI 语音识别
  • Coqui TTS: 开源语音合成
  • SpeechBrain: 语音处理工具包
  • Silero Models: 轻量级语音模型

论文阅读

实战练习

  1. 构建会议转录系统(ASR + 说话人分离 + 摘要)
  2. 开发有声读物生成器(TTS + 章节分割)
  3. 实现实时语音翻译(ASR + 翻译 + TTS)
  4. 构建语音情感分析系统(情感识别 + 可视化)

关键要点

  • Whisper 是 ASR 首选:多语言、高准确率、开源
  • Edge TTS 免费高质量:适合快速原型开发
  • 音频预处理很重要:采样率、降噪、音量归一化
  • 实时性考虑:选择合适的模型大小和优化方法
  • 多语言支持:Whisper 支持 99 种语言

常见错误

  • ❌ 忽视音频质量(噪声、回声影响识别)
  • ❌ 采样率不匹配(导致识别错误)
  • ❌ 音频过长(超过模型上下文限制)
  • ❌ 忽视延迟优化(实时应用需要低延迟)

性能优化

  • 使用 Faster Whisper(4-5倍加速)
  • 批量处理音频(提升吞吐量)
  • VAD 过滤静音(减少计算量)
  • 模型量化(降低显存和延迟)

下一步 学完本节后,结合视觉模型(6.1)和多模态融合(6.3),你将能够构建完整的多模态 AI 应用,如视频理解、多模态对话等。

坚持是一种品格