6.2 语音模型
学习时长:2-3 周
语音模型使 AI 能够"听懂"和"说话",是构建语音助手、实时翻译、会议转录等应用的核心技术。本节覆盖语音识别(ASR)、语音合成(TTS)和语音理解的完整技术栈。
6.2.1 语音处理基础
核心任务分类
| 任务类型 | 描述 | 输入 | 输出 | 典型应用 |
|---|---|---|---|---|
| 语音识别(ASR) | 将语音转换为文字 | 音频 | 文本 | 语音输入、字幕生成 |
| 语音合成(TTS) | 将文字转换为语音 | 文本 | 音频 | 有声读物、语音助手 |
| 说话人识别 | 识别说话人身份 | 音频 | 说话人ID | 声纹认证、会议记录 |
| 语音增强 | 去除噪声、回声 | 噪声音频 | 干净音频 | 通话降噪、录音优化 |
| 语音情感识别 | 识别情感状态 | 音频 | 情感标签 | 客服质检、心理分析 |
| 语音翻译 | 跨语言语音转换 | 音频 | 翻译文本/音频 | 实时翻译、国际会议 |
音频基础概念
python
"""
音频参数说明:
1. 采样率(Sample Rate)
- 每秒采样点数,单位 Hz
- 常见值:8000(电话)、16000(语音识别)、44100(CD)、48000(专业)
- 语音识别推荐:16000 Hz
2. 位深度(Bit Depth)
- 每个采样点的比特数
- 常见值:16-bit、24-bit、32-bit
- 语音识别推荐:16-bit
3. 声道数(Channels)
- 单声道(Mono):1 个声道
- 立体声(Stereo):2 个声道
- 语音识别推荐:单声道
4. 音频格式
- WAV:无损,文件大
- MP3:有损压缩,文件小
- FLAC:无损压缩
- 语音识别推荐:WAV 或 FLAC
"""
# 音频处理示例
import librosa
import soundfile as sf
import numpy as np
def load_audio(file_path: str, target_sr: int = 16000):
"""加载音频文件"""
# librosa 自动重采样到目标采样率
audio, sr = librosa.load(file_path, sr=target_sr, mono=True)
return audio, sr
def save_audio(audio: np.ndarray, file_path: str, sr: int = 16000):
"""保存音频文件"""
sf.write(file_path, audio, sr)
def get_audio_duration(audio: np.ndarray, sr: int) -> float:
"""获取音频时长(秒)"""
return len(audio) / sr
# 使用示例
audio, sr = load_audio("speech.wav", target_sr=16000)
duration = get_audio_duration(audio, sr)
print(f"音频时长: {duration:.2f} 秒")
print(f"采样率: {sr} Hz")
print(f"音频形状: {audio.shape}")6.2.2 语音识别(ASR)
1. 使用 Whisper 进行语音识别(推荐)
Whisper 是 OpenAI 开源的多语言语音识别模型,支持 99 种语言,准确率高。
python
# pip install openai-whisper
import whisper
import torch
from typing import Dict, List
class WhisperASR:
"""Whisper 语音识别器"""
def __init__(self, model_size: str = "base"):
"""
Args:
model_size: 模型大小
- tiny: 最快,39M 参数,英文 WER ~10%
- base: 74M 参数,英文 WER ~7%
- small: 244M 参数,英文 WER ~5%
- medium: 769M 参数,英文 WER ~4%
- large: 1550M 参数,英文 WER ~3%
- large-v2: 改进版
- large-v3: 最新版(推荐)
"""
print(f"🔧 加载 Whisper {model_size} 模型...")
self.model = whisper.load_model(model_size)
self.device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"✅ 模型已加载到 {self.device}")
def transcribe(
self,
audio_path: str,
language: str = None,
task: str = "transcribe",
return_timestamps: bool = False
) -> Dict:
"""
转录音频
Args:
audio_path: 音频文件路径
language: 语言代码(如 "zh", "en"),None 表示自动检测
task: "transcribe"(转录)或 "translate"(翻译成英文)
return_timestamps: 是否返回时间戳
Returns:
{
"text": "转录文本",
"language": "zh",
"segments": [...], # 如果 return_timestamps=True
"duration": 10.5
}
"""
print(f"🎤 转录音频:{audio_path}")
# 转录
result = self.model.transcribe(
audio_path,
language=language,
task=task,
verbose=False,
word_timestamps=return_timestamps
)
# 整理结果
output = {
"text": result["text"].strip(),
"language": result["language"]
}
if return_timestamps:
output["segments"] = result["segments"]
print(f"✅ 转录完成(语言: {output['language']})")
return output
def transcribe_with_timestamps(self, audio_path: str) -> List[Dict]:
"""
转录并返回带时间戳的片段
Returns:
[
{"start": 0.0, "end": 2.5, "text": "你好"},
{"start": 2.5, "end": 5.0, "text": "世界"}
]
"""
result = self.transcribe(audio_path, return_timestamps=True)
segments = []
for seg in result["segments"]:
segments.append({
"start": seg["start"],
"end": seg["end"],
"text": seg["text"].strip()
})
return segments
def translate_to_english(self, audio_path: str) -> str:
"""
将任意语言音频翻译成英文
Args:
audio_path: 音频文件路径
Returns:
英文翻译文本
"""
result = self.transcribe(audio_path, task="translate")
return result["text"]
# 使用示例
asr = WhisperASR(model_size="base")
# 基础转录
result = asr.transcribe("speech_chinese.wav")
print(f"转录结果: {result['text']}")
print(f"检测语言: {result['language']}")
# 带时间戳的转录
segments = asr.transcribe_with_timestamps("speech.wav")
print("\n📝 转录片段:")
for seg in segments:
print(f"[{seg['start']:.2f}s - {seg['end']:.2f}s] {seg['text']}")
# 翻译成英文
translation = asr.translate_to_english("speech_chinese.wav")
print(f"\n🌐 英文翻译: {translation}")输出示例
🔧 加载 Whisper base 模型...
✅ 模型已加载到 cuda
🎤 转录音频:speech_chinese.wav
✅ 转录完成(语言: zh)
转录结果: 今天天气很好,我们一起去公园散步吧。
检测语言: zh
📝 转录片段:
[0.00s - 2.50s] 今天天气很好,
[2.50s - 5.00s] 我们一起去公园散步吧。
🌐 英文翻译: The weather is nice today, let's go for a walk in the park together.2. 使用 Faster Whisper(加速版本)
python
# pip install faster-whisper
from faster_whisper import WhisperModel
from typing import List, Dict
class FasterWhisperASR:
"""Faster Whisper(速度优化版)"""
def __init__(
self,
model_size: str = "base",
device: str = "cuda",
compute_type: str = "float16"
):
"""
Args:
model_size: 模型大小
device: "cuda" 或 "cpu"
compute_type: "float16"(GPU)或 "int8"(CPU)
"""
self.model = WhisperModel(
model_size,
device=device,
compute_type=compute_type
)
print(f"✅ Faster Whisper {model_size} 已加载")
def transcribe(
self,
audio_path: str,
language: str = None,
beam_size: int = 5,
vad_filter: bool = True
) -> Dict:
"""
转录音频(比原版快 4-5 倍)
Args:
audio_path: 音频路径
language: 语言代码
beam_size: beam search 大小(越大越准确但越慢)
vad_filter: 是否使用 VAD 过滤静音
Returns:
{"text": "...", "segments": [...]}
"""
segments, info = self.model.transcribe(
audio_path,
language=language,
beam_size=beam_size,
vad_filter=vad_filter
)
# 收集所有片段
all_segments = []
full_text = []
for segment in segments:
all_segments.append({
"start": segment.start,
"end": segment.end,
"text": segment.text.strip()
})
full_text.append(segment.text.strip())
return {
"text": " ".join(full_text),
"language": info.language,
"segments": all_segments,
"duration": info.duration
}
# 使用示例
fast_asr = FasterWhisperASR(model_size="base", compute_type="float16")
result = fast_asr.transcribe("long_audio.wav", vad_filter=True)
print(f"转录结果: {result['text']}")
print(f"音频时长: {result['duration']:.2f} 秒")3. 使用 Azure Speech API(云端方案)
python
# pip install azure-cognitiveservices-speech
import azure.cognitiveservices.speech as speechsdk
from typing import Dict
class AzureSpeechASR:
"""Azure 语音识别"""
def __init__(self, subscription_key: str, region: str):
"""
Args:
subscription_key: Azure 订阅密钥
region: 区域(如 "eastus")
"""
self.speech_config = speechsdk.SpeechConfig(
subscription=subscription_key,
region=region
)
def transcribe_file(
self,
audio_path: str,
language: str = "zh-CN"
) -> Dict:
"""
转录音频文件
Args:
audio_path: 音频文件路径
language: 语言代码(zh-CN, en-US, ja-JP 等)
Returns:
{"text": "...", "confidence": 0.95}
"""
# 设置语言
self.speech_config.speech_recognition_language = language
# 创建音频配置
audio_config = speechsdk.AudioConfig(filename=audio_path)
# 创建识别器
recognizer = speechsdk.SpeechRecognizer(
speech_config=self.speech_config,
audio_config=audio_config
)
# 执行识别
result = recognizer.recognize_once()
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
return {
"text": result.text,
"confidence": result.properties.get(
speechsdk.PropertyId.SpeechServiceResponse_JsonResult
)
}
elif result.reason == speechsdk.ResultReason.NoMatch:
return {"error": "无法识别语音"}
else:
return {"error": f"识别失败: {result.reason}"}
def transcribe_realtime(self, language: str = "zh-CN"):
"""实时语音识别(从麦克风)"""
self.speech_config.speech_recognition_language = language
recognizer = speechsdk.SpeechRecognizer(
speech_config=self.speech_config
)
print("🎤 开始说话...")
# 识别回调
def recognized(evt):
print(f"识别结果: {evt.result.text}")
recognizer.recognized.connect(recognized)
# 开始连续识别
recognizer.start_continuous_recognition()
input("按 Enter 停止...\n")
recognizer.stop_continuous_recognition()
# 使用示例(需要 Azure 账号)
# asr = AzureSpeechASR(
# subscription_key="YOUR_KEY",
# region="eastus"
# )
# result = asr.transcribe_file("speech.wav", language="zh-CN")
# print(result["text"])4. 实战:会议转录系统
python
import os
from pathlib import Path
from datetime import datetime
from typing import List, Dict
class MeetingTranscriber:
"""会议转录系统"""
def __init__(self, model_size: str = "medium"):
self.asr = WhisperASR(model_size=model_size)
def transcribe_meeting(
self,
audio_path: str,
output_dir: str = "./transcripts"
) -> str:
"""
转录会议音频,生成带时间戳的文本
Args:
audio_path: 会议音频路径
output_dir: 输出目录
Returns:
转录文本文件路径
"""
print(f"📋 开始转录会议:{audio_path}\n")
# 创建输出目录
os.makedirs(output_dir, exist_ok=True)
# 转录
segments = self.asr.transcribe_with_timestamps(audio_path)
# 生成文本
transcript = self._format_transcript(segments)
# 保存
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_path = os.path.join(
output_dir,
f"meeting_transcript_{timestamp}.txt"
)
with open(output_path, "w", encoding="utf-8") as f:
f.write(transcript)
print(f"\n✅ 转录完成:{output_path}")
# 生成摘要
summary_path = self._generate_summary(segments, output_dir, timestamp)
return output_path
def _format_transcript(self, segments: List[Dict]) -> str:
"""格式化转录文本"""
lines = ["# 会议转录\n"]
lines.append(f"生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
lines.append("=" * 60 + "\n\n")
for seg in segments:
start_time = self._format_time(seg["start"])
end_time = self._format_time(seg["end"])
lines.append(f"[{start_time} - {end_time}]\n")
lines.append(f"{seg['text']}\n\n")
return "".join(lines)
def _format_time(self, seconds: float) -> str:
"""格式化时间(秒 -> HH:MM:SS)"""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
return f"{hours:02d}:{minutes:02d}:{secs:02d}"
def _generate_summary(
self,
segments: List[Dict],
output_dir: str,
timestamp: str
) -> str:
"""生成会议摘要(使用 LLM)"""
from openai import OpenAI
client = OpenAI()
# 合并所有文本
full_text = " ".join([seg["text"] for seg in segments])
# 调用 LLM 生成摘要
prompt = f"""请为以下会议内容生成摘要:
会议内容:
{full_text[:4000]} # 限制长度
请生成:
1. 会议主题
2. 关键讨论点(3-5 条)
3. 行动项(如果有)
4. 决策事项(如果有)
格式要求:简洁明了,使用 Markdown 格式。"""
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
summary = response.choices[0].message.content
# 保存摘要
summary_path = os.path.join(
output_dir,
f"meeting_summary_{timestamp}.md"
)
with open(summary_path, "w", encoding="utf-8") as f:
f.write(summary)
print(f"📝 摘要已生成:{summary_path}")
return summary_path
except Exception as e:
print(f"⚠️ 摘要生成失败: {e}")
return None
# 使用示例
transcriber = MeetingTranscriber(model_size="medium")
transcript_path = transcriber.transcribe_meeting("meeting_recording.wav")
print(f"\n📄 转录文件:{transcript_path}")6.2.3 语音合成(TTS)
1. 使用 Edge TTS(免费、高质量)
python
# pip install edge-tts
import edge_tts
import asyncio
from typing import List
class EdgeTTS:
"""Edge TTS 语音合成"""
def __init__(self):
self.voices = None
async def list_voices(self, language: str = None) -> List[Dict]:
"""
列出可用的语音
Args:
language: 语言代码(如 "zh-CN", "en-US")
Returns:
语音列表
"""
voices = await edge_tts.list_voices()
if language:
voices = [v for v in voices if v["Locale"].startswith(language)]
return voices
async def synthesize(
self,
text: str,
output_path: str,
voice: str = "zh-CN-XiaoxiaoNeural",
rate: str = "+0%",
volume: str = "+0%",
pitch: str = "+0Hz"
):
"""
合成语音
Args:
text: 要合成的文本
output_path: 输出音频路径
voice: 语音名称
中文:
- zh-CN-XiaoxiaoNeural: 晓晓(女声,通用)
- zh-CN-YunxiNeural: 云希(男声)
- zh-CN-YunyangNeural: 云扬(男声,新闻)
英文:
- en-US-JennyNeural: Jenny(女声)
- en-US-GuyNeural: Guy(男声)
rate: 语速(-50% 到 +100%)
volume: 音量(-50% 到 +100%)
pitch: 音调(-50Hz 到 +50Hz)
"""
communicate = edge_tts.Communicate(
text=text,
voice=voice,
rate=rate,
volume=volume,
pitch=pitch
)
await communicate.save(output_path)
print(f"✅ 语音已保存:{output_path}")
def synthesize_sync(self, text: str, output_path: str, **kwargs):
"""同步版本的 synthesize"""
asyncio.run(self.synthesize(text, output_path, **kwargs))
# 使用示例
tts = EdgeTTS()
# 列出中文语音
voices = asyncio.run(tts.list_voices(language="zh-CN"))
print("🎤 可用中文语音:")
for v in voices[:5]:
print(f" • {v['ShortName']}: {v['FriendlyName']}")
# 合成语音
text = "你好,我是人工智能语音助手。今天天气很好,适合出门散步。"
tts.synthesize_sync(
text=text,
output_path="output.mp3",
voice="zh-CN-XiaoxiaoNeural",
rate="+10%" # 稍快一点
)输出示例
🎤 可用中文语音:
• zh-CN-XiaoxiaoNeural: Microsoft Xiaoxiao Online (Natural) - Chinese (Mainland)
• zh-CN-YunxiNeural: Microsoft Yunxi Online (Natural) - Chinese (Mainland)
• zh-CN-YunyangNeural: Microsoft Yunyang Online (Natural) - Chinese (Mainland)
• zh-CN-XiaoyiNeural: Microsoft Xiaoyi Online (Natural) - Chinese (Mainland)
• zh-CN-YunjianNeural: Microsoft Yunjian Online (Natural) - Chinese (Mainland)
✅ 语音已保存:output.mp32. 使用 Coqui TTS(开源、可定制)
python
# pip install TTS
from TTS.api import TTS
import torch
class CoquiTTS:
"""Coqui TTS 语音合成"""
def __init__(self, model_name: str = "tts_models/zh-CN/baker/tacotron2-DDC-GST"):
"""
Args:
model_name: 模型名称
中文:
- tts_models/zh-CN/baker/tacotron2-DDC-GST
英文:
- tts_models/en/ljspeech/tacotron2-DDC
- tts_models/en/vctk/vits(多说话人)
"""
self.device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"🔧 加载 TTS 模型:{model_name}")
self.tts = TTS(model_name=model_name).to(self.device)
print(f"✅ 模型已加载到 {self.device}")
def synthesize(
self,
text: str,
output_path: str,
speaker: str = None
):
"""
合成语音
Args:
text: 文本
output_path: 输出路径
speaker: 说话人(多说话人模型)
"""
if speaker:
self.tts.tts_to_file(
text=text,
file_path=output_path,
speaker=speaker
)
else:
self.tts.tts_to_file(
text=text,
file_path=output_path
)
print(f"✅ 语音已保存:{output_path}")
def list_speakers(self):
"""列出可用的说话人"""
if hasattr(self.tts, 'speakers'):
return self.tts.speakers
return None
# 使用示例
tts = CoquiTTS()
tts.synthesize(
text="欢迎使用开源语音合成系统。",
output_path="coqui_output.wav"
)3. 使用 Azure Speech TTS(云端方案)
python
import azure.cognitiveservices.speech as speechsdk
class AzureTTS:
"""Azure 语音合成"""
def __init__(self, subscription_key: str, region: str):
self.speech_config = speechsdk.SpeechConfig(
subscription=subscription_key,
region=region
)
def synthesize(
self,
text: str,
output_path: str,
voice: str = "zh-CN-XiaoxiaoNeural",
style: str = None,
rate: float = 1.0
):
"""
合成语音
Args:
text: 文本
output_path: 输出路径
voice: 语音名称
style: 语音风格(如 "cheerful", "sad", "angry")
rate: 语速(0.5 - 2.0)
"""
# 设置语音
self.speech_config.speech_synthesis_voice_name = voice
# 创建音频配置
audio_config = speechsdk.audio.AudioOutputConfig(
filename=output_path
)
# 创建合成器
synthesizer = speechsdk.SpeechSynthesizer(
speech_config=self.speech_config,
audio_config=audio_config
)
# 构建 SSML(如果需要风格或语速调整)
if style or rate != 1.0:
ssml = self._build_ssml(text, voice, style, rate)
result = synthesizer.speak_ssml_async(ssml).get()
else:
result = synthesizer.speak_text_async(text).get()
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print(f"✅ 语音已保存:{output_path}")
else:
print(f"❌ 合成失败: {result.reason}")
def _build_ssml(
self,
text: str,
voice: str,
style: str = None,
rate: float = 1.0
) -> str:
"""构建 SSML"""
rate_str = f"{rate * 100:.0f}%"
ssml = f"""
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="zh-CN">
<voice name="{voice}">
"""
if style:
ssml += f'<mstts:express-as style="{style}">'
ssml += f'<prosody rate="{rate_str}">{text}</prosody>'
if style:
ssml += '</mstts:express-as>'
ssml += """
</voice>
</speak>
"""
return ssml
# 使用示例(需要 Azure 账号)
# tts = AzureTTS(subscription_key="YOUR_KEY", region="eastus")
# tts.synthesize(
# text="今天天气真好!",
# output_path="azure_output.wav",
# voice="zh-CN-XiaoxiaoNeural",
# style="cheerful",
# rate=1.2
# )4. 实战:有声读物生成器
python
import re
from pathlib import Path
from typing import List
class AudiobookGenerator:
"""有声读物生成器"""
def __init__(self):
self.tts = EdgeTTS()
def generate_audiobook(
self,
text_file: str,
output_dir: str = "./audiobook",
voice: str = "zh-CN-XiaoxiaoNeural",
chapters: bool = True
):
"""
生成有声读物
Args:
text_file: 文本文件路径
output_dir: 输出目录
voice: 语音
chapters: 是否按章节分割
"""
print(f"📚 生成有声读物:{text_file}\n")
# 读取文本
with open(text_file, "r", encoding="utf-8") as f:
content = f.read()
# 创建输出目录
Path(output_dir).mkdir(parents=True, exist_ok=True)
if chapters:
# 按章节分割
chapter_list = self._split_chapters(content)
for i, chapter in enumerate(chapter_list, 1):
print(f"🎙️ 生成第 {i} 章...")
output_path = Path(output_dir) / f"chapter_{i:02d}.mp3"
self.tts.synthesize_sync(
text=chapter["content"],
output_path=str(output_path),
voice=voice
)
else:
# 整本书一个文件
print("🎙️ 生成完整音频...")
output_path = Path(output_dir) / "full_audiobook.mp3"
self.tts.synthesize_sync(
text=content,
output_path=str(output_path),
voice=voice
)
print(f"\n✅ 有声读物生成完成:{output_dir}")
def _split_chapters(self, content: str) -> List[Dict]:
"""分割章节"""
# 简单的章节分割(基于 "第X章" 或 "Chapter X")
chapter_pattern = r'(第[一二三四五六七八九十百千\d]+章|Chapter\s+\d+)'
chapters = []
parts = re.split(chapter_pattern, content)
for i in range(1, len(parts), 2):
if i + 1 < len(parts):
chapters.append({
"title": parts[i].strip(),
"content": parts[i + 1].strip()
})
# 如果没有找到章节标记,整本书作为一章
if not chapters:
chapters.append({
"title": "全文",
"content": content.strip()
})
return chapters
# 使用示例
generator = AudiobookGenerator()
generator.generate_audiobook(
text_file="novel.txt",
output_dir="./my_audiobook",
voice="zh-CN-XiaoxiaoNeural",
chapters=True
)6.2.4 说话人识别与分离
1. 说话人分离(Diarization)
python
# pip install pyannote.audio
from pyannote.audio import Pipeline
import torch
class SpeakerDiarization:
"""说话人分离"""
def __init__(self, hf_token: str):
"""
Args:
hf_token: HuggingFace 访问令牌
获取:https://huggingface.co/settings/tokens
接受协议:https://huggingface.co/pyannote/speaker-diarization
"""
self.pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=hf_token
)
# 移动到 GPU
if torch.cuda.is_available():
self.pipeline.to(torch.device("cuda"))
def diarize(self, audio_path: str) -> List[Dict]:
"""
识别说话人
Returns:
[
{"start": 0.0, "end": 2.5, "speaker": "SPEAKER_00"},
{"start": 2.5, "end": 5.0, "speaker": "SPEAKER_01"}
]
"""
print(f"🎤 分析说话人:{audio_path}")
# 执行分离
diarization = self.pipeline(audio_path)
# 整理结果
segments = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
segments.append({
"start": turn.start,
"end": turn.end,
"speaker": speaker
})
print(f"✅ 识别到 {len(set(s['speaker'] for s in segments))} 个说话人")
return segments
def transcribe_with_speakers(
self,
audio_path: str,
asr_model: WhisperASR
) -> List[Dict]:
"""
转录并标注说话人
Returns:
[
{"start": 0.0, "end": 2.5, "speaker": "SPEAKER_00", "text": "你好"},
{"start": 2.5, "end": 5.0, "speaker": "SPEAKER_01", "text": "你好"}
]
"""
# 1. 说话人分离
speaker_segments = self.diarize(audio_path)
# 2. 语音识别
print("\n🎤 转录音频...")
asr_segments = asr_model.transcribe_with_timestamps(audio_path)
# 3. 匹配说话人和文本
print("\n🔗 匹配说话人和文本...")
result = self._match_segments(speaker_segments, asr_segments)
return result
def _match_segments(
self,
speaker_segments: List[Dict],
asr_segments: List[Dict]
) -> List[Dict]:
"""匹配说话人和转录文本"""
result = []
for asr_seg in asr_segments:
asr_mid = (asr_seg["start"] + asr_seg["end"]) / 2
# 找到对应的说话人
speaker = None
for sp_seg in speaker_segments:
if sp_seg["start"] <= asr_mid <= sp_seg["end"]:
speaker = sp_seg["speaker"]
break
result.append({
"start": asr_seg["start"],
"end": asr_seg["end"],
"speaker": speaker or "UNKNOWN",
"text": asr_seg["text"]
})
return result
# 使用示例(需要 HuggingFace token)
# diarizer = SpeakerDiarization(hf_token="YOUR_HF_TOKEN")
# asr = WhisperASR(model_size="base")
# # 转录并标注说话人
# result = diarizer.transcribe_with_speakers("conversation.wav", asr)
# print("\n📝 对话转录:")
# for seg in result:
# print(f"[{seg['speaker']}] ({seg['start']:.2f}s - {seg['end']:.2f}s)")
# print(f" {seg['text']}\n")6.2.5 语音情感识别
python
# pip install transformers torch
from transformers import Wav2Vec2Processor, Wav2Vec2ForSequenceClassification
import torch
import librosa
class SpeechEmotionRecognition:
"""语音情感识别"""
def __init__(self):
model_name = "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
self.processor = Wav2Vec2Processor.from_pretrained(model_name)
self.model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
self.model.eval()
# 情感标签
self.emotions = {
0: "angry",
1: "disgust",
2: "fear",
3: "happy",
4: "neutral",
5: "sad"
}
def recognize_emotion(self, audio_path: str) -> Dict:
"""
识别语音情感
Returns:
{"emotion": "happy", "confidence": 0.85, "all_scores": {...}}
"""
# 加载音频
audio, sr = librosa.load(audio_path, sr=16000)
# 预处理
inputs = self.processor(
audio,
sampling_rate=16000,
return_tensors="pt",
padding=True
)
# 推理
with torch.no_grad():
logits = self.model(**inputs).logits
probs = torch.nn.functional.softmax(logits, dim=-1)[0]
# 解析结果
pred_id = torch.argmax(probs).item()
emotion = self.emotions[pred_id]
confidence = probs[pred_id].item()
all_scores = {
self.emotions[i]: probs[i].item()
for i in range(len(self.emotions))
}
return {
"emotion": emotion,
"confidence": confidence,
"all_scores": all_scores
}
# 使用示例
ser = SpeechEmotionRecognition()
result = ser.recognize_emotion("emotional_speech.wav")
print(f"🎭 情感识别结果:")
print(f"主要情感: {result['emotion']} ({result['confidence']:.2%})")
print(f"\n所有情感得分:")
for emotion, score in result['all_scores'].items():
print(f" • {emotion}: {score:.2%}")6.2.6 实战案例:智能语音助手
python
"""
综合案例:智能语音助手
功能:
1. 语音唤醒
2. 语音识别
3. 意图理解(调用 LLM)
4. 语音回复
"""
from openai import OpenAI
import pyaudio
import wave
import numpy as np
class VoiceAssistant:
"""智能语音助手"""
def __init__(self):
print("🤖 初始化语音助手...")
# 语音识别
self.asr = WhisperASR(model_size="base")
# 语音合成
self.tts = EdgeTTS()
# LLM
self.llm_client = OpenAI()
# 对话历史
self.conversation_history = []
print("✅ 初始化完成")
def listen(self, duration: int = 5, save_path: str = "temp_audio.wav") -> str:
"""
录音并识别
Args:
duration: 录音时长(秒)
save_path: 临时音频文件路径
Returns:
识别的文本
"""
print(f"🎤 开始录音({duration} 秒)...")
# 录音参数
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
# 初始化 PyAudio
p = pyaudio.PyAudio()
# 打开音频流
stream = p.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK
)
frames = []
# 录音
for _ in range(0, int(RATE / CHUNK * duration)):
data = stream.read(CHUNK)
frames.append(data)
print("✅ 录音完成")
# 停止录音
stream.stop_stream()
stream.close()
p.terminate()
# 保存音频
wf = wave.open(save_path, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()
# 识别
print("🔍 识别中...")
result = self.asr.transcribe(save_path)
text = result["text"]
print(f"📝 识别结果: {text}")
return text
def think(self, user_input: str) -> str:
"""
理解并生成回复
Args:
user_input: 用户输入
Returns:
助手回复
"""
# 添加到历史
self.conversation_history.append({
"role": "user",
"content": user_input
})
# 调用 LLM
messages = [
{"role": "system", "content": "你是一个友好的语音助手,回答要简洁明了。"}
] + self.conversation_history
response = self.llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
temperature=0.7,
max_tokens=200
)
assistant_reply = response.choices[0].message.content
# 添加到历史
self.conversation_history.append({
"role": "assistant",
"content": assistant_reply
})
return assistant_reply
def speak(self, text: str, output_path: str = "temp_response.mp3"):
"""
语音播报
Args:
text: 要播报的文本
output_path: 临时音频文件
"""
print(f"🔊 播报: {text}")
# 合成语音
self.tts.synthesize_sync(
text=text,
output_path=output_path,
voice="zh-CN-XiaoxiaoNeural",
rate="+10%"
)
# 播放音频(需要安装 playsound)
try:
from playsound import playsound
playsound(output_path)
except:
print("⚠️ 无法播放音频(需要安装 playsound)")
def run(self):
"""运行助手"""
print("\n" + "="*60)
print("🤖 语音助手已启动")
print("说 '退出' 或 '再见' 结束对话")
print("="*60 + "\n")
while True:
try:
# 1. 监听
user_input = self.listen(duration=5)
# 检查退出指令
if any(word in user_input for word in ["退出", "再见", "拜拜"]):
goodbye = "好的,再见!"
print(f"🤖 {goodbye}")
self.speak(goodbye)
break
# 2. 思考
response = self.think(user_input)
# 3. 回复
self.speak(response)
print("\n" + "-"*60 + "\n")
except KeyboardInterrupt:
print("\n\n👋 助手已退出")
break
except Exception as e:
print(f"❌ 错误: {e}")
# 使用示例
assistant = VoiceAssistant()
assistant.run()6.2.7 学习资源
推荐课程
- Stanford CS224S: Speech Recognition and Synthesis
- Fast.ai Audio: 音频深度学习实战
开源项目
- Whisper: OpenAI 语音识别
- Coqui TTS: 开源语音合成
- SpeechBrain: 语音处理工具包
- Silero Models: 轻量级语音模型
论文阅读
- Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
- Wav2Vec 2.0: Self-Supervised Learning for Speech Recognition
- VALL-E: Neural Codec Language Models
实战练习
- 构建会议转录系统(ASR + 说话人分离 + 摘要)
- 开发有声读物生成器(TTS + 章节分割)
- 实现实时语音翻译(ASR + 翻译 + TTS)
- 构建语音情感分析系统(情感识别 + 可视化)
关键要点
- ✅ Whisper 是 ASR 首选:多语言、高准确率、开源
- ✅ Edge TTS 免费高质量:适合快速原型开发
- ✅ 音频预处理很重要:采样率、降噪、音量归一化
- ✅ 实时性考虑:选择合适的模型大小和优化方法
- ✅ 多语言支持:Whisper 支持 99 种语言
常见错误
- ❌ 忽视音频质量(噪声、回声影响识别)
- ❌ 采样率不匹配(导致识别错误)
- ❌ 音频过长(超过模型上下文限制)
- ❌ 忽视延迟优化(实时应用需要低延迟)
性能优化
- 使用 Faster Whisper(4-5倍加速)
- 批量处理音频(提升吞吐量)
- VAD 过滤静音(减少计算量)
- 模型量化(降低显存和延迟)
下一步 学完本节后,结合视觉模型(6.1)和多模态融合(6.3),你将能够构建完整的多模态 AI 应用,如视频理解、多模态对话等。