Whisper

部分內容由 LLM 生成，尚未經過人工驗證。

Whisper 是 OpenAI 開發的自動語音辨識（ASR）模型，可將音訊轉錄為文字並生成時間戳。

安裝

使用 pip 安裝

pip install -U openai-whisper

安裝 ffmpeg（必要依賴）

Whisper 需要 ffmpeg 來處理音訊檔案：

# 使用 Chocolatey
choco install ffmpeg

# 或使用 Scoop
scoop install ffmpeg

brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Arch Linux
sudo pacman -S ffmpeg

可用模型

模型	參數量	僅英文模型	多語言模型	所需 VRAM	相對速度
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~10x
base	74 M	`base.en`	`base`	~1 GB	~7x
small	244 M	`small.en`	`small`	~2 GB	~4x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x
turbo	809 M	N/A	`turbo`	~6 GB	~8x

.en 模型僅支援英文，但在英文辨識上表現更好。多語言模型可自動偵測語言。

基本用法

命令列介面

whisper audio.mp3 --model base

常用參數

參數	說明	範例
`--model`	指定模型	`--model medium`
`--language`	指定語言	`--language zh`
`--task`	任務類型（transcribe/translate）	`--task translate`
`--output_dir`	輸出目錄	`--output_dir ./output`
`--output_format`	輸出格式	`--output_format srt`
`--device`	指定設備	`--device cuda`

輸出格式

格式	說明
`txt`	純文字（無時間戳）
`vtt`	WebVTT 字幕格式
`srt`	SRT 字幕格式
`tsv`	Tab 分隔值（含時間戳）
`json`	JSON 格式（含詳細資訊）
`all`	輸出所有格式

常用指令範例

基本轉錄

# 使用預設模型轉錄
whisper audio.mp3

# 指定模型
whisper audio.mp3 --model medium

# 指定語言（加快處理速度）
whisper audio.mp3 --model medium --language zh

輸出字幕檔

# 輸出 SRT 字幕
whisper audio.mp3 --model base --output_format srt

# 輸出所有格式
whisper audio.mp3 --model base --output_format all --output_dir ./subtitles

翻譯為英文

# 將任何語言翻譯為英文
whisper audio.mp3 --model medium --task translate

使用 GPU 加速

# 使用 CUDA（NVIDIA GPU）
whisper audio.mp3 --model large --device cuda

# 指定 GPU 編號
whisper audio.mp3 --model large --device cuda:0

進階參數

參數	說明	預設值
`--temperature`	取樣溫度	0
`--best_of`	候選數量	5
`--beam_size`	Beam search 大小	5
`--patience`	Beam search patience	1.0
`--initial_prompt`	初始提示文字	None
`--condition_on_previous_text`	參考前文	True
`--word_timestamps`	詞級時間戳	False

詞級時間戳

whisper audio.mp3 --model base --word_timestamps True

使用初始提示

# 提供領域術語或格式提示
whisper audio.mp3 --model medium --initial_prompt "這是一段關於機器學習的演講"

效能優化

記憶體不足時

# 使用較小模型
whisper audio.mp3 --model tiny

# 或使用 CPU（較慢但不需 GPU 記憶體）
whisper audio.mp3 --model medium --device cpu

加速處理

# 使用 turbo 模型（速度與品質平衡）
whisper audio.mp3 --model turbo

# 指定語言（跳過語言偵測）
whisper audio.mp3 --model base --language en

常見問題

支援的音訊格式

Whisper 透過 ffmpeg 支援多種格式：

MP3, WAV, FLAC, AAC, OGG, M4A
影片檔案（自動提取音訊）：MP4, MKV, AVI, MOV

最佳實踐

選擇適當模型：一般用途使用 base 或 small，需要高準確度使用 medium 或 large
指定語言：已知語言時指定 --language 可加快處理
音訊品質：清晰的音訊可顯著提升辨識準確度
GPU 加速：有 NVIDIA GPU 時使用 --device cuda

Whisper

安裝

使用 pip 安裝

安裝 ffmpeg（必要依賴）

可用模型

基本用法

命令列介面

常用參數

輸出格式

常用指令範例

基本轉錄

輸出字幕檔

翻譯為英文

使用 GPU 加速

進階參數

詞級時間戳

使用初始提示

效能優化

記憶體不足時

加速處理

常見問題

支援的音訊格式

最佳實踐

相關主題