代码实战OpenAI开源的能识别99种语言的语音识别系统——whisper

图灵汇官网

OpenAI 最近推出了一款名为 Whisper 的语音识别模型。与 DALLE-2 和 GPT-3 不同的是,Whisper 是一款免费开源的模型。

Whisper 语音识别模型简介

Whisper 是一种自动语音识别系统,经过大量多语言数据的训练,数据量达到 680,000 小时。据 OpenAI 介绍,该模型在处理各种口音、背景噪音和技术语言方面表现出色。它能够支持 99 种不同语言的转录以及将这些语言翻译成英语。

Whisper 的架构

Whisper 使用简单的端到端方法,采用 Transformer 模型作为编码器-解码器结构。输入的音频被分割成 30 秒的片段,然后转换为 log-Mel 频谱图,传递给编码器。编码器负责计算注意力机制,随后将数据传递给解码器,解码器被训练用于预测相应的文本,并添加特殊的标记,使单个模型能够执行多种任务,如语言识别、多语言语音转录及英语语音翻译。

Whisper 的优势与特点

与其他方法相比,Whisper 是在一个庞大且多样化的数据集上训练的,未针对特定数据集进行微调。因此,虽然它可能无法在 LibriSpeech 这类特定数据集上超越专业模型,但在多个不同数据集上的零样本性能却更为稳健,错误率降低了约 50%。此外,Whisper 的音频数据集中约有三分之一是非英语的,这使得它在学习跨语言语音到文本的翻译任务时表现出色。

Whisper 的实战应用

Whisper 支持五种不同大小的模型,其中前四种适用于英语版本。下面我们来看如何使用 Whisper 进行语音识别的实际操作。

安装依赖

首先需要安装必要的库: python !pip install --upgrade pytube !pip install git+https://github.com/openai/whisper.git –q

下载音频文件

接下来,使用 pytube 库下载 YouTube 上的视频音频: ```python import whisper import pytube

video = "https://www.youtube.com/watch?v=-7E-qFI" data = pytube.YouTube(video) audio = data.streams.getaudioonly() audio.download() ```

使用 Whisper 进行语音识别

下载完音频文件后,可以使用 Whisper 模型进行语音识别: python model = whisper.load_model("medium") text = model.transcribe("11.mp4") print(text['text'])

这段代码会自动识别音频文件的内容,并打印出识别的文本。例如: I don't know who you are. I don't know what you want. If you are looking for ransom, I can tell you I don't have money. But what I do have are a very particular set of skills. Skills I have acquired over a very long career. Skills that make me a nightmare for people like you. If you let my daughter go now, that will be the end of it. I will not look for you. I will not pursue you. But if you don't, I will look for you. I will find you. And I will kill you. Good luck.

支持多语言识别与翻译

Whisper 不仅支持英文,还支持 99 种不同语言的识别与翻译。以下是一个使用中文语音的例子:

首先选择需要识别的语言: ```python import ipywidgets as widgets

languages = { "afza": "Afrikaans", "amet": "Amharic", "areg": "Arabic", "asin": "Assamese", "azaz": "Azerbaijani", "beby": "Belarusian", "bgbg": "Bulgarian", "bnin": "Bengali", "bsba": "Bosnian", "caes": "Catalan", "cmnhanscn": "Chinese", "cscz": "Czech", "cygb": "Welsh", "dadk": "Danish", "dede": "German", "elgr": "Greek", "enus": "English", "es419": "Spanish", "etee": "Estonian", "fair": "Persian", "fifi": "Finnish", "filph": "Tagalog", "frfr": "French", "gles": "Galician", "guin": "Gujarati", "hang": "Hausa", "heil": "Hebrew", "hiin": "Hindi", "hrhr": "Croatian", "huhu": "Hungarian", "hyam": "Armenian", "idid": "Indonesian", "isis": "Icelandic", "itit": "Italian", "jajp": "Japanese", "jvid": "Javanese", "kage": "Georgian", "kkkz": "Kazakh", "kmkh": "Khmer", "knin": "Kannada", "kokr": "Korean", "lblu": "Luxembourgish", "lncd": "Lingala", "lola": "Lao", "ltlt": "Lithuanian", "lvlv": "Latvian", "minz": "Maori", "mkmk": "Macedonian", "mlin": "Malayalam", "mnmn": "Mongolian", "mrin": "Marathi", "msmy": "Malay", "mtmt": "Maltese", "mymm": "Myanmar", "nbno": "Norwegian", "nenp": "Nepali", "nlnl": "Dutch", "ocfr": "Occitan", "pain": "Punjabi", "plpl": "Polish", "psaf": "Pashto", "ptbr": "Portuguese", "roro": "Romanian", "ruru": "Russian", "sdin": "Sindhi", "sksk": "Slovak", "slsi": "Slovenian", "snzw": "Shona", "soso": "Somali", "srrs": "Serbian", "svse": "Swedish", "swke": "Swahili", "tain": "Tamil", "tein": "Telugu", "tgtj": "Tajik", "thth": "Thai", "trtr": "Turkish", "ukua": "Ukrainian", "urpk": "Urdu", "uzuz": "Uzbek", "vivn": "Vietnamese", "yo_ng": "Yoruba" }

selection = widgets.Dropdown( options=[("Select language", None), ("----------", None)] + sorted([(f"{v} ({k})", k) for k, v in languages.items()]), value="cmnhanscn", description='Language:', disabled=False, )

lang = selection.value language = languages[lang] assert lang is not None, "Please select a language"

audio = '2233.mp3' transcriptions = [] translations = []

options = dict(language=language, beamsize=5, bestof=5) transcribeoptions = dict(task="transcribe", **options) translateoptions = dict(task="translate", **options)

transcription = model.transcribe(audio, *transcribe_options)["text"] translation = model.transcribe(audio, *translate_options)["text"]

transcriptions.append(transcription) translations.append(translation)

data = pd.DataFrame(dict(transcription=transcriptions, translation=translations)) print(data) ```

通过上述代码,可以识别并翻译不同语言的语音内容。例如: | transcription | translation | |-------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------| | 你可将此文本替换为所需的任何文本。你可在此文本框中编写或在此处粘贴你自己的文本。请尽情使用文本转语音功能。 | You can replace this document with any other document you need. You can write or paste your own documents in this document box. Please use the text translation feature. |

以上是 Whisper 语音识别模型的详细介绍及其实际应用示例。希望这些信息对你有所帮助。

本文来源: 图灵汇 文章作者: 秦靳锦