Pros:
- General-purpose speech recognition: Whisper is a versatile speech recognition model that can handle various speech processing tasks, including multilingual speech recognition, speech translation, and language identification.
- Large-scale training: Whisper is trained on a large dataset of diverse audio, which helps improve its accuracy and robustness.
- Multitasking capability: Whisper is a multitasking model, meaning it can perform multiple speech processing tasks using a single model, eliminating the need for separate models or stages in a traditional speech-processing pipeline.
- Available pre-trained models: OpenAI provides pre-trained models of different sizes, offering a range of speed and accuracy trade-offs, allowing users to choose the model that best suits their specific needs.
- Python and command-line interfaces: Whisper can be used both from the command line and within Python code, providing flexibility in how it is integrated into different applications.
Cons:
- Resource-intensive: The larger models in Whisper require significant computational resources, including memory (VRAM) and processing power, which may pose challenges for users with limited resources.
- External dependencies: Whisper relies on external tools and libraries such as ffmpeg and PyTorch, which need to be installed and configured correctly for proper functioning. This may introduce additional complexity during setup and installation.
- Language-specific models: While Whisper offers English-only and multilingual models, the performance of the English-only models tends to be better for English applications. This may lead to performance differences when transcribing non-English speech using the English-only models.
- Model customization: While Whisper provides pre-trained models, fine-tuning or customizing the models for specific domains or tasks may require additional expertise and resources.
To use Whisper, you first need to set up your environment:
- Install the necessary Python version (3.8-3.11 is expected to be compatible) and PyTorch (version 1.10.1 was used for training and testing, but recent versions should work).
- Install the Whisper package and its dependencies using one of the following commands:
- To install or update to the latest release of Whisper, use:python
pip install -U openai-whisper
- To pull and install the latest commit from this repository along with its Python dependencies, use:python
pip install git+https://github.com/openai/whisper.git
- To update the package to the latest version from this repository, use:python
pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
- To install or update to the latest release of Whisper, use:
- Install the command-line tool
ffmpegon your system. This can be done via most package managers. For example, on Ubuntu or Debian, you can use:pythonsudo apt update && sudo apt install ffmpeg
- You may also need to install
rustiftiktoken(a Python package the codebase depends on) does not provide a pre-built wheel for your platform. If you see installation errors during the pip install command, follow the instructions to install the Rust development environment. You may also need to configure thePATHenvironment variable and installsetuptools_rustif necessary.
There are five model sizes for Whisper, with four offering English-only versions. These models offer various speed and accuracy trade-offs. The available models, their approximate memory requirements, and relative speeds are:
tiny/tiny.en: ~1 GB VRAM, ~32x relative speedbase/base.en: ~1 GB VRAM, ~16x relative speedsmall/small.en: ~2 GB VRAM, ~6x relative speedmedium/medium.en: ~5 GB VRAM, ~2x relative speedlarge: ~10 GB VRAM, 1x relative speed
The .en models are designed for English-only applications and tend to perform better, especially the tiny.en and base.en models1.
To use Whisper on the command line, you can use the following command to transcribe speech in audio files using the medium model:
whisper audio.flac audio.mp3 audio.wav --model medium
To transcribe an audio file containing non-English speech, you can specify the language using the --language option:
whisper japanese.wav --language Japanese
You can also add --task translate to translate the speech into English:
whisper japanese.wav --language Japanese --task translate
To view all available options, you can run whisper --help.
If you want to use Whisper in Python, you can use the following code to transcribe an audio file:
import whispermodel = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])
In the Python code above, the load_model function is used to load the model, and the transcribe method is used to transcribe the audio file.







