Step up your coding game with AI-powered Code Explainer. Get insights like never before!
Speech recognition, also called automatic speech recognition (ASR), is the process of converting spoken audio into human-readable text. In this updated tutorial, you will learn how to convert speech to text in Python using modern, reliable tools that are suitable for real applications in 2026.
The older approach in this article used the SpeechRecognition package with recognize_google(). That is still fine for quick experiments, but it is not the best default anymore: it depends on a free/unofficial endpoint, has practical limits, and does not give you modern features such as strong multilingual accuracy, robust long-audio handling, voice activity detection, or high-quality timestamps.
As of May 14, 2026, a better Python speech-to-text stack is:
gpt-4o-transcribe: a high-accuracy hosted API option for production transcription.gpt-4o-mini-transcribe: a cheaper hosted option when cost matters more than maximum accuracy.In this tutorial, we will focus on three practical Python solutions: OpenAI for best hosted accuracy, Faster-Whisper for local/offline transcription, and Groq for fast hosted Whisper transcription. We will also handle microphone recording, long audio files, and SRT subtitle output.
Learn also: How to Convert Text to Speech in Python.
| Tool | Best for | Pros | Trade-offs |
|---|---|---|---|
gpt-4o-transcribe |
Production accuracy | Excellent accuracy, multilingual, simple API | Requires an API key and uploads audio to OpenAI |
gpt-4o-mini-transcribe |
Lower-cost API transcription | Cheaper and still strong for many use cases | May be less accurate than the full model |
| Faster-Whisper | Offline/local transcription | Private, fast, no per-minute API cost, supports VAD | Needs local CPU/GPU resources |
| Groq Whisper | Fast hosted Whisper transcription | Very low latency, OpenAI-compatible style | Hosted API, model choices depend on provider |
| WhisperX | Diarization and word timestamps | Speaker labels and better alignment | Heavier setup, often needs GPU/Hugging Face token for diarization |
If you only want the easiest reliable solution, use gpt-4o-transcribe. If you need offline transcription or you cannot upload audio to a third party, use Faster-Whisper. If you want a fast hosted Whisper API, Groq is a good option.
Create a virtual environment first:
python -m venv .venv
source .venv/bin/activate
On Windows PowerShell, activate it with:
.\.venv\Scripts\Activate.ps1
Install the Python packages:
pip install -U openai faster-whisper groq sounddevice scipy
You should also install FFmpeg, because it lets us convert MP3, MP4, M4A, WebM, and other formats to clean mono WAV audio when needed.
On Ubuntu/Debian:
sudo apt update
sudo apt install ffmpeg
On macOS:
brew install ffmpeg
On Windows, you can install FFmpeg with Chocolatey:
choco install ffmpeg
This is the simplest production-ready option. Set your API key first:
export OPENAI_API_KEY="your-api-key-here"
On Windows PowerShell:
$env:OPENAI_API_KEY="your-api-key-here"
Now create a Python file called openai_transcribe.py:
from pathlib import Path
from openai import OpenAI
client = OpenAI()
def transcribe_with_openai(
audio_path: str,
model: str = "gpt-4o-transcribe",
language: str | None = None,
prompt: str | None = None,
) -> str:
"""Transcribe an audio file with OpenAI's speech-to-text API."""
kwargs = {"model": model}
if language:
kwargs["language"] = language
if prompt:
kwargs["prompt"] = prompt
with Path(audio_path).open("rb") as audio_file:
transcript = client.audio.transcriptions.create(file=audio_file, **kwargs)
return transcript.text
if __name__ == "__main__":
text = transcribe_with_openai(
"meeting.mp3",
language="en",
prompt="This is a technical meeting about Python, APIs, and machine learning.",
)
print(text)
Run it:
python openai_transcribe.py
The optional language parameter is useful when you already know the language. For example, use "en" for English, "fr" for French, "es" for Spanish, and so on. The optional prompt helps the model with names, acronyms, product names, or domain-specific vocabulary.
If you want a cheaper model, change:
model="gpt-4o-transcribe"
to:
model="gpt-4o-mini-transcribe"
Faster-Whisper is a fast Whisper implementation powered by CTranslate2. It is a great choice when you want offline transcription, more control, or better privacy.
Create a file called local_transcribe.py:
from faster_whisper import WhisperModel
def transcribe_locally(audio_path: str, language: str | None = None) -> str:
"""Transcribe audio locally using Faster-Whisper."""
model = WhisperModel(
"large-v3",
device="cpu", # use "cuda" if you have an NVIDIA GPU
compute_type="int8" # use "float16" on CUDA for better speed
)
segments, info = model.transcribe(
audio_path,
beam_size=5,
language=language,
vad_filter=True,
vad_parameters={"min_silence_duration_ms": 500},
)
print(f"Detected language: {info.language} ({info.language_probability:.2f})")
return "".join(segment.text for segment in segments).strip()
if __name__ == "__main__":
print(transcribe_locally("meeting.mp3", language="en"))
Run it:
python local_transcribe.py
If your machine is slow, start with a smaller model:
model = WhisperModel("small", device="cpu", compute_type="int8")
If you have a decent NVIDIA GPU, use:
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
For real-time-ish or faster transcription, you can also try large-v3-turbo or a Faster-Whisper-compatible large-v3-turbo checkpoint from Hugging Face.
Groq provides fast hosted Whisper models such as whisper-large-v3 and whisper-large-v3-turbo. First, set your Groq API key:
export GROQ_API_KEY="your-groq-api-key"
On Windows PowerShell:
$env:GROQ_API_KEY="your-groq-api-key"
Then use the Groq SDK:
import os
from pathlib import Path
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
def transcribe_with_groq(audio_path: str, language: str | None = None) -> str:
kwargs = {
"model": "whisper-large-v3-turbo",
"temperature": 0.0,
}
if language:
kwargs["language"] = language
with Path(audio_path).open("rb") as audio_file:
transcript = client.audio.transcriptions.create(file=audio_file, **kwargs)
return transcript.text
print(transcribe_with_groq("meeting.mp3", language="en"))
Hosted APIs usually have file-size limits, and long recordings can also be easier to retry if they are split into smaller chunks. A reliable approach is:
Here is the core chunking logic:
import wave
from pathlib import Path
def chunk_wav(input_wav: str, chunk_seconds: int = 600) -> list[Path]:
"""Split a WAV file into fixed-size chunks without loading it all into memory."""
input_wav = Path(input_wav)
output_dir = input_wav.parent / f"{input_wav.stem}_chunks"
output_dir.mkdir(parents=True, exist_ok=True)
chunks = []
with wave.open(str(input_wav), "rb") as reader:
params = reader.getparams()
frames_per_chunk = int(params.framerate * chunk_seconds)
index = 1
while True:
frames = reader.readframes(frames_per_chunk)
if not frames:
break
chunk_path = output_dir / f"chunk_{index:04d}.wav"
with wave.open(str(chunk_path), "wb") as writer:
writer.setparams(params)
writer.writeframes(frames)
chunks.append(chunk_path)
index += 1
return chunks
The complete script at the end of this tutorial includes transcribe_large_file_with_openai(), which converts, chunks, transcribes, and joins the results automatically.
Faster-Whisper returns timestamped segments, so we can easily write an SRT file:
def seconds_to_srt_time(seconds: float) -> str:
milliseconds = round(seconds * 1000)
hours, remainder = divmod(milliseconds, 3_600_000)
minutes, remainder = divmod(remainder, 60_000)
secs, millis = divmod(remainder, 1000)
return f"{hours:02}:{minutes:02}:{secs:02},{millis:03}"
def write_srt(segments, output_path: str) -> None:
lines = []
for i, segment in enumerate(segments, start=1):
lines.extend([
str(i),
f"{seconds_to_srt_time(segment.start)} --> {seconds_to_srt_time(segment.end)}",
segment.text.strip(),
"",
])
Path(output_path).write_text("\n".join(lines), encoding="utf-8")
Using the full script below, you can generate subtitles like this:
python speech_to_text_2026.py video.mp4 --engine faster-whisper --model large-v3 --srt captions.srt
If you want to record from your microphone and then transcribe the recording, use sounddevice and scipy:
from pathlib import Path
import sounddevice as sd
from scipy.io.wavfile import write
def record_microphone(output_path: str = "microphone.wav", seconds: int = 8, sample_rate: int = 16_000) -> Path:
print(f"Recording for {seconds} seconds...")
audio = sd.rec(int(seconds * sample_rate), samplerate=sample_rate, channels=1, dtype="int16")
sd.wait()
write(output_path, sample_rate, audio)
return Path(output_path)
With the complete script, record and transcribe 8 seconds of microphone audio using OpenAI:
python speech_to_text_2026.py --record 8 --engine openai --language en
Or record and transcribe locally:
python speech_to_text_2026.py --record 8 --engine faster-whisper --model small --language en
The full code section contains a complete script named speech_to_text_2026.py. Here are some examples:
# Best hosted accuracy
python speech_to_text_2026.py meeting.mp3 --engine openai --language en
# Cheaper OpenAI transcription
python speech_to_text_2026.py meeting.mp3 --engine openai --model gpt-4o-mini-transcribe --language en
# Long file with OpenAI chunking
python speech_to_text_2026.py long_meeting.mp3 --engine openai --long --chunk-seconds 600 --language en
# Local/offline transcription
python speech_to_text_2026.py meeting.mp3 --engine faster-whisper --model large-v3 --language en
# Local transcription with SRT subtitles
python speech_to_text_2026.py video.mp4 --engine faster-whisper --model large-v3 --srt captions.srt
# Fast hosted Whisper transcription
python speech_to_text_2026.py meeting.mp3 --engine groq --language en
language="en".large-v3 is usually better than small.For modern Python speech-to-text applications, you no longer need to rely on the old SpeechRecognition demo-style workflow. If you want a simple hosted API, use OpenAI's gpt-4o-transcribe or gpt-4o-mini-transcribe. If you want local and private transcription, use Faster-Whisper. If you want a very fast hosted Whisper endpoint, Groq is also a strong option.
The complete script below gives you a practical CLI that supports hosted transcription, local transcription, microphone recording, long-audio chunking, and SRT subtitle generation.
Read also: Speech Recognition using Transformers in Python.
Happy Coding ♥
Why juggle between languages when you can convert? Check out our Code Converter. Try it out today!
View Full Code Build My Python Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!