How to Convert Speech to Text in Python

Learn how to convert speech to text in Python with modern 2026 STT tools: OpenAI gpt-4o-transcribe, Faster-Whisper, Groq Whisper, long-audio chunking, subtitles, and microphone recording.
  · 9 min read · Updated may 2026 · Machine Learning · Application Programming Interfaces

Step up your coding game with AI-powered Code Explainer. Get insights like never before!

Speech recognition, also called automatic speech recognition (ASR), is the process of converting spoken audio into human-readable text. In this updated tutorial, you will learn how to convert speech to text in Python using modern, reliable tools that are suitable for real applications in 2026.

The older approach in this article used the SpeechRecognition package with recognize_google(). That is still fine for quick experiments, but it is not the best default anymore: it depends on a free/unofficial endpoint, has practical limits, and does not give you modern features such as strong multilingual accuracy, robust long-audio handling, voice activity detection, or high-quality timestamps.

As of May 14, 2026, a better Python speech-to-text stack is:

  • OpenAI gpt-4o-transcribe: a high-accuracy hosted API option for production transcription.
  • OpenAI gpt-4o-mini-transcribe: a cheaper hosted option when cost matters more than maximum accuracy.
  • Faster-Whisper: a fast local/offline implementation of Whisper using CTranslate2, great when you want privacy or no per-minute API cost.
  • Groq Whisper: a very fast hosted Whisper API, useful when you want low latency and OpenAI-compatible transcription calls.
  • WhisperX: useful when you need word-level alignment or speaker diarization for podcasts, interviews, and meetings.

In this tutorial, we will focus on three practical Python solutions: OpenAI for best hosted accuracy, Faster-Whisper for local/offline transcription, and Groq for fast hosted Whisper transcription. We will also handle microphone recording, long audio files, and SRT subtitle output.

Learn also: How to Convert Text to Speech in Python.

Which Speech-to-Text Tool Should You Use?

Tool Best for Pros Trade-offs
gpt-4o-transcribe Production accuracy Excellent accuracy, multilingual, simple API Requires an API key and uploads audio to OpenAI
gpt-4o-mini-transcribe Lower-cost API transcription Cheaper and still strong for many use cases May be less accurate than the full model
Faster-Whisper Offline/local transcription Private, fast, no per-minute API cost, supports VAD Needs local CPU/GPU resources
Groq Whisper Fast hosted Whisper transcription Very low latency, OpenAI-compatible style Hosted API, model choices depend on provider
WhisperX Diarization and word timestamps Speaker labels and better alignment Heavier setup, often needs GPU/Hugging Face token for diarization

If you only want the easiest reliable solution, use gpt-4o-transcribe. If you need offline transcription or you cannot upload audio to a third party, use Faster-Whisper. If you want a fast hosted Whisper API, Groq is a good option.

Installing the Dependencies

Create a virtual environment first:

python -m venv .venv
source .venv/bin/activate

On Windows PowerShell, activate it with:

.\.venv\Scripts\Activate.ps1

Install the Python packages:

pip install -U openai faster-whisper groq sounddevice scipy

You should also install FFmpeg, because it lets us convert MP3, MP4, M4A, WebM, and other formats to clean mono WAV audio when needed.

On Ubuntu/Debian:

sudo apt update
sudo apt install ffmpeg

On macOS:

brew install ffmpeg

On Windows, you can install FFmpeg with Chocolatey:

choco install ffmpeg

Method 1: Convert Speech to Text with OpenAI

This is the simplest production-ready option. Set your API key first:

export OPENAI_API_KEY="your-api-key-here"

On Windows PowerShell:

$env:OPENAI_API_KEY="your-api-key-here"

Now create a Python file called openai_transcribe.py:

from pathlib import Path
from openai import OpenAI

client = OpenAI()


def transcribe_with_openai(
    audio_path: str,
    model: str = "gpt-4o-transcribe",
    language: str | None = None,
    prompt: str | None = None,
) -> str:
    """Transcribe an audio file with OpenAI's speech-to-text API."""
    kwargs = {"model": model}
    if language:
        kwargs["language"] = language
    if prompt:
        kwargs["prompt"] = prompt

    with Path(audio_path).open("rb") as audio_file:
        transcript = client.audio.transcriptions.create(file=audio_file, **kwargs)

    return transcript.text


if __name__ == "__main__":
    text = transcribe_with_openai(
        "meeting.mp3",
        language="en",
        prompt="This is a technical meeting about Python, APIs, and machine learning.",
    )
    print(text)

Run it:

python openai_transcribe.py

The optional language parameter is useful when you already know the language. For example, use "en" for English, "fr" for French, "es" for Spanish, and so on. The optional prompt helps the model with names, acronyms, product names, or domain-specific vocabulary.

If you want a cheaper model, change:

model="gpt-4o-transcribe"

to:

model="gpt-4o-mini-transcribe"

Method 2: Convert Speech to Text Locally with Faster-Whisper

Faster-Whisper is a fast Whisper implementation powered by CTranslate2. It is a great choice when you want offline transcription, more control, or better privacy.

Create a file called local_transcribe.py:

from faster_whisper import WhisperModel


def transcribe_locally(audio_path: str, language: str | None = None) -> str:
    """Transcribe audio locally using Faster-Whisper."""
    model = WhisperModel(
        "large-v3",
        device="cpu",       # use "cuda" if you have an NVIDIA GPU
        compute_type="int8" # use "float16" on CUDA for better speed
    )

    segments, info = model.transcribe(
        audio_path,
        beam_size=5,
        language=language,
        vad_filter=True,
        vad_parameters={"min_silence_duration_ms": 500},
    )

    print(f"Detected language: {info.language} ({info.language_probability:.2f})")
    return "".join(segment.text for segment in segments).strip()


if __name__ == "__main__":
    print(transcribe_locally("meeting.mp3", language="en"))

Run it:

python local_transcribe.py

If your machine is slow, start with a smaller model:

model = WhisperModel("small", device="cpu", compute_type="int8")

If you have a decent NVIDIA GPU, use:

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

For real-time-ish or faster transcription, you can also try large-v3-turbo or a Faster-Whisper-compatible large-v3-turbo checkpoint from Hugging Face.

Method 3: Fast Hosted Whisper Transcription with Groq

Groq provides fast hosted Whisper models such as whisper-large-v3 and whisper-large-v3-turbo. First, set your Groq API key:

export GROQ_API_KEY="your-groq-api-key"

On Windows PowerShell:

$env:GROQ_API_KEY="your-groq-api-key"

Then use the Groq SDK:

import os
from pathlib import Path
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])


def transcribe_with_groq(audio_path: str, language: str | None = None) -> str:
    kwargs = {
        "model": "whisper-large-v3-turbo",
        "temperature": 0.0,
    }
    if language:
        kwargs["language"] = language

    with Path(audio_path).open("rb") as audio_file:
        transcript = client.audio.transcriptions.create(file=audio_file, **kwargs)

    return transcript.text


print(transcribe_with_groq("meeting.mp3", language="en"))

Transcribing Long Audio Files

Hosted APIs usually have file-size limits, and long recordings can also be easier to retry if they are split into smaller chunks. A reliable approach is:

  1. Convert the input file to mono 16 kHz WAV with FFmpeg.
  2. Split the WAV into chunks, for example 10 minutes each.
  3. Transcribe each chunk.
  4. Join the partial transcripts.

Here is the core chunking logic:

import wave
from pathlib import Path


def chunk_wav(input_wav: str, chunk_seconds: int = 600) -> list[Path]:
    """Split a WAV file into fixed-size chunks without loading it all into memory."""
    input_wav = Path(input_wav)
    output_dir = input_wav.parent / f"{input_wav.stem}_chunks"
    output_dir.mkdir(parents=True, exist_ok=True)

    chunks = []
    with wave.open(str(input_wav), "rb") as reader:
        params = reader.getparams()
        frames_per_chunk = int(params.framerate * chunk_seconds)
        index = 1

        while True:
            frames = reader.readframes(frames_per_chunk)
            if not frames:
                break

            chunk_path = output_dir / f"chunk_{index:04d}.wav"
            with wave.open(str(chunk_path), "wb") as writer:
                writer.setparams(params)
                writer.writeframes(frames)

            chunks.append(chunk_path)
            index += 1

    return chunks

The complete script at the end of this tutorial includes transcribe_large_file_with_openai(), which converts, chunks, transcribes, and joins the results automatically.

Generating SRT Subtitles

Faster-Whisper returns timestamped segments, so we can easily write an SRT file:

def seconds_to_srt_time(seconds: float) -> str:
    milliseconds = round(seconds * 1000)
    hours, remainder = divmod(milliseconds, 3_600_000)
    minutes, remainder = divmod(remainder, 60_000)
    secs, millis = divmod(remainder, 1000)
    return f"{hours:02}:{minutes:02}:{secs:02},{millis:03}"


def write_srt(segments, output_path: str) -> None:
    lines = []
    for i, segment in enumerate(segments, start=1):
        lines.extend([
            str(i),
            f"{seconds_to_srt_time(segment.start)} --> {seconds_to_srt_time(segment.end)}",
            segment.text.strip(),
            "",
        ])
    Path(output_path).write_text("\n".join(lines), encoding="utf-8")

Using the full script below, you can generate subtitles like this:

python speech_to_text_2026.py video.mp4 --engine faster-whisper --model large-v3 --srt captions.srt

Recording from the Microphone

If you want to record from your microphone and then transcribe the recording, use sounddevice and scipy:

from pathlib import Path
import sounddevice as sd
from scipy.io.wavfile import write


def record_microphone(output_path: str = "microphone.wav", seconds: int = 8, sample_rate: int = 16_000) -> Path:
    print(f"Recording for {seconds} seconds...")
    audio = sd.rec(int(seconds * sample_rate), samplerate=sample_rate, channels=1, dtype="int16")
    sd.wait()
    write(output_path, sample_rate, audio)
    return Path(output_path)

With the complete script, record and transcribe 8 seconds of microphone audio using OpenAI:

python speech_to_text_2026.py --record 8 --engine openai --language en

Or record and transcribe locally:

python speech_to_text_2026.py --record 8 --engine faster-whisper --model small --language en

Complete CLI Usage

The full code section contains a complete script named speech_to_text_2026.py. Here are some examples:

# Best hosted accuracy
python speech_to_text_2026.py meeting.mp3 --engine openai --language en

# Cheaper OpenAI transcription
python speech_to_text_2026.py meeting.mp3 --engine openai --model gpt-4o-mini-transcribe --language en

# Long file with OpenAI chunking
python speech_to_text_2026.py long_meeting.mp3 --engine openai --long --chunk-seconds 600 --language en

# Local/offline transcription
python speech_to_text_2026.py meeting.mp3 --engine faster-whisper --model large-v3 --language en

# Local transcription with SRT subtitles
python speech_to_text_2026.py video.mp4 --engine faster-whisper --model large-v3 --srt captions.srt

# Fast hosted Whisper transcription
python speech_to_text_2026.py meeting.mp3 --engine groq --language en

Improving Transcription Accuracy

  • Use a language hint when possible, such as language="en".
  • Use a context prompt for product names, acronyms, people names, and technical vocabulary.
  • Convert noisy audio to mono 16 kHz WAV before transcription.
  • Use VAD when transcribing locally to reduce silence-related hallucinations.
  • Use a better model for difficult audio. For Faster-Whisper, large-v3 is usually better than small.
  • Use diarization when you need speaker labels. For that, look at WhisperX or a diarization-capable hosted API.

Conclusion

For modern Python speech-to-text applications, you no longer need to rely on the old SpeechRecognition demo-style workflow. If you want a simple hosted API, use OpenAI's gpt-4o-transcribe or gpt-4o-mini-transcribe. If you want local and private transcription, use Faster-Whisper. If you want a very fast hosted Whisper endpoint, Groq is also a strong option.

The complete script below gives you a practical CLI that supports hosted transcription, local transcription, microphone recording, long-audio chunking, and SRT subtitle generation.

Read also: Speech Recognition using Transformers in Python.

Happy Coding ♥

Why juggle between languages when you can convert? Check out our Code Converter. Try it out today!

View Full Code Build My Python Code
Sharing is caring!



Read Also



Comment panel

    Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!