How to Convert PDF to Docx in Python

Learn how you can use pdf2docx library to convert PDF files to docx word files in Python
  · · 3 min read · Updated jun 2023 · PDF File Handling

Step up your coding game with AI-powered Code Explainer. Get insights like never before!

In this tutorial, we will dive into how we can use the pdf2docx library to convert PDF files into docx extension.

The goal of this tutorial is to develop a lightweight command-line-based utility, through Python-based modules without relying on external utilities outside the Python ecosystem in order to convert one or a collection of PDF files located within a folder.

pdf2docx is a Python library to extract data from PDF with PyMuPDF, parse layout with rules, and generate docx file with python-docx. python-docx is another library that is used by pdf2docx for creating and updating Microsoft Word (.docx) files.

Download: Practical Python PDF Processing EBook.

Going into the requirements:

$ pip install pdf2docx==0.5.1

Let's start by importing the modules:

# Import Libraries
from pdf2docx import parse
from typing import Tuple

Let's define the function responsible for converting PDF to Docx:

def convert_pdf2docx(input_file: str, output_file: str, pages: Tuple = None):
    """Converts pdf to docx"""
    if pages:
        pages = [int(i) for i in list(pages) if i.isnumeric()]
    result = parse(pdf_file=input_file,
                   docx_with_path=output_file, pages=pages)
    summary = {
        "File": input_file, "Pages": str(pages), "Output File": output_file
    }
    # Printing Summary
    print("## Summary ########################################################")
    print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
    print("###################################################################")
    return result

The convert_pdf2docx() function allows you to specify a range of pages to convert, it converts a PDF file into a Docx file and prints a summary of the conversion process in the end.

Let's use it now:

if __name__ == "__main__":
    import sys
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    convert_pdf2docx(input_file, output_file)

Get Our Practical Python PDF Processing EBook

Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!

Download EBook

We simply use Python's built-in sys module to get the input and output file names from command-line arguments. Let's try to convert a sample PDF file (get it here):

$ python convert_pdf2docx.py letter.pdf letter.docx

A new letter.docx file will appear in the current directory, and the output will be like this:

Parsing Page 1: 1/1...
Creating Page 1: 1/1...
--------------------------------------------------
Terminated in 0.10869679999999998s.
## Summary ########################################################
File:letter.pdf
Pages:None
Output File:letter.docx
###################################################################

You can also specify the pages you want in the convert_pdf2docx() function.

I hope you enjoyed this short tutorial and you found this converter useful.

Learn also: How to Replace Text in Docx Files in Python.

PDF-related tutorials:

Finally, for more PDF handling guides on Python, you can check our Practical Python PDF Processing EBook, where we dive deeper into PDF document manipulation with Python, make sure to check it out here if you're interested!

Happy coding ♥

Found the article interesting? You'll love our Python Code Generator! Give AI a chance to do the heavy lifting for you. Check it out!

View Full Code Improve My Code
Sharing is caring!



Read Also



Comment panel

    Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!