How to Convert PDF to Docx in Python

Learn how you can use pdf2docx library to convert PDF files to docx word files in Python
  · · 3 min read · Updated jul 2022 · PDF File Handling

Disclosure: This post may contain affiliate links, meaning when you click the links and make a purchase, we receive a commission.

In this tutorial, we will dive into how we can use the pdf2docx library to convert PDF files into docx extension.

The goal of this tutorial is to develop a lightweight command-line-based utility, through Python-based modules without relying on external utilities outside the Python ecosystem in order to convert one or a collection of PDF files located within a folder.

pdf2docx is a Python library to extract data from PDF with PyMuPDF, parse layout with rules, and generate docx file with python-docx. python-docx is another library that is used by pdf2docx for creating and updating Microsoft Word (.docx) files.

Going into the requirements:

$ pip install pdf2docx==0.5.1

Let's start by importing the modules:

# Import Libraries
from pdf2docx import parse
from typing import Tuple

Let's define the function responsible for converting PDF to Docx:

def convert_pdf2docx(input_file: str, output_file: str, pages: Tuple = None):
    """Converts pdf to docx"""
    if pages:
        pages = [int(i) for i in list(pages) if i.isnumeric()]
    result = parse(pdf_file=input_file,
                   docx_with_path=output_file, pages=pages)
    summary = {
        "File": input_file, "Pages": str(pages), "Output File": output_file
    }
    # Printing Summary
    print("## Summary ########################################################")
    print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
    print("###################################################################")
    return result

The convert_pdf2docx() function allows you to specify a range of pages to convert, it converts a PDF file into a Docx file and prints a summary of the conversion process in the end.

Let's use it now:

if __name__ == "__main__":
    import sys
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    convert_pdf2docx(input_file, output_file)

We simply use Python's built-in sys module to get the input and output file names from command-line arguments. Let's try to convert a sample PDF file (get it here):

$ python convert_pdf2docx.py letter.pdf letter.docx

A new letter.docx file will appear in the current directory, and the output will be like this:

Parsing Page 1: 1/1...
Creating Page 1: 1/1...
--------------------------------------------------
Terminated in 0.10869679999999998s.
## Summary ########################################################
File:letter.pdf
Pages:None
Output File:letter.docx
###################################################################

You can also specify the pages you want in the convert_pdf2docx() function.

I hope you enjoyed this short tutorial and you found this converter useful.

Learn also: How to Replace Text in Docx Files in Python.

PDF related tutorials:

Finally, if you're a beginner and want to learn Python, I suggest you take the Python For Everybody Coursera course, in which you'll learn a lot about Python. You can also check our resources and courses page to see the Python resources I recommend on various topics!

Happy coding ♥

View Full Code
Sharing is caring!



Read Also



Comment panel