Disclosure: This post may contain affiliate links, meaning when you click the links and make a purchase, we receive a commission.
In this tutorial, we will dive into how we can use the pdf2docx library to convert PDF files into docx extension.
The goal of this tutorial is to develop a lightweight command-line-based utility, through Python-based modules without relying on external utilities outside the Python ecosystem in order to convert one or a collection of PDF files located within a folder.
pdf2docx is a Python library to extract data from PDF with PyMuPDF, parse layout with rules, and generate docx file with python-docx. python-docx is another library that is used by pdf2docx for creating and updating Microsoft Word (.docx) files.
Going into the requirements:
$ pip install pdf2docx==0.5.1
Let's start by importing the modules:
# Import Libraries
from pdf2docx import parse
from typing import Tuple
Let's define the function responsible for converting PDF to Docx:
def convert_pdf2docx(input_file: str, output_file: str, pages: Tuple = None):
"""Converts pdf to docx"""
if pages:
pages = [int(i) for i in list(pages) if i.isnumeric()]
result = parse(pdf_file=input_file,
docx_with_path=output_file, pages=pages)
summary = {
"File": input_file, "Pages": str(pages), "Output File": output_file
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("###################################################################")
return result
The convert_pdf2docx()
function allows you to specify a range of pages to convert, it converts a PDF file into a Docx file and prints a summary of the conversion process in the end.
Let's use it now:
if __name__ == "__main__":
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
convert_pdf2docx(input_file, output_file)
We simply use Python's built-in sys module to get the input and output file names from command-line arguments. Let's try to convert a sample PDF file (get it here):
$ python convert_pdf2docx.py letter.pdf letter.docx
A new letter.docx
file will appear in the current directory, and the output will be like this:
Parsing Page 1: 1/1...
Creating Page 1: 1/1...
--------------------------------------------------
Terminated in 0.10869679999999998s.
## Summary ########################################################
File:letter.pdf
Pages:None
Output File:letter.docx
###################################################################
You can also specify the pages you want in the convert_pdf2docx()
function.
I hope you enjoyed this short tutorial and you found this converter useful.
Learn also: How to Replace Text in Docx Files in Python.
PDF related tutorials:
Finally, if you're a beginner and want to learn Python, I suggest you take the Python For Everybody Coursera course, in which you'll learn a lot about Python. You can also check our resources and courses page to see the Python resources I recommend on various topics!
Happy coding ♥
View Full Code