Turn your code into any language with our Code Converter. It's the ultimate tool for multi-language programming. Start converting now!
Have you ever wanted to automatically extract HTML tables from web pages and save them in a proper format on your computer? If that's the case, then you're in the right place. In this tutorial, we will be using requests and BeautifulSoup libraries to convert any table on any web page and save it on our disk.
We will also use pandas to easily convert to CSV format (or any format that pandas support). If you haven't requests, BeautifulSoup and pandas installed, then install them with the following command:
If you want to do the other way around, converting Pandas data frames to HTML tables, then check this tutorial.
Open up a new Python file and follow along. Let's import the libraries:
We need a function that accepts the target URL and gives us the proper soup object:
We first initialize a requests session, we use the User-Agent header to indicate that we are just a regular browser and not a bot (some websites block them), and then we get the HTML content using session.get() method. After that, we construct a BeautifulSoup object using html.parser.
Related tutorial: How to Make an Email Extractor in Python.
Since we want to extract every table on any page, we need to find the table HTML tag and return it. The following function does exactly that:
Now we need a way to get the table headers, the column names, or whatever you want to call them:
The above function finds the first row of the table and extracts all the th tags (table headers).
Now that we know how to extract table headers, the remaining is to extract all the table rows:
All the above function is doing, is to find tr tags (table rows) and extract td elements which then appends them to a list. The reason we used table.find_all("tr")[1:] and not all tr tags, is because the first tr tag corresponds to the table headers; we don't wanna add it here.
The below function takes the table name, table headers, and all the rows and saves them in CSV format:
Now that we have all the core functions, let's bring them all together in the main()
function:
The above function does the following:
Finally, let's call the main function:
This will accept the URL from the command line arguments. Let's try to see if this is working:
Nice, two CSV files appeared in my current directory that correspond to the two tables in that Wikipedia page. Here is a part of one of the tables extracted:
Awesome! We have successfully built a Python script to extract any table from any website, try to pass other URLs, and see if it's working.
For Javascript-driven websites (which load the website data dynamically using Javascript), try to use requests-html library or selenium instead. Let us see what you did in the comments below!
You can also make a web crawler that downloads all tables from an entire website. You can do that by extracting all website links and running this script on each URL you got from it.
Also, if, for whatever reason, the website you're scraping blocks your IP address, you need to use some proxy server as a countermeasure.
Read also: How to Extract and Submit Web Forms from a URL using Python.
Happy Scraping ♥
Just finished the article? Now, boost your next project with our Python Code Generator. Discover a faster, smarter way to code.
View Full Code Assist My Coding
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!