Confused by complex code? Let our AI-powered Code Explainer demystify it for you. Try it out!
Wikipedia is no doubt the largest and most popular general reference work on the internet, it is one of the most popular websites. It features exclusively free content. As a result, being able to access this large amount of information in Python is a handy work. In this tutorial, you will be able to extract information from Wikipedia easily without any hard work.
RELATED: How to Extract All Website Links in Python.
I need to mention that we are not going to web scrape Wikipedia pages manually, the wikipedia
module already did the tough work for us. Let's install it:
$ pip3 install wikipedia
Open up a Python interactive shell or an empty file and follow along.
Let's get the summary of what Python programming language is:
import wikipedia
# print the summary of what python is
print(wikipedia.summary("Python Programming Language"))
This will extract the summary from this wikipedia page. More specifically, it will print some first sentences, we can specify the number of sentences
to extract:
In [2]: wikipedia.summary("Python programming languag", sentences=2)
Out[2]: "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace."
Notice that I misspelled the query intentionally, it still gives me an accurate result.
Search for a term in Wikipedia search:
In [3]: result = wikipedia.search("Neural networks")
In [4]: print(result)
['Neural network', 'Artificial neural network', 'Convolutional neural network', 'Recurrent neural network', 'Rectifier (neural networks)', 'Feedforward neural network', 'Neural circuit', 'Quantum neural network', 'Dropout (neural networks)', 'Types of artificial neural networks']
This returned a list of related page titles, let's get the whole page for "Neural network"
which is result[0]
:
# get the page: Neural network
page = wikipedia.page(result[0])
Extracting the title
:
# get the title of the page
title = page.title
Getting all the categories of that Wikipedia page:
# get the categories of the page
categories = page.categories
Extracting the text after removing all HTML tags (this is done automatically):
# get the whole wikipedia page text (content)
content = page.content
All links:
# get all the links in the page
links = page.links
The references:
# get the page references
references = page.references
Finally, the summary:
# summary
summary = page.summary
Let's print them out:
# print info
print("Page content:\n", content, "\n")
print("Page title:", title, "\n")
print("Categories:", categories, "\n")
print("Links:", links, "\n")
print("References:", references, "\n")
print("Summary:", summary, "\n")
Try it out!
You can also change the language in wikipedia library in Python from English to another one of your choice:
# changing language
# for a list of available languages,
# check http://meta.wikimedia.org/wiki/List_of_Wikipedias link.
language = "es"
wikipedia.set_lang(language)
# get a page and print the summary in the new language
print(f"Summary of web scraping in {language}:", wikipedia.page("Web Scraping").summary)
Above we changed the language using wikipedia.set_lang()
function, and then extract our pages normally after that. For a list of available languages, check this link.
Alright, we are done, this was a brief introduction to how you can extract information from Wikipedia in Python. This can be helpful if you want to automatically collect data for language models, make a question-answering chatbot, make a wrapper application around this, and much more! The possibilities are endless, tell us what you did with this in the comments below!
If you are interested in extracting data from YouTube videos, check this tutorial.
Check the full code here and the official documentation for this library.
Learn also: How to Convert HTML Tables into CSV Files in Python.
Happy Coding ♥
Ready for more? Dive deeper into coding with our AI-powered Code Explainer. Don't miss it!
View Full Code Create Code for Me
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!