Unlock the secrets of your code with our AI-powered Code Explainer. Take a look!
A proxy is a server application that acts as an intermediary for requests between a client and the server from which the client is requesting a certain service (HTTP, SSL, etc.).
When using a proxy server, instead of directly connecting to the target server and requesting whatever that is you wanna request, you direct the request to the proxy server which evaluates the request and performs it, and returns the response, here is a simple Wikipedia demonstration of proxy servers:
Web scraping experts often use more than one proxy to prevent websites from banning their IP addresses. Proxies have several other benefits, including bypassing filters and censorship, hiding your real IP address, etc.
In this tutorial, you will learn how you can use proxies in Python using requests library, we will also be using stem library, which is a Python controller library for Tor, let's install them:
pip3 install bs4 requests stem
If you're just beginning your Python programming journey, this in-depth Python web scraping tutorial is the perfect starting point. The guide will walk you through the most popular Python libraries for web scraping, including requests, BeautifulSoup4, and Selenium, showing you how to extract and save data to CSV and Excel files. It's a go-to resource for building a Python web scraper from scratch, understanding library differences, and discovering best practices for your project.
Related: How to Make a Subdomain Scanner in Python.
First, some websites offer free proxy lists to use; I have built a function to grab this list automatically:
import requests
import random
from bs4 import BeautifulSoup as bs
def get_free_proxies():
url = "https://free-proxy-list.net/"
# get the HTTP response and construct soup object
soup = bs(requests.get(url).content, "html.parser")
proxies = []
for row in soup.find("table", attrs={"id": "proxylisttable"}).find_all("tr")[1:]:
tds = row.find_all("td")
try:
ip = tds[0].text.strip()
port = tds[1].text.strip()
host = f"{ip}:{port}"
proxies.append(host)
except IndexError:
continue
return proxies
However, when I tried to use them, most of them were timing out, I filtered some working ones:
proxies = [
'167.172.248.53:3128',
'194.226.34.132:5555',
'203.202.245.62:80',
'141.0.70.211:8080',
'118.69.50.155:80',
'201.55.164.177:3128',
'51.15.166.107:3128',
'91.205.218.64:80',
'128.199.237.57:8080',
]
This list may not be viable forever; in fact, most of these will stop working when you read this tutorial (so you should execute the above function each time you want to use fresh proxy servers).
The below function accepts a list of proxies and creates a requests session that randomly selects one of the proxies passed:
def get_session(proxies):
# construct an HTTP session
session = requests.Session()
# choose one random proxy
proxy = random.choice(proxies)
session.proxies = {"http": proxy, "https": proxy}
return session
Let's test this by requesting a website that returns our IP address:
for i in range(5):
s = get_session(proxies)
try:
print("Request page with IP:", s.get("http://icanhazip.com", timeout=1.5).text.strip())
except Exception as e:
continue
Here is my output:
Request page with IP: 45.64.134.198
Request page with IP: 141.0.70.211
Request page with IP: 94.250.248.230
Request page with IP: 46.173.219.2
Request page with IP: 201.55.164.177
As you can see, these are some IP addresses of the working proxy servers and not our real IP address (try to visit this website in your browser and you'll see your real IP address).
Free proxies tend to die very quickly, mostly in days or even hours, and would often die before our scraping project ends. To prevent that, you need to use premium proxies for large-scale data extraction projects, there are many providers out there who rotate IP addresses for you. One of the well-known solutions is Zyte. We will talk more about it in the last section of this tutorial.
You can also use the Tor network to rotate IP addresses:
import requests
from stem.control import Controller
from stem import Signal
def get_tor_session():
# initialize a requests Session
session = requests.Session()
# setting the proxy of both http & https to the localhost:9050
# this requires a running Tor service in your machine and listening on port 9050 (by default)
session.proxies = {"http": "socks5://localhost:9050", "https": "socks5://localhost:9050"}
return session
def renew_connection():
with Controller.from_port(port=9051) as c:
c.authenticate()
# send NEWNYM signal to establish a new clean connection through the Tor network
c.signal(Signal.NEWNYM)
if __name__ == "__main__":
s = get_tor_session()
ip = s.get("http://icanhazip.com").text
print("IP:", ip)
renew_connection()
s = get_tor_session()
ip = s.get("http://icanhazip.com").text
print("IP:", ip)
Note: The above code should work only if you have Tor installed in your machine (head to this link to properly install it) and well configured (ControlPort 9051 is enabled, check this stackoverflow answer for further details).
This will create a session with a Tor IP address and make an HTTP request, and then renew the connection by sending NEWNYM signal (which tells Tor to establish a new clean connection) to change the IP address and make another request, here is the output:
IP: 185.220.101.49
IP: 109.70.100.21
Great! However, when you experience web scraping using the Tor network, you'll soon realize it's pretty slow most of the time, that is why the recommended way is below.
Zyte's Smart Proxy Manager allows you to crawl quickly and reliably, it manages and rotates proxies internally, so if you're banned, it will automatically detect that and rotates the IP address for you.
It is specifically designed for web scraping and crawling. Its job is clear: making your life easier as a web scraper. It helps you get successful requests and extract data at scale from any website using any web scraping tool.
With its simple API, the requests you make when scraping will be routed through a pool of high-quality proxies. When necessary, it automatically introduces delays between requests and removes/adds IP addresses to overcome different crawling challenges.
Here is how you can use Zyte with requests library in Python:
import requests
url = "http://icanhazip.com"
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"
proxy_auth = "<APIKEY>:"
proxies = {
"https": f"https://{proxy_auth}@{proxy_host}:{proxy_port}/",
"http": f"http://{proxy_auth}@{proxy_host}:{proxy_port}/"
}
r = requests.get(url, proxies=proxies, verify=False)
Once you register for a plan, you'll be provided with an API key which you'll replace proxy_auth
.
So, here is what Zyte does for you:
There are several proxy types, including transparent proxies, anonymous proxies, and elite proxies. If your goal of using proxies is to prevent websites from banning your scrapers, then elite proxies are your optimal choice; it will make you seem like a regular internet user who is not using a proxy at all.
Furthermore, an extra anti-scraping measure uses rotating user agents, in which you send a changing spoofed header each time, saying that you're a regular browser.
Learn also: How to Extract All Website Links in Python.
Happy Coding ♥
Let our Code Converter simplify your multi-language projects. It's like having a coding translator at your fingertips. Don't miss out!
View Full Code Explain My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!