How to Build Web Crawler Using Python
In order to build web Crawler, we are going to use request python library. Requests
is a Python library that greatly facilitates working with HTTP requests. Sooner or later, in a project, you may have to make web requests, either to consume an API, extract information from a page or submit the content of a form in an automated way. If so, Python requests your great ally.
This tutorial is an introduction to the library requests. In it I will show you how to use it for the most common uses: how to make a GET request, a POST request, pass headers in the request, how to simulate the sending of data from a form, how to send and receive a JSON or how to manage the code's response.
Contents
- 1 Installation
- 2 How to make a GET request in Python with requests
- 3 How to make a POST request in Python with requests
- 4 And many more Request
- 5 The Response object : Request response python library
- 6 How to pass parameters in url in python with requests
- 7 How to send headers with requests
- 8 JSON requests and responses in Python with requests
- 9 How to send a Multipart-Encoded file
- 10 Cookies with requests
- 11 Conclusion
Installation[edit]
Being a third-party library, the first thing you should do is install it. The easiest way to do this is to use the pip package manager:
pip install requests
If you are using python3, try:
pip3 install requests
How to make a GET request in Python with requests[edit]
One of the most common operations with the library requests is to make a request GET, either to obtain the content of a website or to make a request to an API.
To do this, you simply have to invoke the function get()
indicating the URL to which to make the request.
import requests
r = requests.get('https://www.google.com/')
print(r.text)
The function returns an object Response, which in this case has been assigned to the variable r
, with all the information from the response.
How to make a POST request in Python with requests[edit]
Now imagine that you want to make a request POST to send the data of a form. In this case, the way to proceed is very similar to the previous one, only that you must call the function post()
and indicate in the parameter data a dictionary with the data of the body of the request. By passing the data in the parameter data, it requests takes care of encoding them correctly before making the request:
import requests
form_Data = {'gameCategory': 'racing', 'numberOfGames': '12'}
r = requests.post('https://example.com/api/', data=form_Data)
For the case where a form has one or more multivalued fields, the different values can be specified in two different ways.
In a dictionary, indicating a list of values for a key:
import requests
form_data = {'gameCategories': ['racing', 'arcade'], 'numberOfGames': '12'}
r = requests.post('https://example.com/api/', data=form_data)
And many more Request[edit]
In the same way that there are functions get()
and post()
, requestsit has functions for the following request methods: PATCH, PUT, DELETE, HEAD and OPTIONS. They are patch()
, put()
, delete(
), head()
and options()
.
Timeouts[edit]
For any request, it is possible to specify a response timeout
. To do this, you must indicate in the parameter timeout the seconds that the request should wait, at most, before receiving the first byte.
import requests
r = requests.get('https://www.google.com/', timeout=0.01)
Note: If a is not specified timeout
, it requests will wait indefinitely for a response to be obtained.
If the server does not return a response before the specified time, the exception will be thrown requests.exceptions.Timeout
.
Requests exceptions[edit]
In case of an error when making the request
, it requests will throw an exception. The base class for all exceptions is requests.exceptions.RequestException
. However, the most common exceptions are the following:
- Timeout: If the server does not return a response before the time indicated in the parameter timeout.
- TooManyRedirects: If a request exceeds the maximum number of redirects.
- ConnectionError: If there is a network problem (no internet, DNS failure, connection rejected, ...).
controll redirect in request library python[edit]
By default, when making a request with requests library
, it follows the redirections that the server indicates before returning the definitive response (except for HEAD, which must be explicitly indicated).
If this happens, the object with the answer saves in the attribute history list with all the answers from the oldest to the most recent.
For example, if you try to make a request to http://google.com
, you will get the following in response
>>> import requests
>>> r = requests.get('http://google.com/')
>>> r.history
[<Response [301]>]
>>> r.status_code
200
>>> r.history
[<Response [301]>]
>>> r.url
'http://www.google.com/'
As you can see, the first request redirects you to http://www.google.com/
.
To modify this behaviour, you must set the parameter allow_redirects
with value False.
>>> r = requests.get('http://google.com/', allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]
The Response object : Request response python library[edit]
Once we have reviewed the main aspects to make an HTTP request, in this section we are going to focus on the object Response, which is obtained as a result of a request.
This object contains all the information regarding the response, such as content, response code, headers, or cookies.
Response content[edit]
When the response returned by a server is of type text, for example, HTML or XML, the content is found in the attribute text of the object Response.
Requests
, automatically decode the content returned by the server, guessing the encoding to use from the response headers. To know the encoding used you can access the attribute encoding
.
>>> import requests
>>> r = requests.get('https://www.google.com/')
>>> r.encoding
'ISO-8859-1'
>>> r.text
'<!doctype html><html itemscope="" itemtype="h...'
For those cases in which the response is not text, such as an image or a pdf, then the attribute must be accessed content
, since it returns the content as a sequence of bytes.
Finally, there is a special case that allows access to the socket that returns the server's response. It is through the attribute raw
. However, instead of accessing the attribute raw
directly, it is preferable to call the function iter_content
using the following pattern, especially when you want to raw stream the download:
for chunk in r.iter_content(chunk_size=128):
# code that handles the downloaded byte sequence
Response status code[edit]
To get the status code of the response, you must access its attribute status_code
.
>>> import requests
>>> r = requests.get('http://www.google.com/')
>>> r.status_code
200
Response headers[edit]
The response headers are accessible through the attribute headers
. This attribute is a special dictionary that contains each of the headers returned as keys to the dictionary.
>>> import requests
>>> r = requests.get('http://www.google.com/')
>>> r.headers
{'Expires': '-1', 'Cache-Control': 'private, max-age=0', ...}
Response cookies[edit]
If you want to check the cookies returned by the server, you can do so by accessing cookies
the response attribute. This attribute is of type RequestsCookieJar
, which acts as a dictionary with improvements, to indicate the domain and/or the path of a cookie, among other things:
>>> import requests
>>> r = requests.get('https://www.google.com/')
>>> r.cookies
<RequestsCookieJar[Cookie(version=0, name='1P_JAR', value='2020-10-03-10', port=None, port_specified=False, domain='.google.com', ... rest={'HttpOnly': None}, rfc2109=False)]>
>>> r.cookies['1P_JAR']
'2020-10-03-10'
How to pass parameters in url in python with requests[edit]
Sometimes it is necessary to pass a series of parameters in the URL of the request. This can be done manually by adding to the URL a string that begins with the character ?
followed by pairs of the form parm1=value1¶m2=value2
....
For example:
import requests
r = requests.get('https://example.com?page=2')
However, it requests
makes it easier to build a parameterized URL by passing a dictionary in the parameter params
.
import requests
parameters = {'key1' : 'value1' , 'key2' : [ 'val1' , 'val2']}
r = requests.get('https://example.com' , params = parameters)
How to send headers with requests[edit]
If you need to specify a header in the request, you must pass a dictionary of pairs content:value
in the parameter headers
. The value of each of the items in the dictionary must be a string
. The key corresponds to the name of the header.
Example:
import requests
headers = {'cache-control' : 'no-cache' , 'accept' : 'text / html'}
r = requests.get('https://example.come' , headers = headers)
JSON requests and responses in Python with requests[edit]
One of the most common uses of the library requests
is to make requests to an API from an application.
One of the main characteristics of consuming an API is that, generally, the data is sent and obtained in JSON format.
Next, I'll show you how easy it is to make API calls using the library requests
.
Python JSON GET requests[edit]
To make a request GET, you simply have to call the function get()
.
If the response is a JSON, which is the most common, we can call json()
the response method to decode the data and return it as a dictionary with the fields of said JSON
.
import requests
r = requests.get('https://example.com/json/' )
posts = r.json()
NOTE: Please check the response code to see if the response is valid or a crash has occurred. In many cases, a server can return a JSON even when it fails.
Python POST requests[edit]
To send data in JSON format to an API using the POST
, PUT
or methods PATCH
, simply pass a dictionary through the parameter json
. Requests
It already takes care of specifying the header Content-Type
for you and serializing the data correctly.
import requests
load = {'game_name':'Friv','score':5 }
r = requests.post( 'https://friv.co.in/game/' , json = load)
How to send a Multipart-Encoded file[edit]
To send a file with requests
, you just have to load its content in a dictionary and pass this through the parameter files
:
import requests
files = {'file1' : open('gamename.pdf' , 'rb')}
r = requests.post( 'https://example.com/submit/' , files = files)
It is also possible to explicitly specify the file name and type as follows:
import requests
files = {'file1' :('gamename.pdf', open ('gamename.pdf','rb' ), 'application / pdf')}
r = requests.post('https://example.com/submit/' , files = files)
Cookies with requests[edit]
Finally, we are going to see how to send a cookie to the server-generated by our application.
As I have mentioned before, it requests
handles cookies through a type object RequestsCookieJar
that is a kind of dictionary with an interface to specify, among other things, the domain and/or path of a cookie.
Therefore, to send a cookie through requests
, we can do it in the following way:
import requests
jar = requests.cookies.RequestsCookieJar()
jar.set('cookie_name_1' , 'value_1' , domain = 'example.com' , path = '/')
jar.set('cookie_name_2' , 'value_2' , domain = 'example.com' , path = '/ admin')
r = requests.get('https://example.com' , cookies = jar)
Conclusion[edit]
Well, it was an intense but very productive tutorial. In it, we have reviewed the main aspects of how to use the library requests
to make HTTP requests, either to a web page or to consume an API.