How to Build Web Crawler Using Python

From WikiHTP

In order to build web Crawler, we are going to use request python library. Requests is a Python library that greatly facilitates working with HTTP requests. Sooner or later, in a project, you may have to make web requests, either to consume an API, extract information from a page or submit the content of a form in an automated way. If so, Python requests your great ally.

This tutorial is an introduction to the library requests. In it I will show you how to use it for the most common uses: how to make a GET request, a POST request, pass headers in the request, how to simulate the sending of data from a form, how to send and receive a JSON or how to manage the code's response.

Installation[edit]

Being a third-party library, the first thing you should do is install it. The easiest way to do this is to use the pip package manager:

pip install requests

If you are using python3, try:

pip3 install requests

How to make a GET request in Python with requests[edit]

One of the most common operations with the library requests is to make a request GET, either to obtain the content of a website or to make a request to an API.

To do this, you simply have to invoke the function get() indicating the URL to which to make the request.

import requests
r = requests.get('https://www.google.com/')
print(r.text)

The function returns an object Response, which in this case has been assigned to the variable r, with all the information from the response.

How to make a POST request in Python with requests[edit]

Now imagine that you want to make a request POST to send the data of a form. In this case, the way to proceed is very similar to the previous one, only that you must call the function post() and indicate in the parameter data a dictionary with the data of the body of the request. By passing the data in the parameter data, it requests takes care of encoding them correctly before making the request:

import requests
form_Data = {'gameCategory': 'racing', 'numberOfGames': '12'}
r = requests.post('https://example.com/api/', data=form_Data)

For the case where a form has one or more multivalued fields, the different values ​​can be specified in two different ways.

In a dictionary, indicating a list of values ​​for a key:

import requests
form_data = {'gameCategories': ['racing', 'arcade'], 'numberOfGames': '12'}
r = requests.post('https://example.com/api/', data=form_data)

And many more Request[edit]

In the same way that there are functions get() and post(), requestsit has functions for the following request methods: PATCH, PUT, DELETE, HEAD and OPTIONS. They are patch(), put(), delete(), head() and options().

Timeouts[edit]

For any request, it is possible to specify a response timeout. To do this, you must indicate in the parameter timeout the seconds that the request should wait, at most, before receiving the first byte.

import requests
r = requests.get('https://www.google.com/', timeout=0.01)

Note: If a is not specified timeout, it requests will wait indefinitely for a response to be obtained.

If the server does not return a response before the specified time, the exception will be thrown requests.exceptions.Timeout.

Requests exceptions[edit]

In case of an error when making the request, it requests will throw an exception. The base class for all exceptions is requests.exceptions.RequestException. However, the most common exceptions are the following:

  • Timeout: If the server does not return a response before the time indicated in the parameter timeout.
  • TooManyRedirects: If a request exceeds the maximum number of redirects.
  • ConnectionError: If there is a network problem (no internet, DNS failure, connection rejected, ...).

controll redirect in request library python[edit]

By default, when making a request with requests library, it follows the redirections that the server indicates before returning the definitive response (except for HEAD, which must be explicitly indicated).

If this happens, the object with the answer saves in the attribute history list with all the answers from the oldest to the most recent.

For example, if you try to make a request to http://google.com, you will get the following in response

>>> import requests
>>> r = requests.get('http://google.com/')
>>> r.history
[<Response [301]>]
>>> r.status_code
200
>>> r.history
[<Response [301]>]
>>> r.url
'http://www.google.com/'

As you can see, the first request redirects you to http://www.google.com/.

To modify this behaviour, you must set the parameter allow_redirects with value False.

>>> r = requests.get('http://google.com/', allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]

The Response object : Request response python library[edit]

Once we have reviewed the main aspects to make an HTTP request, in this section we are going to focus on the object Response, which is obtained as a result of a request.

This object contains all the information regarding the response, such as content, response code, headers, or cookies.

Response content[edit]

When the response returned by a server is of type text, for example, HTML or XML, the content is found in the attribute text of the object Response.

Requests, automatically decode the content returned by the server, guessing the encoding to use from the response headers. To know the encoding used you can access the attribute encoding.

>>> import requests
>>> r = requests.get('https://www.google.com/')
>>> r.encoding
'ISO-8859-1'
>>> r.text
'<!doctype html><html itemscope="" itemtype="h...'

For those cases in which the response is not text, such as an image or a pdf, then the attribute must be accessed content, since it returns the content as a sequence of bytes.

Finally, there is a special case that allows access to the socket that returns the server's response. It is through the attribute raw. However, instead of accessing the attribute raw directly, it is preferable to call the function iter_content using the following pattern, especially when you want to raw stream the download:

for chunk in r.iter_content(chunk_size=128):
    # code that handles the downloaded byte sequence

Response status code[edit]

To get the status code of the response, you must access its attribute status_code.

>>> import requests
>>> r = requests.get('http://www.google.com/')
>>> r.status_code
200

Response headers[edit]

The response headers are accessible through the attribute headers. This attribute is a special dictionary that contains each of the headers returned as keys to the dictionary.

>>> import requests
>>> r = requests.get('http://www.google.com/')
>>> r.headers
{'Expires': '-1', 'Cache-Control': 'private, max-age=0', ...}

Response cookies[edit]

If you want to check the cookies returned by the server, you can do so by accessing cookies the response attribute. This attribute is of type RequestsCookieJar, which acts as a dictionary with improvements, to indicate the domain and/or the path of a cookie, among other things:

>>> import requests
>>> r = requests.get('https://www.google.com/')
>>> r.cookies
<RequestsCookieJar[Cookie(version=0, name='1P_JAR', value='2020-10-03-10', port=None, port_specified=False, domain='.google.com', ... rest={'HttpOnly': None}, rfc2109=False)]>
>>> r.cookies['1P_JAR']
'2020-10-03-10'

How to pass parameters in url in python with requests[edit]

Sometimes it is necessary to pass a series of parameters in the URL of the request. This can be done manually by adding to the URL a string that begins with the character ? followed by pairs of the form parm1=value1&param2=value2....

For example:

import requests
r = requests.get('https://example.com?page=2')

However, it requests makes it easier to build a parameterized URL by passing a dictionary in the parameter params.

import requests
parameters = {'key1' : 'value1' , 'key2' : [ 'val1' , 'val2']} 
r = requests.get('https://example.com' , params = parameters)

How to send headers with requests[edit]

If you need to specify a header in the request, you must pass a dictionary of pairs content:value in the parameter headers. The value of each of the items in the dictionary must be a string. The key corresponds to the name of the header.

Example:

import requests
headers = {'cache-control' : 'no-cache' , 'accept' : 'text / html'} 
r = requests.get('https://example.come' , headers = headers)

JSON requests and responses in Python with requests[edit]

One of the most common uses of the library requests is to make requests to an API from an application.

One of the main characteristics of consuming an API is that, generally, the data is sent and obtained in JSON format.

Next, I'll show you how easy it is to make API calls using the library requests.

Python JSON GET requests[edit]

To make a request GET, you simply have to call the function get() .

If the response is a JSON, which is the most common, we can call json() the response method to decode the data and return it as a dictionary with the fields of said JSON.

import requests
r = requests.get('https://example.com/json/' )
posts = r.json()

NOTE: Please check the response code to see if the response is valid or a crash has occurred. In many cases, a server can return a JSON even when it fails.

Python POST requests[edit]

To send data in JSON format to an API using the POST, PUT or methods PATCH, simply pass a dictionary through the parameter json. Requests It already takes care of specifying the header Content-Type for you and serializing the data correctly.

import requests
load = {'game_name':'Friv','score':5 }
r = requests.post( 'https://friv.co.in/game/' , json = load)

How to send a Multipart-Encoded file[edit]

To send a file with requests, you just have to load its content in a dictionary and pass this through the parameter files:

import requests

files = {'file1' : open('gamename.pdf' , 'rb')}
r = requests.post( 'https://example.com/submit/' , files = files)

It is also possible to explicitly specify the file name and type as follows:

import requests
files = {'file1' :('gamename.pdf', open ('gamename.pdf','rb' ), 'application / pdf')}
r = requests.post('https://example.com/submit/' , files = files)

Cookies with requests[edit]

Finally, we are going to see how to send a cookie to the server-generated by our application.

As I have mentioned before, it requests handles cookies through a type object RequestsCookieJar that is a kind of dictionary with an interface to specify, among other things, the domain and/or path of a cookie.

Therefore, to send a cookie through requests, we can do it in the following way:

import requests
jar = requests.cookies.RequestsCookieJar()
jar.set('cookie_name_1' , 'value_1' , domain = 'example.com' , path = '/')
jar.set('cookie_name_2' , 'value_2' , domain = 'example.com' , path = '/ admin')
r = requests.get('https://example.com' , cookies = jar)

Conclusion[edit]

Well, it was an intense but very productive tutorial. In it, we have reviewed the main aspects of how to use the library requests to make HTTP requests, either to a web page or to consume an API.

About This Tutorial

This page was last edited on 3 October 2020, at 02:42.