class: center, middle .center[] # Web Scraping in Python: Requests February 15, 2018
Instructor: [S. M. Masoud Sadrnezhaad](https://twitter.com/smmsadrnezh) --- Quick Review On HTTP Methods ========== - The **Hypertext Transfer Protocol (HTTP)** is designed to enable communications **between clients and servers**. - HTTP works as a **request-response protocol** between a client and server. - **A web browser** may be the **client**, and **an application** on a computer that hosts a web site may be the **server**. - Example: A client (browser) submits an HTTP request to the server; then the server returns a response to the client. The response contains status information about the request and may also contain the requested content. - Two commonly used methods for a request-response between a client and server are: GET and POST. * **GET** - Requests data from a specified resource * **POST** - Submits data to be processed to a specified resource --- The GET Method ========== - Note that the query string (name/value pairs) is sent in **the URL** of a GET request: ```python /test/demo_form.php?name1=value1&name2=value2 ``` - Some other notes on GET requests: * GET requests can be **cached** * GET requests remain in the **browser history** * GET requests can be **bookmarked** * GET requests should **never be used** when dealing with **sensitive data** * GET requests have **length restrictions** * GET requests **should be used only to retrieve data** --- The POST Method ========== - Note that the query string (name/value pairs) is sent in **the HTTP message body** of a POST request: ```python POST /test/demo_form.php HTTP/1.1 Host: w3schools.com name1=value1&name2=value2 ``` - Some other notes on POST requests: * POST requests are **never cached** * POST requests do **not remain** in the **browser history** * POST requests **cannot** be **bookmarked** * POST requests have **no restrictions on data length** --- Other HTTP Request Methods ========== The following table lists some other HTTP request methods: - **HEAD** Same as GET but returns only HTTP headers and no document body - **PUT** Uploads a representation of the specified URI - **DELETE** Deletes the specified resource - **OPTIONS** Returns the HTTP methods that the server supports - **CONNECT** Converts the request connection to a transparent TCP/IP tunnel Status Code Definitions ========== Each Status-Code is described below, including a description of which method(s) it can follow and any metainformation required in the response. - **Successful 2xx** - **Redirection 3xx** - **Client Error 4xx** - **Server Error 5xx** --- # Python Requests - **Requests** is an elegant and simple **HTTP library for Python**, built for human beings. - Requests is ready for **today's web**. - Requests officially supports Python 2.6–2.7 & **3.4–3.7** - Requests is released under terms of **Apache2 License**. - Twitter, Spotify, Microsoft, Amazon, Lyft, BuzzFeed, Reddit, The NSA, Her Majesty's Government, Google, Twilio, Runscope, Mozilla, Heroku, PayPal, NPR, Obama for America, Transifex, Native Instruments, The Washington Post, SoundCloud, Kippt, Sony, and Federal U.S. Institutions that prefer to be unnamed claim to use Requests internally. --- Installation of Requests ========== ## pipenv method To install Requests, simply run this simple command in your terminal of choice: ```python $ pipenv install requests ``` If you don't have [pipenv](http://pipenv.org) installed (tisk tisk!), head over to the Pipenv website for installation instructions. ## pip method ```python $ pip3 install requests ``` Requests is actively developed on GitHub, where the code is [always available](https://github.com/requests/requests) You can either clone the public repository or, download the [tarball](https://github.com/requests/requests/tarball/master). --- Make a Request ========== Let's get started with some simple examples. Making a request with Requests is very simple. Begin by importing the Requests module: ```python >>> import requests ``` Now, let's try to get a webpage. For this example, let's get GitHub's public timeline: ```python >>> r = requests.get('https://api.github.com/events') ``` Now, we have a **Response** object called `r`. We can get all the information we need from this object. --- Make a Request (Contd) ========== Requests' simple API means that all forms of HTTP request are as obvious. For example, this is how you make an HTTP POST request: ```python >>> r = requests.post('http://httpbin.org/post', data = {'key':'value'}) ``` Nice, right? What about the other HTTP request types: PUT, DELETE, HEAD and OPTIONS? These are all just as simple: ```python >>> r = requests.put('http://httpbin.org/put', data = {'key':'value'}) >>> r = requests.delete('http://httpbin.org/delete') >>> r = requests.head('http://httpbin.org/get') >>> r = requests.options('http://httpbin.org/get') ``` That's all well and good, but it's also only the start of what Requests can do. --- Passing Parameters In URLs ========== You often want to send some sort of data in the **URL's query string**. If you were constructing the URL by hand, this data would be given as key/value pairs in the URL after a question mark, e.g. `httpbin.org/get?key=val`. Requests allows you to provide these arguments **as a dictionary of strings**, using the `params` keyword argument. As an example, if you wanted to pass `key1=value1` and `key2=value2` to `httpbin.org/get`, you would use the following code: ```python >>> payload = {'key1': 'value1', 'key2': 'value2'} >>> r = requests.get('http://httpbin.org/get', params=payload) ``` You can see that the URL has been correctly encoded by printing the URL: ```python >>> print(r.url) http://httpbin.org/get?key2=value2&key1=value1 ``` Note that any dictionary key whose value is `None` **will not be added** to the URL's query string. --- Passing Parameters In URLs (Contd) ========== You can also pass **a list of items as a value**: ```python >>> payload = {'key1': 'value1', 'key2': ['value2', 'value3']} >>> r = requests.get('http://httpbin.org/get', params=payload) >>> print(r.url) http://httpbin.org/get?key1=value1&key2=value2&key2=value3 ``` --- Response Content ========== We can read the content of the server's response. Consider the GitHub timeline again: ```python >>> import requests >>> r = requests.get('https://api.github.com/events') >>> r.text u'[{"repository":{"open_issues":0,"url":"https://github.com/... ``` When you make a request, Requests makes **educated guesses** about the **encoding of the response** based on the HTTP headers. The text encoding guessed by Requests is used when you access `r.text`. You can find out what encoding Requests is using, and **change it**, using the `r.encoding` property: ```python >>> r.encoding 'utf-8' >>> r.encoding = 'ISO-8859-1' ``` If you change the encoding, Requests will **use the new value** of `r.encoding` whenever you call `r.text`. --- Binary Response Content ========== You can also access the response body as bytes, for non-text requests: ```python >>> r.content b'[{"repository":{"open_issues":0,"url":"https://github.com/... ``` The `gzip` and `deflate` transfer-encodings are automatically decoded for you. For example, to create an image from binary data returned by a request, you can use the following code: ```python >>> from PIL import Image >>> from io import BytesIO >>> i = Image.open(BytesIO(r.content)) ``` --- Response Status Codes ========== We can check the response status code:: ```python >>> r = requests.get('http://httpbin.org/get') >>> r.status_code 200 ``` Requests also comes with a built-in status code lookup object for easy reference: ```python >>> r.status_code == requests.codes.ok True ``` --- Response Status Codes (Contd) ========== If we made a bad request (a 4XX client error or 5XX server error response), we can raise it with `Response.raise_for_status()`: ```python >>> bad_r = requests.get('http://httpbin.org/status/404') >>> bad_r.status_code 404 >>> bad_r.raise_for_status() Traceback (most recent call last): File "requests/models.py", line 832, in raise_for_status raise http_error requests.exceptions.HTTPError: 404 Client Error ``` But, since our `status_code` for `r` was `200`, when we call `raise_for_status()` we get: ```python >>> r.raise_for_status() None ``` All is well. --- Response Headers ========== We can view the server's response headers using a Python dictionary:: ```python >>> r.headers { 'content-encoding': 'gzip', 'transfer-encoding': 'chunked', 'connection': 'close', 'server': 'nginx/1.0.4', 'x-runtime': '148ms', 'etag': '"e1ca502697e5c9317743dc078f67693f"', 'content-type': 'application/json' } ``` --- Response Headers (Contd) ========== The dictionary is special, though: it's made just for HTTP headers. According to [RFC 7230](http://tools.ietf.org/html/rfc7230#section-3.2) HTTP Header names are case-insensitive. So, we can access the headers using any capitalization we want:: ```python >>> r.headers['Content-Type'] 'application/json' >>> r.headers.get('content-type') 'application/json' ``` --- Redirection and History ========== By default Requests will perform location redirection for all verbs except HEAD. We can use the `history` property of the Response object to track redirection. The `Response.history` list contains the `Response` objects that were created in order to complete the request. The list is sorted from the oldest to the most recent response. For example, GitHub redirects all HTTP requests to HTTPS:: ```python >>> r = requests.get('http://github.com') >>> r.url 'https://github.com/' >>> r.status_code 200 >>> r.history [
] ``` --- Redirection and History (Contd) ========== If you're using GET, OPTIONS, POST, PUT, PATCH or DELETE, you can disable redirection handling with the `allow_redirects` parameter:: ```python >>> r = requests.get('http://github.com', allow_redirects=False) >>> r.status_code 301 >>> r.history [] ``` If you're using HEAD, you can enable redirection as well:: ```python >>> r = requests.head('http://github.com', allow_redirects=True) >>> r.url 'https://github.com/' >>> r.history [Response [301]] ``` --- Timeouts ========== You can tell Requests to **stop waiting** for a response **after a given number of seconds** with the ``timeout`` parameter. Nearly all production code should use this parameter in nearly all requests. Failure to do so can cause your program to **hang indefinitely**: ```python >>> requests.get('http://github.com', timeout=0.001) Traceback (most recent call last): File "
", line 1, in
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001) ``` --- # References * http://docs.python-requests.org * https://github.com/requests/requests * https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html --- class: center, middle .center[] # Thank you. Any questions?