class: center, middle .center[] # Web Scraping in Python: Web APIs March 14, 2018
Instructor: [S. M. Masoud Sadrnezhaad](https://twitter.com/smmsadrnezh) --- Introduction to Web APIs ========== - An API (Application Programming Interface) is a framework for building HTTP services that can be consumed by a wide variety of clients. - Web APIs use HTTP protocol to handle requests between the client and the web server. - Some of the most common APIs that enable developers to integrate and use their infrastructure include: - Google APIs - Twitter API - Amazon API - Facebook API - One of the most important reasons to use an API as opposed to other static data sources is because it's **real time**. - For example, the Twitter API we are going to use will fetch real-time data from the social network. - Another advantage is that the data keeps changing, so if you were to download it at intervals, it would be time-consuming. --- Using the Requests Library ========== - In order to use an API, you will need to install the requests Python library. Requests is an HTTP library in Python that enables you to send HTTP requests in Python. Install Requests - The GET method is used to get information from a web server. Let's see how to make a GET request to get GitHub's public timeline. - We use the variable `response` to store the response from our request. ```python import requests response = requests.get('https://github.com/timeline.json') ``` --- Using the Requests Library (Contd) ========== - Now that we have made a request to the GitHub timeline, let's get the encoding and the content contained in the response. ```python response.text u'{"message":"Hello there, wayfaring stranger. If you\u2019re reading this then you probably didn\u2019t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.","documentation_url":"https://developer.github.com/v3/activity/events/#list-public-events"} response.encoding 'utf-8' ``` - Requests has a built-in JSON decode which you can use to get the response of a request in JSON format. ```python import json response.json() {u'documentation_url': u'https://developer.github.com/v3/activity/events/#list-public-events', u'message': u'Hello there, wayfaring stranger. If you\u2019re reading this then you probably didn\u2019t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.'} ``` --- How to Create and Update Information on the Web API ========== - The POST and PUT methods are both used to create and update data. - Despite the similarities, it's important to note that using a POST request to update data will result in two entries in the data store if two identical items are submitted. - Create data (POST request): ```python r = requests.post('http://127.0.0.1/api/v1/add_item', data = {'task':'Shopping'}) ``` - Update data (PUT request): ```python r = requests.put('http://127.0.0.1/api/v1/add_item', data = {'task':'Shopping at 2'}) Working With the Twitter REST API ``` --- What does curl do? ========== - Internet transfers for resources specified as URLs using Internet protocols. - Everything and anything that is related to Internet protocol transfers can be considered curl's business. - Things that are not related to that should be avoided and be left for other projects and products. - It could be important to also consider that curl and libcurl try to avoid handling the actual data that is transferred. - It has, for example, no knowledge about HTML or anything else of the content that is popular to transfer over HTTP, but it knows all about how to transfer such data over HTTP. - It was also a client-side program, a URL client. So 'c' for Client and URL: cURL. - curl started out as a command-line tool and it has been invoked from shell prompts and from within scripts by thousands of users over the years. --- curl Command line options ========== - When telling curl to do something, you invoke curl with zero, one or several command-line options to accompany the URL or set of URLs you want the transfer to be about. - curl supports over two hundred different options. - Command line options pass on information to curl about how you want it to behave. - Like you can ask curl to switch on verbose mode with the -v option: ```bash curl -v http://example.com ``` - `-v` is here used as a "short option". - You write those with the minus symbol and a single letter immediately following it. - Many options are just switches that switches something on or changes something between two known states. They can be used with just that option name. --- curl Command line options (Contd) ========== - You can then also **combine several single-letter** options after the minus. - To ask for both verbose mode and that curl follows HTTP redirects: ```bash curl -vL http://example.com ``` - The command-line parser in curl always parses the entire line and you can put the options anywhere you like; they can also appear after the URL: ```bash curl http://example.com -Lv ``` - Single-letter options are convenient since they are **quick to write and use**, but as there are only a **limited number of letters** in the alphabet and there are many things to control, not all options are available like that. - **Long option** names are therefore provided for those. - Also, as a convenience and to allow scripts to become **more readable**, most short options have **longer name aliases**. --- curl Command line options (Contd) ========== - Long options are always written with **two minuses** (or dashes, whichever you prefer to call them) and then the name and you can **only write one option name per double-minus**. - Asking for **verbose mode** using the **long option** format looks like: ```bash curl --verbose http://example.com ``` - and asking for HTTP redirects as well using the long format looks like: ```bash curl --verbose --location http://example.com ``` - Not all options are just simple boolean flags that enable or disable features. - For some of them you need to **pass on data**, like perhaps a user name or a path to a file. --- curl Command line options (Contd) ========== - You do this by writing first the option and then the argument, separated with a space. - Like, for example, if you want to send an arbitrary string of data in an HTTP POST to a server: ```bash curl -d arbitrary http://example.com ``` - and it works the same way even if you use the long form of the option: ```bash curl --data arbitrary http://example.com ``` - When you use the short options with arguments, you can, in fact, also write the data without the space separator: ```bash curl -darbitrary http://example.com ``` --- GitHub REST API ========== - Let's start by testing our setup. Open up a command prompt and enter the following command: ```bash curl https://api.github.com/zen Keep it logically awesome. ``` - The response will be a random selection from Github design philosophies. - Next, let's `GET` [Chris Wanstrath's](https://github.com/defunkt) GitHub profile: ```bash # GET /users/defunkt curl https://api.github.com/users/defunkt { "login": "defunkt", "id": 2, "url": "https://api.github.com/users/defunkt", "html_url": "https://github.com/defunkt", ... } ``` - Mmmmm, tastes like JSON. --- GitHub REST API (Contd) ========== - Let's add the `-i` flag to include headers: ```bash curl -i https://api.github.com/users/defunkt HTTP/1.1 200 OK Server: GitHub.com Date: Sun, 11 Nov 2012 18:43:28 GMT Content-Type: application/json; charset=utf-8 Connection: keep-alive Status: 200 OK ETag: "bfd85cbf23ac0b0c8a29bee02e7117c6" X-RateLimit-Limit: 60 X-RateLimit-Remaining: 57 X-RateLimit-Reset: 1352660008 X-GitHub-Media-Type: github.v3 Vary: Accept Cache-Control: public, max-age=60, s-maxage=60 X-Content-Type-Options: nosniff Content-Length: 692 Last-Modified: Tue, 30 Oct 2012 18:58:42 GMT ... ``` --- GitHub REST API (Contd) ========== ```bash ... { "login": "defunkt", "id": 2, "url": "https://api.github.com/users/defunkt", "html_url": "https://github.com/defunkt", ... } ``` - There are a few interesting bits in the response headers. As expected, the `Content-Type` is `application/json`. - Any headers beginning with `X-` are custom headers, and are not included in the HTTP spec. For example: - Take note of the `X-RateLimit-Limit` and `X-RateLimit-Remaining` headers. - This pair of headers indicate how many requests a client can make in a rolling time period (typically an hour) and how many of those requests the client has already spent. --- GitHub REST API (Contd) ========== - Unauthenticated clients can make **60 requests per hour**. To get more, we'll need to authenticate. - In fact, doing anything interesting with the GitHub API **requires authentication**. - The **easiest way** to authenticate with the GitHub API is by simply using your GitHub **username and password** via Basic Authentication. ```bash curl -i -u your_username https://api.github.com/users/defunkt Enter host password for user your_username: ``` - The `-u` flag sets the username, and cURL will prompt you for the password. - You can use `-u "username:password"` to avoid the prompt, but this leaves your password in shell history and isn't recommended. - When authenticating, you should see your rate limit bumped to **5,000 requests an hour**, as indicated in the `X-RateLimit-Limit` header. - In addition to just getting more calls per hour, authentication is the key to **reading and writing private information** via the API. --- GitHub REST API (Contd) ========== - When properly authenticated, you can take advantage of the permissions associated with your GitHub account. - For example, try getting your own user profile: ```bash curl -i -u your_username https://api.github.com/user { ... "plan": { "space": 2516582, "collaborators": 10, "private_repos": 20, "name": "medium" } ... } ``` - This time, in addition to the same set of public information we retrieved for [@defunkt](https://github.com/defunkt) earlier, you should also see the **non-public information** for your user profile. For example, you'll see a plan object in the response which gives details about the **GitHub plan** for the account. --- Twitter REST API ========== - In this section, you are going to learn how to obtain Twitter API credentials, authenticate to the Twitter API, and interact with the Twitter API using Python. - You will also be able to retrieve information from public Twitter accounts, like tweets, followers, etc. ## Authenticating With Twitter - We need to authenticate with the Twitter API before we can interact with it. To do this, follow the following steps: - Go to the [Twitter Apps page](https://apps.twitter.com/). - Click on Create New App (you need to be logged in to Twitter to access this page). If you don't have a Twitter account, create one. - Create a name and description for your app and a website placeholder. --- Authenticating With Twitter (Contd) ========== - Locate the Keys and Access Tokens Tab and create your access token. - You need to take note of the `Access token` and `Access Token secret` since you will need them for the authentication process. - You also need to take note of the `Consumer Key` and `Consumer Secret`. - There are a few libraries that we can use to access the Twitter API, but we are going to use the `python-twitter` library here. - To install python-twitter, use: ```bash pip install python-twitter ``` --- Authenticating With Twitter (Contd) ========== - The Twitter API is exposed via the `twitter.Api` class, so let's create the class by passing our tokens and secret keys: ```bash import twitter api = twitter.Api(consumer_key=[consumer key], consumer_secret=[consumer secret], access_token_key=[access token], access_token_secret=[access token secret]) ``` - Replace your credentials above and make sure they are enclosed in quotes, i.e. consumer_key=‘xxxxxxxxxx’, ...) --- Querying Twitter ========== There are many methods of interacting with the Twitter API, including: ```python >>> api.PostUpdates(status) >>> api.PostDirectMessage(user, text) >>> api.GetUser(user) >>> api.GetReplies() >>> api.GetUserTimeline(user) >>> api.GetHomeTimeline() >>> api.GetStatus(status_id) >>> api.DestroyStatus(status_id) >>> api.GetFriends(user) >>> api.GetFollowers() ``` - To **get data** from Twitter, we are going to make an **API call** with the help of the **api object** we created above. --- Querying Twitter (Contd) ========== - We will do the following: - Create a `user` variable and set it equal to a valid Twitter handle (username). - Call the `GetUserTimeline()` method on the `api` object and **pass in** the following **arguments**: a valid Twitter handle, the number of tweets you want to retrieve (count), a flag to exclude retweets (this is done using `include_rts = false`) Let's get the latest tweets from the [smmsadrnezh](https://twitter.com/smmsadrnezh) timeline, excluding retweets. ```python import twitter api = twitter.Api(consumer_key="xxxxxxxxxxxx", consumer_secret="xxxxxxxxxxxxxx", access_token_key="314746354-xxxxx", access_token_secret="xxxxxx") user = "@smmsadrnezh" statuses = api.GetUserTimeline( screen_name=user, count=30, include_rts=False) for s in statuses: print(s.text) ``` The `GetUserTimeline()` method will return a list of the latest 30 tweets. --- Querying Twitter (Contd) ========== To retrieve followers, we use the `GetFriends()` method. ```python import twitter api = twitter.Api(consumer_key="ftFL8G4yzQXUVzbUCdxGwKSjJ", consumer_secret="KxGwBe6GlgSYyC7PioIVuZ5tFAsZs7q1rseEYCOnTDIjulT0mZ", access_token_key="314746354-Ucq36TRDnfGAxpOVtnK1qZxMfRKzFHFhyRqzNpTx", access_token_secret="7wZ1qHS0qycy0aNjoMDpKhcfzuLm6uAbhB2LilxZzST8w") user = "@smmsadrnezh" friends = api.GetFriends(screen_name=user) for friend in friends: print friend.name ``` - Twitter’s API can be used to a great extent in **data analytics**. - It can also be used in complex **big data** problems and authenticating **apps**. - Read more about the Twitter API at the [Twitter developers site](https://developer.twitter.com/en/docs/api-reference-index). --- # References * https://code.tutsplus.com/articles/how-to-use-restful-web-apis-in-python--cms-29493 * https://ec.haxx.se/ * https://developer.github.com/v3/guides/getting-started/ --- class: center, middle .center[] # Thank you. Any questions?