Interesting data sets are the fuel of a good data science project. And while websites like Kaggle offer free data sets to interested data scientists, APIs are another very common way to access and acquire interesting data.
Instead of having to download a data set, APIs allow programmers to request data directly from certain websites through what’s called an Application Programming Interface (hence, “API”). Many large websites like Reddit, Twitter and Facebook offer APIs so that data analysts and data scientists can access interesting data.
In this tutorial, we’re going to cover the basics of accessing an API using the R programming language. You don’t need any API experience, but you will need to be familiar with the fundamentals of R to follow along.
Introduction to APIs with R
“API” is a general term for the place where one computer program interacts with another, or with itself. In this tutorial, we’ll specifically be working with web APIs, where two different computers — a client and server — will interact with each other to request and provide data, respectively.
APIs offer data scientists a polished way to request clean and curated data from a website. When a website like Facebook sets up an API, they are essentially setting up a computer that waits for data requests.
Once this computer receives a data request, it will do its own processing of the data and send it to the computer that requested it. From our perspective as the requester, we will need to write code in R that creates the request and tells the computer running the API what we need. That computer will then read our code, process the request, and return nicely-formatted data that can be easily parsed by existing R libraries.
Why is this valuable? Contrast the API approach to pure web scraping. When a programmer scrapes a web page, they receive the data in a messy chunk of HTML. While there are certainly libraries out there that make parsing HTML text easy, these are all cleaning steps that need to be taken before we even get our hands on the data we want!
Often, we can immediately use the data we get from an API, which saves us time and frustration.
Making API requests in R
To work with APIs in R, we need to bring in some libraries. These libraries take all of the complexities of an API request and wrap them up in functions that we can use in single lines of code. The R libraries that we’ll be using are httr
and jsonlite
. They serve different roles in our introduction of APIs, but both are essential.
If you don’t have either of these libraries in your R console or RStudio, you’ll need to download them first. Use the install.packages()
function to bring in these packages.
install.packages(c("httr", "jsonlite"))
After downloading the libraries, we’ll be able to use them in our R scripts or RMarkdown files.
library(httr)
library(jsonlite)
Making Our First API Request
The first step in getting data from an API is making the actual request in R. This request will be sent to the computer server that has the API, and assuming everything goes smoothly, it will send back a response. The graphic below illustrates this:
There are several types of requests that one can make to an API server. These types of requests correspond to different actions that you want the server to make.
For our purposes, we’ll just be asking for data, which corresponds to a GET
request. Other types of requests are POST
and PUT
, but we won’t need to worry about them for the purposes of this data-science-focused R API tutorial.
In order to create a GET
request, we need to use the GET()
function from the httr
library. The GET()
function requires a URL, which specifies the address of the server that the request needs to be sent to.
The GET()
function encapsulates all of the complexity of a GET request. For our example, we’ll be working with the Open Notify API, which opens up data on various NASA projects. Using the Open Notify API, we can learn about the location of the International Space Station and how many people are currently in space.
We’ll be working with the latter API first. We’ll start by making our request using the GET()
function and specifying the API’s URL:
>> res = GET("http://api.open-notify.org/astros.json")
The output of the GET()
function is a list, which contains all of the information that is returned by the API server. In other words, the res
variable contains the response of the API server to our request
Examining the GET()
output
Let’s have a look at what the res
variable looks like in the R console:
>> res
Response [http://api.open-notify.org/astros.json]
Date: 2020-01-30 18:07
Status: 200
Content-Type: application/json
Size: 314 B
Investigating the res
variable gives us a summary look at the resulting response. The first thing to notice is that it contains the URL that the GET
request was sent to. We can also see the date and time that the request was made, as well as the size of the response.
The content type gives us an idea of what form the data takes. This particular response says that the data takes on a json
format, which gives a hint about why we need the jsonlite
library.
The status deserves some special attention. “Status” refers to the success or failure of the API request, and it comes in the form of a number. The number returned tells you whether or not the request was a success and can also detail some reasons why it might have failed.
The number 200 is what we want to see; it corresponds to a successful request, and that’s what we have here. There is more information about other status codes we might encounter here. Since we have a successful 200 status response, we know that we have the data on hand and we can start working with it.
Handling JSON Data
JSON stands for JavaScript Object Notation. While JavaScript is another programming language, our focus on JSON is its structure. JSON is useful because it is easily readable by a computer, and for this reason, it has become the primary way that data is transported through APIs. Most APIs will send their responses in JSON format.
JSON is formatted as a series of key-value pairs, where a particular word (“key”) is associated with a particular value. An example of this key-value structure is shown below:
{
“name”: “Jane Doe”,
“number_of_skills”: 2
}
In its current state, the data in the res
variable is not usable. The actual data is contained as raw Unicode in the res
list, which ultimately needs to be converted into JSON format.
To do this, we first need to convert the raw Unicode into a character vector that resembles the JSON format shown above. The rawToChar()
function performs just this task, as shown below:
>> rawToChar(res$content)
[1] "{\"people\": [{\"name\": \"Christina Koch\", \"craft\": \"ISS\"}, {\"name\": \"Alexander Skvortsov\", \"craft\": \"ISS\"}, {\"name\": \"Luca Parmitano\", \"craft\": \"ISS\"}, {\"name\": \"Andrew Morgan\", \"craft\": \"ISS\"}, {\"name\": \"Oleg Skripochka\", \"craft\": \"ISS\"}, {\"name\": \"Jessica Meir\", \"craft\": \"ISS\"}], \"number\": 6, \"message\": \"success\"}"
While the resulting string looks messy, it’s truly the JSON structure in character format.
From a character vector, we can convert it into list
data structure using the fromJSON()
function from the jsonlite
library.
The fromJSON()
function needs a character vector that contains the JSON structure, which is what we got from the rawToChar() output. So, if we string these two functions together, we’ll get the data we want in a format that we can more easily manipulate in R.
>> data = fromJSON(rawToChar(res$content))
>> names(data)
[1] "people" "number" "message"
In our data
variable, the data set that we’re interested in looking at is contained in the people
data frame. We can use the $
operator to directly look at this data frame:
>> data$people
name craft
1 Christina Koch ISS
2 Alexander Skvortsov ISS
3 Luca Parmitano ISS
4 Andrew Morgan ISS
5 Oleg Skripochka ISS
6 Jessica Meir ISS
So, there’s our answer: at the time of writing this blog post, there are 6 people in space. But if you try it for yourself, you’ll likely get different names and a different number. That’s one of the advantages of APIs — unlike downloadable data sets, they’re generally updated in real-time or near-real-time, so they’re a great way to get access to very current data.
A data frame, like the one we see above, is how we would typically store structured data for further analysis in the tidyverse
libraries that we learn in the Dataquest curriculum. (You can learn more about this in our Data Analyst in R Path if interested.)
Above, we’ve walked through a very straightforward API workflow. Most APIs will require you to follow this same general pattern, but they can be more complex.
The API URL we used above just required us to make a request from it without any extra details. Some APIs require more information from the user. In the last part of this tutorial, we will cover how to provide additional information to the API with your request.
APIs and Query Parameters
What if we wanted to know when the ISS was going to pass over a given location on earth? Unlike the People in Space API, Open Notify’s ISS Pass Times API requires us to provide additional parameters before it can return the data we want.
Specifically, we’ll need to specify the latitude and longitude of the location we’re asking about as part of our GET()
request. Once a latitude and longitude are specified, they are combined with the original URL as query parameters.
Let’s use this API to find out when the ISS will be passing over the Brooklyn Bridge (which is at roughly latitude 40.7, longitude: -74):
res = GET(“http://api.open-notify.org/iss-pass.json",
query = list(lat = 40.7, lon = -74))
# Checking the URL that gets used in the API request yields
# http://api.open-notify.org/iss-pass.json?lat=40.7&lon=-74
The different elements of the list that we pass into the query
argument in GET()
are formatted correctly into the URL, so you don’t have to do this yourself.
You’ll need to check the documentation for the API that you’re working with to see if there are any required query parameters. Here is the API documentation for the ISS Pass Times.
The vast majority of APIs you may want to access will have documentation that you can (and should) read to get a clear understanding of what parameters your request requires. One common parameter required by many APIs, although not the ones we’re using for this tutorial, is an API key or some other form of authentication. Check the API’s documentation to learn how to get an API key and how to insert it into your request.
Anyway, now that we’ve made our request, including the location parameters, we can check out the response using the same functions we used earlier. Let’s extract the data from the response:
>> res = GET("http://api.open-notify.org/iss-pass.json",
query = list(lat = 40.7, lon = -74))
>> data = fromJSON(rawToChar(res$content))
>> data$response
duration risetime
1 623 1580439398
2 101 1580445412
3 541 1580493826
4 658 1580499550
5 601 1580505413
This API returns times to us in the form of Unix time. Unix time is just the amount of time that has passed since January 1st, 1970. There are many functions that enable easy conversion from Unix to a familiar time form — we won’t cover working with dates and times in this tutorial, but that is covered in our Data Analyst in R interactive learning path.
We can also quickly convert Unix time to a more readable format on websites like this.
You’ve Got the Basics of APIs in R!
In this tutorial, we learned what an API is, and how they can be useful to data analysts and data scientists.
Using our R programming skills and the httr
and jsonlite
libraries, we took data from an API and converted it into a familiar format for analysis.
We’ve just scratched the surface with working with APIs here, but hopefully this introduction has given you the confidence to look into some more complex and powerful APIs, and helped unlock a whole new world of data out there for you to explore!
This article has been published from the source link without modifications to the text. Only the headline has been changed.
Source link