Home Data DIY Loading and Cleaning Data with R and the tidyverse

Loading and Cleaning Data with R and the tidyverse

Audio version of the article

Messy datasets are everywhere. If you want to analyze data, it’s inevitable that you will need to clean data. In this tutorial, we’re going to take a look at how to do that using R and some nifty tidyverse tools.

The tidyverse tools provide powerful methods to diagnose and clean messy datasets in R. While there’s far more we can do with the tidyverse, in this tutorial we’ll focus on learning how to:

  • Import comma-separated values (CSV) and Microsoft Excel flat files into R
  • Combine data frames
  • Clean up column names
  • And more!

The tidyverse is a collection of R packages designed for working with data. The tidyverse packages share a common design philosophy, grammar, and data structures. Tidyverse packages “play well together”. The tidyverse enables you to spend less time cleaning data so that you can focus more on analyzing, visualizing, and modeling data.

1. Characteristics of Clean Data and Messy Data

What exactly is clean data? Clean data is accurate, complete, and in a format that is ready to analyze. Characteristics of clean data include data that are:

  • Free of duplicate rows/values
  • Error-free (e.g. free of misspellings)
  • Relevant (e.g. free of special characters)
  • The appropriate data type for analysis
  • Free of outliers (or only contain outliers have been identified/understood), and
  • Follows a “tidy data” structure

Common symptoms of messy data include data that contain:

  • Special characters (e.g. commas in numeric values)
  • Numeric values stored as text/character data types
  • Duplicate rows
  • Misspellings
  • Inaccuracies
  • White space
  • Missing data
  • Zeros instead of null values

2. Motivation

In this blog post, we will work with five property-sales datasets that are publicly available on the New York City Department of Finance Rolling Sales Data website. We encourage you to download the datasets and follow along! Each file contains one year of real estate sales data for one of New York City’s five boroughs. We will work with the following Microsoft Excel files:

  • rollingsales_bronx.xls
  • rollingsales_brooklyn.xls
  • rollingsales_manhattan.xls
  • rollingsales_queens.xls
  • rollingsales_statenisland.xls

As we work through this blog post, imagine that you are helping a friend launch their home-inspection business in New York City. You offer to help them by analyzing the data to better understand the real-estate market. But you realize that before you can analyze the data in R, you will need to diagnose and clean it first. And before you can diagnose the data, you will need to load it into R!

3. Load Data into R with readxl

Benefits of using tidyverse tools are often evident in the data-loading process. In many cases, the tidyverse package readxl will clean some data for you as Microsoft Excel data is loaded into R. If you are working with CSV data, the tidyverse readr package function read_csv() is the function to use (we’ll cover that later).

Let’s look at an example. Here’s how the Excel file for the Brooklyn borough looks:

Loading and Cleaning Data with R and the tidyverse
The Brooklyn Excel file

Now let’s load the Brooklyn dataset into R from an Excel file. We’ll use the readxl package. We specify the function argument skip = 4 because the row that we want to use as the header (i.e. column names) is actually row 5. We can ignore the first four rows entirely and load the data into R beginning at row 5. Here’s the code:

library(readxl) # Load Excel files
brooklyn <- read_excel("rollingsales_brooklyn.xls", skip = 4)

Note we saved this dataset with the variable name brooklyn for future use.

4. View the Data with tidyr::glimpse()

The tidyverse offers a user-friendly way to view this data with the glimpse() function that is part of the tibble package. To use this package, we will need to load it for use in our current session. But rather than loading this package alone, we can load many of the tidyverse packages at one time. If you do not have the tidyverse collection of packages, install it on your machine using the following command in your R or R Studio session:

install.packages("tidyverse")

Once the package is installed, load it to memory:

library(tidyverse)

Now that tidyverse is loaded into memory, take a “glimpse” of the Brooklyn dataset:

glimpse(brooklyn)
## Observations: 20,185
## Variables: 21
## $ BOROUGH <chr> "3", "3", "3", "3", "3", "3", "…
## $ NEIGHBORHOOD <chr> "BATH BEACH", "BATH BEACH", "BA…
## $ `BUILDING CLASS CATEGORY` <chr> "01 ONE FAMILY DWELLINGS", "01 …
## $ `TAX CLASS AT PRESENT` <chr> "1", "1", "1", "1", "1", "1", "…
## $ BLOCK <dbl> 6359, 6360, 6364, 6367, 6371, 6…
## $ LOT <dbl> 70, 48, 74, 24, 19, 32, 65, 20,…
## $ `EASE-MENT` <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `BUILDING CLASS AT PRESENT` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ ADDRESS <chr> "8684 15TH AVENUE", "14 BAY 10T…
## $ `APARTMENT NUMBER` <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `ZIP CODE` <dbl> 11228, 11228, 11214, 11214, 112…
## $ `RESIDENTIAL UNITS` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `COMMERCIAL UNITS` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `TOTAL UNITS` <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `LAND SQUARE FEET` <dbl> 1933, 2513, 2492, 1571, 2320, 3…
## $ `GROSS SQUARE FEET` <dbl> 4080, 1428, 972, 1456, 1566, 22…
## $ `YEAR BUILT` <dbl> 1930, 1930, 1950, 1935, 1930, 1…
## $ `TAX CLASS AT TIME OF SALE` <chr> "1", "1", "1", "1", "1", "1", "…
## $ `BUILDING CLASS AT TIME OF SALE` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ `SALE PRICE` <dbl> 1300000, 849000, 0, 830000, 0, …
## $ `SALE DATE` <dttm> 2020-04-28, 2020-03-18, 2019-0…

The glimpse() function provides a user-friendly way to view the column names and data types for all columns, or variables, in the data frame. With this function, we are also able to view the first few observations in the data frame. This data frame has 20,185 observations, or property sales records. And there are 21 variables, or columns.

5. Data Types

Looking at the data types for each column, we see that, in general, the data is stored in a format that is ready to use! For example:

  • NEIGHBORHOOD is “character” type, also known as a string.
  • GROSS SQUARE FEET (i.e. the size of the property) is of type “double”, which part of the “numeric” class in R.
  • SALE PRICE is also numeric.
  • SALE DATE is stored in a format that represents calendar dates and times.

So why does this matter? Because GROSS SQUARE FEET and SALE PRICE are numeric, we can perform arithmetic operations on the data right away. For example, we can calculate the average sale price for all properties:

mean(brooklyn$`SALE PRICE`)
## [1] 1098644

6. Ready to Plot!

It’s useful that SALE DATE is stored in a format that represents calendar dates and times because this enables us to use a single line of code to make a histogram of property sales by date:

qplot(`SALE DATE`, data = brooklyn)

Loading and Cleaning Data with R and the tidyverse
Property Sales Declined Sharply in April, 2020

Notice the dramatic drop in property sales in April, 2020. Might this related to the COVID-19 pandemic? As you can see, with only a few lines of code, we can begin to explore our data and ask some interesting questions!

Note that the qplot() function used to make the histogram is from the ggplot2 package, which is a core tidyverse package.

7. Comparing to read.csv()

How much has the readxl() function helped us out? Let’s compare this to the read.csv() function that is built into R. To do this, we downloaded the original Excel file, opened it in the Numbers program on a Mac, and converted the file to CSV. Sure this workflow is not ideal, but it’s not uncommon for analysts to prefer reading in tabular data as a CSV.

Here’s what we see when load the same data in CSV format with read.csv():

brooklyn_csv <- read.csv("rollingsales_brooklyn.csv", skip = 4)
glimpse(brooklyn_csv)
## Observations: 20,185
## Variables: 21
## $ BOROUGH <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ NEIGHBORHOOD <fct> BATH BEACH, BATH BEACH, BATH BEAC…
## $ BUILDING.CLASS.CATEGORY <fct> 01 ONE FAMILY DWELLINGS, 01 ONE F…
## $ TAX.CLASS.AT.PRESENT <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ BLOCK <int> 6359, 6360, 6364, 6367, 6371, 637…
## $ LOT <int> 70, 48, 74, 24, 19, 32, 65, 20, 1…
## $ EASE.MENT <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ BUILDING.CLASS.AT.PRESENT <fct> S1, A5, A5, A9, A9, A9, A2, A5, B…
## $ ADDRESS <fct> 8684 15TH AVENUE, 14 BAY 10TH STR…
## $ APARTMENT.NUMBER <fct> , , , , , , , , , , , , , , , , ,…
## $ ZIP.CODE <int> 11228, 11228, 11214, 11214, 11214…
## $ RESIDENTIAL.UNITS <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, …
## $ COMMERCIAL.UNITS <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TOTAL.UNITS <int> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, …
## $ LAND.SQUARE.FEET <fct> "1,933", "2,513", "2,492", "1,571…
## $ GROSS.SQUARE.FEET <fct> "4,080", "1,428", "972", "1,456",…
## $ YEAR.BUILT <int> 1930, 1930, 1950, 1935, 1930, 189…
## $ TAX.CLASS.AT.TIME.OF.SALE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ BUILDING.CLASS.AT.TIME.OF.SALE <fct> S1, A5, A5, A9, A9, A9, A2, A5, A…
## $ SALE.PRICE <fct> "1,300,000", "849,000", "0", "830…
## $ SALE.DATE <fct> 4/28/20, 3/18/20, 8/15/19, 6/26/2…

This data is messier! And here’s how:

  • Character (string) data, such as ADDRESS, has been stored as class “factor”. Think of factors as categories or buckets.
  • GROSS.SQUARE.FEET and SALE.PRICE are also stored as factors. We can’t perform arithmetic operations, like calculating the mean, on a factor!
  • SALE.DATE is not stored in a format that represents calendar dates and times. So we can’t build the histogram we saw above. (We can make a histogram, but it’s messy, and it makes no sense).
  • The GROSS.SQUARE.FEET and SALE.PRICE columns contain a special character, the comma (,).

But if we load the same dataset with the read_csv() function from the readr package, which is part of the tidyverse, we see results similar to our original approach with readxl():

brooklyn_csv <- read_csv("rollingsales_brooklyn.csv", skip = 4)
glimpse(brooklyn_csv)
## Observations: 20,185
## Variables: 21
## $ BOROUGH <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ NEIGHBORHOOD <chr> "BATH BEACH", "BATH BEACH", "BA…
## $ `BUILDING CLASS CATEGORY` <chr> "01 ONE FAMILY DWELLINGS", "01 …
## $ `TAX CLASS AT PRESENT` <chr> "1", "1", "1", "1", "1", "1", "…
## $ BLOCK <dbl> 6359, 6360, 6364, 6367, 6371, 6…
## $ LOT <dbl> 70, 48, 74, 24, 19, 32, 65, 20,…
## $ `EASE-MENT` <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `BUILDING CLASS AT PRESENT` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ ADDRESS <chr> "8684 15TH AVENUE", "14 BAY 10T…
## $ `APARTMENT NUMBER` <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `ZIP CODE` <dbl> 11228, 11228, 11214, 11214, 112…
## $ `RESIDENTIAL UNITS` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `COMMERCIAL UNITS` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `TOTAL UNITS` <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `LAND SQUARE FEET` <dbl> 1933, 2513, 2492, 1571, 2320, 3…
## $ `GROSS SQUARE FEET` <dbl> 4080, 1428, 972, 1456, 1566, 22…
## $ `YEAR BUILT` <dbl> 1930, 1930, 1950, 1935, 1930, 1…
## $ `TAX CLASS AT TIME OF SALE` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ `BUILDING CLASS AT TIME OF SALE` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ `SALE PRICE` <dbl> 1300000, 849000, 0, 830000, 0, …
## $ `SALE DATE` <chr> "4/28/20", "3/18/20", "8/15/19"…

To summarize, key differences of loading the data into R with readxl() or read_csv() are that none of the variables have been coerced to the factor data type. Instead. Many of the variables were loaded as character, or string data types.

Also, notice that the special characters are missing from the Sale Price column and the column has been loaded as a double or numeric data type. This means that we can perform calculations relating to sale price right away, without taking extra step to convert the column to numeric!

8. Combining Datasets

If we want to perform data analysis for all five boroughs in New York City it will be helpful to combine the datasets. Also, if the data needs any additional cleaning, it would be great to have to clean the data in only one place instead of five! We’ve verified that the column names are the same across each of the five Excel files. So we can combine the dataframes with the bind_rows() function from the dplyr package (another tidyverse package!):

brooklyn <- read_excel("rollingsales_brooklyn.xls", skip = 4)
bronx <- read_excel("rollingsales_bronx.xls", skip = 4)
manhattan <- read_excel("rollingsales_manhattan.xls", skip = 4)
staten_island <- read_excel("rollingsales_statenisland.xls", skip = 4)
queens <- read_excel("rollingsales_queens.xls", skip = 4)

# Bind all dataframes into one, save as "NYC_property_sales"
NYC_property_sales <- bind_rows(brooklyn, bronx, manhattan, staten_island, queens)

glimpse(NYC_property_sales)
## Observations: 70,870
## Variables: 21
## $ BOROUGH <chr> "3", "3", "3", "3", "3", "3", "…
## $ NEIGHBORHOOD <chr> "BATH BEACH", "BATH BEACH", "BA…
## $ `BUILDING CLASS CATEGORY` <chr> "01 ONE FAMILY DWELLINGS", "01 …
## $ `TAX CLASS AT PRESENT` <chr> "1", "1", "1", "1", "1", "1", "…
## $ BLOCK <dbl> 6359, 6360, 6364, 6367, 6371, 6…
## $ LOT <dbl> 70, 48, 74, 24, 19, 32, 65, 20,…
## $ `EASE-MENT` <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `BUILDING CLASS AT PRESENT` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ ADDRESS <chr> "8684 15TH AVENUE", "14 BAY 10T…
## $ `APARTMENT NUMBER` <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `ZIP CODE` <dbl> 11228, 11228, 11214, 11214, 112…
## $ `RESIDENTIAL UNITS` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `COMMERCIAL UNITS` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `TOTAL UNITS` <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `LAND SQUARE FEET` <dbl> 1933, 2513, 2492, 1571, 2320, 3…
## $ `GROSS SQUARE FEET` <dbl> 4080, 1428, 972, 1456, 1566, 22…
## $ `YEAR BUILT` <dbl> 1930, 1930, 1950, 1935, 1930, 1…
## $ `TAX CLASS AT TIME OF SALE` <chr> "1", "1", "1", "1", "1", "1", "…
## $ `BUILDING CLASS AT TIME OF SALE` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ `SALE PRICE` <dbl> 1300000, 849000, 0, 830000, 0, …
## $ `SALE DATE` <dttm> 2020-04-28, 2020-03-18, 2019-0…

This NYC_property_sales dataframe also contains 21 variables, like the brooklyn dataframe. This is a good because it confirms that all five datasets have the exact same column names, so we are able to combine them without any corrections! The bind_rows() function essentially stacked the five dataframes on top of each other to form one.

If we combined these dataframes and ended up with more columns than we had in the brooklyn dataframe, it could indicate a problem such as an erroneous column name in one of the datasets. But that did not happen here, so we can move on to cleaning up column names.

9. Clean Up Column Names with magrittr Magic!

Now it’s time for one of our favorite data cleaning tricks here at Dataquest!

The column names contain spaces, which can be harder to work with in the tidyverse. Also, the column names contain upper-case letters. We don’t want to worry about spaces or remembering to capitalize variable names during our analysis! Let’s quickly clean up the column names using a handy method from the magrittr package. Start by loading the package into memory. Install the package if you need to. The ‘magrittr` package is used in the tidyverse, but we need to load it explicitly to access one of it’s built-in functions.

library(magrittr)

We are going to use the “assignment pipe” function from the magrittr package to efficiently update all variable names. Pipes are powerful tools that allow R users to chain together multiple operations at once. Pipes also make R code more readable and easier to comprehend. Pipes are used extensively when working with tidyverse tools.

Let’s combine the assignment pipe operator with the str_replace_all() function from the tidyverse stringr package to replace all spaces with underscores. The code for the NYC_property_sales dataframe looks like this:

colnames(NYC_property_sales) %<>% str_replace_all("\\s", "_") %<>% tolower()

So what’s going on here?! Think of the %<>% to mean “and then update”. Let’s put this into context. The above line of code essentially means:

Take the column names from the NYC_property_sales data frame, and then update all column names to replace all spaces with underscores, and then update all column names to lower case.

That’s a long sentence! But that demonstrates the value of the pipe operator to chain together multiple commands. Let’s take a look at the updated column names:

colnames(NYC_property_sales)
## [1] "borough" "neighborhood" 
## [3] "building_class_category" "tax_class_at_present" 
## [5] "block" "lot" 
## [7] "ease-ment" "building_class_at_present" 
## [9] "address" "apartment_number" 
## [11] "zip_code" "residential_units" 
## [13] "commercial_units" "total_units" 
## [15] "land_square_feet" "gross_square_feet" 
## [17] "year_built" "tax_class_at_time_of_sale" 
## [19] "building_class_at_time_of_sale" "sale_price" 
## [21] "sale_date"

That looks better!

10. The Pipe Operation

Typically when working with tidyverse tools, we’ll work with the single-pipe (%>%) from magrittr. The pipe is a way to link together multiple commands. Recall from above how we can think of %<>% as “and then update”? Well, the single pipe can be thought of simply as “and then”. Here’s an example using the commands we’ve learned so far:

NYC_property_sales %>% glimpse()

This roughly translates into:

Let’s grab the NYC_property_sales dataframe and then take a glimpse at the data.

Recap

Wow, check out everything we covered in this blog post:

  • Load Microsoft Excel flat files into R with the read_excel() function from the readxl package
  • Import CSV files with the read_csv() function from the readr package
  • View dataframe characteristics with the glimpse() function from the tibble package
  • Generate a histogram with the qplot() function from the ggplot2 package
  • Combine data frames with the bind_rows() function from the dplyr package
  • Clean up column names using functions from the magrittr package and the stringr package
  • Link commands together with the single-pipe (%>%) from magrittr

As you can see, tidyverse packages are very powerful tools for loading, cleaning, and inspecting data so that you can begin analyzing your data right away! And remember, you can load all of these packages at once with library(tidyverse).

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link

- Advertisment - Loading and Cleaning Data with R and the tidyverseLoading and Cleaning Data with R and the tidyverse

Most Popular

Meaning of ‘trustless’ in blockchain

Blockchain allegedly eliminates the need for trust — but how far is this notion actually true or helpful? Crypto community members will likely be familiar...

Humanizing of Machine Learning Technology

Digital engagement is becoming the new normal, but it can inhibit the individual connection that is foundational to building long-term customer relationships. How can...

Top AI governmemt projects in the world

The role of different governments of the world in the field of AI Government enthusiasm for AI has picked up recently, and numerous government authorities...

Using machine learning to translate lost languages

System developed at MIT CSAIL aims to help linguists decipher languages that have been lost to history. Recent research suggests that most languages that have...

Python needs to develop further

Open-source programming language Python has become one of the few languages that won't disappear anytime soon. It's the top or one of the top...

Feature Stores for MLOps with Mike del Balso

Mike, who you might remember from our last conversation on the podcast, was a foundational member of the Uber team that created their ML...
- Advertisment - Loading and Cleaning Data with R and the tidyverseLoading and Cleaning Data with R and the tidyverse