Messy datasets are everywhere. If you want to analyze data, it’s inevitable that you will need to clean data. In this tutorial, we’re going to take a look at how to do that using R and some nifty tidyverse tools.
1. Characteristics of Clean Data and Messy Data
What exactly is clean data? Clean data is accurate, complete, and in a format that is ready to analyze. Characteristics of clean data include data that are:
- Free of duplicate rows/values
- Error-free (e.g. free of misspellings)
- Relevant (e.g. free of special characters)
- The appropriate data type for analysis
- Free of outliers (or only contain outliers have been identified/understood), and
- Follows a “tidy data” structure
Common symptoms of messy data include data that contain:
- Special characters (e.g. commas in numeric values)
- Numeric values stored as text/character data types
- Duplicate rows
- Misspellings
- Inaccuracies
- White space
- Missing data
- Zeros instead of null values
2. Motivation
In this blog post, we will work with five property-sales datasets that are publicly available on the New York City Department of Finance Rolling Sales Data website. We encourage you to download the datasets and follow along! Each file contains one year of real estate sales data for one of New York City’s five boroughs. We will work with the following Microsoft Excel files:
- rollingsales_bronx.xls
- rollingsales_brooklyn.xls
- rollingsales_manhattan.xls
- rollingsales_queens.xls
- rollingsales_statenisland.xls
As we work through this blog post, imagine that you are helping a friend launch their home-inspection business in New York City. You offer to help them by analyzing the data to better understand the real-estate market. But you realize that before you can analyze the data in R, you will need to diagnose and clean it first. And before you can diagnose the data, you will need to load it into R!
3. Load Data into R with readxl
Benefits of using tidyverse tools are often evident in the data-loading process. In many cases, the tidyverse package readxl
will clean some data for you as Microsoft Excel data is loaded into R. If you are working with CSV data, the tidyverse readr
package function read_csv()
is the function to use (we’ll cover that later).
Let’s look at an example. Here’s how the Excel file for the Brooklyn borough looks:
Now let’s load the Brooklyn dataset into R from an Excel file. We’ll use the readxl
package. We specify the function argument skip = 4
because the row that we want to use as the header (i.e. column names) is actually row 5. We can ignore the first four rows entirely and load the data into R beginning at row 5. Here’s the code:
library(readxl) # Load Excel files
brooklyn <- read_excel("rollingsales_brooklyn.xls", skip = 4)
Note we saved this dataset with the variable name brooklyn
for future use.
4. View the Data with tidyr::glimpse()
The tidyverse offers a user-friendly way to view this data with the glimpse()
function that is part of the tibble
package. To use this package, we will need to load it for use in our current session. But rather than loading this package alone, we can load many of the tidyverse packages at one time. If you do not have the tidyverse collection of packages, install it on your machine using the following command in your R or R Studio session:
install.packages("tidyverse")
Once the package is installed, load it to memory:
library(tidyverse)
Now that tidyverse
is loaded into memory, take a “glimpse” of the Brooklyn dataset:
glimpse(brooklyn)
## Observations: 20,185
## Variables: 21
## $ BOROUGH <chr> "3", "3", "3", "3", "3", "3", "…
## $ NEIGHBORHOOD <chr> "BATH BEACH", "BATH BEACH", "BA…
## $ `BUILDING CLASS CATEGORY` <chr> "01 ONE FAMILY DWELLINGS", "01 …
## $ `TAX CLASS AT PRESENT` <chr> "1", "1", "1", "1", "1", "1", "…
## $ BLOCK <dbl> 6359, 6360, 6364, 6367, 6371, 6…
## $ LOT <dbl> 70, 48, 74, 24, 19, 32, 65, 20,…
## $ `EASE-MENT` <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `BUILDING CLASS AT PRESENT` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ ADDRESS <chr> "8684 15TH AVENUE", "14 BAY 10T…
## $ `APARTMENT NUMBER` <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `ZIP CODE` <dbl> 11228, 11228, 11214, 11214, 112…
## $ `RESIDENTIAL UNITS` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `COMMERCIAL UNITS` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `TOTAL UNITS` <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `LAND SQUARE FEET` <dbl> 1933, 2513, 2492, 1571, 2320, 3…
## $ `GROSS SQUARE FEET` <dbl> 4080, 1428, 972, 1456, 1566, 22…
## $ `YEAR BUILT` <dbl> 1930, 1930, 1950, 1935, 1930, 1…
## $ `TAX CLASS AT TIME OF SALE` <chr> "1", "1", "1", "1", "1", "1", "…
## $ `BUILDING CLASS AT TIME OF SALE` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ `SALE PRICE` <dbl> 1300000, 849000, 0, 830000, 0, …
## $ `SALE DATE` <dttm> 2020-04-28, 2020-03-18, 2019-0…
The glimpse()
function provides a user-friendly way to view the column names and data types for all columns, or variables, in the data frame. With this function, we are also able to view the first few observations in the data frame. This data frame has 20,185 observations, or property sales records. And there are 21 variables, or columns.
5. Data Types
Looking at the data types for each column, we see that, in general, the data is stored in a format that is ready to use! For example:
NEIGHBORHOOD
is “character” type, also known as a string.GROSS SQUARE FEET
(i.e. the size of the property) is of type “double”, which part of the “numeric” class in R.SALE PRICE
is also numeric.SALE DATE
is stored in a format that represents calendar dates and times.
So why does this matter? Because GROSS SQUARE FEET
and SALE PRICE
are numeric, we can perform arithmetic operations on the data right away. For example, we can calculate the average sale price for all properties:
mean(brooklyn$`SALE PRICE`)
## [1] 1098644
6. Ready to Plot!
It’s useful that SALE DATE
is stored in a format that represents calendar dates and times because this enables us to use a single line of code to make a histogram of property sales by date:
qplot(`SALE DATE`, data = brooklyn)
Notice the dramatic drop in property sales in April, 2020. Might this related to the COVID-19 pandemic? As you can see, with only a few lines of code, we can begin to explore our data and ask some interesting questions!
Note that the qplot()
function used to make the histogram is from the ggplot2 package, which is a core tidyverse
package.
7. Comparing to read.csv()
How much has the readxl()
function helped us out? Let’s compare this to the read.csv()
function that is built into R. To do this, we downloaded the original Excel file, opened it in the Numbers program on a Mac, and converted the file to CSV. Sure this workflow is not ideal, but it’s not uncommon for analysts to prefer reading in tabular data as a CSV.
Here’s what we see when load the same data in CSV format with read.csv()
:
brooklyn_csv <- read.csv("rollingsales_brooklyn.csv", skip = 4)
glimpse(brooklyn_csv)
## Observations: 20,185
## Variables: 21
## $ BOROUGH <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ NEIGHBORHOOD <fct> BATH BEACH, BATH BEACH, BATH BEAC…
## $ BUILDING.CLASS.CATEGORY <fct> 01 ONE FAMILY DWELLINGS, 01 ONE F…
## $ TAX.CLASS.AT.PRESENT <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ BLOCK <int> 6359, 6360, 6364, 6367, 6371, 637…
## $ LOT <int> 70, 48, 74, 24, 19, 32, 65, 20, 1…
## $ EASE.MENT <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ BUILDING.CLASS.AT.PRESENT <fct> S1, A5, A5, A9, A9, A9, A2, A5, B…
## $ ADDRESS <fct> 8684 15TH AVENUE, 14 BAY 10TH STR…
## $ APARTMENT.NUMBER <fct> , , , , , , , , , , , , , , , , ,…
## $ ZIP.CODE <int> 11228, 11228, 11214, 11214, 11214…
## $ RESIDENTIAL.UNITS <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, …
## $ COMMERCIAL.UNITS <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TOTAL.UNITS <int> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, …
## $ LAND.SQUARE.FEET <fct> "1,933", "2,513", "2,492", "1,571…
## $ GROSS.SQUARE.FEET <fct> "4,080", "1,428", "972", "1,456",…
## $ YEAR.BUILT <int> 1930, 1930, 1950, 1935, 1930, 189…
## $ TAX.CLASS.AT.TIME.OF.SALE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ BUILDING.CLASS.AT.TIME.OF.SALE <fct> S1, A5, A5, A9, A9, A9, A2, A5, A…
## $ SALE.PRICE <fct> "1,300,000", "849,000", "0", "830…
## $ SALE.DATE <fct> 4/28/20, 3/18/20, 8/15/19, 6/26/2…
This data is messier! And here’s how:
- Character (string) data, such as
ADDRESS
, has been stored as class “factor”. Think of factors as categories or buckets. GROSS.SQUARE.FEET
andSALE.PRICE
are also stored as factors. We can’t perform arithmetic operations, like calculating the mean, on a factor!SALE.DATE
is not stored in a format that represents calendar dates and times. So we can’t build the histogram we saw above. (We can make a histogram, but it’s messy, and it makes no sense).- The
GROSS.SQUARE.FEET
andSALE.PRICE
columns contain a special character, the comma (,
).
But if we load the same dataset with the read_csv()
function from the readr
package, which is part of the tidyverse, we see results similar to our original approach with readxl()
:
brooklyn_csv <- read_csv("rollingsales_brooklyn.csv", skip = 4)
glimpse(brooklyn_csv)
## Observations: 20,185
## Variables: 21
## $ BOROUGH <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ NEIGHBORHOOD <chr> "BATH BEACH", "BATH BEACH", "BA…
## $ `BUILDING CLASS CATEGORY` <chr> "01 ONE FAMILY DWELLINGS", "01 …
## $ `TAX CLASS AT PRESENT` <chr> "1", "1", "1", "1", "1", "1", "…
## $ BLOCK <dbl> 6359, 6360, 6364, 6367, 6371, 6…
## $ LOT <dbl> 70, 48, 74, 24, 19, 32, 65, 20,…
## $ `EASE-MENT` <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `BUILDING CLASS AT PRESENT` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ ADDRESS <chr> "8684 15TH AVENUE", "14 BAY 10T…
## $ `APARTMENT NUMBER` <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `ZIP CODE` <dbl> 11228, 11228, 11214, 11214, 112…
## $ `RESIDENTIAL UNITS` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `COMMERCIAL UNITS` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `TOTAL UNITS` <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `LAND SQUARE FEET` <dbl> 1933, 2513, 2492, 1571, 2320, 3…
## $ `GROSS SQUARE FEET` <dbl> 4080, 1428, 972, 1456, 1566, 22…
## $ `YEAR BUILT` <dbl> 1930, 1930, 1950, 1935, 1930, 1…
## $ `TAX CLASS AT TIME OF SALE` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ `BUILDING CLASS AT TIME OF SALE` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ `SALE PRICE` <dbl> 1300000, 849000, 0, 830000, 0, …
## $ `SALE DATE` <chr> "4/28/20", "3/18/20", "8/15/19"…
To summarize, key differences of loading the data into R with readxl()
or read_csv()
are that none of the variables have been coerced to the factor
data type. Instead. Many of the variables were loaded as character, or string data types.
Also, notice that the special characters are missing from the Sale Price
column and the column has been loaded as a double
or numeric data type. This means that we can perform calculations relating to sale price right away, without taking extra step to convert the column to numeric!
8. Combining Datasets
If we want to perform data analysis for all five boroughs in New York City it will be helpful to combine the datasets. Also, if the data needs any additional cleaning, it would be great to have to clean the data in only one place instead of five! We’ve verified that the column names are the same across each of the five Excel files. So we can combine the dataframes with the bind_rows()
function from the dplyr
package (another tidyverse package!):
brooklyn <- read_excel("rollingsales_brooklyn.xls", skip = 4)
bronx <- read_excel("rollingsales_bronx.xls", skip = 4)
manhattan <- read_excel("rollingsales_manhattan.xls", skip = 4)
staten_island <- read_excel("rollingsales_statenisland.xls", skip = 4)
queens <- read_excel("rollingsales_queens.xls", skip = 4)
# Bind all dataframes into one, save as "NYC_property_sales"
NYC_property_sales <- bind_rows(brooklyn, bronx, manhattan, staten_island, queens)
glimpse(NYC_property_sales)
## Observations: 70,870
## Variables: 21
## $ BOROUGH <chr> "3", "3", "3", "3", "3", "3", "…
## $ NEIGHBORHOOD <chr> "BATH BEACH", "BATH BEACH", "BA…
## $ `BUILDING CLASS CATEGORY` <chr> "01 ONE FAMILY DWELLINGS", "01 …
## $ `TAX CLASS AT PRESENT` <chr> "1", "1", "1", "1", "1", "1", "…
## $ BLOCK <dbl> 6359, 6360, 6364, 6367, 6371, 6…
## $ LOT <dbl> 70, 48, 74, 24, 19, 32, 65, 20,…
## $ `EASE-MENT` <lgl> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `BUILDING CLASS AT PRESENT` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ ADDRESS <chr> "8684 15TH AVENUE", "14 BAY 10T…
## $ `APARTMENT NUMBER` <chr> NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `ZIP CODE` <dbl> 11228, 11228, 11214, 11214, 112…
## $ `RESIDENTIAL UNITS` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `COMMERCIAL UNITS` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `TOTAL UNITS` <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ `LAND SQUARE FEET` <dbl> 1933, 2513, 2492, 1571, 2320, 3…
## $ `GROSS SQUARE FEET` <dbl> 4080, 1428, 972, 1456, 1566, 22…
## $ `YEAR BUILT` <dbl> 1930, 1930, 1950, 1935, 1930, 1…
## $ `TAX CLASS AT TIME OF SALE` <chr> "1", "1", "1", "1", "1", "1", "…
## $ `BUILDING CLASS AT TIME OF SALE` <chr> "S1", "A5", "A5", "A9", "A9", "…
## $ `SALE PRICE` <dbl> 1300000, 849000, 0, 830000, 0, …
## $ `SALE DATE` <dttm> 2020-04-28, 2020-03-18, 2019-0…
This NYC_property_sales
dataframe also contains 21 variables, like the brooklyn
dataframe. This is a good because it confirms that all five datasets have the exact same column names, so we are able to combine them without any corrections! The bind_rows()
function essentially stacked the five dataframes on top of each other to form one.
If we combined these dataframes and ended up with more columns than we had in the brooklyn
dataframe, it could indicate a problem such as an erroneous column name in one of the datasets. But that did not happen here, so we can move on to cleaning up column names.
9. Clean Up Column Names with magrittr Magic!
Now it’s time for one of our favorite data cleaning tricks here at Dataquest!
The column names contain spaces, which can be harder to work with in the tidyverse. Also, the column names contain upper-case letters. We don’t want to worry about spaces or remembering to capitalize variable names during our analysis! Let’s quickly clean up the column names using a handy method from the magrittr
package. Start by loading the package into memory. Install the package if you need to. The ‘magrittr` package is used in the tidyverse, but we need to load it explicitly to access one of it’s built-in functions.
library(magrittr)
We are going to use the “assignment pipe” function from the magrittr
package to efficiently update all variable names. Pipes are powerful tools that allow R users to chain together multiple operations at once. Pipes also make R code more readable and easier to comprehend. Pipes are used extensively when working with tidyverse tools.
Let’s combine the assignment pipe operator with the str_replace_all()
function from the tidyverse stringr
package to replace all spaces with underscores. The code for the NYC_property_sales
dataframe looks like this:
colnames(NYC_property_sales) %<>% str_replace_all("\\s", "_") %<>% tolower()
So what’s going on here?! Think of the %<>%
to mean “and then update”. Let’s put this into context. The above line of code essentially means:
Take the column names from the
NYC_property_sales
data frame, and then update all column names to replace all spaces with underscores, and then update all column names to lower case.
That’s a long sentence! But that demonstrates the value of the pipe operator to chain together multiple commands. Let’s take a look at the updated column names:
colnames(NYC_property_sales)
## [1] "borough" "neighborhood"
## [3] "building_class_category" "tax_class_at_present"
## [5] "block" "lot"
## [7] "ease-ment" "building_class_at_present"
## [9] "address" "apartment_number"
## [11] "zip_code" "residential_units"
## [13] "commercial_units" "total_units"
## [15] "land_square_feet" "gross_square_feet"
## [17] "year_built" "tax_class_at_time_of_sale"
## [19] "building_class_at_time_of_sale" "sale_price"
## [21] "sale_date"
That looks better!
10. The Pipe Operation
Typically when working with tidyverse tools, we’ll work with the single-pipe (%>%
) from magrittr
. The pipe is a way to link together multiple commands. Recall from above how we can think of %<>%
as “and then update”? Well, the single pipe can be thought of simply as “and then”. Here’s an example using the commands we’ve learned so far:
NYC_property_sales %>% glimpse()
This roughly translates into:
Let’s grab the
NYC_property_sales
dataframe and then take a glimpse at the data.
Recap
Wow, check out everything we covered in this blog post:
- Load Microsoft Excel flat files into R with the
read_excel()
function from thereadxl
package - Import CSV files with the
read_csv()
function from thereadr
package - View dataframe characteristics with the
glimpse()
function from thetibble
package - Generate a histogram with the
qplot()
function from theggplot2
package - Combine data frames with the
bind_rows()
function from thedplyr
package - Clean up column names using functions from the
magrittr
package and thestringr
package - Link commands together with the single-pipe (
%>%
) frommagrittr
As you can see, tidyverse packages are very powerful tools for loading, cleaning, and inspecting data so that you can begin analyzing your data right away! And remember, you can load all of these packages at once with library(tidyverse)
.
This article has been published from the source link without modifications to the text. Only the headline has been changed.
Source link