Home Artificial Intelligence Artificial Intelligence DIY Pandas Common Excel tasks shown

Pandas Common Excel tasks shown

January 13, 2020

Introduction

The purpose of this article is to show some common Excel tasks and how you would execute similar tasks in pandas. Some of the examples are somewhat trivial but I think it is important to show the simple as well as the more complex functions you can find elsewhere. As an added bonus, I’m going to do some fuzzy string matching to show a little twist to the process and show how pandas can utilize the full python system of modules to do something simply in python that would be complex in Excel.

Make sense? Let’s get started.

Adding a Sum to a Row

The first task I’ll cover is summing some columns to add a total column.

We will start by importing our excel data into a pandas dataframe.

import pandas as pd
import numpy as np
df = pd.read_excel("excel-comp-data.xlsx")
df.head()

	account	name	street	city	state	postal-code	Jan	Feb	Mar
0	211829	Kerluke, Koepp and Hilpert	34456 Sean Highway	New Jaycob	Texas	28752	10000	62000	35000
1	320563	Walter-Trantow	1311 Alvis Tunnel	Port Khadijah	NorthCarolina	38365	95000	45000	35000
2	648336	Bashirian, Kunde and Price	62184 Schamberger Underpass Apt. 231	New Lilianland	Iowa	76517	91000	120000	35000
3	109996	D’Amore, Gleichner and Bode	155 Fadel Crescent Apt. 144	Hyattburgh	Maine	46021	45000	120000	10000
4	121213	Bauch-Goldner	7274 Marissa Common	Shanahanchester	California	49681	162000	120000	35000

We want to add a total column to show total sales for Jan, Feb and Mar.

This is straightforward in Excel and in pandas. For Excel, I have added the formula sum(G2:I2) in column J. Here is what it looks like in Excel:

Next, here is how we do it in pandas:

df["total"] = df["Jan"] + df["Feb"] + df["Mar"]
df.head()

	account	name	street	city	state	postal-code	Jan	Feb	Mar	total
0	211829	Kerluke, Koepp and Hilpert	34456 Sean Highway	New Jaycob	Texas	28752	10000	62000	35000	107000
1	320563	Walter-Trantow	1311 Alvis Tunnel	Port Khadijah	NorthCarolina	38365	95000	45000	35000	175000
2	648336	Bashirian, Kunde and Price	62184 Schamberger Underpass Apt. 231	New Lilianland	Iowa	76517	91000	120000	35000	246000
3	109996	D’Amore, Gleichner and Bode	155 Fadel Crescent Apt. 144	Hyattburgh	Maine	46021	45000	120000	10000	175000
4	121213	Bauch-Goldner	7274 Marissa Common	Shanahanchester	California	49681	162000	120000	35000	317000

Next, let’s get some totals and other values for each month. Here is what we are trying to do as shown in Excel:

As you can see, we added a SUM(G2:G16) in row 17 in each of the columns to get totals by month.

Performing column level analysis is easy in pandas. Here are a couple of examples.

df["Jan"].sum(), df["Jan"].mean(),df["Jan"].min(),df["Jan"].max()

(1462000, 97466.666666666672, 10000, 162000)

Now, we want to add a total by month and grand total. This is where pandas and Excel diverge a little. It is very simple to add totals in cells in Excel for each month. Because pandas need to maintain the integrity of the entire DataFrame, there are a couple more steps.

First, create a sum for the month and total columns.

sum_row=df[["Jan","Feb","Mar","total"]].sum()
sum_row

Jan      1462000
Feb      1507000
Mar       717000
total    3686000
dtype: int64

This is fairly intuitive however, if you want to add totals as a row, you need to do some minor manipulations.

We need to transpose the data and convert the Series to a DataFrame so that it is easier to concat onto our existing data. The T function allows us to switch the data from being row-based to column-based.

df_sum=pd.DataFrame(data=sum_row).T
df_sum

	Jan	Feb	Mar	total
0	1462000	1507000	717000	3686000

The final thing we need to do before adding the totals back is to add the missing columns. We use reindex to do this for us. The trick is to add all of our columns and then allow pandas to fill in the values that are missing.

df_sum=df_sum.reindex(columns=df.columns)
df_sum

	account	name	street	city	state	postal-code	Jan	Feb	Mar	total
0	NaN	NaN	NaN	NaN	NaN	NaN	1462000	1507000	717000	3686000

Now that we have a nicely formatted DataFrame, we can add it to our existing one using append .

df_final=df.append(df_sum,ignore_index=True)
df_final.tail()

	account	name	street	city	state	postal-code	Jan	Feb	Mar	total
11	231907	Hahn-Moore	18115 Olivine Throughway	Norbertomouth	NorthDakota	31415	150000	10000	162000	322000
12	242368	Frami, Anderson and Donnelly	182 Bertie Road	East Davian	Iowa	72686	162000	120000	35000	317000
13	268755	Walsh-Haley	2624 Beatty Parkways	Goodwinmouth	RhodeIsland	31919	55000	120000	35000	210000
14	273274	McDermott PLC	8917 Bergstrom Meadow	Kathryneborough	Delaware	27933	150000	120000	70000	340000
15	NaN	NaN	NaN	NaN	NaN	NaN	1462000	1507000	717000	3686000

Additional Data Transforms

For another example, let’s try to add a state abbreviation to the data set.

From an Excel perspective the easiest way is probably to add a new column, do a vlookup on the state name and fill in the abbreviation.

I did this and here is a snapshot of what the results looks like:

You’ll notice that after performing the vlookup, there are some values that are not coming through correctly. That’s because we misspelled some of the states. Handling this in Excel would be really challenging (on big data sets).

Fortunately with pandas we have the full power of the python ecosystem at our disposal. In thinking about how to solve this type of messy data problem, I thought about trying to do some fuzzy text matching to determine the correct value.

Fortunately someone else has done a lot of work in this are. The fuzzy wuzzy library has some pretty useful functions for this type of situation. Make sure to get it and install it first.

The other piece of code we need is a state name to abbreviation mapping. Instead of trying to type it myself, a little googling found this code.

Get started by importing the appropriate fuzzywuzzy functions and define our state map dictionary.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
state_to_code = {"VERMONT": "VT", "GEORGIA": "GA", "IOWA": "IA", "Armed Forces Pacific": "AP", "GUAM": "GU",
                 "KANSAS": "KS", "FLORIDA": "FL", "AMERICAN SAMOA": "AS", "NORTH CAROLINA": "NC", "HAWAII": "HI",
                 "NEW YORK": "NY", "CALIFORNIA": "CA", "ALABAMA": "AL", "IDAHO": "ID", "FEDERATED STATES OF MICRONESIA": "FM",
                 "Armed Forces Americas": "AA", "DELAWARE": "DE", "ALASKA": "AK", "ILLINOIS": "IL",
                 "Armed Forces Africa": "AE", "SOUTH DAKOTA": "SD", "CONNECTICUT": "CT", "MONTANA": "MT", "MASSACHUSETTS": "MA",
                 "PUERTO RICO": "PR", "Armed Forces Canada": "AE", "NEW HAMPSHIRE": "NH", "MARYLAND": "MD", "NEW MEXICO": "NM",
                 "MISSISSIPPI": "MS", "TENNESSEE": "TN", "PALAU": "PW", "COLORADO": "CO", "Armed Forces Middle East": "AE",
                 "NEW JERSEY": "NJ", "UTAH": "UT", "MICHIGAN": "MI", "WEST VIRGINIA": "WV", "WASHINGTON": "WA",
                 "MINNESOTA": "MN", "OREGON": "OR", "VIRGINIA": "VA", "VIRGIN ISLANDS": "VI", "MARSHALL ISLANDS": "MH",
                 "WYOMING": "WY", "OHIO": "OH", "SOUTH CAROLINA": "SC", "INDIANA": "IN", "NEVADA": "NV", "LOUISIANA": "LA",
                 "NORTHERN MARIANA ISLANDS": "MP", "NEBRASKA": "NE", "ARIZONA": "AZ", "WISCONSIN": "WI", "NORTH DAKOTA": "ND",
                 "Armed Forces Europe": "AE", "PENNSYLVANIA": "PA", "OKLAHOMA": "OK", "KENTUCKY": "KY", "RHODE ISLAND": "RI",
                 "DISTRICT OF COLUMBIA": "DC", "ARKANSAS": "AR", "MISSOURI": "MO", "TEXAS": "TX", "MAINE": "ME"}

Here are some example of how the fuzzy text matching function works.

process.extractOne("Minnesotta",choices=state_to_code.keys())

('MINNESOTA', 95)

process.extractOne("AlaBAMMazzz",choices=state_to_code.keys(),score_cutoff=80)

Now that we know how this works, we create our function to take the state column and convert it to a valid abbreviation. We use the 80 score_cutoff for this data. You can play with it to see what number works for your data. You’ll notice that we either return a valid abbreviation or an np.nan so that we have some valid values in the field.

def convert_state(row):
    abbrev = process.extractOne(row["state"],choices=state_to_code.keys(),score_cutoff=80)
    if abbrev:
        return state_to_code[abbrev[0]]
    return np.nan

Add the column in the location we want and fill it with NaN values

df_final.insert(6, "abbrev", np.nan)
df_final.head()

	account	name	street	city	state	postal-code	abbrev	Jan	Feb	Mar	total
0	211829	Kerluke, Koepp and Hilpert	34456 Sean Highway	New Jaycob	Texas	28752	NaN	10000	62000	35000	107000
1	320563	Walter-Trantow	1311 Alvis Tunnel	Port Khadijah	NorthCarolina	38365	NaN	95000	45000	35000	175000
2	648336	Bashirian, Kunde and Price	62184 Schamberger Underpass Apt. 231	New Lilianland	Iowa	76517	NaN	91000	120000	35000	246000
3	109996	D’Amore, Gleichner and Bode	155 Fadel Crescent Apt. 144	Hyattburgh	Maine	46021	NaN	45000	120000	10000	175000
4	121213	Bauch-Goldner	7274 Marissa Common	Shanahanchester	California	49681	NaN	162000	120000	35000	317000

We use apply to add the abbreviations into the approriate column.

df_final['abbrev'] = df_final.apply(convert_state, axis=1)
df_final.tail()

	account	name	street	city	state	postal-code	abbrev	Jan	Feb	Mar	total
11	231907	Hahn-Moore	18115 Olivine Throughway	Norbertomouth	NorthDakota	31415	ND	150000	10000	162000	322000
12	242368	Frami, Anderson and Donnelly	182 Bertie Road	East Davian	Iowa	72686	IA	162000	120000	35000	317000
13	268755	Walsh-Haley	2624 Beatty Parkways	Goodwinmouth	RhodeIsland	31919	RI	55000	120000	35000	210000
14	273274	McDermott PLC	8917 Bergstrom Meadow	Kathryneborough	Delaware	27933	DE	150000	120000	70000	340000
15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1462000	1507000	717000	3686000

I think this is pretty cool. We have developed a very simple process to intelligently clean up this data. Obviously when you only have 15 or so rows, this is not a big deal. However, what if you had 15,000? You would have to do something manual in Excel to clean this up.

Subtotals

For the final section of this article, let’s get some subtotals by state.

In Excel, we would use the subtotal tool to do this for us.

The output would look like this:

Creating a subtotal in pandas, is accomplished using groupby

df_sub=df_final[["abbrev","Jan","Feb","Mar","total"]].groupby('abbrev').sum()
df_sub

	Jan	Feb	Mar	total
abbrev
AR	150000	120000	35000	305000
CA	162000	120000	35000	317000
DE	150000	120000	70000	340000
IA	253000	240000	70000	563000
ID	70000	120000	35000	225000
ME	45000	120000	10000	175000
MS	62000	120000	70000	252000
NC	95000	45000	35000	175000
ND	150000	10000	162000	322000
PA	70000	95000	35000	200000
RI	200000	215000	70000	485000
TN	45000	120000	55000	220000
TX	10000	62000	35000	107000

Next, we want to format the data as currency by using applymap to all the values in the data frame.

def money(x):
    return "${:,.0f}".format(x)

formatted_df = df_sub.applymap(money)
formatted_df

	Jan	Feb	Mar	total
abbrev
AR	$150,000	$120,000	$35,000	$305,000
CA	$162,000	$120,000	$35,000	$317,000
DE	$150,000	$120,000	$70,000	$340,000
IA	$253,000	$240,000	$70,000	$563,000
ID	$70,000	$120,000	$35,000	$225,000
ME	$45,000	$120,000	$10,000	$175,000
MS	$62,000	$120,000	$70,000	$252,000
NC	$95,000	$45,000	$35,000	$175,000
ND	$150,000	$10,000	$162,000	$322,000
PA	$70,000	$95,000	$35,000	$200,000
RI	$200,000	$215,000	$70,000	$485,000
TN	$45,000	$120,000	$55,000	$220,000
TX	$10,000	$62,000	$35,000	$107,000

The formatting looks good, now we can get the totals like we did earlier.

sum_row=df_sub[["Jan","Feb","Mar","total"]].sum()
sum_row

Jan      1462000
Feb      1507000
Mar       717000
total    3686000
dtype: int64

Convert the values to columns and format it.

df_sub_sum=pd.DataFrame(data=sum_row).T
df_sub_sum=df_sub_sum.applymap(money)
df_sub_sum

	Jan	Feb	Mar	total
0	$1,462,000	$1,507,000	$717,000	$3,686,000

Finally, add the total value to the DataFrame.

final_table = formatted_df.append(df_sub_sum)
final_table

	Jan	Feb	Mar	total
AR	$150,000	$120,000	$35,000	$305,000
CA	$162,000	$120,000	$35,000	$317,000
DE	$150,000	$120,000	$70,000	$340,000
IA	$253,000	$240,000	$70,000	$563,000
ID	$70,000	$120,000	$35,000	$225,000
ME	$45,000	$120,000	$10,000	$175,000
MS	$62,000	$120,000	$70,000	$252,000
NC	$95,000	$45,000	$35,000	$175,000
ND	$150,000	$10,000	$162,000	$322,000
PA	$70,000	$95,000	$35,000	$200,000
RI	$200,000	$215,000	$70,000	$485,000
TN	$45,000	$120,000	$55,000	$220,000
TX	$10,000	$62,000	$35,000	$107,000
0	$1,462,000	$1,507,000	$717,000	$3,686,000

You’ll notice that the index is ‘0’ for the total line. We want to change that using rename .

final_table = final_table.rename(index={0:"Total"})
final_table

	Jan	Feb	Mar	total
AR	$150,000	$120,000	$35,000	$305,000
CA	$162,000	$120,000	$35,000	$317,000
DE	$150,000	$120,000	$70,000	$340,000
IA	$253,000	$240,000	$70,000	$563,000
ID	$70,000	$120,000	$35,000	$225,000
ME	$45,000	$120,000	$10,000	$175,000
MS	$62,000	$120,000	$70,000	$252,000
NC	$95,000	$45,000	$35,000	$175,000
ND	$150,000	$10,000	$162,000	$322,000
PA	$70,000	$95,000	$35,000	$200,000
RI	$200,000	$215,000	$70,000	$485,000
TN	$45,000	$120,000	$55,000	$220,000
TX	$10,000	$62,000	$35,000	$107,000
Total	$1,462,000	$1,507,000	$717,000	$3,686,000

Conclusion

By now, most people know that pandas can do a lot of complex manipulations on data – similar to Excel. As I have been learning about pandas, I still find myself trying to remember how to do things that I know how to do in Excel but not in pandas. I realize that this comparison may not be exactly fair – they are different tools. However, I hope to reach people that know Excel and want to learn what alternatives are out there for their data processing needs. I hope these examples will help others feel confident that they can replace a lot of their crufty Excel data manipulations with pandas.

I found this exercise helpful to cement these ideas in my mind. I hope it works for you as well. If you have other Excel tasks that you would like to learn how to do in pandas, let me know via the comments below and I will try to help.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

[ad_2]

Source link

Pandas Common Excel tasks shown

Introduction

Adding a Sum to a Row

Additional Data Transforms

Subtotals

Conclusion

Follow Us

POPULAR POSTS

POPULAR CATEGORY

Meta’s head of AI warns Europe on Banning AI Model Sharing