Making Excel Faster Using Python

Excel is very, very pervasive in today’s businesses. At Dataquest, we generally recommend working with code for quite a few reasons, and many of our data science courses are aimed at teaching efficient coding for data analysis and data science. But no matter how strongly you prefer working with Python, at the end of the day there is sometimes a need to present your findings or share your data using Excel.

But that doesn’t mean can’t still enjoy some of the efficiencies of Python! In fact, using a library called xlwings, you can use Python to make working in Excel faster.

In this xlwings tutorial, we’ll walk through how to use Python in Excel to execute and use some common operations like deleting rows based on certain criteria, using Excel functions and formulas, autofilling, creating sheets, charts, etc. In order to follow this post you should be familiar with basic Python concepts (objects, methods, attributes, functions) and Python’s syntax and have an intermediate knowledge of Excel and VBA.

We will be working with a data set that contains information regarding the draws of an European lottery called EuroMillions. This data set was downloaded from this link and it contains all the EuroMillions draws up to, and including, the 20th of September. The data available at that link should be updated with the latest information up to whenever you’re reading this post, but in case it’s unavailable, here’s a CSV file with the data from that link as of September 20.

Euromillions is a multinational lottery available in some European countries, specifically in Andorra, Austria, Belgium, France (including overseas regions and collectivities), Ireland, the Isle of Man, Liechtenstein, Luxembourg, Monaco, Portugal, Spain, Switzerland and the United Kingdom (source).

As of this writing, the draw consists of five numbers from a pool of 50 numbers (numbered 1 through 50) and two numbers called lucky stars from a pool of 12 numbers. In order to win the jackpot, participants must correctly choose all drawn numbers and lucky stars. The largest jackpot ever won was €190 million. (Note, though, that our data set denominates winnings in Pounds, not Euros).

In this tutorial, we’ll use Python and xlwings with Excel to clean up a data set and then generate some graphs to visualize which numbers win the EuroMillions most frequently.

import pandas as pd
import xlwings as xw
df = pd.read_csv('euromillions.csv')
df.sample(5)
  No. Day DD MMM YYYY N1 N2 N3 N4 N5 L1 L2 Jackpot Wins
627 514 Fri 24 Aug 2012 6 5 19 37 12 7 3 37709047 1
230 911 Tue 14 Jun 2016 13 50 42 39 34 11 9 11928000 0
998 143 Fri 3 Nov 2006 13 11 44 24 49 9 3 88344099 0
1071 70 Fri 10 Jun 2005 37 32 47 7 6 7 1 21734699 0
522 619 Tue 27 Aug 2013 7 40 38 43 30 6 2 12931500 0

The first column is the draw number, columns N1-L2 are the drawn numbers and lucky stars (by the order they were drawn), the Jackpot column is the jackpot in Euros and the Wins column tell us how many bets hit the jackpot.

Meet xlwings

xlwings is a Python library that makes some of the data analysis features of Python available in an Excel instance, including support for numpy arrays and pandas Series and DataFrames. Like any other Python library, it can be installed using common methods like pip or conda, but you can access the documentation for xlwings here if you need additional details.

Note that you’ll need to have a version of Microsoft Excel installed on the computer you’re using to do this xlwings tutorial.

xlwings objects

In xlwings there are four main object types which are, in decreasing hierarchical order: App (which represents an Excel instance), BookSheet and Range. In addition to these we’ll also be dealing with Chart and Shape objects. You can find useful information regarding these and other objects in the official documentation, but we’ll be looking at each of these objects one at a time.

Let’s start by creating a Book instance and naming it wb (workbook).

wb = xw.Book() # wb = xw.Book(filename) would open an existing file

When you run that code, it should look something like this.

Notice that when the code cell is run in the Jupyter Notebook, Excel is launched automatically.

By instantiating a Book object, an App object that belongs to our book object is created automatically. Here’s how we can check all opened Excel instances.

Note: we’re not going to include gifs for every step of this tutorial, because we don’t want this page to be a hassle to load for people with slow or limited internet connections. However, subsequent code-running steps should look similar to what we can see above: when we run a cell in Juypter, the Excel spreadsheet is updated in accordance with whatever code we’ve run.

print(xw.apps)
Apps([<Excel App 9536>])

The object xw.apps is an iterable. To check what workbooks belong to the unique instances in this iterable we can invoke the books method on it like so.

print(xw.apps[0].books)
Books([<Book [Book1]>])

As expected the only instance of this iterable is the workbook wb. We check this fact below.

print(xw.apps[0].books[0] == wb)
True

Similarly, we can check what sheets belong to this workbook:

print(wb.sheets)
Sheets([<Sheet [Book1]Sheet1>])

We can also refer to sheets by their name:

#creates a worksheet object assigns it to ws
ws = wb.sheets["Sheet1"]
#checks that wb.sheets[0] equals ws
print(ws == wb.sheets[0])
True

Sheet objects have a name attribute that works as expected. Let us change the name of our only sheet.

ws.name = "EuroMillions"

We can move data from certain Python objects (e.g. lists and tuples) into Excel. Let’s move the data in our dataframe into the sheet EuroMillions. To do this, we’ll make use of range to create a range object that stores the data from our DataFrame in a range of cells in Excel, starting in this case with the cell A1:

#ws.range("A1") is a Range object
ws.range("A1").value = df

Here’s how that will look:

As we can see, the index column of df was also moved to Excel. Let’s clear the contents of this sheet and copy the data without the index.

ws.clear_contents()
ws.range("A1").options(index=False).value = df

It will be useful to be able to tell where our table ends. More specifically, we’ll need the last row that has data in it. To this end, we can use the end method and the row attribute of Range objects.

The row method, not surprisingly, returns the row of the Range object.

The method end takes in as an argument a direction ("up" (or 1), "right" (or 2), "left" (or 3"down" (or 4)) and returns another range object. It mimics the very common action in Excel CTRL+Shift+Arrow.

last_row = ws.range(1,1).end('down').row
print("The last row is {row}.".format(row=last_row))
print("The DataFrame df has {rows} rows.".format(rows=df.shape[0]))
The last row is 1142.
The DataFrame df has 1141 rows.

It checks out!

The API property

Not every Excel functionality is available as a native xlwings feature. Sometimes we’ll have to find workarounds to do what we want. Fortunately xlwings makes this very easy for us. From the Missing Features section in the official documentation:

Workaround: in essence, xlwings is just a smart wrapper around pywin32 on Windows and appscript on Mac. You can access the underlying objects by calling the api property. The underlying objects will offer you pretty much everything you can do with VBA, using the syntax of pywin32 (which pretty much feels like VBA) and appscript (which doesn’t feel like VBA). But apart from looking ugly, keep in mind that it makes your code platform specific (!)Excel Visual Basic for Applications is a rich source of explanations for the various existing Excel objects.

Sorting is one such functionality that is missing from xlwings. You may have noticed that the records are ordered from the most recent to the oldest draw. In the next few steps we’ll be reversing the order.

The object ws.range("A2:N{row}".format(row=last_row)) is a Range object. Appending the api property to it yields a VBA Range object which in turn gives access to its VBA features.

We’ll be using the Sort property of this VBA object. In its simplest application, Sort takes in two arguments: the column by which we want to order the table (as a VBA Range object), and the order type (whether we want to sort it in ascending or descending order). The second argument’s parameter’s documentation can be seen here. We’ll be sorting in ascending order.

Putting all of that together looks like this:

ws.range(
"A2:N{row}".format(row=last_row)
).api.Sort(Key1=ws.range("A:A").api, Order1=1)
True

Here’s how it’ll look on your screen after running (note that the first column has changed, and it’s now sorted in ascending rather than descending order.

Analyzing Our Data

One problem we’ll encounter in trying to analyze this data set is that the dates are scattered across three different columns. We’ll need to compress it into a single column. To do this, we’ll concatenate the columns appropriately in Excel, using Python. We begin by inserting a header in the empty adjacent column.

ws.range("O1").value = "Date"

Next, we can insert the Excel formula we want to use as a string. Note: the specifics of what argument separator you should use depends on your machine’s local regional settings. On mine, the argument separator is a comma and that’s what I’ve used in this tutorial, but on yours it might be a semicolon.

ws.range("O2").value = "=C2&D2&RIGHT(E2, 2)"

Having inserted the formula in the first cell, it’s second nature in the regular Excel workflow to autofill the rest of the cells through the end of the table. Autofill is a method of the VBA Range object. It takes as parameters the destination cells as a VBA Range object and the type of fill. We’re interested in the default one whose enumeration is 0.

ws.range("O2").api.AutoFill(
ws.range("O2:O{row}".format(row=last_row)).api,
0
)
True

Here’s roughly how the screen should look after this step; note the new “Date” column on the far right.

We can also use the named form of the type of fill we want. To do this we’ll need to retrieve it from the module xlwings.constants which contains named versions of the enumerated arguments of most VBA properties. Recall that you can always check the available attributes by printing dir(xlwings.constants).

(If you’re not familiar with it, dir is a native Python function and can take several kinds of arguments (modules, classes and regular objects (like lists and strings). For example, if you print dir(some_list) it will give you all the methods and attributes that you can use with a list.)

What we did above could also be achieved with the following code snippet.

from xlwings.constants import AutoFillType
ws.range("O2").api.AutoFill(
ws.range("O2:O{row}".format(row=last_row)).api,
    AutoFillType.xlFillDefault
)

Since we’ll be using this often, we’ll create a function that applies the default fill, given:

  • a worksheet
  • a string representing a cell in the worksheet
  • a last row to fill.

To do this we’ll introduce a new Range method called get_address. It takes in four Boolean arguments and returns a string that identifies the range with varying levels of detail. Here’s a hopefully enlightening illustration of this method.

for arg4 in (0, 1):
for arg3 in (0,1):
for arg2 in (0,1):
for arg1 in (0,1):
print(ws.range("O2").get_address(arg1, arg2, arg3, arg4))
O2
O$2
$O2
$O$2
EuroMillions!O2
EuroMillions!O$2
EuroMillions!$O2
EuroMillions!$O$2
[Book1]EuroMillions!O2
[Book1]EuroMillions!O$2
[Book1]EuroMillions!$O2
[Book1]EuroMillions!$O$2
[Book1]EuroMillions!O2
[Book1]EuroMillions!O$2
[Book1]EuroMillions!$O2
[Book1]EuroMillions!$O$2

And now we define our function.

def autofill(worksheet, cell, last_row):
   rg_cell = worksheet.range(cell)
   to_fill = "{col}{top_row}:{col}{last_row}".format(
      col=rg_cell.get_address(0,0)[0],
      top_row=rg_cell.row,
      last_row=last_row
   )
   rg_cell.api.Autofill(worksheet.range(to_fill).api, 0)

In order to save Excel from making unnecessary computations, we’ll replace the formulas we just inserted on column O with hard coded values. Before we do this let us take a moment to think about what kind of Python object is Range.value when Range is an array.

print(type(ws.range("O2:O{row}".format(row=last_row)).value))
<class 'list'>

It’s a list! Let’s see its first ten elements.

print(ws.range('O2:O{row}'.format(row=last_row)).value[:10])
['13Feb04', '20Feb04', '27Feb04', '5Mar04', '12Mar04', '19Mar04', '26Mar04', '2Apr04', '9Apr04', '16Apr04']

If we insert this list into any range, it will place the values horizontally, which isn’t what we want. In order to place them vertically, we’ll need to use the options method of Range objects with the option transpose=True as a parameter, like this:

ws.range('O2').options(transpose=True).value\
= ws.range('O2:O{row}'.format(row=last_row)).value

And we can now delete columns C through E.

ws.range('C:E').api.Delete()
True

EuroMillions format has suffered some mild modifications over the years, the last one happened on September 24, 2016.

From September 24, 2016 the amount of lucky stars changed from a pool of 11 to a pool of 12 numbers. In order to make a meaningful analysis, we’ll only consider draws that occurred after the last modification. The next code snippet finds the last game that happened prior to the modification and names it to_delete.

import datetime

for day in ws.range('L2:L{}'.format(last_row)):
# checks if day is not prior to the change of the rules
    if day.value <= datetime.datetime(2016, 9, 24, 0, 0):
        # since day is past the modification date,
        # the row we want is the previous one, hence minus 1
        to_delete = int(day.get_address(0, 0)[1:])-1
        # leave the for cycle
        break

And we can now delete every row from the first game to to_delete.

ws.range('2:{}'.format(to_delete)).api.Delete()
True

Here’s where we’re at, at this point:

Having finished preparing the data, we will now format this table. We’ll begin by setting the font of the first row to bold.

ws.range('1:1').api.Font.Bold = True

We can follow that by formatting the Jackpot column in millions. Note that the string format below depends on your machine’s local regional settings. If the format looks odd on your end, try swapping the commas with the dots. More on Excel custom formats here.

ws.range('J:J').number_format = "£##.##0,,' M'"

As an auxiliary step for what will follow, we’ll find the letter corresponding to the last column that has data.

last_column = ws.range(1,1).end('right').get_address(0,0)[0]

Let’s now add a border to the bottom of the header cells. Similarly to what we have been doing, we’ll use the api property. Additionally we are going to need the Border property of Range objects, the border orientation enumeration and the style of the border. We’ll be setting a double edged border (line style -4119) on the bottom of the header cells only (orientation 9).

ws.range('A1:{}1'.format(last_column)).api.Borders(9).LineStyle = -4119

Let us now autofit by both rows and columns.

ws.autofit()

Oops! This looks a bit squished, let us set the width of all columns to that of column J which seems to be the largest. Here is the ColumnWidth documentation that we are using just below.

ws.range('A:L').api.ColumnWidth = ws.range('J:J').api.ColumnWidth

That should look better. And we’re done with this sheet!

Let us add a new blank sheet called Frequencies and let us assign it to the Python name frequencies.

wb.sheets.add('Frequencies')
frequencies = wb.sheets['Frequencies']

We will be populating this sheet with the absolute frequencies of each number and each lucky star in the data set we just organized in the sheet EuroMillions.

# add a header for the numbers
frequencies.range('A1').value = 'Number'
# populate the fifty cells immediately below with the numbers 1 through 50
# since we're starting from the second row, we'll want to substract 1 from the row
frequencies.range('A2:A51').value = '=ROW()-1'

Below we’ll be inserting a header for the frequencies in cell B1 and in cell B2 we will input a formula that will count how many times the value in A2 occurs in the range C2:G201. In other words, we will count how many times 1 occurred in the columns N1-N5. After this we will autofill the rest of the cells on column B to do the same for their respective rows.

# add a header for the frequencies
frequencies.range('B1').value = 'Frequency'
# insert on B2 the result of a standard Excel formula
frequencies.range('B2').value = '=COUNTIF(Euromillions!$C$2:$G$201,Frequencies!A2)'
autofill(frequencies, 'B2', 51)

And we do the same for the lucky stars:

frequencies.range('D1').value = 'Lucky Star'
frequencies.range('E1').value = 'Frequency'
frequencies.range('D2:D13').value = '=ROW()-1'
frequencies.range('E2').value =\
'=COUNTIF(EuroMillions!$H$2:$I$201,Frequencies!D2)'
autofill(frequencies, 'E2', 13)
frequencies.autofit()

Here’s how our new sheet should be looking at this point:

We are nearing our goal. Let’s create a sheet called Graphs.

wb.sheets.add('Graphs')
graphs = wb.sheets['Graphs']

Now we’ll create a Chart object. This will only produce a blank white box, but don’t worry! We’re going to use that box to chart our data in a moment.

nr_freq = xw.Chart()

We can name our chart similarly to what we did for sheets. The method set_source_data allows us to define the data source for our chart by passing in a range object.

nr_freq.name = 'Number Frequencies'
nr_freq.set_source_data(frequencies.range('Frequencies!B1:B51'))

Excel will try to guess what the x-axis should be, but we can force it to be the numbers we created on Frequencies by using the VBA Chart method FullSeriesCollection. We can edit the chart by using the object at index 1 of nr_freq.api:

nr_freq.api[1].FullSeriesCollection(1).XValues = '=Frequencies!A2:A51'

Excel is very good at guessing what kind of chart the user wants, but just in case it guesses incorrectly, we will force it to be a column chart. The various types of charts are listed here. Alas, the only documentation bridging these to the possible values of the chart_type attribute is the source code itself.

nr_freq.chart_type = 'column_clustered'

We will now define the height and the width of the chart. The measure unit will be points.

nr_freq.height = 250
nr_freq.width = 750

Here’s what we should be seeing at this point:

The SetElement method together with the parameter 2 sets the title above the chart. See other arguments here.

nr_freq.api[1].SetElement(2)  # Place chart title at the top
nr_freq.api[1].ChartTitle.Text = 'Number Frequencies'

And we add the final touches. We remove the legend using the HasLegend property.

nr_freq.api[1].HasLegend = 0

The xlCategory category passed in as the argument 1 to the Axes method, together with the property TickLabelSpacing set to 1, ensures that every element of our axis is displayed.

nr_freq.api[1].Axes(1).TickLabelSpacing = 1

To finish formatting this chart, we remove the outline by setting the property Visible of the Line object to 0.

graphs.shapes.api('Number Frequencies').Line.Visible = 0

And here’s what we’ll see:

Below we do nearly the same thing for the lucky stars.

ls_freq = xw.Chart()
ls_freq.top = 250
ls_freq.name = 'Lucky Star Frequencies'
ls_freq.set_source_data(frequencies.range('Frequencies!E1:E13'))
ls_freq.api[1].FullSeriesCollection(1).XValues = '=Frequencies!D2:D13'
ls_freq.chart_type = 'column_clustered'
ls_freq.height = 250
ls_freq.width = 750
ls_freq.api[1].SetElement(2)
ls_freq.api[1].ChartTitle.Text = 'Lucky Star Frequencies'
ls_freq.api[1].HasLegend = 0
ls_freq.api[1].Axes(1).TickLabelSpacing = 1
graphs.shapes.api('Lucky Star Frequencies').Line.Visible = 0

And to finalize we create a time series graph displaying the evolution of the jackpot.

jackpot = xw.Chart()
jackpot.top = 500
jackpot.name = 'Jackpot'
last_row = ws.range(1,1).end('down').row
jackpot.set_source_data(ws.range('Euromillions!J2:J{}'.format(last_row)))
jackpot.api[1].FullSeriesCollection(1).XValues\
= 'Euromillions!L2:L{}'.format(last_row)
jackpot.chart_type = 'line'
jackpot.height = 250
jackpot.width = 750
jackpot.api[1].SetElement(2)
jackpot.api[1].ChartTitle.Text = 'Jackpot'
jackpot.api[1].HasLegend = 0
graphs.shapes.api('Jackpot').Line.Visible = 0

And we fix the vertical axis’ labels’ format by setting the TickLabels property NumberFormat to the desired appearance.

jackpot.api[1].Axes(2).TickLabels.NumberFormat = "£0,,' M'"

And we’re done! Now we save the file and exit our Excel instance.

wb.save('EuroMillions.xlsx')
xw.apps[0].quit()

Hopefully, this xlwings tutorial has been helpful!

[ad_2]

This article has been published from the source link without modifications to the text. Only the deadline has been changed.

Source link