HomeData EngineeringData DIYTutorial on Python Pandas

Tutorial on Python Pandas

Data is an important part of our world. In fact, 90% of the world’s data was created in just the last 3 years. Many tech giants have started hiring data scientists to analyze data and extract useful insights for business decisions.

Currently, Python is the most important language for data analysis, and many of the industry-standard tools are written in Python. Python Pandas is one of the most essential, in-demand tools that any aspiring data analysts need to learn. Today, we’ll introduce you to the essentials of Pandas.

Introducing Pandas for Python

The Pandas library is one of the most important and popular tools for Python data scientists and analysts, as it is the backbone of many data projects. Pandas is an open-source Python package for data cleaning and data manipulation. It provides extended, flexible data structures to hold different types of labeled and relational data. On top of that, it is actually quite easy to install and use.

Pandas is often used in conjunction with other Python libraries. In fact, Pandas is built on the NumPy package, so a lot of the structure between them is similar. Pandas is also used in SciPy for statistical analysis or with Matplotlib for plotting functions. Pandas can be used on its own with a text editor or with Juptyer Notebooks, the ideal environment for more complex data modeling. Pandas is available for most versions of Python, including Python3.

Think of Pandas as the home for your data where you can clean, analyze, and transform your data, all in one place. Pandas is essentially a more powerful replacement for Excel. Using Pandas, you can do things like:

  • Easily calculate statistics about data such as finding the average, distribution, and median of columns
  • Use data visualization tools, such as Matplotlib, to easily create plot bars, histograms, and more
  • Clean your data by filtering columns by particular criteria or easily removing values
  • Manipulate your data flexibly using operations like merging, joining, reshaping, and more
  • Read, write, and store your clean data as a database, txt file, or CSV file

Popularity of Pandas

As we learned, Python is the most popular programming language for data analytics, and many of the popular machine learning and visualization libraries are written in Python, including Pandas, Numpy, TensorFlow, Matplotlib, Scikit-learn, and more. In fact, Python ranked 4th in the 2020 StackOverflow survey for the most popular programming language, and it is beloved for its simplicity, easy learning-curve, and improved library support.

Pandas is an important part of data analytics. It ranks 4th for most popular and loved libraries. It also consistently ranks highly for most wanted programming tools, a sure sign that Pandas is a sought-after tool for developers around the world. Learning Pandas is an important step to becoming a data analyst.

First Step: Installing Pandas

You can install Pandas using the built-in Python tool pip and run the following command.

$ pip install pandas

Pandas Data Structures and Data Types

data type is like an internal construct that determines how Python will manipulate, use, or store your data. When doing data analysis, it’s important to use the correct data types to avoid errors. Pandas will often correctly infer data types, but sometimes, we need to explicitly convert data. Let’s go over the data types available to us in Pandas, also called dtypes.

  • object: text or mixed numeric or non-numeric values
  • int64: integer numbers
  • bool: true/false vaues
  • float64: floating point numbers
  • category: finite list of text values
  • datetime64: Date and time values
  • timedelta[ns]: differences between two datetimes

data structure is a particular way of organizing our data. Pandas has two data structures, and all operations are based on those two objects:

  • Series
  • DataFrame

Think of this as a chart for easy storage and organization, where Series are the columns, and the DataFrame is a table composed of a collection of series. Series can be best described as the single column of a 2-D array that can store data of any type. DataFrame is like a table that stores data similar to a spreadsheet using multiple columns and rows. Each value in a DataFrame object is associated with a row index and a column index.

Series: the most important operations

We can get started with Pandas by creating a series. We create series by invoking the pd.Series() method and then passing a list of values. We print that series using the print statement. Pandas will, by default, count index from 0. We then explicitly define those values.

series1 = pd.Series([1,2,3,4])

print(series1)

Let’s look at a more complex example. Run the code below.

import pandas as pd

df = pd.read_csv('test.csv')

print(df.columns)

print("\nThe original DataFrame:")
print(df.head())

print("\nThe new DataFrame with selected columns is:\n")
new_df = pd.DataFrame(df, columns=['Sex', 'Under 1', '40-44'])
print(new_df.head())

Reindex data in a DataFrame

We can also reindex the data either by the indexes themselves or the columns. Reindexing with reindex() allows us to make changes without messing up the initial setting of the objects.

Note: The rules for reindexing are the same for Series and DataFrame objects.

#importing pandas in our program
import pandas as pd

# Defining a series object
srs1 = pd.Series([11.9, 36.0, 16.6, 21.8, 34.2], index = ['China', 'India', 'USA', 'Brazil', 'Pakistan'])

# Set Series name
srs1.name = "Growth Rate"

# Set index name
srs1.index.name = "Country"

srs2 = srs1.reindex(['China', 'India', 'Malaysia', 'USA', 'Brazil', 'Pakistan', 'England'])
print("The series with new indexes is:\n",srs2)

srs3 = srs1.reindex(['China', 'India', 'Malaysia', 'USA', 'Brazil', 'Pakistan', 'England'], fill_value=0)
print("\nThe series with new indexes is:\n",srs3)
print(pd.concat([df1, df2]))

Pretty simple, right? Some other common data wrangling processes that you should know are:

  • Mapping data and finding duplicates
  • Finding outliers in data
  • Data Aggregation
  • Reshaping data
  • Replace & rename
  • and more

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link

Most Popular