Data is an important part of our world. In fact, 90% of the world’s data was created in just the last 3 years. Many tech giants have started hiring data scientists to analyze data and extract useful insights for business decisions.
Currently, Python is the most important language for data analysis, and many of the industry-standard tools are written in Python. Python Pandas is one of the most essential, in-demand tools that any aspiring data analysts need to learn. Today, we’ll introduce you to the essentials of Pandas.
Introducing Pandas for Python
The Pandas library is one of the most important and popular tools for Python data scientists and analysts, as it is the backbone of many data projects. Pandas is an open-source Python package for data cleaning and data manipulation. It provides extended, flexible data structures to hold different types of labeled and relational data. On top of that, it is actually quite easy to install and use.
Pandas is often used in conjunction with other Python libraries. In fact, Pandas is built on the NumPy package, so a lot of the structure between them is similar. Pandas is also used in SciPy for statistical analysis or with Matplotlib for plotting functions. Pandas can be used on its own with a text editor or with Juptyer Notebooks, the ideal environment for more complex data modeling. Pandas is available for most versions of Python, including Python3.
Think of Pandas as the home for your data where you can clean, analyze, and transform your data, all in one place. Pandas is essentially a more powerful replacement for Excel. Using Pandas, you can do things like:
- Easily calculate statistics about data such as finding the average, distribution, and median of columns
- Use data visualization tools, such as Matplotlib, to easily create plot bars, histograms, and more
- Clean your data by filtering columns by particular criteria or easily removing values
- Manipulate your data flexibly using operations like merging, joining, reshaping, and more
- Read, write, and store your clean data as a database,
txt
file, orCSV
file
Popularity of Pandas
As we learned, Python is the most popular programming language for data analytics, and many of the popular machine learning and visualization libraries are written in Python, including Pandas, Numpy, TensorFlow, Matplotlib, Scikit-learn, and more. In fact, Python ranked 4th in the 2020 StackOverflow survey for the most popular programming language, and it is beloved for its simplicity, easy learning-curve, and improved library support.
Pandas is an important part of data analytics. It ranks 4th for most popular and loved libraries. It also consistently ranks highly for most wanted programming tools, a sure sign that Pandas is a sought-after tool for developers around the world. Learning Pandas is an important step to becoming a data analyst.
First Step: Installing Pandas
You can install Pandas using the built-in Python tool pip
and run the following command.
$ pip install pandas
Pandas Data Structures and Data Types
A data type is like an internal construct that determines how Python will manipulate, use, or store your data. When doing data analysis, it’s important to use the correct data types to avoid errors. Pandas will often correctly infer data types, but sometimes, we need to explicitly convert data. Let’s go over the data types available to us in Pandas, also called dtypes
.
object
: text or mixed numeric or non-numeric valuesint64
: integer numbersbool
: true/false vauesfloat64
: floating point numberscategory
: finite list of text valuesdatetime64
: Date and time valuestimedelta[ns]
: differences between two datetimes
A data structure is a particular way of organizing our data. Pandas has two data structures, and all operations are based on those two objects:
Series
DataFrame
Think of this as a chart for easy storage and organization, where Series are the columns, and the DataFrame is a table composed of a collection of series. Series
can be best described as the single column of a 2-D array that can store data of any type. DataFrame
is like a table that stores data similar to a spreadsheet using multiple columns and rows. Each value in a DataFrame
object is associated with a row index and a column index.
Series
: the most important operations
We can get started with Pandas by creating a series. We create series by invoking the pd.Series()
method and then passing a list of values. We print that series using the print
statement. Pandas will, by default, count index from 0. We then explicitly define those values.
series1 = pd.Series([1,2,3,4])
print(series1)
Let’s look at a more complex example. Run the code below.
import pandas as pd df = pd.read_csv('test.csv') print(df.columns) print("\nThe original DataFrame:") print(df.head()) print("\nThe new DataFrame with selected columns is:\n") new_df = pd.DataFrame(df, columns=['Sex', 'Under 1', '40-44']) print(new_df.head())
Reindex data in a DataFrame
We can also reindex the data either by the indexes themselves or the columns. Reindexing with reindex()
allows us to make changes without messing up the initial setting of the objects.
Note: The rules for reindexing are the same for
Series
andDataFrame
objects.
#importing pandas in our program import pandas as pd # Defining a series object srs1 = pd.Series([11.9, 36.0, 16.6, 21.8, 34.2], index = ['China', 'India', 'USA', 'Brazil', 'Pakistan']) # Set Series name srs1.name = "Growth Rate" # Set index name srs1.index.name = "Country" srs2 = srs1.reindex(['China', 'India', 'Malaysia', 'USA', 'Brazil', 'Pakistan', 'England']) print("The series with new indexes is:\n",srs2) srs3 = srs1.reindex(['China', 'India', 'Malaysia', 'USA', 'Brazil', 'Pakistan', 'England'], fill_value=0) print("\nThe series with new indexes is:\n",srs3)
print(pd.concat([df1, df2]))
Pretty simple, right? Some other common data wrangling processes that you should know are:
- Mapping data and finding duplicates
- Finding outliers in data
- Data Aggregation
- Reshaping data
- Replace & rename
- and more
This article has been published from the source link without modifications to the text. Only the headline has been changed.
Source link