Getting Datasets for ML in Python

A machine learning project is a combination of code and data, as opposed to other types of programming exercises. You must accomplish both the goal and do something useful. Many well-known datasets have been created over the years, and many of them have become standards or benchmarks. In this tutorial, we’ll look at how to easily obtain those well-known public datasets. In addition, we will learn how to create a synthetic dataset if none of the existing datasets meet our requirements.

After completing this tutorial, you will be able to:

  1. Where can I find free datasets for machine learning projects?
  2. How to Download Datasets in Python Using Libraries
  3. How to Use Scikit-learn to Create Synthetic Datasets

 

Tutorial Overview

This tutorial is divided into four sections, which are as follows:

  1. Dataset Repositories
  2. Dataset retrieval in scikit-learn and Seaborn
  3. TensorFlow dataset retrieval
  4. Creating a dataset with scikit-learn

Dataset Repositories

Because machine learning has been developed for decades, there are some datasets with historical significance. The UCI Machine Learning Repository is one of the most well-known repositories for these datasets. The majority of the datasets over there are small in size because the technology at the time was insufficient to handle larger amounts of data. The iris flower dataset (introduced by Ronald Fisher in 1936) and the 20 newsgroups dataset are two well-known datasets housed in this repository (textual data usually referred to by information retrieval literature).

Newer datasets tend to be larger in size. The ImageNet dataset, for example, is more than 160 GB in size. These datasets are commonly found in Kaggle and can be found by searching for them by name. If we need to download them, we should use Kaggle’s command-line tool after creating an account.

OpenML is a newer repository with a large number of datasets. It’s convenient because you can search for datasets by name, but it also has a standardized web API that users can use to retrieve data. It would be useful if you wanted to use Weka because it offers files in ARFF format.

However, many datasets are publicly available but are not housed in these repositories for a variety of reasons. You might also want to look at Wikipedia’s “List of datasets for machine-learning research.” On that page, you’ll find a long list of datasets labeled with different categories, as well as links to download them.

Dataset retrieval in scikit-learn and Seaborn

Those datasets can be obtained tangentially by downloading them from the web, either through the browser, the command line, the wget tool, or network libraries such as requests in Python. Because some of these datasets have become standards or benchmarks, many machine learning libraries have developed functions to assist in retrieving them. For practical reasons, datasets are frequently not shipped with libraries but are downloaded in real-time when the functions are invoked. As a result, you must have a consistent internet connection to use them.

Scikit-learn is an example of a library that allows you to download datasets via its API. The related functions are defined under sklearn.datasets, and you can find a list of them here:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

For example, you can use the function load_iris() to get the iris flower dataset as follows:

import sklearn.datasets
data, target = sklearn.datasets.load_iris(return_X_y=True, as_frame=True)
data["target"] = target
print(data)

The load_iris() function would return numpy arrays (i.e., does not have column headers) instead of pandas DataFrame unless the argument as_frame=True is specified. Also, we pass return_X_y=True to the function, so only the machine learning features and targets are returned, rather than some metadata such as the description of the dataset. The above code prints the following:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0                  5.1         3.5               1.4              0.2      0
1                  4.9         3.0               1.4              0.2      0
2                  4.7         3.2               1.3              0.2      0
3                  4.6         3.1               1.5              0.2      0
4                  5.0         3.6               1.4              0.2      0
..                 ...         ...               ...              ...     ...
145                6.7         3.0               5.2              2.3      2
146                6.3         2.5               5.0              1.9      2
147                6.5         3.0               5.2              2.0      2
148                6.2         3.4               5.4              2.3      2
149                5.9         3.0               5.1              1.8      2

[150 rows x 5 columns]

While it is convenient to separate the features and targets for training a scikit-learn model, combining them would be beneficial for visualization. For example, we could combine the DataFrame as described above and then use Seaborn to visualize the correlogram:

import sklearn.datasets
import matplotlib.pyplot as plt
import seaborn as sns

data, target = sklearn.datasets.load_iris(return_X_y=True, as_frame=True)
data["target"] = target

sns.pairplot(data, kind="scatter", diag_kind="kde", hue="target",
palette="muted", plot_kws={'alpha':0.7})
plt.show()

The correlogram shows that target 0 is easily distinguished but targets 1 and 2 frequently overlap. We can find the equivalent data loading function from Seaborn because this dataset is also useful for demonstrating plotting functions. We can rewrite the preceding as follows:

import matplotlib.pyplot as plt
import seaborn as sns

data = sns.load_dataset("iris")
sns.pairplot(data, kind="scatter", diag_kind="kde", hue="species",
palette="muted", plot_kws={'alpha':0.7})
plt.show()

Seaborn’s supported dataset is more limited. By running: we can see the names of all supported datasets.

import seaborn as sns
print(sns.get_dataset_names())

where the following is all the datasets from Seaborn:

['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes',
'diamonds', 'dots', 'exercise', 'flights', 'fmri', 'gammas', 'geyser',
'iris', 'mpg', 'penguins', 'planets', 'taxis', 'tips', 'titanic']

There are a handful of similar functions to load the “toy datasets” from scikit-learn. For example, we have load_wine() and load_diabetes() defined in similar fashion.

Larger datasets are also similar. We have fetch_california_housing(), for example, that needs to download the dataset from the internet (hence the “fetch” in the function name). Scikit-learn documentation calls these the “real-world datasets,” but the toy datasets are equally real.

import sklearn.datasets

data = sklearn.datasets.fetch_california_housing(return_X_y=False, as_frame=True)
data = data["frame"]
print(data)
    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  MedHouseVal
0   8.3252  41.0      6.984127  1.023810   322.0      2.555556  37.88    -122.23      4.526
1   8.3014  21.0      6.238137  0.971880   2401.0     2.109842  37.86    -122.22      3.585
2   7.2574  52.0      8.288136  1.073446   496.0      2.802260  37.85    -122.24      3.521
3   5.6431  52.0      5.817352  1.073059   558.0      2.547945  37.85    -122.25      3.413
4   3.8462  52.0      6.281853  1.081081   565.0      2.181467  37.85    -122.25      3.422
... ...     ...       ...       ...        ...        ...       ...       ...          ...
20635 1.5603 25.0     5.045455  1.133333   845.0      2.560606  39.48    -121.09      0.781
20636 2.5568 18.0     6.114035  1.315789   356.0      3.122807  39.49    -121.21      0.771
20637 1.7000 17.0     5.205543  1.120092   1007.0     2.325635  39.43    -121.22      0.923
20638 1.8672 18.0     5.329513  1.171920   741.0      2.123209  39.43    -121.32      0.847
20639 2.3886 16.0     5.254717  1.162264   1387.0     2.616981  39.37    -121.24      0.894

[20640 rows x 9 columns]

If we require more than these, scikit-learn has a handy function that allows us to read any dataset from OpenML. As an example,

import sklearn.datasets

data = sklearn.datasets.fetch_openml("diabetes", version=1, as_frame=True, return_X_y=False)
data = data["frame"]
print(data)
  preg  plas  pres  skin  insu  mass  pedi  age  class
0  6.0  148.0 72.0  35.0  0.0   33.6  0.627 50.0 tested_positive
1  1.0  85.0  66.0  29.0  0.0   26.6  0.351 31.0 tested_negative
2  8.0  183.0 64.0  0.0   0.0   23.3  0.672 32.0 tested_positive
3  1.0  89.0  66.0  23.0  94.0  28.1  0.167 21.0 tested_negative
4  0.0  137.0 40.0  35.0  168.0 43.1  2.288 33.0 tested_positive
..  ... ... ... ... ... ... ... ... ...
763 10.0 101.0 76.0 48.0 180.0 32.9   0.171 63.0 tested_negative
764 2.0  122.0 70.0 27.0 0.0   36.8   0.340 27.0 tested_negative
765 5.0  121.0 72.0 23.0 112.0 26.2   0.245 30.0 tested_negative
766 1.0  126.0 60.0 0.0  0.0   30.1   0.349 47.0 tested_positive
767 1.0  93.0  70.0 31.0 0.0   30.4   0.315 23.0 tested_negative

[768 rows x 9 columns]

In OpenML, we should not always use the name to identify a dataset because there may be multiple datasets with the same name. On OpenML, we can look for the data ID and use it in the function as follows:

import sklearn.datasets

data = sklearn.datasets.fetch_openml(data_id=42437, return_X_y=False, as_frame=True)
data = data["frame"]
print(data)

In the code above, the data ID refers to the Titanic dataset. To demonstrate how we can obtain the Titanic dataset and then run the logistic regression, we can extend the code as follows:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_openml

X, y = fetch_openml(data_id=42437, return_X_y=True, as_frame=False)
clf = LogisticRegression(random_state=0).fit(X, y)
print(clf.score(X,y)) # accuracy
print(clf.coef_) # coefficient in logistic regression
0.8114478114478114
[[-0.7551392 2.24013347 -0.20761281 0.28073571 0.24416706 -0.36699113
0.4782924 ]]

TensorFlow dataset retrieval

TensorFlow, in addition to scikit-learn, is a tool that can be used for machine learning projects. For similar reasons, there is a dataset API for TensorFlow that provides the dataset in a format compatible with TensorFlow. The API, unlike scikit-learn, is not included in the standard TensorFlow package. You must run the following command to install it:

pip install tensorflow-datasets

The catalog contains a list of all datasets:

https://www.tensorflow.org/datasets/catalog/overview#all datasets

A name is assigned to each dataset. The names can be found in the above catalog. You can also obtain a list of names by doing the following:

import tensorflow_datasets as tfds
print(tfds.list_builders())

which includes over 1,000 names

Let’s take the MNIST handwritten digits dataset as an example. We can get the data by doing the following:

import tensorflow_datasets as tfds
ds = tfds.load("mnist", split="train", shuffle_files=True)
print(ds)

This shows us that tfds.load() gives us an object of type tensorflow.data.OptionsDataset:

<_OptionsDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>

In particular, this dataset has the data instances (images) in a numpy array of shapes (28,28,1), and the targets (labels) are scalars.

With minor polishing, the data is ready for use in the Keras fit() function. An example is as follows:

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, AveragePooling2D, Dropout, Flatten
from tensorflow.keras.callbacks import EarlyStopping

# Read data with train-test split
ds_train, ds_test = tfds.load("mnist", split=['train', 'test'],
shuffle_files=True, as_supervised=True)

# Set up BatchDataset from the OptionsDataset object
ds_train = ds_train.batch(32)
ds_test = ds_test.batch(32)

# Build LeNet5 model and fit
model = Sequential([
Conv2D(6, (5,5), input_shape=(28,28,1), padding="same", activation="tanh"),
AveragePooling2D((2,2), strides=2),
Conv2D(16, (5,5), activation="tanh"),
AveragePooling2D((2,2), strides=2),
Conv2D(120, (5,5), activation="tanh"),
Flatten(),
Dense(84, activation="tanh"),
Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["sparse_categorical_accuracy"])
earlystopping = EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)
model.fit(ds_train, validation_data=ds_test, epochs=100, callbacks=[earlystopping])

If we provided as_supervised=True, the dataset would be records of tuples (features, targets) instead of the dictionary. It is required for Keras. Moreover, to use the dataset in the fit() function, we need to create an iterable of batches. This is done by setting up the batch size of the dataset to convert it from OptionsDataset object into BatchDataset object.

We applied the LeNet5 model for the image classification. But since the target in the dataset is a numerical value (0 to 9) rather than a Boolean vector, we ask Keras to convert the softmax output vector into a number before computing accuracy and loss by specifying sparse_categorical_accuracy and sparse_categorical_crossentropy in the compile() function.

The key point here is to recognize that each dataset has a unique shape. When you use it with your TensorFlow model, you must modify it to fit the dataset.

Creating a dataset with scikit-learn

There is a set of very useful functions in scikit-learn for generating a dataset with specific properties. Because we can manipulate the properties of the synthetic dataset, we can assess the performance of our models in a situation that is not commonly seen in other datasets.

Documentation for Scikit-learn These functions is referred to as the samples generator. It is simple to use, for example:

from sklearn.datasets import make_circles
import matplotlib.pyplot as plt

data, target = make_circles(n_samples=500, shuffle=True, factor=0.7, noise=0.1)
plt.figure(figsize=(6,6))
plt.scatter(data[:,0], data[:,1], c=target, alpha=0.8, cmap="Set1")
plt.show()

The make circles() function generates coordinates from scattered points in a 2D plane, resulting in two classes positioned in the shape of concentric circles. The parameters factor and noise in the argument allow us to control the size and overlap of the circles. Because there is no linear separator available, this synthetic dataset is useful for evaluating classification models such as a support vector machine.

The output from make_circles() is always in two classes, and the coordinates are always in 2D. But some other functions can generate points of more classes or in higher dimensions, such as make_blob(). In the example below, we generate a dataset in 3D with 4 classes:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

data, target = make_blobs(n_samples=500, n_features=3, centers=4,
shuffle=True, random_state=42, cluster_std=2.5)

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(projection='3d')
ax.scatter(data[:,0], data[:,1], data[:,2], c=target, alpha=0.7, cmap="Set1")
plt.show()

There are also some functions to generate a dataset for regression problems. For example, make_s_curve() and make_swiss_roll() will generate coordinates in 3D with targets as continuous values.

from sklearn.datasets import make_s_curve, make_swiss_roll
import matplotlib.pyplot as plt

data, target = make_s_curve(n_samples=5000, random_state=42)

fig = plt.figure(figsize=(15,8))
ax = fig.add_subplot(121, projection='3d')
ax.scatter(data[:,0], data[:,1], data[:,2], c=target, alpha=0.7, cmap="viridis")

data, target = make_swiss_roll(n_samples=5000, random_state=42)
ax = fig.add_subplot(122, projection='3d')
ax.scatter(data[:,0], data[:,1], data[:,2], c=target, alpha=0.7, cmap="viridis")

plt.show()

If we prefer not to look at the data from a geometric perspective, there are also make_classification() and make_regression(). Compared to the other functions, these two provide us more control over the feature sets, such as introducing some redundant or irrelevant features.

Below is an example of using make_regression() to generate a dataset and run linear regression with it:

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import numpy as np

# Generate 10-dimensional features and 1-dimensional targets
X, y = make_regression(n_samples=500, n_features=10, n_targets=1, n_informative=4,
noise=0.5, bias=-2.5, random_state=42)

# Run linear regression on the data
reg = LinearRegression()
reg.fit(X, y)

# Print the coefficient and intercept found
with np.printoptions(precision=5, linewidth=100, suppress=True):
print(np.array(reg.coef_))
print(reg.intercept_)

We created 10-dimensional features in the preceding example, but only four of them are informative. As a result of the regression, we discovered that only four of the coefficients are significantly non-zero.

[-0.00435 -0.02232 19.0113 0.04391 46.04906 -0.02882 -0.05692 28.61786 -0.01839 16.79397]
-2.5106367126731413

An example of using make_classification() similarly is as follows. A support vector machine classifier is used in this case:

from sklearn.datasets import make_classification
from sklearn.svm import SVC
import numpy as np

# Generate 10-dimensional features and 3-class targets
X, y = make_classification(n_samples=1000, n_features=10, n_classes=3,
n_informative=4, n_redundant=2, n_repeated=1,
random_state=42)

# Run SVC on the data
clf = SVC(kernel="rbf")
clf.fit(X, y)

# Print the accuracy
print(clf.score(X, y))

 

Summary

In this tutorial, you learned about different ways to load or generate a common dataset in Python.

You specifically learned:

  1. How to load common machine learning datasets using the dataset API in scikit-learn, Seaborn, and TensorFlow.
  2. The minor differences in the format of the datasets returned by various APIs, as well as how to use them
  3. How to Make a Dataset with Scikit-Learn

Source link