A tutorial on statistical-learning for scientific data processing

[ad_1]

 

Statistical learning

Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are facing is rapidly growing. Problems it tackles range from building a prediction function linking different observations, to classifying observations, or learning the structure in an unlabeled dataset.

This tutorial will explore statistical learning, the use of machine learning techniques with the goal of statistical inference: drawing conclusions on the data at hand.

Scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages

Datasets

Scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis.

A simple example shipped with scikit-learn: iris dataset>>>

>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)

It is made of 150 observations of irises, each described by 4 features: their sepal and petal length and width, as detailed in iris.DESCR.

When the data is not initially in the (n_samples, n_features) shape, it needs to be preprocessed in order to be used by scikit-learn.

An example of reshaping data would be the digits dataset

../../_images/sphx_glr_plot_digits_last_image_001.png

The digits dataset is made of 1797 8×8 images of hand-written digits>>>

>>> digits = datasets.load_digits()
>>> digits.images.shape
(1797, 8, 8)
>>> import matplotlib.pyplot as plt 
>>> plt.imshow(digits.images[-1], cmap=plt.cm.gray_r) 
<matplotlib.image.AxesImage object at ...>

To use this dataset with scikit-learn, we transform each 8×8 image into a feature vector of length 64>>>

>>> data = digits.images.reshape((digits.images.shape[0], -1))

Estimators objects

Fitting data: the main API implemented by scikit-learn is that of the estimator. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.

All estimator objects expose a fit method that takes a dataset (usually a 2-d array):>>>

>>> estimator.fit(data)

Estimator parameters: All the parameters of an estimator can be set when it is instantiated or by modifying the corresponding attribute:>>>

>>> estimator = Estimator(param1=1, param2=2)
>>> estimator.param1
1

Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:>>>

>>> estimator.estimated_param_ 

[ad_2]

Source link