HomeData EngineeringData EducationBig Data Analytics Algorithm

# Big Data Analytics Algorithm

 Audio version of the article

Getting started with your advanced analytics initiatives can seem like a daunting task, but these five fundamental algorithms can make your work easier.

There is a fervor in the air when it comes to the topics of big data and advanced analytics. Top analyst firms have written extensively on what initiatives around these concepts can do to revolutionize businesses in a digital era. Fortune 500 companies around the world are investing heavily in big data and advanced analytics and are seeing direct benefits to their company’s top and bottom lines. The problem is that many companies want to achieve incredible results as well but are not sure exactly where to start.

Advanced analytics often starts with a single use case. This includes the application of new methods of data transformation and analysis to uncover previously unknown trends and patterns within their data. When this new information is then applied to business processes and operating norms, it has the potential to transform your business.

To extract greater value from your data, put these five categories of algorithms to work.

Linear Regression

Linear regression is one of the most basic algorithms of advanced analytics. This also makes it one of the most widely used. People can easily visualize how it is working and how the input data is related to the output data.

Linear regression uses the relationship between two sets of continuous quantitative measures. The first set is called the predictor or independent variable. The other is the response or dependent variable. The goal of linear regression is to identify the relationship in the form of a formula that describes the dependent variable in terms of the independent variable. Once this relationship is quantified, the dependent variable can be predicted for any instance of an independent variable.

One of the most common independent variables used is time. Whether your independent variable is revenue, costs, customers, use, or productivity, if you can define the relationship it has with time, you can forecast a value with linear regression.

Logistic Regression

Logistic regression sounds similar to linear regression but is actually focused on problems involving categorization instead of quantitative forecasting. Here the output variable values are discrete and finite rather than continuous and with infinite values as with linear regression.

The goal of logistic regression is to categorize whether an instance of an input variable either fits within a category or not. The output of logistic regression is a value between 0 and 1. Results closer to 1 indicate that the input variable more clearly fits within the category. Results closer to 0 indicate that the input variable likely does not fit within the category.

Logistic regression is often used to answer clearly defined yes or no questions. Will a customer buy again? Is a buyer credit worthy? Will the prospect become a customer? Predicting the answer to these questions can spawn a series of actions within the business process which can help drive future revenue.

Classification and Regression Trees

Classification and regression trees use a decision to categorize data. Each decision is based on a question related to one of the input variables. With each question and corresponding response, the instance of data gets moved closer to being categorized in a specific way. This set of questions and responses and subsequent divisions of data create a tree-like structure. At the end of each line of questions is a category. This is called the leaf node of the classification tree.

These classification trees can become quite large and complex. One method of controlling the complexity is through pruning the tree or intentionally removing levels of questioning to balance between exact fit and abstraction. A model that works well with all instances of input values, both those that are known in training and those that are not, is paramount. Preventing overfitting of this model requires a delicate balance between exact fit and abstraction.

A variant of classification and regression trees is called random forests. Instead of constructing a single tree with many branches of logic, a random forest is a culmination of many small and simple trees that each evaluate the instances of data and determine a categorization. Once all of these simple trees complete their data evaluation, the process merges the individual results to create a final prediction of the category based on the composite of the smaller categorizations. This is commonly referred to as an ensemble method. These random forests often do well at balancing exact fit and abstraction and have been implemented successfully in many business cases.

In contrast to logistic regression, which focuses on a yes or no categorization, classification and regression trees can be used to predict multivalue categorizations. They are also easier to visualize and see the definitive path that guided the algorithm to a specific categorization.

K-Nearest Neighbors

K-nearest neighbor is also a classification algorithm. It is known as a “lazy learner” because the training phase of the process is very limited. The learning process is composed of the training set of data being stored. As new instances are evaluated, the distance to each data point in the training set is evaluated and there is a consensus decision as to which category the new instance of data falls into based on its proximity to the training instances.

This algorithm can be computationally expensive depending on the size and scope of the training set. As each new instance has to be compared to all instances of the training data set and a distance derived, this process can use many computing resources each time it runs.

This categorization algorithm allows for multivalued categorizations of the data. In addition, noisy training data tends to skew classifications.

K-nearest neighbors is often chosen because it is easy to use, easy to train, and easy to interpret the results. It is often used in search applications when you are trying to find similar items.

K-Means Clustering

K-means clustering focuses on creating groups of related attributes. These groups are referred to as clusters. Once these clusters are created, other instances can be evaluated against them to see where they best fit.

This technique is often used as part of data exploration. To start, the analyst specifies the number of clusters. The K-means cluster process breaks the data into that number of clusters based on finding data points with similarities around a common hub, called the centroid. These clusters are not the same as categories because initially they do not have business meaning. They are just closely related instances of input variables. Once these clusters are identified and analyzed, they can be converted to categories and provided a name that has business meaning.

K-means clustering is often used because it is simple to use and explain and because it is fast. One area to note is that k-means clustering is extremely sensitive to outliers. These outliers can significantly shift the nature and definition of these clusters and ultimately the results of analysis.

These are some of the most popular algorithms in use in advanced analytics initiatives. Each has pros and cons and different ways in which it can be effectively utilized to generate business value. The end target with the implementation of these algorithms is to further refine the data to a point where the information that results can be applied to business decisions. It is this process of informing downstream processes with more refined and higher value data that is a fundamental to companies becoming truly harnessing the value of their data and achieving the results that they desire.