Machine Learning is the branch of science that studies how computers can learn without being explicitly programmed. ML is one of the most exciting technologies. As the name implies, it provides the computer with a feature that makes it more human-like: the ability to learn. Machine learning is being used actively today, possibly in many more places than one would expect.
Let’s explore this topic in detail. In this blog you are going to get answers to all your questions, so stay tuned till the end it will be worth it.
What is machine learning?
Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. Machine Learning Formal Definitions:
· Arthur Samuel (1959) — Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed.
· Tom Mitchell (1998) — A computer program is said to learn from experience E with respect to some task T and some performance measure P. if its performance on T as measured by P improves with experience E.
How is Machine Learning related to artificial intelligence, deep learning and data science?
· Artificial Intelligence — AI is the broad concept of developing machines that can simulate human thinking, reasoning and behaviour.
· Machine Learning — ML is a subset of AI wherein computer systems learn from the environment and, in turn, use these learnings to improve experiences and processes. All machine learning is AI, but not all AI is machine learning.
· Deep Learning — DL is part of a broader family of machine learning methods based on artificial neural networks. DL uses multiple layers to progressively extract higher-level features from the raw input.
·Data Science — Data Science is the processing, analysis and extraction of relevant assumptions from data. It’s about finding hidden patterns in the data. A Data Scientist makes use of machine learning in order to predict future events.
What is supervised learning and which ML algorithms come under it? What is unsupervised learning and which ML algorithms come under it?
Supervised learning is a machine learning approach that’s defined by its use of labelled datasets. These datasets are designed to train or “supervise” algorithms into classifying data or predicting outcomes accurately. Using labelled inputs and outputs, the model can measure its accuracy and learn over time.
Supervised learning can be separated into two types of problems: classification and regression:
1. Classification problems use an algorithm to accurately assign test data into specific categories, such as separating apples from oranges. Or, in the real world, supervised learning algorithms can be used to classify spam in a separate folder from your inbox. Linear classifiers, support vector machines, decision trees and random forest are all common types of classification algorithms.
2. Regression is another type of supervised learning method that uses an algorithm to understand the relationship between dependent and independent variables. Regression models are helpful for predicting numerical values based on different data points, such as sales revenue projections for a given business. Some popular regression algorithms are linear regression, logistic regression and polynomial regression.
Unsupervised learning uses machine learning algorithms to analyse and cluster unlabelled data sets. These algorithms discover hidden patterns in data
without the need for human intervention (hence, they are “unsupervised”). Unsupervised learning models are used for three main tasks: clustering, association and dimensionality reduction:
1. Clustering is a data mining technique for grouping un-labelled data based on their similarities or differences. For example, K-means clustering algorithms assign similar data points into groups, where the K value represents the size of the grouping and granularity. This technique is helpful for market segmentation, image compression, etc.
2. Association is another type of unsupervised learning method that uses different rules to find relationships between variables in a given dataset.
These methods are frequently used for market basket analysis and recommendation engines, along the lines of “Customers Who Bought This Item Also Bought” recommendations.
3. Dimensionality reduction is a learning technique used when the number of features (or dimensions) in a given dataset is too high. It reduces the number of data inputs to a manageable size while also preserving data integrity. Often, this technique is used in the pre-processing data stage, such as when autoencoders remove noise from visual data to improve picture quality.
What Are The Popular Machine Learning Algorithms That we use often?
- Linear regression
- Logistic regression
3. Decision tree
4. SVM algorithm
5. Naive Bayes algorithm
6. KNN algorithm
7. K-means
8. Random forest algorithm
9. Dimensionality reduction algorithms
10. Gradient boosting algorithm and AdaBoosting algorithm
What Are The Popular Deep Learning Algorithms That we use often?
1. Convolutional Neural Networks (CNNs)
2. Long Short-term Memory Networks (LSTMs)
3. Recurrent Neural Networks (RNNs)
4. Generative Adversarial Networks (GANs)
5. Radial Basis Function Networks (RBFNs)
What are hyperparameters and how do they differ from parameters?
Hyperparameters:
Hyperparameters are parameters whose values control the learning process and determine the values of model parameters that a learning algorithm ends up learning. The prefix ‘hyper_’ suggests that they are ‘top-level’ parameters that control the learning process and the model parameters that result from it.
As a machine learning engineer designing a model, you choose and set hyperparameter values that your learning algorithm will use before the training of the model even begins. In this light, hyperparameters are said to be external to the model because the model cannot change its values during learning/training.
Hyperparameters are used by the learning algorithm when it is learning but they are not part of the resulting model. At the end of the learning process, we have the trained model parameters which effectively is what we refer to as the model. The hyperparameters that were used during training are not part of this model. We cannot for instance know what hyperparameter values were used to train a model from the model itself, we only know the model parameters that were learned.
Basically, anything in machine learning and deep learning that you decide their values or choose their configuration before training begins and whose values or configuration will remain the same when training ends is a hyperparameter.
Here are some common examples for Hyperparameters:
- Train-test split ratio
- Learning rate in optimization algorithms (e.g., gradient descent).
- Choice of optimization algorithm (e.g., gradient descent, stochastic gradient descent, or Adam optimizer).
- Choice of activation function in a neural network layer (e.g., Sigmoid, ReLU, Tanh).
- The choice of cost or loss function, the model will use.
- Number of hidden layers in a neural network.
- Number of activation units in each layer.
- The drop-out rate in neural network (dropout probability).
- Number of iterations (epochs) in training a neural network.
- Number of clusters in a clustering task.
- Kernel or filter size in convolutional layers.
- Pooling size, Batch Size
Parameters:
Parameters on the other hand are internal to the model. That is, they are learned or estimated purely from the data during training as the algorithm used tries to learn the mapping between the input features and the labels or targets. Model training typically starts with parameters being initialized to some values (random values or set to zeros).
As training/learning progresses the initial values are updated using an optimization algorithm (e.g., gradient descent). The learning algorithm is continuously updating the parameter values as learning progress but hyperparameter values set by the model designer remain unchanged.
At the end of the learning process, model parameters are what constitute the model itself. Here are some common examples for Parameters:
· The coefficients (or weights) of linear and logistic regression models.
· Weights and biases of a neural network
· The cluster centroids in clustering
What is the purpose of feature scaling? Explain the difference between Normalization and Standardization.
Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data pre-processing step
Normalization:
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. Here’s the formula for normalization:
Here, X(max) and X(min) are the maximum and the minimum values of the feature respectively.
- When the value of X is the minimum value in the column, the numerator will be 0, and hence X’ is 0.
- On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the denominator and thus the value of X’ is 1.
- If the value of X is between the minimum and the maximum value, then the value of X’ is between 0 and 1.
Standardization:
Standardization is another scaling technique where the values are centred around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Here’s the formula for standardization:
µ is the mean of the feature values and is the standard deviation of the feature values, note that in this case, the values are not restricted to a particular range.
Standardisation is generally preferred over normalisation in the machine learning context as it is especially important when comparing the similarities between features based on certain distance measures. This is most prominent in Principal Component Analysis (PCA), a dimensionality reduction algorithm, where we are interested in the components that maximise the variance in the data.
The normalisation, on the other hand, also offers many practical applications particularly in computer vision and image processing where pixel intensities have to be normalised in order to fit within the RGB colour range between 0 and 255. Moreover, neural network algorithms typically require data to be normalised to a 0 to 1 scale before model training.
At the end of the day, there is no definitive answer as to whether you should normalise or standardise your data. One can always apply both techniques and compare the model performance under each approach for the best result
What is the activation function and why is it used? What types of activation functions do we have?
An Activation Function decides whether a neuron should be activated or not. This means that it will decide whether the neuron’s input to the network is important or not in the process of prediction using simpler mathematical operations.
The role of the Activation Function is to derive output from a set of input values fed to a node (or a layer).
Generally, we have four different types of activation functions that we can choose from. Of course, there are more different types, but these four are the predominant ones.
The Threshold Function is the simplest function where if the value is less than zero then the function is zero, but if the value is more than zero the function is one. It is basically a yes/no type of function.
2) Sigmoid Function:
The Sigmoid Function is a function that is used in logistic regression. It has a gradual progression and anything below zero is just zero, and above zero it approximates to one. It is especially useful in the output layer, especially when we are trying to predict probabilities. The logistic sigmoid function can cause a neural network to get stuck at the training time.
The SoftMax Function is a more generalized logistic activation function that is used for multiclass classification. The sigmoid function is used mostly used in classification type problems since we need to scale the data in some given specific range with a threshold
The Rectifier Function is one of the most popular functions for artificial neural networks so when it goes all the way to zero it is zero, and from there it is gradually progressing as the input value increase as well. It is used in almost all convolutional neural networks or deep learning.
4) Hyperbolic Tangent tanh Function:
The Hyperbolic Tangent tanh Function is very similar to the sigmoid function, but here the hyperbolic tangent function goes below zero, and that could be useful in some applications. The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph. The tanh function is mainly used in classification between two classes.
Some Popular types of activation functions and when to use them
· Binary Step
· Linear
· Sigmoid
· Tanh
· ReLU
· Leaky ReLU
· Parameterised ReLU
· Exponential Linear Unit
· Swish
· SoftMax
What are the Top Optimisation Methods in Machine Learnings that we use?
· Gradient Descent.
· Stochastic Gradient Descent.
· Adaptive Learning Rate Method.
· Conjugate Gradient Method.
· Derivative-Free Optimisation.
· Zeroth Order Optimisation.
What is stochastic gradient descent (SGD) and how does it differ from gradient descent (GD)?
In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.
While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.
Thus, if the number of training samples is large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample.
SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.
How does regularisation prevent overfitting
Let’s use a linear regression equation to explain regularization further.
Y represents the value that is to be predicted. βi stands for the regressor coefficient estimates for the corresponding predictor Xi And, Xi represents the weights or magnitudes assigned to various predictors (independent variables). Here, i represents any value greater than or equal to 0, and less than p. A loss function is involved in the fitting process. It is computed as the difference between the actual and predicted output from a model.
A loss function provides a means of assessing how well an algorithm model fits the given data. It is used to minimize the error, in turn optimizing the weights. In this context, the loss function is referred to as the residual sum of squares (RSS).
Based on the training data, the loss function will adjust the coefficients. If the presence of noise or outliers is found in the training data, the approximated coefficients will not generalize well to the unseen data. Regularization comes into play and shrinks the learned estimates towards zero.
In other words, it tunes the loss function by adding a penalty term, that prevents excessive fluctuation of the coefficients. Thereby, reducing the chances of overfitting.
Lasso regression:
Lasso regression is a regularization technique used to reduce model complexity. It is also known as L1 regularization. Lasso stands for Least Absolute Shrinkage and Selector Operator.
We note that it has a slight variation to the previously discussed loss function, with the introduction of a penalty term. To penalize highly fluctuating coefficients, lasso uses absolute values of the regression coefficients (∣β∣).
Lasso minimizes the regression coefficients to regularize the model parameters. Sometimes, Lasso can reduce regression coefficients to zero, which is particularly important when it comes to feature selection.
Feature selection refers to the process of choosing relevant variables and predictors to construct a model. Here, the feature selection process is attributed to the ability of lasso to reduce some regression coefficients to zero. It occurs after the regression coefficients are shrunk.
The predictors whose coefficients are reduced to zero will not be included in the final model. These are the predictors considered to have less importance. This is how some features are eliminated. However, every non-zero regression coefficient is selected for use in the model. This greatly assists in minimizing prediction errors.
Lasso also helps improve the prediction accuracy of the models. The shrinking coefficients minimize the bias and improve the variance of models. A context that favours the use of lasso is when we have a small dataset with a greater number of features.
The tuning parameter λ controls the shrinkage. From the equation, when λ is zero, the equation is reduced to the linear regression loss function equation. The greater the value of λ, the greater the reduction of the coefficients will be towards zero.
How is deep learning better than machine learning?
Deep Learning is a class of machine learning algorithms. It resorts to a multi-layered filter system to hierarchically extract useful features. This means that each successive layer of the input receives the output data of the previous layer. The features of a higher level are derived from the features of a lower level.
Advantages of Deep Learning:
1. High-level Performance
In many areas like computer vision, speech recognition, and natural language processing, neural networks based on deep learning technologies are currently many times superior to the methods used by classical machine learning. The accuracy level increases while the number of errors decreases.
2. Ability to Develop New Functions
Classical machine learning presumes that humans develop functions and that that approach is very time-consuming. Deep learning is capable of generating new functions based on the limited number available in their learning data set. What we mean is that deep learning algorithms can create new tasks to reach current objectives.
3. Advanced Analysis Capabilities
To make machine learning algorithms operate correctly, it is necessary to prepare labelled data. The system based on deep learning algorithms is capable of becoming “smarter” by itself in the process of problem solving and can work with unlabelled data.
4. Adaptability and Scalability
Deep learning methods are much easier to adapt to different areas, in comparison to classical ML algorithms. This feature becomes possible due to transfer learning facilitating in which the entire model is learned, in most cases, helping to achieve higher productivity in a shorter period of time.
Another important advantage is scalability. Neural networks can handle data growth better than classical machine learning algorithms. It is well demonstrated in the chart provided below.
What are the best Courses and Books for Machine Learning and Deep Learning?
Machine Learning:
Courses
- Machine Learning by Andrew Ng: Coursera.
- Machine Learning A-Z : Udemy.
Books
- Pattern Recognition and Machine Learning by Christopher Bishop
- An Introduction to Statistical Learning by Gareth M. James, Daniela Witten, Trevor Hastie and Robert Tibshirani
- Hands on Machine Learning with Scikit-Learn and TensorFlow by Aurelien Geron
Deep Learning:
Courses
- Deep Learning Specialization by Andrew Ng : Coursera
- Deep Learning with PyTorch by Yann LeCun : YouTube
- Deep Learning with fast.ai by Jeremy Howard : fast.ai
Books
- Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville
- Deep Learning with Python by François Chollet
- Hands on Machine Learning with Scikit-Learn and TensorFlow by Aurelien Geron
Moving forward we will discuss many more algorithms involving image processing. Till then wish you good health and success. Do visit the official website of AlgoZenith if you wish to master Data Structure and Algorithms for your future Internship/Job tests.
Do let us know in the comments if you liked the content also do check out our blog series on Finance and Product Management. Do check this blog if you are searching for an Ultimate guide for your Job/ Internship. Stay tuned for more such blogs. You can also check out the previous blogs of this series on Everything about Data Science and How to approach a Data Science project for a beginner. Keep Learning Keep shining.