Deriving the Gradient Descent Rule for Linear Regression and Adaline
Linear Regression and Adaptive Linear Neurons (Adalines) are closely related to each other. In fact, the Adaline algorithm is a identical to linear regression except for a threshold function that converts the continuous output into a categorical class label
where is the net input, which is computed as the sum of the input features multiplied by the model weights :
(Note that refers to the bias unit so that .)
In the case of linear regression and Adaline, the activation function is simply the identity function so that .
Now, in order to learn the optimal model weights , we need to define a cost function that we can optimize. Here, our cost function is the sum of squared errors (SSE), which we multiply by to make the derivation easier:
where is the label or target label of the th training point .
(Note that the SSE cost function is convex and therefore differentiable.)
In simple words, we can summarize the gradient descent learning as follows:
 Initialize the weights to 0 or small random numbers.
 For epochs (passes over the training set)
 For each training sample
 Compute the predicted output value
 Compare to the actual output and Compute the "weight update" value
 Update the "weight update" value
 Update the weight coefficients by the accumulated "weight update" values
 For each training sample
Which we can translate into a more mathematical notation:
 Initialize the weights to 0 or small random numbers.

For epochs

For each training sample
 (where is the learning rate);


Performing this global weight update
can be understood as "updating the model weights by taking an opposite step towards the cost gradient scaled by the learning rate "
where the partial derivative with respect to each can be written as
To summarize: in order to use gradient descent to learn the model coefficients, we simply update the weights by taking a step into the opposite direction of the gradient for each pass over the training set  that's basically it. But how do we get to the equation
Let's walk through the derivation step by step.