In gradient boosting, the initial prediction y_hat is improved by using a loss function to quantify the difference between predicted and true labels.
Common loss functions for regression tasks include Mean Squared Error (MSE) and for classification tasks, cross-entropy loss is often used.
The negative derivative of the loss function with respect to y_hat helps determine what needs to be added to the model to decrease the loss.
A second weak learner, f_1(x), is trained on the values of the pseudo-residual (r) to create a new model, F_1(x).
The constant term gamma_0 is multiplied by the result of f_1(x) to adjust how much of f_1(x) should be added to the model to minimize loss.
Line search is typically used to find the optimal value for gamma_0 that minimizes the loss.
Creating subsequent models involves repeating the process of calculating the negative derivative, fitting a new weaker learner, and determining the new gamma value.
The goal is to iteratively improve the model's prediction by adding new weak learners to the ensemble.