Exploring Batch and Stochastic Gradient Descent in Linear Regression for Predicting Credit Card Transactions

Supervised learning is a type of machine learning that uses labeled datasets to train algorithms to predict outcomes and recognize patterns. Unlike unsupervised learning, supervised algorithms are provided with labeled training data to understand the relationship between inputs and outputs. In a regression problem, the model predicts continuous values using a set of input features (x_i). Hypothesis and Cost Function The prediction value, known as the hypothesis (h), is defined as the dot product of the transpose of the parameter vector (\theta) and the feature vector (x): [ h_\theta(x) = \theta^\top x ] The objective of linear regression is to minimize the mean squared error (MSE) between the predicted values and the actual values given in the training dataset. The cost function (J(\theta)) is expressed as: [ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 ] where (m) is the number of training examples. Batch Gradient Descent Gradient Descent is an optimization algorithm used to find the local minima of a function by iteratively moving in the direction opposite to the steepest ascent. For a single input feature and parameter, the update rule is: [ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) ] where (\alpha) is the learning rate. The gradient (\nabla_\theta J(\theta)) for all parameters is a matrix of partial derivatives: [ \nabla_\theta J(\theta) = \frac{1}{m} X^\top (X\theta - y) ] The update rule in matrix notation is: [ \theta := \theta - \alpha \nabla_\theta J(\theta) ] The algorithm continues to adjust parameters until the cost function converges, moving "downhill" step by step. Least Mean Squares (LMS) Rule The LMS rule is an iterative algorithm that updates the model's parameters in each epoch by subtracting a fraction of the average error across all training examples: [ \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} ] This process helps the algorithm find the optimal parameters that minimize the cost function. Normal Equation Alternatively, the optimal parameter (\theta^*) can be found using the normal equation, which provides an analytical solution for linear regression: [ \theta^* = (X^\top X)^{-1} X^\top y ] This method ensures immediate convergence but requires the design matrix (X) to be invertible, implying that input features are linearly independent. Simulation A simulation demonstrates how gradient descent finds local minima of a quadratic cost function with different learning rates (0.1, 0.3, 0.8,) and (0.9): ```python def cost_func(x): return x**2 - 4 * x + 1 def gradient(x): return 2*x - 4 def gradient_descent(gradient, start, learn_rate, max_iter, tol): x = start steps = [start] # records learning steps for _ in range(max_iter): diff = learn_rate * gradient(x) if np.abs(diff) < tol: break x = x - diff steps.append(x) return x, steps ``` The learning rates control the size of the steps taken during the optimization process. Predicting Credit Card Transactions A practical application involves predicting credit card transactions using linear regression with batch gradient descent (BGD). The dataset includes transaction amounts, per capita income, yearly income, credit limits, credit scores, and current ages. After preprocessing, outliers beyond three standard deviations are removed, and the target variable 'amount' is log-transformed to normalize its distribution. Data Preprocessing Loading and Sanitizing Data: Transactions, card, and user data are loaded and sanitized. Fraud labels are added to the dataset. Continuous variables with a linear relationship to transaction amounts are selected. Outliers are filtered, and the target variable is log-transformed. Transforming Data: Categorical and numerical features are transformed using pipelines. Train and test datasets are split and preprocessed. Defining the BGD Regressor A custom class BatchGradientDescentLinearRegressor is created to implement BGD with early stopping and L2 regularization: ```python class BatchGradientDescentLinearRegressor: def init(self, learning_rate=0.01, n_iterations=1000, l2_penalty=0.01, tol=1e-4, patience=10): self.learning_rate = learning_rate self.n_iterations = n_iterations self.l2_penalty = l2_penalty self.tol = tol self.patience = patience self.weights = None self.bias = None self.history = {'loss': [], 'grad_norm': [], 'weight':[], 'bias': [], 'val_loss': []} self.best_weights = None self.best_bias = None self.best_val_loss = float('inf') self.epochs_no_improve = 0 def _mse_loss(self, y_true, y_pred, weights): m = len(y_true) loss = (1 / (2 * m)) * np.sum((y_pred - y_true)**2) l2_term = (self.l2_penalty / (2 * m)) * np.sum(weights**2) return loss + l2_term def fit(self, X_train, y_train, X_val=None, y_val=None): n_samples, n_features = X_train.shape self.weights = np.zeros(n_features) self.bias = 0 for i in range(self.n_iterations): y_pred = np.dot(X_train, self.weights) + self.bias dw = (1 / n_samples) * np.dot(X_train.T, (y_pred - y_train)) + (self.l2_penalty / n_samples) * self.weights db = (1 / n_samples) * np.sum(y_pred - y_train) loss = self._mse_loss(y_train, y_pred, self.weights) gradient = np.concatenate([dw, [db]]) grad_norm = np.linalg.norm(gradient) self.history['weight'].append(self.weights[0]) self.history['loss'].append(loss) self.history['grad_norm'].append(grad_norm) self.history['bias'].append(self.bias) self.weights -= self.learning_rate * dw self.bias -= self.learning_rate * db if X_val is not None and y_val is not None: val_y_pred = np.dot(X_val, self.weights) + self.bias val_loss = self._mse_loss(y_val, val_y_pred, self.weights) self.history['val_loss'].append(val_loss) if val_loss < self.best_val_loss - self.tol: self.best_val_loss = val_loss self.best_weights = self.weights.copy() self.best_bias = self.bias self.epochs_no_improve = 0 else: self.epochs_no_improve += 1 if self.epochs_no_improve >= self.patience: self.weights = self.best_weights self.bias = self.best_bias break if (i + 1) % 100 == 0: print(f"Iteration {i+1}/{self.n_iterations}, Loss: {loss:.4f}", end="") if X_val is not None: print(f", Validation Loss: {val_loss:.4f}") else: pass def predict(self, X_test): return np.dot(X_test, self.weights) + self.bias ``` The model is trained and tested on the processed dataset: python model = BatchGradientDescentLinearRegressor(learning_rate=0.001, n_iterations=10000, l2_penalty=0, tol=1e-5, patience=5) model.fit(X_train_processed, y_train.values) y_pred = model.predict(X_test_processed) Results indicate that per_capita_income has the highest correlation with transaction amounts: Mean Squared Error (MSE): 1.5752 R-squared: 0.0206 Mean Absolute Error (MAE): 1.0472 Stochastic Gradient Descent (SGD) Batch Gradient Descent is computationally expensive for large datasets, leading to the introduction of Stochastic Gradient Descent (SGD). SGD shuffles the training data at the beginning of each epoch and updates the model's weights and bias after processing each individual training example. This introduces "noise" into the optimization process, which can help escape shallow local minima. A custom class StochasticGradientDescentLinearRegressor is implemented to demonstrate SGD: ```python class StochasticGradientDescentLinearRegressor: def init(self, learning_rate=0.01, n_iterations=100, l2_penalty=0.01, random_state=None): self.learning_rate = learning_rate self.n_iterations = n_iterations self.l2_penalty = l2_penalty self.random_state = random_state self._rng = np.random.default_rng(seed=random_state) self.weights_history = [] self.bias_history = [] self.loss_history = [] self.weights = None self.bias = None def _mse_loss_single(self, y_true, y_pred): return 0.5 * (y_pred - y_true)**2 def fit(self, X, y): n_samples, n_features = X.shape self.weights = self._rng.random(n_features) self.bias = 0.0 for epoch in range(self.n_iterations): permutation = self._rng.permutation(n_samples) X_shuffled = X[permutation] y_shuffled = y[permutation] epoch_loss = 0 for i in range(n_samples): xi = X_shuffled[i] yi = y_shuffled[i] y_pred = np.dot(xi, self.weights) + self.bias dw = xi * (y_pred - yi) + self.l2_penalty * self.weights db = y_pred - yi self.weights -= self.learning_rate * dw self.bias -= self.learning_rate * db epoch_loss += self._mse_loss_single(yi, y_pred) if n_features >= 2: self.weights_history.append(self.weights[:2].copy()) elif n_features == 1: self.weights_history.append(np.array([self.weights[0], 0])) self.bias_history.append(self.bias) self.loss_history.append(self._mse_loss_single(yi, y_pred) + (self.l2_penalty / (2 * n_samples)) * (np.sum(self.weights**2) + self.bias**2)) print(f"Epoch {epoch+1}/{self.n_iterations}, Loss: {epoch_loss/n_samples:.4f}") def predict(self, X): return np.dot(X, self.weights) + self.bias ``` SGD is executed and evaluated: python model = StochasticGradientDescentLinearRegressor(learning_rate=0.001, n_iterations=200, random_state=42) model.fit(X=X_train_processed, y=y_train.values) y_pred = model.predict(X_test_processed) SGD results show: Mean Squared Error (MSE): 1.5808 R-squared: 0.0172 Mean Absolute Error (MAE): 1.0475 Conclusion While linear models are computationally efficient, they may not capture complex relationships in data. The choice between BGD and SGD depends on the dataset size and the need for rapid convergence. BGD is more stable but slower, while SGD is faster but noisier, potentially leading to better optimization by escaping local minima. Understanding these trade-offs is crucial for selecting the appropriate optimization technique for specific modeling objectives. Industry Insight Industry experts highlight the importance of balancing computational efficiency and model performance. BGD is often used for smaller datasets or when high precision is required, whereas SGD is preferred for large-scale applications where speed and flexibility are more critical. Companies like Kuriko IWAI's portfolio showcase the practical implementation of these algorithms in real-world scenarios, demonstrating their versatility and effectiveness in various domains.

Exploring Batch and Stochastic Gradient Descent in Linear Regression for Predicting Credit Card Transactions

Related Links