HyperAI

Regression

returnIt is a supervised learning algorithm, mainly used for prediction and modeling of numerical continuous random variables. It defines the relationship between input and output, where the input is existing knowledge and the output is the predicted value.

The purpose of regression is to obtain a best fit line.

Assumptions and content

  • In data analysis, it is usually necessary to make some assumptions about the data:
  • Homogeneity of variance
  • Linear Relationship
  • Cumulative Effects
  • No measurement error in variables
  • The variables follow a multivariate normal distribution
  • Observation Independence
  • Model complete
  • The error terms are independent and follow a (0, 1) normal distribution.

Main content of regression analysis

  1. Starting from a set of data, we can determine the quantitative relationship between certain variables, that is, establish a mathematical model and estimate the unknown parameters. The most commonly used method for estimating parameters is the least squares method.
  2. The credibility of these relationships was tested.
  3. In a relationship where many independent variables jointly influence a dependent variable, it is necessary to determine which independent variable(s) has a significant effect and which has an insignificant effect, add the independent variables with significant effects to the model, and eliminate the variables with insignificant effects. Methods such as stepwise regression, forward regression, and backward regression are usually used.
  4. Use the required relationship to predict or control a production process. The application of regression analysis is very extensive, and statistical software packages make various regression methods very convenient to calculate.

Main issues in regression analysis research

  • Determine the quantitative relationship expression between Y and X, which is called the regression equation;
  • Test the credibility of the obtained regression equation;
  • Determine whether the independent variable X has an impact on the dependent variable Y;
  • The obtained regression equation is used for prediction and control.

Steps in regression analysis

  • Determine the variables: By clarifying the specific goal of the prediction, you will also determine the dependent variable.
  • Establish a prediction model: Calculate based on the historical statistical data of the independent variables and dependent variables, and on this basis establish a regression analysis equation, namely the regression analysis prediction model.
  • Carry out correlation analysis: Regression analysis is a mathematical and statistical analysis of the influencing factors and prediction objects with causal relationships. The established regression equation is meaningful only when there is a certain relationship between the independent variable and the dependent variable.
  • Calculation of prediction error: Whether the regression prediction model can be used for actual prediction depends on the testing of the regression prediction model and the calculation of the prediction error.
  • Determine the predicted value: Use the regression prediction model to calculate the predicted value, and conduct a comprehensive analysis of the predicted value to determine the final predicted value.

Regression analysis method

  • Linear regression (Regularization): Linear regression is one of the most commonly used algorithms for regression tasks. The algorithm is simple in form and expects to use a hyperplane to fit the dataset.
  • Regression Trees (Ensemble Methods): Regression trees achieve hierarchical learning by repeatedly splitting a dataset into different branches, with the criterion for splitting being to maximize the information gain of each split.

Regression and other issues

  • The prediction problem where both the input and output variables are continuous variables is a regression problem;
  • The prediction problem with a finite number of discrete output variables becomes a classification problem;
  • The prediction problem when both the input variables and the output variables are variable sequences becomes a labeling problem.
Related words: classification, labeling