HyperAI超神经

This article was first published on WeChat official account: HyperAI

Contents at a glance:If you are new to machine learning and hope to conduct academic research in this field in the future, don't miss this "pitfall avoidance guide" tailored for you.

Keywords:Machine learning research standard academic research

How can a machine learning academic novice avoid pitfalls and publish his paper smoothly?

Associate Professor Michael A. Lones from the School of Mathematics and Computer Science at Heriot-Watt University in Scotland published a paper in 2021 – "How to avoid machine learning pitfalls: a guide for academic researchers",This is discussed in detail.

Read the full paper (V2):

https://arxiv.org/pdf/2108.02497.pdf

Michael A. Lones' main research interests include optimization, machine learning and nonstandard computing, with applications in biology, medicine, robotics and security problems.

In this paper, the author starts from the perspective of academic research, combines his own scientific research experience and teaching experience, and includes the complete link of using machine learning technology.Frequently occurring, requiring special attention 5 major problems and proposed corresponding solutions.

Applicable people:

Students or scholars who are relatively new to the ML field and only have basic ML knowledge

Kind tips:This article focuses on issues of common concern in the academic community, such as how to rigorously evaluate and compare models so that papers can be published successfully.

Next, we will follow the complete process of ML model training and describe it in stages.

Phase 1: Before creating the model

Many students are eager to train and evaluate the model right from the start, often ignoring the more important "homework".These “homework” include:

* What is the goal of the project?

* What kind of data is needed to achieve this goal?

* Are there any limitations to the data? If so, how to solve them?

* How is the R&D progress in this field and what has been done

If these preliminary work is not done well and one just rushes to run the model, then in the end it is likely that the model will not be able to prove the expected conclusions and the scientific research work will not be published.

1.1 Understand and analyze data

Reliable data sources, scientific collection methods and high data quality will greatly benefit the publication of papers.A widely used dataset may not necessarily be of good quality, but may also be because it is easily accessible.Before selecting data, some exploratory data analysis is performed to eliminate data limitations.

1.2 Don’t look at all the data, separate the test data before you start

Information leakage from the test set into the training process is a common reason why machine learning models fail to generalize.To this end, during the data exploratory analysis phase, do not look at the test data too carefully to avoid intentionally or unintentionally making untestable assumptions that limit the generalizability of the model.

Kind tips:It is okay to make assumptions, but these assumptions should only be incorporated into the training of the model, not the testing.

1.3 Prepare sufficient data

Insufficient data may reduce the generalization and versatility of the model, which depends on the signal-to-noise ratio (SNR) of the data set. In the field of machine learning research,A common problem is insufficient data volume. In this case, the availability of existing data can be improved through cross-validation, data enhancement and other techniques.

1.4 Actively seek advice from experts in the field

Experts in the field have rich scientific research experience.It can help us identify the problems that need to be solved, the most appropriate feature sets and machine learning models, and guide the publication of our research results.It achieves twice the result with half the effort.

1.5 Do a good job of literature research

Scholarly progress is an iterative process, with each study providing information that guides the next.By ignoring previous research, you are likely to miss out on valuable information.Instead of racking your brains to explain why you are studying the same topic and why you don’t start research based on existing results when you are writing your paper, it is better to do a literature review before starting work.

1.6 Think about model deployment in advance

If the ultimate goal of academic research is to create a machine learning model that can be deployed in the real world,You need to consider deployment issues as early as possible.Such as the impact of environmental restrictions on model complexity, whether there are time limits, how to integrate with the software system, etc.

Phase 2: Creating Models Reliably

It is important to create models in an organized way so that we can use the data correctly and make well-thought-out model choices.

2.1 Test data cannot be used in model training

Once test data is involved in the configuration, training, or selection of the model, it will greatly affect the reliability and versatility of the data. This is also a common reason why published machine learning models are often not applicable to real-world data.

❎ Error examples (avoid them):

* During data preparation, use the mean and range information of the variables in the entire dataset for variable scaling (the correct approach is to only do this in the training data)

* Perform feature selection before splitting the data

* Evaluate the generalizability of multiple models using the same test data

* Apply data augmentation before splitting the test data

In order to avoid the above problems,The best way is to divide a data subset before the project starts.And at the end of the project, only this independent test set is used to test the generalizability of a single model.

Kind tips:Time series data should be handled with special care, as random splits of the data can easily lead to leakage and overfitting.

2.2 Try multiple different models

There is no universal machine learning model in the world. Our research work is to find a machine learning model that is suitable for specific problems. Modern machine learning libraries such as Python, R, Julia, etc.With only minor changes to the code, you can try out multiple models to find the one that works best.

Kind tips:

* Do not use inappropriate models and use validation sets instead of test sets to evaluate models

* When comparing models, optimize the model's hyperparameters and perform multiple evaluations, and correct for multiple comparisons when publishing results.

2.3 Don’t use inappropriate models

Modern machine learning libraries have lowered the threshold for implementing machine learning, but they also make it easy for us to choose inappropriate models, such as applying a model suitable for categorical features to a dataset containing numerical features, or using a classification model when a regression model should be used.When choosing a model, try to choose the one that fits the use case as best as possible.

2.4 Deep learning is sometimes not the optimal solution

Although deep neural networks (DNNs) perform well on some tasks,But it does not mean that DNN is suitable for all problems.Especially when the data is limited, the underlying pattern is quite simple, or the model needs to be interpretable, DNN may not perform as well as some old fashioned machine learning models, such as random forest and SVM.

2.5 Optimizing model hyperparameters

Hyperparameters have a huge impact on the performance of a model and usually need to be tailored to a specific dataset. Testing without a clear purpose may not be the best way to find the right hyperparameters.It is recommended to use hyperparameter optimization strategies such as random search and grid search.

Kind tips:For models with a large number of hyperparameters or high training costs, these strategies are not applicable. Technologies such as AutoML and data mining pipelines can be used to optimize the selection of models and their hyperparameters.

2.6 Be extra careful when optimizing hyperparameters and selecting features

Hyperparameter optimization and feature selection are part of model training. Do not perform feature selection on the entire dataset before model training begins, as this will cause information to leak from the test set to the training process.It is best to use the exact same data that you used to train the model, and a common technique is nested cross-validation (also called double cross-validation.

Stage 3: Evaluate the model robustly

Inappropriate model evaluation is very common and hinders the progress of academic research.Careful thought needs to be given to how the data is used in experiments, how the true performance of the model is measured, and how it is reported.

3.1 Use an appropriate test suite

Use a test set to measure the generalizability of your machine learning model and to ensure that the data in the test set is appropriate.The test set should not overlap with the training set and needs to cover a wider range of conditions. For example, for a photographic dataset of an object, if both the training set and the test set are collected outdoors on a sunny day, the test set is not independent because it does not capture a wider range of weather conditions.

3.2 Do not perform data augmentation before splitting the data

Data augmentation helps balance the data set and improve the generality and robustness of machine learning models.It should be noted that data augmentation should only be applied to the training set and not the test set to prevent overfitting.

3.3 Using a Validation Set

Use a separate validation set to measure model performance, which contains a set of samples that are not used directly in training but are used to guide training. Another benefit of the validation set is that it allows for early stopping.

3.4 Evaluate the model multiple times

A single evaluation of a model is not reliable.may underestimate or overestimate the true performance of the model,To do this, the model needs to be evaluated multiple times, mostly involving training the model multiple times using different subsets of the training data. Cross-Validation is a particularly popular method with many varieties, such as Ten-fold Cross-Validation.

Kind tips:While reporting the means and standard deviations of multiple evaluations, it is recommended that a single score be kept for subsequent comparisons of models using statistical tests.

3.5 Reserve some data to evaluate the final model instance

The best way to reliably assess the generalizability of model instances,Maybe just use another test set.Therefore, if the amount of data is large enough, it is better to reserve some of it and use it to perform an unbiased evaluation on the final selected model instance.

3.6 Don’t use accuracy for unbalanced datasets

Be careful when choosing metrics for evaluating machine learning models. For example, the most commonly used metric for classification models is accuracy, which works well if the dataset is balanced (each category has a similar number of sample representatives in the dataset). However, accuracy can be a very misleading metric if the dataset is unbalanced.

in this case,It is better to use indicators that are insensitive to class size imbalance, such as F1 score, Cohen's Kappa coefficient (κ), or Matthews correlation coefficient (MCC).

Stage 4: Comparing Models Fairly

Comparison of models is fundamental to academic research, but if the comparison is made in an unfair way and published, it will mislead other researchers.You need to ensure that you evaluate different models under the same conditions and use statistical tests appropriately.

4.1 For models, the higher the number, the better the performance.

This statement often appears in papers: "The accuracy of the previous study was 94%, and the accuracy of this model is as high as 95%, so it is better." There are various reasons.A higher number does not equate to a better model,If the models are trained or evaluated on different partitions of the same dataset, the performance difference may be small; if completely different datasets are used, the performance difference may be huge. Not performing the same amount of hyperparameter optimization can also affect the performance difference of the models.

Therefore, in order to scientifically compare the performance of the two models,Models should be optimized to the same degree and evaluated multiple times, with statistical tests used to determine whether the performance differences are significant.

4.2 Comparing Models Using Statistical Tests

It is recommended to use statistical tests to compare the performance differences between two models. Broadly speaking, tests for comparing machine learning models fall into two categories:The first category is used to compare similar model instances.For example, when comparing two trained decision trees, the McNemar test can be used;The second category is suitable for more general model comparisons.For example, when comparing which decision tree or neural network is more suitable, the Mann-Whitney U test is used.

4.3 Correction for multiple comparisons

Comparing more than two models using statistical tests is somewhat complicated. Multiple pairwise tests are similar to using the test set multiple times, which may lead to overly-optimistic interpretations of significance.

It is recommended to use a multiple testing correction, such as the Bonferroni correction, to address this issue.

4.4 Don’t trust community benchmarks too much

For problems in certain fields, many people choose to use benchmark datasets to evaluate the performance of new machine learning models, because everyone uses the same data to train and test the model, so the comparison will be more intuitive. This approach has some major disadvantages.

First, if access to the test set is unlimited, there is no guarantee that others have not used it as part of the training process, which can lead to overoptimism in the results. In addition, even if each person using the data only uses the test set once, overall, the test set is used many times by the community, which can also lead to overfitting of the model.To this end, the results of benchmark datasets should be interpreted with caution and reasonable judgments should be made on the performance improvements.

Phase 5: Reporting Results

Academic research needs to contribute to knowledge.This requires reporting on the overall status of the research work, including which work was successful and which was not.Machine learning is often associated with trade-offs, and it is rare that one model is better than another in all respects, so this needs to be reflected when reporting results.

5.1 Reporting needs to be transparent

Share all research work transparently.This makes it easier for others to repeat the experiment and compare models. Clearly documenting experiments and writing clean code is good for both you and others. The machine learning community is increasingly focusing on the reproducibility of experiments, and inadequately documenting workflows may affect subsequent publications.

5.2 Reporting Performance in Multiple Ways

When evaluating model performance,A more rigorous approach is to use multiple datasets.This can help overcome any deficiencies associated with a single dataset and give a comprehensive picture of the model’s performance. Reporting multiple metrics for each dataset is a good practice, as different metrics can present different results and increase transparency of your work.

5.3 Summarize only the data

Don’t make invalid conclusions that will lead other researchers astray. A common mistake is to publish generalizations that are not supported by the data used to train and evaluate the model. Just because a model performs well on one dataset does not mean it will do well on other datasets. While reliable insights can be gained by using multiple datasets, there is always a limit to what you can study and infer from an experiment.Don't exaggerate findings, be aware of limitations.

5.4 Reporting significant differences with caution

The statistical tests discussed above can help test the differences between models. However, statistical tests are not perfect and may underestimate or overestimate the significance of the model, resulting in false positives or false negatives. In addition, more and more statisticians advocate abandoning the use of confidence thresholds and directly reporting p values to determine model significance.

In addition to statistical significance, another question to consider is whether the difference between the two models is really important. Because as long as the sample is sufficient, you can always find a significant difference, even if the actual performance difference is negligible. So when judging importance, you can measure the effect size, including Cohen's d statistic (more common), Kolmogorov Smirnov (better, recommended), etc.

5.5 Focus on the operating principle of the model

The trained model contains a lot of valid information.However, many authors only report the performance indicators of the model without explaining the model principles.The purpose of research is not to achieve a slightly higher accuracy than others, but to summarize knowledge and share it with the research community, thereby increasing the possibility of publishing work results. For example, for simple models such as decision trees, provide model visualization; for complex models such as deep neural networks, consider using XAI (explainable artificial intelligence) technology to extract relevant information.

The above is the complete content of the "Avoidance Guide". I hope every student who is new to machine learning will learn from it.You can all keep this treasure book, read it often and learn new things, so that you can easily find your research direction, choose a good topic, and publish your paper as soon as possible!

Looking forward to your good news~

Reference Links:[How to avoid machine learning pitfalls: a guide for academic researchers]