HyperAI

5 Chapters, 25 Specifications, a Comprehensive Encyclopedia of Get Dataset Selection and Creation

2 years ago
Information
Yinrong Huang
特色图像

Contents at a glance:If you are learning how to create or choose a suitable dataset, this post will give you some practical advice to help you make informed decisions when choosing and creating datasets.

Keywords:Machine Learning Datasets   

This article was first published on HyperAI WeChat public platform~

Author | xixi

Proofreading | Sanyang

A high-quality dataset can not only improve the accuracy and operating efficiency of the model, but also save training time and computing resources.

In this article,We refer to Jan Marcel Kezmann's article "The Dos and Don'ts of Dataset Selection for Machine Learning You Have to Be Aware of", which explains in detail the methods of creating and selecting datasets. I hope it can help data science engineers avoid pitfalls and practice best practices for model training. Let’s take a look at the tips~

Read the original English article:

https://medium.com/mlearning-ai/the-dos-and-donts-of-dataset-selection-for-machine-learning-you-have-to-be-aware-of-8b14513d94a

Table of contents

1. Best Practices for Selecting Datasets

2. Be aware of the traps to avoid

3. 5 Tips

4. Best Practices for Creating Datasets

5. Dataset Evaluation

Applicable people:

Beginners, data scientists, machine learning practitioners

1. Best Practices for Selecting Datasets

This section will delve into best practices for selecting public datasets.There are 6 key steps to keep in mind:

1.1 Understanding the Problem 

It is important to understand the problem you want to solve, including determining the input and output variables, the type of problem (classification, regression, clustering, etc.), and the performance metric.

1.2 Defining the Problem 

Narrow the scope of the dataset by specifying the industry or domain, the type of data required (text, images, audio, etc.), and any constraints associated with the dataset.

1.3 Focus on quality 

Find datasets that are reliable, accurate, and relevant to your problem.Check for missing data, outliers, and inconsistencies, as these issues can negatively impact the performance of your model.

1.4 Consider the size of the dataset 

The size of the dataset affects the accuracy and generalization ability of the model.While larger datasets help improve model accuracy and robustness, they also mean more computing resources and longer training time.

1.5 Check Bias 

Bias in a dataset can lead to unfair or inaccurate predictions. Be aware of bias related to the data collection process, such as sampling bias, and bias related to social issues, such as gender, race, or socioeconomic status.

1.6 Seek diversity 

Choosing a diverse dataset from different sources, populations, or locations can help the model learn from a variety of different examples and avoid overfitting.

2. Be aware of the traps to avoid

This section applies to both predefined datasets and datasets you create yourself.

2.1 Insufficient Data

Insufficient data can cause the model to fail to capture the underlying patterns in the data, resulting in poor performance. If there is not enough data, you can consider using techniques such as data augmentation or transfer learning to enhance the dataset or model capabilities. If the labels are consistent, multiple datasets can be merged into one.

2.2 Imbalanced Classes

Class imbalance means that the number of samples of one class is significantly larger than that of another class, which can lead to prediction bias or other model errors. To address this problem, techniques such as oversampling, undersampling, or class weighting are recommended. Enhancing underrepresented classes can also reduce this problem.

Kind tips:

Different machine learning tasks have different impacts on the model due to class imbalance. For example, in anomaly detection tasks, severe class imbalance is normal; however, this is less common in standard image classification problems.

2.3 Outliers 

Outliers are data points that are significantly different from other data samples and can negatively impact model performance.If a dataset contains too many outliers, a machine learning or deep learning model will often have difficulty learning the desired distribution.

Consider using techniques such as winsorization to remove or correct outliers, or using mean/median imputation to replace all missing values present in the sample with the mean or median.

2.4 Data Snooping and Leakage 

To avoid data snooping, which can lead to overfitting and reduced performance,You should split your dataset into training, validation, and test sets, and use only the training set to train your model.

On the other hand, training a model with data from the test set can induce data leakage, which can lead to overly optimistic performance estimates. To avoid data leakage, one should always keep validation and test sets separate and only use them to evaluate the final model.

3. 5 Tips

  • With transfer learning, a pre-trained model is used to solve a related problem, and for a specific problem, it can be fine-tuned using a smaller dataset.
  • Merge multiple datasets to increase the size and diversity of the dataset, resulting in more accurate and robust models. Be aware of data compatibility and quality issues.
  • Use crowdsourcing to quickly collect large amounts of labeled data at a low cost. Pay attention to quality control and bias issues.
  • Keep an eye out for data APIs from various companies and organizations to access their data in a code-like manner.
  • Check out available benchmarks that provide standardized datasets and evaluation metrics to compare the performance of different models for the same problem.

 4. Best Practices for Creating Datasets

4.1 Define the problem and objectives 

Before collecting any data, be clear about the target variable you want to predict, the scope of the problem you want to solve, and the intended use of the dataset.

Clarifying the problem and goal helps focus the collection of relevant data.Avoid wasting time and resources on irrelevant or noisy data while helping to understand the assumptions and limitations of the dataset.

4.2 Collecting Diverse and Representative Datasets 

Collecting data from different sources and domains ensures that the dataset is representative of real-world problems.This includes collecting data from different locations, demographics, and time periods, ensuring that the dataset is not biased towards specific groups or sectors.

Additionally, make sure that the data does not contain any confounding variables, which are third unmeasured variables that affect the hypothesized cause and the hypothesized effect, thus influencing the results.

4.3 Carefully label your data 

Use clear labels that clearly reflect the ground truth to annotate data, and use multiple annotators or crowdsourcing to reduce the impact of personal bias on the data and improve the quality and reliability of labels. It is recommended to version control the data to more easily track, share, and reproduce the training and evaluation process.

Kind tips:

If the dataset only contains the correct labels for 80%, then even the best model will not be more accurate than 80% in most cases.

4.4 Ensuring data quality and integrity 

Data quality refers to the accuracy, completeness and consistency of data.Techniques such as data cleaning, outlier detection, and missing value interpolation can help improve the quality of the dataset. In addition, it is also necessary to ensure that the data format is easy for machine learning algorithms to understand and process.

4.5 Ensuring data privacy and security

To protect privacy, you need to ensure that data collection and storage are secure and that any sensitive information is anonymized or encrypted. In addition, you can consider using encryption technology to protect data during transmission and static storage.

Kind tips:

Pay attention to the usage specifications of verification data to ensure that they comply with laws and regulations.

5. Dataset Evaluation

Check whether the dataset has sufficiently met the following 5 criteria:

  • Data size:Generally speaking, the more data the better.
  • Data distribution:Make sure the dataset is balanced and representative.
  • Data Quality:Clean, consistent and error-free data is critical
  • Data complexity:Make sure the data is not too complex.
  • Data Relevance:The data should be relevant to the problem.

The above is the complete content of the dataset selection and creation guide. Choosing a suitable dataset is the key to machine learning. I hope this guide can help you choose or create a high-quality dataset and train accurate and robust models!

  Download massive public datasets online

As of now, HyperAI's official website has launched more than 1,200 high-quality public data sets, completed nearly 500,000 downloads, and contributed more than 2,000 TB of traffic, greatly lowering the access threshold for high-quality public data sets at home and abroad.

Visit the following link to search and download the dataset you need immediately and start your model training journey!

Visit the official website: https://orion.hyper.ai/datasets

This article was first published on HyperAI WeChat public platform~