How Incorrectly Crafted Synthetic Data Can Trigger Model Collapse (02/23)

by Admin_Azoo 23 Feb 2024

How Incorrectly Crafted Synthetic Data Can Trigger Model Collapse (02/23)

With the advancement of generative AI technology, an increasing variety of data is being generated through models. The synthesized data created in this manner is frequently utilized in AI training. One of the notable advantages of generative AI lies in its ability to produce an unlimited amount of data once the model is trained. This capability is often harnessed to obtain diverse datasets for various corner cases.

However, continual use of inappropriately generated synthetic data in model training can lead to a degradation in performance. Many individuals are not fully aware of this fact. As generative AI advances, generating vast amounts of data, it becomes crucial for users to recognize the potential pitfalls associated with the indiscriminate use of poorly crafted synthetic datasets, as repeated training with such data may paradoxically result in a decline in model performance.

Relevant Articles: link

Why does Model Collapse occur?

Model collapse is a phenomenon that commonly occurs during the training process of generative models, where the model loses diversity and repetitively generates the same patterns, deviating from generating a variety of data. In essence, it occurs when the generator becomes overly biased towards specific types of data, resulting in a lack of diversity in the generated output. Particularly, information from the minority portion of the original dataset’s distribution may be distorted or diminished as the model undergoes multiple training iterations. Consequently, the generated data becomes monotonous, losing the diverse characteristics of the original data distribution

How to create good synthetic data?

To avoid model collapse resulting from the use of synthetic data that deviates from the actual data distribution, it is necessary to adjust various aspects such as the structure of the generator and discriminator, learning rates, and loss functions to induce diversity in the generated data. Additionally, including real data periodically among synthetic data during training can be beneficial. Another alternative is to differentiate between human-created data and machine-generated data by introducing distinctions that assist the model in discerning between them. Recognizing the importance of using inherently high-quality synthetic data for model training, rather than solely relying on a large quantity of synthetic data, is a crucial factor for enhancing model performance

Tags :