We use cookies to ensure that you have the best experience on our site.
We use cookies to ensure that you have the best experience on our site.
Distribution Similarity Validation between Synthetic Combined Data and Original Data
- Validate the distribution similarity for each variable.
- Validate the overall dataset distribution similarity.
Two-Dimensional Relationship Similarity Validation between Synthetic Combined Data and Original Data
Indistinguishability Validation between Synthetic Combined Data and Original Data
Indistinguishability Validation between Synthetic Combined Data and Original Data
Risk Validation of Personal Data Identification through Synthetic Combined Data
Risk Validation of Sensitive Information Inference through Synthetic Combined Data
Risk Validation of Personal Data Inference through Synthetic Combined Data
Metric Descriptions | Measured Method | |
---|---|---|
Distribution similarity by variable | Visualize and measure differences in variable-wise value distributions between synthetic data and original data. | Visualize distributions |
Distribution similarity between datasets | Measure distribution similarity between the two by transforming the combined synthetic and raw data into the same embedding space | KID (Kernel Inception Distance) |
2D relationship similarity | Measure the correlation coefficient between each column of the synthesized data and the original data, then compare them 1:1. | Visualize distributions & Pearson Correlation Coefficient |
Indistinguishability | Check whether the model trained on the original data can distinguish the synthesized data from the original data. | OCC (One-Class Classification) |
Downstream model performance | When the model is trained on the combined synthetic data and the original data, measure the ratio of model performance on test data. | Accuracy ratio |
Identification Risk | Measure the risk that the same records exist in the synthetic data as in the source data. | 1:1 compare records |
Sensitive Information Risk of inference |
Measures the risk that an attacker can infer sensitive information about an individual in the source data from the synthetic data when the attacker knows the quasi-identifier information in the source data. | TCAP (Targeted Correct Attribute Probability) |
Personal information Risk of Inference | Synthetic data carries a risk of inferring personal information because it contains a lot of data that is similar to the original data. | 1:1 compare distance |
We are always ready to help
you and answer your questions
Trade with Trust