We use cookies to ensure that you have the best experience on our site.

Product

Issue

Pseudonymization process
Loss of large amounts of information

Process of CUBIG

Fast synthetic data creation tech
Freely shareable

Validate

Validate Usability

01
Distribution Similarity Validation between Synthetic Combined Data and Original Data
- Validate the distribution similarity for each variable.
- Validate the overall dataset distribution similarity.
02
Two-Dimensional Relationship Similarity Validation between Synthetic Combined Data and Original Data
03
Indistinguishability Validation between Synthetic Combined Data and Original Data
04
Indistinguishability Validation between Synthetic Combined Data and Original Data

Validate

Validate Safety

·
Risk Validation of Personal Data Identification through Synthetic Combined Data
·
Risk Validation of Sensitive Information Inference through Synthetic Combined Data
·
Risk Validation of Personal Data Inference through Synthetic Combined Data

Validate

Evaluation metrics

	Metric Descriptions	Measured Method
Distribution similarity by variable	Visualize and measure differences in variable-wise value distributions between synthetic data and original data.	Visualize distributions
Distribution similarity between datasets	Measure distribution similarity between the two by transforming the combined synthetic and raw data into the same embedding space	KID (Kernel Inception Distance)
2D relationship similarity	Measure the correlation coefficient between each column of the synthesized data and the original data, then compare them 1:1.	Visualize distributions & Pearson Correlation Coefficient
Indistinguishability	Check whether the model trained on the original data can distinguish the synthesized data from the original data.	OCC (One-Class Classification)
Downstream model performance	When the model is trained on the combined synthetic data and the original data, measure the ratio of model performance on test data.	Accuracy ratio
Identification Risk	Measure the risk that the same records exist in the synthetic data as in the source data.	1:1 compare records
Sensitive Information Risk of inference	Measures the risk that an attacker can infer sensitive information about an individual in the source data from the synthetic data when the attacker knows the quasi-identifier information in the source data.	TCAP (Targeted Correct Attribute Probability)
Personal information Risk of Inference	Synthetic data carries a risk of inferring personal information because it contains a lot of data that is similar to the original data.	1:1 compare distance

Validate

Example of distribution
Original Data and Synthetic Data

DTS
- ·
  Data Transform System
- ·
  Synthetic Data generation prog.
Explore more
azoo
- ·
  (Synthetic)data trading marketplace
- ·
  Nurturing data sellers and profit sharing
Explore more
LLM CAPSULE
- ·
  Large language model equipped with security technology
- ·
  ChatGPT + Sensitive information protection tech
Explore more