We use cookies to ensure that you have the best experience on our site.

Product

Issue

Pseudonymization process
Loss of large amounts of information

Process of CUBIG

Fast synthetic data creation tech
Freely shareable

Validate

Validate Usability

  • 01

    Distribution Similarity Validation between Synthetic Combined Data and Original Data
    - Validate the distribution similarity for each variable.
    - Validate the overall dataset distribution similarity.

  • 02

    Two-Dimensional Relationship Similarity Validation between Synthetic Combined Data and Original Data

  • 03

    Indistinguishability Validation between Synthetic Combined Data and Original Data

  • 04

    Indistinguishability Validation between Synthetic Combined Data and Original Data

Validate

Validate Safety

  • ·

    Risk Validation of Personal Data Identification through Synthetic Combined Data

  • ·

    Risk Validation of Sensitive Information Inference through Synthetic Combined Data

  • ·

    Risk Validation of Personal Data Inference through Synthetic Combined Data

Validate

Evaluation metrics

Metric Descriptions Measured Method
Distribution similarity by variable Visualize and measure differences in variable-wise value distributions between synthetic data and original data. Visualize distributions
Distribution similarity between datasets Measure distribution similarity between the two by transforming the combined synthetic and raw data into the same embedding space KID
(Kernel Inception Distance)
2D relationship similarity Measure the correlation coefficient between each column of the synthesized data and the original data, then compare them 1:1. Visualize distributions & Pearson Correlation Coefficient
Indistinguishability Check whether the model trained on the original data can distinguish the synthesized data from the original data. OCC
(One-Class Classification)
Downstream model performance When the model is trained on the combined synthetic data and the original data, measure the ratio of model performance on test data. Accuracy ratio
Identification Risk Measure the risk that the same records exist in the synthetic data as in the source data. 1:1 compare records
Sensitive Information
Risk of inference
Measures the risk that an attacker can infer sensitive information about an individual in the source data from the synthetic data when the attacker knows the quasi-identifier information in the source data. TCAP
(Targeted Correct Attribute Probability)
Personal information Risk of Inference Synthetic data carries a risk of inferring personal information because it contains a lot of data that is similar to the original data. 1:1 compare distance
Validate

Example of distribution
Original Data and Synthetic Data

  • 디자인용 이미지

    DTS

    • ·

      Data Transform System

    • ·

      Synthetic Data generation prog.

  • 디자인용 이미지

    azoo

    • ·

      (Synthetic)data trading marketplace

    • ·

      Nurturing data sellers and profit sharing

  • 디자인용 이미지

    LLM CAPSULE

    • ·

      Large language model equipped with security technology

    • ·

      ChatGPT + Sensitive information protection tech

We are always ready to help
you and answer your questions

Explore more

Trade with Trust

Data Market