Privacy-Preserving Synthetic Data for ML: The Role of Masked Language Models

cover
8 Apr 2025
  1. Abstract & Introduction

  2. Proposal

    1. Classification Target
    2. Masked Conditional Density Estimation (MaCoDE)
  3. Theoretical Results

    1. With Missing Data
  4. Experiments

  5. Results

    1. Related Works
    2. Conclusions and Limitations
    3. References
  6. A1 Proof of Theorem 1

    1. A2 Proof of Proposition 1
    2. A3 Dataset Descriptions
  7. A4 Missing Mechanism

    1. A5 Experimental Settings for Reproduction
  8. A6 Additional Experiments

  9. A7 Detailed Experimental Results

A.1 Proof of Theorem 1

Proof. This proof is based on Theorem 6.11 of [50] and Theorem 1 of [29].

Thus, for every ϵ > 0,

(B) Furthermore, by the continuous mapping theorem and the algebra of the convergence in probability, for every ϵ > 0,

A.2 Proof of Proposition 1

A.3 Dataset Descriptions

Download links.

• abalone: https://archive.ics.uci.edu/dataset/1/abalone

• banknote: https://archive.ics.uci.edu/dataset/267/banknote+authentication

• breast: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

• concrete: https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength

• covertype: https://www.kaggle.com/datasets/uciml/forest-cover-type-dataset

• kings: https://www.kaggle.com/datasets/harlfoxem/housesalesprediction

• letter: https://archive.ics.uci.edu/dataset/59/letter+recognition

• loan: https://www.kaggle.com/datasets/teertha/personal-loan-modeling

• redwine: https://archive.ics.uci.edu/dataset/186/wine+quality

• whitewine: https://archive.ics.uci.edu/dataset/186/wine+quality

Authors:

(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea (dkstmdghks79@uos.ac.kr);

(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea (dngudxor23@uos.ac.kr);

(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea (wotjd1410@uos.ac.kr);

(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea (hahaha503@uos.ac.kr);

(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea (shong@uos.ac.kr);

(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea (jj.jeon@uos.ac.kr).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.