From NLP to Data Synthesis: The Surprising Power of Masked Language Models

cover
8 Apr 2025
  1. Abstract & Introduction

  2. Proposal

    1. Classification Target
    2. Masked Conditional Density Estimation (MaCoDE)
  3. Theoretical Results

    1. With Missing Data
  4. Experiments

  5. Results

    1. Related Works
    2. Conclusions and Limitations
    3. References
  6. A1 Proof of Theorem 1

    1. A2 Proof of Proposition 1
    2. A3 Dataset Descriptions
  7. A4 Missing Mechanism

    1. A5 Experimental Settings for Reproduction
  8. A6 Additional Experiments

  9. A7 Detailed Experimental Results

2. Proposal

Figure 1: Overall structure and training process of MaCoDE. In this case, the value of the second column is masked (replaced with ‘0’) and predicted.

2.1 Classification Target (Discretization)

2.2 Masked Conditional Density Estimation (MaCoDE)

Definition 2 (Mask distribution [13, 19]). The distribution of mask vector m is defined as:

Synthetic data generation. Tabular data lacks the inherent ordering between columns, unlike natural language [13]. Therefore, as outlined in Algorithm 2, MaCoDE randomly generates one column at a time, conditioned on masked subset sizes from p to 1, in descending order (p → p − 1 → · · · → 2 → 1). [13] demonstrated that, under the masked distribution of Definition 2, the distribution of the number of masked entries is matched during both training and generation.

Figure 2: Trade-off between quality and privacy. Left: feature selection performance. Right: DCR. Error bars represent standard errors. See Appendix A.6.1 for detailed results.

Authors:

(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea (dkstmdghks79@uos.ac.kr);

(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea (dngudxor23@uos.ac.kr);

(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea (wotjd1410@uos.ac.kr);

(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea (hahaha503@uos.ac.kr);

(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea (shong@uos.ac.kr);

(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea (jj.jeon@uos.ac.kr).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.