Abstract
Clinical research relies on high-quality patient data, however,
obtaining big data sets is costly and access to existing data is
often hindered by privacy and regulatory concerns. Synthetic data
generation holds the promise of effectively bypassing these
boundaries allowing for simplified data accessibility and the
prospect of synthetic control cohorts. We employed two different
methodologies of generative artificial intelligence - CTAB-GAN+
and normalizing flows (NFlow) - to synthesize patient data
derived from 1606 patients with acute myeloid leukemia, a
heterogeneous hematological malignancy, that were treated within
four multicenter clinical trials. Both generative models
accurately captured distributions of demographic, laboratory,
molecular and cytogenetic variables, as well as patient outcomes
yielding high performance scores regarding fidelity and usability
of both synthetic cohorts (n = 1606 each). Survival analysis
demonstrated close resemblance of survival curves between
original and synthetic cohorts. Inter-variable relationships were
preserved in univariable outcome analysis enabling explorative
analysis in our synthetic data. Additionally, training sample
privacy is safeguarded mitigating possible patient
re-identification, which we quantified using Hamming distances.
We provide not only a proof-of-concept for synthetic data
generation in multimodal clinical data for rare diseases, but
also full public access to synthetic data sets to foster further
research.
Users
Please
log in to take part in the discussion (add own reviews or comments).