For ethics and scientific reasons it is necessary that biomedical research becomes open to all.

This implies that protocols and statistical analysis plans are transparent, and also that data sets are freely available. Unfortunately, this last condition is most often impossible to achieve because it is in contradiction with other ethical constraints, in particular concerning the right to privacy.

How to keep good statistics properties?

The open Cesp project tries to overcome this paradox. Its ultimate goal is to provide freely, without any conditions, most of the datasets used by the CESP researchers.

To protect the right to privacy of patients and subjects included in our studies. Indeed this is NOT the original datasets which is proposed here, but synthetic or cloned datasets.

From a formal point of view these synthetic datasets have the same joint probability distribution as the original ones they are imitating.

This has been made thanks to an incredible ecosystem of open source libraries in R and Python.

SynthPop

The synthpop package for R allows users to create synthetic versions of confidential individual level data for use by researchers interested in making inferences about the population that the data represent. They can be used to carry out statistical analyses, though we would usually recommend conducting an analysis of the original data to confirm the results. Synthetic data are also useful for providing data sets for teaching.

https://synthpop.org.uk/

SDV

The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of python libraries that allows users to easily learn single-table, multi-table and timeseries datasets to later on generate new Synthetic Data that has the same format and statistical properties as the original dataset.

https://sdv.dev/