A&E Synthetic Data
The synthetic A&E extract, “SynAE”, is the result of an NHS England pilot project to widen data sharing without loss of privacy for patients.
Synthetic extracts use statistical models to create sharable datasets which maintain patient confidentiality whilst retaining the characteristics, and hence value, of the real data. Interest in the creation of synthetic health data is increasing as it is a potential enabler for many health information uses, such as research studies, imputation of missing data and app development. This work is conducted in line with NHS England’s mandate from Government which gives us a responsibility for developing organisational policies within the open data and transparency agendas.
Creating synthetic data is not straightforward and done inappropriately it can lead to misuse of private data or allow invalid conclusions to be inferred. NHS England has been developing its knowledge and methodology over the past year and we continue to collaborate with other government organisations to seek best practice and practical applications of synthetic data. There are a range of types of synthetic data which the Office for National Statistics have summarised here
Impact and Appropriate Use
Users of these data MUST NOT attempt to re-identify individual patient records.
This dataset has been created from sampling from a statistical model, the synthetic records are not real data and the relationships between variables have been corroded to a degree through the synthetic process.
Thus, data in this release should not be used:
- To change operational practice based on relationships inferred from these data;
- To performance monitor or manage contracts;
- For direct patient care.
These data could be used:
- As an example of type, size and complexity of real A&E activity data;
- To assist in proofs of concept or development of new tools or pieces of analysis;
- Testing and development of apps or products.
A synthetic methodology must balance privacy concerns against the retention of value found in the raw data. For these data, the clear priority is that the final extract maintains absolute privacy. The user-case is to have an open extract, including information on geographic, demographic, health and time data. If we limited the access to the data or reduced the combination of information available within the data then methods could be used which maintain higher data value, but at the potential cost of privacy. Alternative methods have been considered and tried but have either proved to be inappropriate or heavy-handed for this project.
To ensure absolute privacy within the data, the raw datasets were processed to remove some granularity and disclosive information. This included:
- removing geographical information in favour of average demographic information for an area;
- grouping variables with high granularity into bands e.g. age-bands;
- removing granular time information such as hour/minute of admission;
- removing rare values e.g. Age = 0 or capping integer variables e.g. a 200 mile upper limit on distance from residence to provider;
- removing unique values and subsets (e.g. ensuring that there are more than 7 counts of any combination of demographic variables).
Using the data transformed and grouped in the ways described above, the synthetic data were created using an R package called “BNLearn” to fit a Bayesian Network. This produces a data model based on a series of conditional probabilities in a hierarchical structure. Starting at the “top” of the structure we can use the Bayesian network to sample from each variable distribution to narrow the range of allowable values from which later variables can be sampled. This maintains many of the relationships within the raw data whilst creating brand new synthetic records.
During this process we have lost some value when transforming and grouping the data, and some when creating the Bayesian Network, as the model has to make simplifying assumptions in order to find a solution (i.e. the Markov property means that if a high order correlation is present between a node and its grand-parent then this will be lost and thus the conditional probabilities will not fully match reality). The synthetic data quality was tested by investigating the probability distributions and variable correlations and found to re-create the raw data well. However, further tests using the Voas-Williamson statistic show that the variability in the data has been changed dramatically, significantly reducing the quality of value retained.
The data and methodology have been extensively tested and tuned to ensure that any attempted re-identification of an individual’s record will produce false results (i.e. any record which appears to match known data is purely by chance rather than a re-creation of original data). Further, it has also been ensured that when sampling from the data model outliers are not directly re-created accidentally. Thus, we have produced an extract which contains new data, maintains the privacy of rare values, and ensured that no reverse-engineering can prove that any record directly matches the original data.
Two standard datasets have been used: A&E activity data and Admitted Patient Care data, both of which are taken from SUS data provided by NHS Digital
Data File Description
|Coverage Start Date||2014-03-23|
|Coverage End Date||2018-03-22|
|License||UK Open Government Licence (OGL)|
|Publisher||Data Catalogue Team|
|Maintainer Email||Data Services|