The role of data has transcended research in recent decades and become a key component to decision making in the business world. For all its growing worth though, privacy concerns have continued to make direct data sharing difficult among companies, organizations, and researchers.
DoppelGANger is a new tool recently proposed by researchers from Carnegie Mellon University and IBM for sharing time series data with high-fidelity and promising privacy properties.
Data holders can train DoppelGANger on their data and then release the trained model or synthetic data generated from DoppelGANger, which captures the properties of the original data, while hiding certain sensitive information if needed.
DoppelGANger’s key idea
The overarching question we sought to answer in developing DoppelGANger was: Can we create high-fidelity synthetic datasets for networking and systems applications, that require minimal human effort in the sharing process? Such a toolkit could enhance the potential of data-driven techniques by making it easier to obtain and share data.
As part of this work, we explored if and how we could leverage recent advances in Generative Adversarial Networks (GANs). The primary benefit GANs offers is the ability to learn complicated datasets, as evidenced by the excitement in generating photo-realistic images. A secondary benefit is that GANs allows users to flexibly tune generation (such as augment anomalous or sparse events), which would not be possible with raw/anonymized datasets.
However, applying GANs directly does not work well as the time series can be much longer than what was considered in prior GAN literature, and complicated correlations exist in data that vanilla GANs cannot model well. DoppelGANger makes several new designs in the network architecture to extend GANs’ capability to the time series data we care about.
DoppelGANger focuses on time series data, a prevalent data type in networking, systems, security, finance, and many other fields. Even if the time series is multi-dimensional, and has mixed continuous/categorical components, DoppelGANger can still handle it.
For example, we have tested DoppelGANger on web traffic data (for example, daily page views of websites), network measurements (for example, traffic counters and packet loss rates of clients at different locations), and cluster traces (for example, CPU/memory usage of jobs).
Although DoppelGANger was originally designed for networking and systems data, it is applicable in other domains as well. For example, several independent companies/users have already used DoppelGANger for banking transactions, sensor data, road traffic, and much more.
A diverse set of evaluations show that DoppelGANger captures the characteristics of data better than traditional time series modelling tools like the Hidden Markov Model and autoregressive models, and recent deep learning models like TimeGAN and RCGAN (Figure 1).
Several companies have independently tried DoppelGANger with promising results on their datasets, and wrote blogs to share their experience.
In relation to privacy, DoppelGANger supports hiding the distribution of some parts of the data, which sometimes contains sensitive business secrets. Further, an important class of membership inference attacks can be mitigated by training DoppelGANger on larger datasets. This may run counter to conventional release practices, which advocate releasing smaller datasets to avoid leaking user data.
There remains some open questions in our research. The recent proposals for training machine learning with differential privacy guarantees achieve bad fidelity/privacy tradeoffs, not only for DoppelGANger, but also for many other models. Besides these notions, there might be other privacy concerns that we have yet to determine. These results highlight interesting directions for future research.
Want to try DoppelGANger on your dataset?
To start using DoppelGANger, all you need to do is save the dataset in the required format — see the code and detailed instructions here.
This work was presented at ACM Internet Measurement Conference (IMC) 2020 and was shortlisted as a finalist of the Best Paper Award.
Zinan Lin is a PhD student at Carnegie Mellon University.
The views expressed by the authors of this blog are their own and do not necessarily reflect the views of APNIC. Please note a Code of Conduct applies to this blog.