Abstract

Various organizations across the globe disseminate administrative data via the Web for unrestricted public use. These organizations balance the trade-off between protection and inference. Recent developments of advanced disclosure avoidance techniques include the incorporation of synthetic data, which capture the essential features of underlying data by releasing altered data generated from a posterior predictive distribution. The Census Bureau has interrelated time series data which are hierarchical and contain many zeros. Current rule-based dislosure avoidance techniques require the Census Bureau to not release count data of small magnitudes. Motivated by this problem, we use zero-inflated Bayesian Generalized Linear Mixed Models with privacy-preserving prior distributions to protect and release synthetic data about thousands of small groups regardless of magnitude. We find that as the prior distributions of the variance components become more precise toward zero, privacy increases. We apply our methodology to the strict privacy measure of empirical differential privacy and a newly defined privacy measure, PORI, which is more representative of each observation in the dataset. We illustrate our results with the Census Bureau’s Quarterly Workforce Indicators and plan to implement a similar approach on their confidential disaggregated data.

Comments

Presentation given by LDI Graduate Assistant, Matthew Schneider, on big data privacy at Temple University on Nov 18, 2013.

Share

COinS