As conversational AI technologies continue advancing and redefining the state of the art in commercial customer experience, we should expect to see federal agencies similarly follow suit as they strive to improve their own citizen experiences around the government services they deliver. As with chatbots today, synthetic data will surely play a key role in developing those technologies in the future.
2. Open data
One of the earliest federal use cases for synthetic data can be found at the U.S. Census Bureau, which regularly collects vast stores of national data of immense value to many fields of research.
In 2001, the Census Bureau was authorized to integrate person-level micro-data from its longitudinal household survey, called the Survey of Income and Program Participation (SIPP), with IRS tax and earnings data and Social Security Administration retirement and disability benefits data. The resulting data trove offers the most comprehensive information available on how the nation’s economic well-being changes over time, and it’s a goldmine for academics, researchers, economists, and policy makers. With SIPP data, they can examine, for example, national income distributions, the impacts of government assistance programs, and the complex relationships between government tax policy and economic activity at the local levels.
The problem for Census, however, is that the highly detailed nature of the SIPP data makes it particularly sensitive because the micro-data could be used to identify specific individuals. To make the data safe for public use while also retaining its research value, the Census chose to create synthetic data from the SIPP data sets. The result is the SIPP Synthetic Beta (SSB), a Census Bureau product that was first made public in 2007 and continues to be updated and released periodically.
3. Federal healthcare
Similarly, the NIH’s N3C Data Enclave is an open data initiative aimed primarily at advancing COVID-19 research. Since the database was opened to researchers in September 2020, it has grown to include billions of rows of data representing more than 5 million COVID-19 positive patients, making it the largest open U.S. database of data from patient electronic health records. Because of its advanced informatics technologies and data linkages to demographic, mortality, and other information, the database helps researchers create clearer pictures of COVID-19 health outcomes among different communities and enables them to find patterns faster than traditional database methodologies allow. Moreover, the N3C Data Enclave has become useful for research well beyond COVID — researchers have used it to improve our understanding of health equity, diabetes, cancer, HIV, rural mortality rates, and chronic obstructive pulmonary disease as well.
Under the open data initiative, scores of federal agencies and subagencies have already made data sets freely available to researchers on Data.gov, the main web portal for open government data. But some data sets cannot be shared because they could reveal personal details of specific individuals.