Synthetic Data: Public-Use Micro Data for a Big Data World

Bookmark and Share

Written by: Ron S. Jarmin, Assistant Director, Research and Methodology Directorate
Thomas A. Louis, Associate Director, Research and Methodology Directorate
Javier Miranda, Principal Economist, Center for Economic Studies

Businesses, households and policymakers need timely and accurate data to make informed decisions. National statistical offices around the world have a wealth of information from survey and administrative sources to meet these needs. However, they are constrained in their ability to release these data because of the confidentiality pledge to data respondents.

Synthetic data offer a way to expand the amount of information that national statistical offices can publically release while maintaining respondent confidentiality. In synthetic datasets, some or all data values are simulated (synthesized) using statistical models designed to mimic the (joint) distributions of the underlying data.

Researchers at the Census Bureau, in partnership with academic economists and statisticians through the Census Bureau’s secure research data centers, recently produced two synthetic public micro datasets. The SIPP-Synthetic Beta product combines survey data from the Survey of Income and Program Participation with administrative records from the Internal Revenue Service and the Social Security Administration (see Benedetto, Stinson and Abowd 2013). The Synthetic Longitudinal Business Database is the first business establishment-level public-use micro dataset made available by a U.S. statistical agency (see Kinney et. al. 2011).

Research findings on the development and use of synthetic data and future usage of these data were presented in a session of the World Statistical Congress in August 2013 held in Hong Kong. These articles are accessible in the Statistical Journal of the International Society of Official Statistics.

While synthetic data are exciting and hold great promise, there are challenges to expanding their development and use. Creating synthetic data requires significant technical expertise that is not widely available within many statistical agencies. Census Bureau progress on synthetic data has relied on robust collaboration with academic experts. Users also confront challenges. Synthetic microdata are still experimental and not as straightforward to use as conventional microdata. Because users may not understand what is involved in developing apps and online tools constructed using synthetic data, such as OnTheMap, they may understate the variance of estimates supplied by such tools.

Synthetic data are one way for national statistical organizations to take the lead in making high quality and reliable official statistics more accessible and relevant. However, creating and supporting synthetic data requires staffing and resources beyond what are generally available to them. The Census Bureau’s “two-way-street” strategy of developing partnerships with academic and funding institutions offers a way to move forward.

Posted in Uncategorized | Leave a comment

“Low Response Score” Indicator Arises Out of Crowdsourcing Solution

Bookmark and Share

Written by: Nancy A. Bates, Research and Methodology Directorate; and Chandra Erdman, Center for Statistical Research and Methodology, Research and Methodology Directorate

In September 2012, the U.S. Census Bureau announced a global crowdsourcing competition. The contest – dubbed the “Census Return Rate Challenge” – encouraged teams and individuals to compete for prize money for predicting 2010 Census mail-return rates. The challenge asked participants to model geographic variations in return rates using predictive variables found in the updated 2012 Planning Database.

The challenge was a success. Over 244 teams and individual competitors submitted solutions. Bill Bame, a software developer from Maryland, submitted the winning model. The Bame model included 342 variables and employed data mining and machine learning techniques. Twenty-four of his top 25 predictors came from the 2012 Planning Database. With these variables, we developed an “ordinary least square” regression model to predict likelihood of self-response resulting in a predicted rate referred to as the “low response score.” See Erdman and Bates, 2014 for a full description of the methodology.

Areas with low self-response require costly follow-up by telephone or in-person. Using the low response score and the wealth of information in the planning database, we can identify areas that are likely to have low rates of self-response and develop tailored strategies to increase these rates.

The low response score is provided for each census tract and block group in the 2014 Planning Database, a publicly available database containing socioeconomic, housing and demographic variables from the 2010 Census and 2008-2012 American Community Survey. The low response score and updated 2014 Planning Database go hand-in-hand, and survey practitioners can use them in many ways. For example, one can use the score to stratify samples to delineate between areas with low and high likelihood of survey and census participation.

Used in tandem, the score and database help survey and census planners to identify hard-to-count areas and understand why such areas are hard to count. This knowledge can then be applied to manage field resources and develop targeted self-response and nonresponse follow-up strategies more efficiently.

For questions or comments, contact census.pdb.questions@census.gov.

Posted in Uncategorized | Leave a comment

Planning and Predicting: The 2014 Census Planning Database

Bookmark and Share

Written by: Barbara C. O’Hare, Center for Survey Measurement, Research and Methodology Directorate; and Nancy A. Bates, Research and Methodology Directorate

Predicting participation rates among the general population is a challenge to those involved in census and survey work, whether in collecting the data or in analyzing the results. These effects are well documented in the literature. See for example, Groves and Couper,1998, Nonresponse in Household Surveys or Groves, 2011, “Three Eras of Survey Research,” Public Opinion Quarterly 75(5).

We rarely have socio-economic information for all individuals or households on survey sample frames that we can use to analyze and mitigate the impacts of nonrandom nonresponse. However, the decennial census and the American Community Survey do provide extensive small area data. These data can serve as a rich source for characteristics of people living in census tracts and block groups – characteristics such as age, education, ethnicity and income – known to be related to both census and survey participation.

The Census Bureau recently released the 2014 Planning Database, which assembles a range of housing, demographic, socio-economic and census operational data useful for survey and census planning. Data are provided at both the census block group and tract levels of geography. The 2014 Planning Database uses selected census and American Community Survey statistics. In addition to these sources, operational variables include the 2010 Census mail return rate. The 2014 Planning Database includes percentage calculations with the basic count data. In addition, a new low response score is provided that is similar in purpose to the hard-to-count scores included in the 2000 Planning Database. The low response score identifies census block groups and tracts whose characteristics predict low census mail return rates and are highly correlated (negatively) with census and survey participation. “Hard-to-count” refers to segments of the population that are often missed, such as newborns or immigrants.

Census Bureau demographers developed the first planning databases after the 1990 and 2000 censuses and it proved to be a valuable tool. With the advent of the American Community Survey replacing the census “long form” and the annual release of American Community Survey statistics covering a five-year period, the planning database content was updated and revised after the 2010 Census. The planning database will be updated and released annually.

Questions or comments? Contact census.pdb.questions@census.gov

 

Posted in Uncategorized | 1 Comment

Outside Experts Present Research Beneficial to the Census Bureau

Bookmark and Share

Written by: Barbara Downs, Lead Research Data Center Administrator, Center for Economic Studies and Lucia Foster, Chief Economist and Chief, Center for Economic Studies

Over 100 researchers from across the country gathered at Census Bureau headquarters on June 12 to participate in a conference highlighting cutting-edge research intended to foster  innovation in  data collection, processing and analysis.

The annual conference also featured data training sessions on health data from the Agency for Healthcare Research and Quality and the National Center for Health Statistics; demographic data from the Survey of Income and Program Participation; and business data from the Longitudinal Business Database. Participants learned about data available at the RDCs and discussed data-specific technical issues.

The day began with opening remarks from Deputy Director and Chief Operating Officer Nancy Potok. She emphasized the critical importance of research in keeping the Census Bureau on the cutting edge of economic and social measurement. She promoted the continued expansion of the RDC network as a way to help the federal statistical system meet the challenges of measuring a changing economy and population in a period of reduced response rates and limited resources.

Sue Helper, Chief Economist for the Department of Commerce, discussed another use of RDCs in her keynote speech, “Evidence-Based Governing: How RDCs Might Help.”  Helper, drawing on personal experience, highlighted the benefits of bringing together two different cultures, academia, and policy.  She has experience in both settings as she is currently on leave from her professorship at Case Western Reserve University. Helper also has RDC experience, having been a researcher in 1998 at the Boston RDC.

Once again, this year’s RDC conference showed how the RDC network fosters innovation in Census Bureau programs and products by building partnerships with expert researchers. We look forward to learning more about ongoing RDC research and its benefits for respondents, data users, and American taxpayers at next year’s conference, to be held at one of our western locations.

The RDC network consists of 17 secure locations across the United States, where qualified researchers from academia, federal agencies, and other institutions with approved projects receive restricted access to selected non-public files to conduct research that benefits the data-owning agency. Currently, there are over 400 researchers working on 150 projects in the network.

For more information about the RDC network, see: http://www.census.gov/ces/rdcresearch/.

For more information about the 2014 RDC conference see:  http://www.census.gov/ces/researchprograms/rdc_conference_2014.html

Posted in Uncategorized | Leave a comment

Race Response Changes by American Indians and Alaska Natives Between the 2000 Census and 2010 Census

Bookmark and Share

Written by: Renuka Bhaskar, Researcher, Center for Administrative Records Research and Applications (CARRA); Carolyn Liebler, Ph.D., Assistant Professor of Sociology, University of Minnesota; and Sonya Rastogi, Ph.D., Senior Researcher, CARRA

Recent Census Bureau research highlights the extent to which people’s race and Hispanic origin responses changed between the 2000 and2010 Censuses. The American Indian and Alaska Native (AIAN) group is among those with a relatively high level of race response change. In this new working paper, we examine characteristics of people who reported (or were reported by someone in the household as) AIAN in 2000 and 2010, in comparison to people who reported AIAN in only one of those censuses.

We use anonymized linked data from 2000 and 2010 to study race response stability and change for 3.1 million AIAN people.  We also use supplementary information for about 188,000 of them who participated in the American Community Survey (ACS) at some point from 2006 to 2010. Note that the linked data used in our study are not nationally representative.

We find substantial race response change among non-Hispanic and Hispanic AIAN people. Of those who reported non-Hispanic single-race AIAN in 2000 or 2010, just over half (53 percent) had the same race and Hispanic origin responses in both censuses.  Among Hispanic AIANs and AIANs who reported more than one race, fewer than one in seven had the same race and Hispanic origin responses in both censuses. An implication is the number of people who ever report AIAN is much larger than the number reporting AIAN at one point in time.

In the chart below, we show the most common response patterns. A number of AIANs reported white in both censuses but added or dropped the AIAN response in one census. Others changed from one single race (AIAN) to or from another single race (for example, white).

IMG1

For those who also participated in the ACS, we explore a wide range of characteristics to understand similarities and differences between those who joined, those who left, and those who stayed in the AIAN response category.  We find, for example, those who were consistently reported as AIAN more often reported an enrolled or principal tribe, lived in an area with a relatively high number of AIANs, and reported AIAN ancestry.

We also find that people who made a particular response change (for example, from Hispanic single-race AIAN in 2000 to Hispanic single-race white in 2010) were similar to those who made the inverse change (Hispanic single-race white in 2000 to Hispanic single-race AIAN in 2010).

There are substantial dynamics in race reporting among American Indians and Alaska Natives.

Researchers studying this important group will need to take into account the possibility of response change in research design and when interpreting results.

Our paper is available at <https://www.census.gov/srd/carra/>.

Posted in Uncategorized | Leave a comment

Recent Findings on Trends in U.S. Entrepreneurship

Bookmark and Share

Written By Ron S. Jarmin, Assistant Director, Research and Methodology Directorate

Two recent papers highlight findings based on the Census Bureau’s Business Dynamics Statistics (BDS) that show declining rates of business dynamism over the last few decades. Census Bureau Business Dynamics Statistics provide data on number of establishments and year-to-year change in employment for births, deaths, expansions, and contractions by firm age and employment size.

The first paper, The Role of Entrepreneurship in U.S. Job Creation and Economic Dynamism, which I co-authored with the Census Bureau’s Javier Miranda and Ryan Decker and John Haltiwanger of the University of Maryland, appeared in the Summer 2014 issue of the Journal of Economic Perspectives. The second paper is a recent Brookings report by Hathaway and Litan.

A central finding in these papers is that rates of new firm start-ups have been declining since the early 1980s (see the figure from Decker et al. below).Seemingly, this trend is not of great concern. However, young entrepreneurial businesses are an important source of job creation in the U.S. economy (see Haltiwanger, Jarmin and Miranda 2013). Also, the churn of firms in the economy and the associated reallocation of resources are critical components of productivity growth (for a review, see Syverson 2011).

Figure shows the Declining Share of Activity from Young Firms (Firms Age 5 or Less)

As these papers report, we do not have a satisfactory explanation for the declining pace of business dynamism. Nor do we fully understand the broader implications on productivity and economic growth. We need more research with data such as the BDS to resolve these gaps in our understanding of the economy.

Here’s a sampling of recent blogs about these findings:

Conversable Economist

Vox

Fivethirtyeight:

Updated Priors

Arnold Kling

Posted in Uncategorized | Leave a comment

Results from America’s Churning Races: Race and Ethnic Response Changes Between the 2000 and 2010 Censuses

Bookmark and Share

Written by: Sonya Rastogi, Ph.D., Senior Researcher, Center for Administrative Records Research and Applications (CARRA); Carolyn Liebler, Ph.D., Assistant Professor of Sociology, University of Minnesota; Leticia Fernandez, Ph.D., Researcher, CARRA; James Noon, Researcher, CARRA; and Sharon Ennis, Researcher, CARRA

Racial and ethnic identities are consequential in shaping people’s lives and experiences. Often these concepts are viewed as immutable and lifelong. However, researchers have documented that individuals’ race and Hispanic origin responses on censuses and surveys can and do change.

Possible explanations for why respondents change these responses include new life experiences, shifting social forces, adjustments to questionnaire design, or a change in who within a household reports the race or Hispanic origin for household members.

In a new working paper, we document the amount and patterns of race and Hispanic origin response change using anonymized linked data of over 162 million people for whom we have responses from both the 2000 Census and 2010 Census.

We find that race and Hispanic origin responses changed for about 9.8 million people (or 6 percent). In the study, we estimate that about 8 percent would have changed their race or Hispanic origin between censuses if the study data were nationally representative.

We use the term “churning” to describe population turnover between census years — the number of people who left or joined a group relative to the number of people who stayed in the group. By using the linked data, we abstract from natural sources of churn such as births, deaths and migration and isolate churn due to response changes.  Notably, across most types of response changes, we found that the (sometimes sizable) number of people joining each race/Hispanic group was similar to the number of people leaving the same group.

The table below shows churning and stability among single-race groups and multiple-race groups, by Hispanic origin. Responses changed across all race and Hispanic origin groups, especially among American Indians and Alaska Natives, Native Hawaiians and Other Pacific Islanders, people who reported multiple races, and Hispanics who reported a race. Responses were generally stable among single-race non-Hispanic whites, blacks, and Asians.

Fluidity of Race Response Between the 2000 Census and the 2010 Census by Hispanic Origin

In general, people consistently report whether they are Hispanic. However, race responses among Hispanics were not as stable as among non-Hispanics. The two most common response changes were changing from “Hispanic Some Other Race” to “Hispanic White” and changing from “Hispanic White” to “Hispanic Some Other Race.”

Most of the common response changes involved adding or subtracting a race or Hispanic response. Many common response changes involved movement between the majority group and a minority group. And a number of people changed from one single-race group to another.

We conclude that a racial or ethnic group does not necessarily include the same individuals over time even if cross-sectional data show small net changes. Researchers who use race and Hispanic-origin information will need to take into account the possibility of response changes when interpreting results.

Our paper is available at <https://www.census.gov/srd/carra/>.

Posted in Uncategorized | Leave a comment

Counting Young Children in Censuses and Surveys

Bookmark and Share

Written by: Frank Vitrano, Associate Director for 2020 Census

There is a well-documented undercount of children ages 4 and under in population censuses. Societies as varied as China, South Africa, Laos, the former Soviet Union, and Canada experience a high net undercount of young children.

As we prepare for the 2020 Census, we will continue to look at ways to produce a more accurate and cost-effective count of the nation, one that is reflective of our dynamic, changing society. We will research how to best reach and include historically hard to count populations, such as young children.

We have seen that for various reasons, young children are often not included on census forms, and as we prepare for 2020, improving this count is one focus of our research.

This coverage error is not unique to decennial censuses. Evaluations have shown that Census Bureau surveys like the American Community Survey, the Current Population Survey, and the Survey of Income and Program Participation also undercount young children, which can result in biased survey estimates.

Federal agencies, state and local governments, and advocacy groups make critical assessments of the well-being of children and distribute funds to support programs for young children based on these surveys’ estimates. Under coverage for this population has far-reaching implications.

Demographic Analysis shows that the undercount of children under age five in the decennial census, and in Census Bureau demographic surveys, is growing.  The 2010 Census, for example, undercounted about 4.6 percent of children aged 0 to 4.  When we look at the differences between Census and demographic analysis counts for adults and young children since 1980, we see noteworthy reductions in differences for the adult population but steady growths in differences for the youngest children.

In 2013, the Census Bureau assembled an informal task force to study the persistent undercount of young children in censuses and surveys. This group met throughout the year to discuss the issue, brainstorm on causes, and review existing data. In the task force’s report, you can find a summary of this problem,  and most critically, what the Census Bureau still needs to pursue in order to improve coverage of this population group in future censuses and surveys.

Along with my decennial managers, I stand committed to reversing this decline in coverage for young children in the 2020 Census. In fiscal year 2015, we will establish a team of experts from across the Bureau to focus on coverage improvement activities for the 2020 Census.  Their first responsibility will be to identify and prioritize key evaluation and research projects. I can promise you that improving the coverage of children under the age of five will be high among their priorities. This team will report to Census Bureau leadership and will be responsible for making sure that this problem continues to receive the attention that it deserves.

In addition, I plan to identify a point person for this specific issue – improving the coverage of young children in official statistics. This individual will serve as an advocate for high quality data for young children and work with both decennial and demographic survey managers to understand and address the causes for this undercount.

We plan to begin coverage improvement work for the 2020 Census in fiscal year 2015. If you have questions about this report, please leave a comment on the blog and we will get back to you.

Report

Posted in Uncategorized | Leave a comment

Visit Us at the 2014 Joint Statistical Meetings in Boston

Bookmark and Share

Written by: Tommy Wright, Chief, Center for Statistical Research and Methodology, U.S. Census Bureau

Several thousand statisticians and people in related professions, including staff from the Census Bureau, will present testing and research results on many topics at the Joint Statistical Meetings (JSM) in Boston, Mass., from Aug. 2 through 7. The theme of the conference is: “Statistics: Global Impact–Past, Present, and Future.”

The American Statistical Association (ASA) and the Census Bureau share a historical link. Many cite the Census Bureau’s need to improve its data collections methods as a main reason for ASA’s founding. As early as 1844, ASA recommended to Congress that the 1840 Census “be revised and a new and accurate copy be published.” In those early years, the heads of the Census Bureau were generally ASA members or officers.

This year, Census Bureau staffers will present research findings on a spectrum of topics, such as:

  • Using administrative records in the 2020 Census to increase efficiency and reduce cost.
  • Using paradata throughout the decennial data collection life cycle.
  • Using targeted contact strategies in censuses and sample surveys for coverage improvement and cost control.
  • Using statistical modeling to reduce cost by improving targeted address canvassing activities.
  • Using small area estimation techniques.
  • Using new household sampling and estimation methodology.
  • Using statistical modeling in business and household sample surveys applications.
  • Addressing data quality, nonresponse, and missing data problems.
  • Improving users’ access to data while protecting privacy and confidentiality.

The JSM offers a unique international forum for Census Bureau staffers to present their research for professional discussion. It is a major setting for ensuring that Census Bureau statistical methodology remains at the cutting edge. We look forward to sharing ideas with you at this year’s conference. For a listing of Census Bureau presentations, see www.census.gov/research/conferences/

This year’s JSM celebrates the 175th anniversary of the ASA, which was founded in 1839 in Boston. Held annually, JSM attracts more than 6,000 attendees from around the world, the largest gathering of statisticians in North America. Nine international statistical organizations participate in JSM (see here for a listing). Attendees present and hear about advances in statistical methodology and applications, including statistical theory and methodological development, state-of-the-art technological advances for data processing and new advances in statistical sampling, estimation, and modeling. JSM offers professional development courses, career placement services, and opportunities to meet and collaborate with individuals conducting similar research.

Posted in Uncategorized | Leave a comment

Another Look at Race Response Change: American Indians and Alaska Natives between 2000 and 2010

Bookmark and Share

Written by:
Sonya Rastogi, PhD, Senior Researcher, Center for Administrative Records Research and Applications (CARRA)
Carolyn Liebler, PhD, Assistant Professor of Sociology, University of Minnesota
Renuka Bhaskar, Researcher, CARRA

For half a century, the American Indian and Alaska Native (AIAN) population has been notably larger with each new census. Over the decades, this population has grown considerably without corresponding increases in immigration and births. According to demographers who study these increases, cross-time changes in race reporting at the individual level play an important role in the observed net increases.

In ongoing work, we are studying patterns in race and Hispanic origin response changes by people who reported (or were reported by someone else in the household) as AIAN (either alone or in combination with another race) in the 2000 or 2010 censuses. We presented a preliminary version of this research, entitled Dynamics of Race: Joining, Leaving, and Staying in the American Indian/Alaska Native Race Category between 2000 and 2010, at the 2014 annual meetings of the Population Association of America in early May.

This blog is the second in a two part series. Part one provided an overview of our related research on changes in respondent answers to the race and Hispanic origin questions across all race and Hispanic origin groups between 2000 and 2010. As we mentioned in part one, some race and Hispanic origin responses can and do change. These changes are relatively common in the AIAN population.

Our new research on the AIAN population is unique in that we use linked data to look deeper into the net changes in population size and composition. We provide substantial information about previously unseen groups – those who joined or left the AIAN response group, and those who remained AIAN between 2000 and 2010. Using data from the American Community Survey, we document patterns in the characteristics of individuals who were involved in each type of response change.

In preliminary analyses, we found generally similar demographic characteristics in two groups: (a) people who reported (or were reported as) AIAN in the 2000 Census, but not in the 2010 Census, and (b) those who reported (or were reported as) AIAN in the 2010 Census but not the 2000 Census. These two groups are similar despite substantial turnover. We also found that people who retain the same race and Hispanic origin response in both censuses have characteristics that are distinct from those who join or leave the AIAN population.

We expect to release a working paper with detailed results in the coming months.

Posted in Uncategorized | Leave a comment