In a recent presentation at the Census Bureau, Dr. Steven Ruggles, the director of the Minnesota Population Center (MPC) at the University of Minnesota, talked about the history of processing data from the census. Ruggles argued that the needs of the Census Bureau drove innovation in data processing technology up to 1960, but the private sector rather than the Census Bureau has played that role in the last 50 years. During this period the costs of data collection, storage and analysis have declined rapidly and the quantity of data collected has grown at an extraordinary pace.
Following the 1960 census, improvements in computers brought more potential for research using census data. The Census Bureau responded to researchers’ requests for data by releasing the 1960 Public Use Microdata Sample (PUMS), a 1-in-1000 sample of the records from the 1960 long form. PUMS files are for statistical purposes only and do not contain any personal information that would allow individuals to be identified. This dataset, which was delivered on 13 UNIVAC tapes (or 18,000 punch cards), allowed researchers to address a variety of questions that would not have been possible using publicly available tabulations. Since the sample consisted of microdata—records at the person- and household-level—it offered the opportunity to develop customized measures and to do multivariate analyses.
The 1960 PUMS was well received by the research community, so following the 1970 census, the Census Bureau released a one-percent sample of the 1960 census (a tenfold increase over what had initially been released) along with six percent of the records from the 1970 long form. Perhaps most importantly, the concurrent release of the revised 1960 sample and the 1970 sample allowed researchers to examine change over time very easily, as both samples used the same codes and formats.
The Census Bureau released a sample of records from the 1980 census, and outside researchers took up the task of producing samples of historical censuses. Hal Winsborough at the University of Wisconsin contracted with the Census Bureau to create samples from the 1940 and 1950 censuses, extending the series of available microdata to five censuses. Projects at the University of Washington and Penn led by Sam Preston produced samples of the 1900 and 1910 censuses, then the latest censuses available to the public. In the late 1980s, Ruggles led efforts at the University of Minnesota to produce samples of historical censuses dating back to 1850.
Though the consistent coding of the 1960 and 1970 census samples had demonstrated the power of interoperability to study change over time, none of the other samples were produced in a consistent manner. Ruggles initiated the Integrated Public Use Microdata Series (IPUMS) project to “harmonize” all of the census public use samples. That is, to produce new versions of these datasets with consistent codes, record layouts, and integrated documentation without any loss of information from the original datasets. The initial release of IPUMS data came not long after the development of the first web browsers, and Ruggles was quick to take advantage of this technology to disseminate the harmonized census microdata, leading to a rapid increase in the use of the IPUMS database for research.
The IPUMS now includes data from all censuses from 1850 to 2000 (with the exception of the 1890 census, which was destroyed by fire). Ruggles is also involved in efforts to digitize entire historical censuses. This project, known as the North Atlantic Population Project (NAPP) since most of the counties involved are in North America and northern and western Europe, currently contains 120 million person records from 24 censuses covering the period from 1800 to 1910. Ruggles predicts that by 2016 all of the IPUMS and NAPP data releases will increase to 1150 censuses and surveys from 110 countries comprising 1.5 billion person records. MPC has also done extensive work harmonizing recent census data from other countries. IPUMS-International currently contains data from 185 censuses from 62 countries comprising some 400 million person records spanning the period from 1960 to 2010.
The greatest challenges to these efforts are that many datasets are inaccessible and at risk of loss. Also, whatever metadata exists is typically sketchy. Despite these challenges, improvements in computing technology and a rapid decline in the cost of storing data have also brought about new opportunities for data collection and analysis. Where it cost about $1200 to store one megabyte of data in 1980, the same amount of storage now costs about $0.00004. These factors have combined to bring about a marked acceleration in the pace of discovery in recent years.
Ruggles pointed out that the between all of MPC’s projects, they currently have data on more than 850 million people, or roughly as many people as Facebook. MPC is now collaborating with the Census Bureau, as well as other research organizations to expand its projects. Current large-scale efforts include the National Historical GIS Project and Terra Populus or “TerraPop,” which is an effort to preserve, integrate and disseminate global-scale spatiotemporal data describing population and the environment.
Where the Census Bureau once drove innovation in data processing technology, it is now a beneficiary of the technological changes of recent years. The Census Bureau is now collaborating with the research community in a variety of ways to improve data collection and to produce new data products.