Feasibility Study: Measuring Population with Admin Data

Technical feasibility: Measuring population and dwellings using administrative data summarises work investigating the technical feasibility of transforming the New Zealand census to a model based primarily on administrative (admin) data supported by surveys. It indicates Stats NZ's readiness to shift towards an admin-data-first-based approach for census, as well as highlighting some obstacles and key areas for development.

Download the PDF of the full paper below, or read the scope and summary online.

Scope

The focus of this paper is on technical feasibility in terms of how well admin data (supported by surveys) can count core statistical units (people, dwellings, and households and families) and provide census-type information about the characteristics of these units.

It should be noted that more general feasibility of an admin-first-based census requires considerations beyond the technical. General acceptance of an admin-first census model is a fundamental requirement. Matters like social and cultural licence, building public trust and confidence in the re-use of administrative data, and the financial and legal aspects of a change to the census model all need to be addressed in the decision-making process surrounding future censuses. These topics are noted in the paper, but they are not the focus of discussion.

Summary

Transforming the New Zealand Census of Population and Dwellings: Issues, options, and strategy was developed in 2012 in response to concerns about the sustainability of the traditional full-field enumeration census model, as well as to realise opportunities arising from increased availability of admin data and technology innovations. The vision of a future census based primarily on admin data supported by surveys was aspirational at the time.

During the decade since, those sustainability risks have become evident. The 2018 Census was unable to achieve the high response rates of the past. Even though the 2023 Census was able to achieve higher response rates than the 2018 Census, the response rates did not return to pre-2018 levels. This was despite the increased efforts of collection operations and significantly higher investment in 2023. Our ability to still produce statistically sound census data in 2018 and 2023 has relied on using admin data where we have been unable to obtain census responses. This innovative approach to mitigating insufficient response rates was made possible by methodological research conducted by the Census transformation research programme.

The goal of the Stats NZ population-measurement research programme for 2020 to 2024 was 'to make a primarily admin-based census supported by sample surveys a genuine choice for the 2028 Census'. We now have most of the data and the methodological underpinnings needed for this shift. The breadth and quality of admin-based population information we are already able to produce is well ahead compared with countries that, similarly to New Zealand, do not have an established population register or national population identifiers and are looking to using admin data in their census.

Unlike many countries with a register-based census, New Zealand does not have a national identity number that can be used to link person-centred admin data to a central population file. A key achievement over the last 10 years has been the ability to successfully integrate multiple datasets using probabilistic linking. The methods developed to undertake this linking on a production basis to produce the Integrated Data Infrastructure (IDI) data, and hence an admin-derived population file, might be the most important step to making this possible. Achieving high coverage of the admin-derived population depends on the ability to combine data sources that give comprehensive coverage and a largely unbiased picture of our population (NZ births, visa information, and tax data). These sources are less subject to the non-response biases we see in field-based censuses and surveys. The resulting population file is a fundamental element of the current combined census model, with individuals identified in admin data who have not responded to the census being included as admin-enumerations in the census (2018 and 2023).

Using a combined census model for the 2023 Census has more information.

The population statistics that we can produce from linked admin data are more up to date, more timely, and provide a richer ability to track demographic population shifts than a traditional census. This will enable us to provide better population data in a world where issues like climate change, high rates of inward and outward migration, and an ageing population are driving profound changes in the characteristics of people and where they live. These changes are creating demand for an increased frequency of population statistics and an increased ability to join population data with other sources of data that can only be achieved with integrated admin data. The IDI includes many admin datasets, Stats NZ surveys, and the 2013 and 2018 Censuses, linked at the individual level. The combined system, represented by these linked admin datasets, is stronger than what might be evident from the perspective of any single agency. Utilising multiple sources increases coverage but also provides some protection against quality limitations, as key demographic variables are not dependent on the quality of a single data source.

Information that can be derived from admin data can be produced every year, or more frequently at highly granular detail, for small geographies. However, there are also trade-offs. For variables that are not covered by admin data we will need to rely on sample surveys, which will lead to a loss of detail, especially about small population groups. Like the census, the direct collection of high-quality survey data is increasingly becoming a challenge that involves significant cost. Innovative survey approaches will need to be developed to ensure the sustainability of delivering detailed small-area data.

The benefits of admin data within the census model will not be realised without the trust of our Tiriti partners, our customers, and the general public. The use of the combined-census model and the release of the experimental administrative population census (APC) products have increased awareness of how admin data is used to support the statistical system, but more work is needed in this space. Future Census Engagement has revealed a common theme of data users looking for more information on the quality of admin data.

Experimental administrative population census has more information about the APC.

There are several areas with significant gaps in the admin data available, or where improvements in the methodology and underlying infrastructure are needed. These are outlined in the section below, 'Critical areas to progress'. They include:

  • poor coverage of iwi affiliation and to a lesser extent Māori descent
  • a lack of admin data identifying people in rainbow communities and providing information on disability
  • improvements required in the collection of ethnicity data to support detailed population data for small ethnic communities
  • the development of the infrastructure to support a high-quality listing of dwellings.

In some situations, information gaps can be addressed through improvements in the quality of admin data and our methodology. Others will need to be supported by a survey programme.

The amount of work, change, and risk required to deliver an admin-first census supported by surveys is large, but the fundamental building blocks are in place. Across all the future census options being considered, admin data has a substantial role in improving the production of timely population measures and resilience against issues such as non-response.

Methodological progress

We have demonstrated we can derive a high-quality list of people in the resident population using linked admin data. The admin-derived resident population closely tracks the age distribution of the official estimated resident population (ERP), with a relatively evenly distributed undercount of around 2 percent. This population provides the foundation for other census information: characteristics for each person can be derived from linked admin sources or surveys and individuals are grouped within dwellings to form households. The admin-derived resident population was used in both the 2018 and 2023 Censuses to include admin enumerations in the census file. Although the admin-derived resident population is very close to the actual population as measured by official population estimates, it is not accurate enough to meet user requirements for high-value uses, such as in health funding models and for local government electoral needs. As with the current census approach, methods to adjust for coverage error in the population listing will still be required to provide the accuracy required for official population statistics.

The administrative population census (APC) is an experimental product released by Stats NZ that demonstrates to customers our current ability to derive census-type information from linked admin data in the Integrated Data Infrastructure (IDI). As well as core demographic variables (age, sex/gender, ethnicity, location, and Māori descent), the APC includes variables on the topics of birthplace, income, work, and education, and in 2023 some household variables were included.

Our population estimation methods have become distinctly more sophisticated in the last two decades (moving from a direct-weighting estimation approach to a Bayesian dual-system estimation model to adjust for census coverage errors). We have identified the key errors that our population estimation system needs to correct for under an admin-first census model supported by surveys and are progressively developing a population estimation system that can correct for these errors. Core methodology has been internationally peer-reviewed and published (Graham et al, 2024), but more investigations are required to account for the error structures inherent when working with multiple admin data sources that differ from traditional census errors. Population estimation models that account for subsets of the errors seen are currently in development, and these will eventually be merged and extended to correct for all known errors. Our current priority is to ensure that we have all the necessary components of the estimation system for a scenario where two admin lists of the target population are available.

Constructing admin-based households and families is progressing. Seventy-two percent of comparable admin households have exactly the same household membership as the 2018 Census household, resulting from 89 percent of people being located at their census address, an improvement compared with earlier studies. Family nucleus membership aligns with the 2018 Census for approximately 82 percent of people in admin families, resulting from 89 percent of within-household relationships identified through admin data. There are still issues, particularly with placing more mobile groups (for example young people and recent migrants) in the correct households and identifying informal relationships and those formed overseas. Further work will include providing family information as well as determining the role that surveys will play in delivering household information.

Approximately half the census attribute variables for individuals can be derived using admin data. We have built up a comprehensive and detailed understanding of the quality of census characteristics that can be derived from admin data and identified the types of information collected by census that are unlikely to be obtained (Bycroft et al, 2021). High-quality variables have been released in the third iteration of the APC (Stats NZ, 2023b). For certain variables, information is of comparatively lower quality for some groups such as recent migrants to New Zealand. Further work will focus on sourcing additional data, including the electoral roll for Māori descent. Statistical imputation methods for remaining missing values will be developed to improve the representativeness of the data.

An extended survey programme is required to deliver the full range of census information needs and to support population estimation; the initial design work for this is complete. Examples of information needs not currently well supported by admin data include gender, sexual identity, languages spoken, religious affiliation, activity limitations (disability), household tenure, and housing quality, such as dampness and mould. The survey will be designed to provide information for small area geographies (for example, SA2s), and there will be some loss of detail for those variables that rely on the sample survey programme compared with a high-coverage full-field enumeration census. However, the sample survey will also offer opportunities to improve timeliness, and to evolve content flexibly.

Critical areas to progress

While substantial admin data sources exist on properties, buildings, and addresses, the development of a places index is in the early stages. We have not yet demonstrated an ability to provide a census of dwellings and their attributes without the use of a full enumeration census. Significant work is required to source the data and develop the infrastructure and methods required to produce national and subnational dwelling counts from admin data and this will likely not include temporary dwellings. While there is confidence in the underlying data available and the programme to develop the places index is resourced and has started work, it is not possible to say whether the quality will be of the standard required for census dwelling counts by 2028. A robust measurement strategy will be required to assess the coverage and will potentially be used for coverage adjustment.

A vital step to maximise the value we get from our admin-first census is to further improve how we form admin-based households. There is a vast array of regular address notification data in the admin data, and our methods for forming households from this data have been improving iteratively. The next step will be to expand the approach to exploit patterns of address changes that help us link individuals together to improve the chances of identifying those living together at a particular reference date. This, and the more extensive use of family data that can be found in admin datasets will contribute to the next iteration of household and family data available in the admin population census.

Iwi affiliation is a core population identifier for Māori and an essential requirement for census that cannot be met through currently available admin data sources. The data collected in the census provides a rich resource for producing a wide range of population measures for iwi. These include core topics collected in the census, such as population size and location, language, and housing, as well as the ability to link population summaries to admin data. However, we also know that the current census model is not meeting all the data needs of iwi and Māori. Across all the census models, Stats NZ will need to work in partnership with iwi-Māori to develop options to better support their needs, and this will take significant investment.

Further data sources would be required to improve the coverage of Māori descent. Without a full enumeration census, the combination of Māori descent information collected in electoral roll data and DIA birth registrations would provide the Māori descent indicator for between 90 and 92 percent of the population. In the short term, historic census data will further reduce the amount of missing data. Statistical methods such as imputation and the population estimation process can account for missing data, although the level of adjustment would be higher than some users would find acceptable. Wider collection of Māori descent in admin data would reduce the reliance on existing data sources and improve the coverage. Ongoing collection of Māori descent in survey collections will still be needed to validate data collected in admin sources, and to inform statistical imputation models.

The quality of ethnicity data (down to the most detailed level of the ethnicity classification, level 4) will need to be improved to support the sustainable production of population measures for ethnic communities. Improvements in the collection and coding of ethnicity is likely to add value to different agencies' abilities to monitor outcomes within their own data systems. These improvements will take time and significant resources from the agencies involved. Where to prioritise quality improvement will need to be identified in partnership with the agencies involved, to ensure the work aligns with their data strategy.

Ethnicity Data Action Plan, developed in 2023 by Te Aka Whai Ora to identify steps to improve the collection of ethnicity data across the health sector, is one example.

The collection of representative and complete information for rainbow communities is limited in admin-data settings, and may continue to be, because in many contexts it will not be necessary or appropriate to collect. If the census model moved away from a full enumeration approach, the loss of population-level identifiers for rainbow communities would reduce the quality of detailed population counts, the ability to report on attributes collected in the census, and the ability to link to admin data to report on measures such as health outcomes, income, and education. A census attribute survey would provide support for measuring these outcomes, but it would not provide the same level of detail as a full enumeration census, particularly for small populations such as trans and non-binary populations. The development of a census survey programme does offer potential to deliver a wider set of social measures, such as wellbeing, that are not able to be provided by existing surveys due to sample size. It will be important to partner with rainbow communities to understand the impact and opportunities of a change in census model for supporting data needs for rainbow communities.

Data supplied by agencies is currently a mix of sex and gender. The mandated data standard for gender, sex, and variations of sex characteristics specifies the collection and output of gender data by default, as opposed to sex. Increasingly, agencies are collecting gender, but Stats NZ will need to work with agencies to ensure that collection aligns with the data standard, and metadata is available describing the variable that is being collected.

Data standard for gender, sex, and variations of sex characteristics has more information.

The development of a survey programme is needed to extend the range of information needs that can be met by admin data. The initial sample and collection design includes the development of a mixed-mode Census Attribute Survey (CAS), sampling around 5 percent of the population annually. This will be supported by more targeted surveying of key communities, with the continued delivery of the New Zealand Household Disability Survey and Te Kupenga, and the development of a Pacific Wellbeing Survey being examples. The development of a survey programme of this scale will introduce competing information needs and a transparent process of balancing these will be important. Stats NZ has a lot of experience running social surveys; however, the development of a survey programme of this scale will require a careful approach to ensure the collection remains in budget and is able to deliver high-quality information for small communities. Investment in methods to reduce the decline in response rates, and to mitigate the impact of non-response, is needed for a CAS and our survey programme more widely.

The continued development of data and analytical infrastructure to support the use of integrated admin and survey data in a production setting is critical. The Integrated Data Infrastructure (IDI) and Statistical Location Register (SLR) have provided an invaluable test environment for research, but this environment is not suited as an enterprise register-based statistical system, which an admin-first census requires. Dwellings are a core function of the census and the SLR provides a list of addresses, but we do not yet have the functionality to adequately identify dwellings, nor determine changes to the dwelling stock over time. The development of the Integrated Statistical Data System (ISDS) (a register-based statistical system) that enables integrated data to be used in a production setting will be a critical component of an admin-first census.

ISBN 978-1-991307-11-8

/Stats NZ Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.