Heck...we'll even teach you how to blend your internal data into our data product models to improve data product performance.
How we Model Our Consumer Data Product Estimates
Overview
Exceed Analysis estimates a number of consumer data products at the 6 digit postal code level for all of Canada. The process involves modeling survey data from various levels of geography down to the final Canada Post 6 digit postal code level. Although these estimates can be based on a number of data sources, they rely on two key data inputs: First, Statistics Canada releases the full Census variable list of demographic variables (over 2,200) at the dissemination area (56,589 regions) or for about every 250 households. Second, Canada Post releases 6 digit postal codes (762,696 in 2020) with individual coordinates and household counts annually.
The process begins with survey data for the most recent year, which can be at the Provincial/Territory level (13), Census Division (293 regions), Census Metropolitan Area (35 regions), Economic Region (76 regions) or as often the case some aggregation or combination of geography. This data is then distributed down to the dissemination area (DA) level using the most recent Census data.
The limited availability of data at the postal code level creates a debate about the most appropriate level of detail for data products. DA level data is easily the most accurate because it's the lowest level of geography where the full census variables are released. DA level data, however, lacks the convenience of postal code level data, particularly for targeted marketing and customer level analysis. For example, postal code market data can easily be linked to customer level data or other survey data for market share and potential analysis where the postal code is known. It's effectively a convenience versus accuracy choice.
Exceed navigates this trade-off between accuracy and convenience by estimating data at the most accurate DA level. Postal code data at the 6 digit level, which includes household numbers, is then used to distribute the DA level data down to the individual latitude and longitude of each postal code. This ensures that data distribution mirrors the geographical population growth in Canada. It also means that although individual postal codes within a DA may have different population numbers, they will all have the same average value (i.e. demographics).
In lower population DAs, where the number of actual records for a particular variable are less than 4, Statistics Canada suppresses the data to protect the confidentiality of individual respondent's personal information. This shows up as missing values, which are estimated using a geographical nearest neighbour algorithm.
Our US Connection
Exceed is partnered with Applied Geographic Solutions, a leading US supplier of small area data, that also brings over 30 years of data modeling experience to the table. We believe that collaboration makes for the best-in-class solutions on both sides of the border. It's interesting to note that US small area data is modeled at the Block Group level, the equivalent to the Canadian DA, and the lowest level of geography where both countries release their full census variables. Canadian DAs have roughly half the population of the US Block Group, which makes data suppression a greater problem and therefore also the further modeling of data down to the 6 digit postal code. Nonetheless, even with the higher Block Group population, US data isn't modeling down to the ZIP+4 level, which is the equivalent to the Canadian 6 digit postal code.
Demographic Data (333 variables)
Population estimates are based on Statistics Canada's annual population survey which is released at the most detailed level (all ages and gender) by Census Division. The distribution to lower levels of geography are dependent on the most recent Canada Post household counts. Household counts are aggregated and reconciled to DAs dissemination area and dissemination block average household size control totals from the latest Census. The end result is a comprehensive estimate of population synchronized to Canada Post household counts.
Statistics Canada also released population and dwelling counts at the more detailed dissemination block level (489,676 regions) for the Census. While this would add additional detail to the population distribution within each DA, we don’t use this data because of a misalignment between dwelling counts from Canada Post postal code coordinates and Statistics Canada geography. Instead, we rely on the dwelling count relationship at the more aggregated DA level for a higher level of accuracy.
Income estimates are based on Statistics Canada's annual income survey and taxfiler data when applicable. Distribution to lower levels of geography is based on the income distribution from the latest Census income distribution combined with the impact of population and household counts from Canada Post. This assumes that population growth within a particular geographic region will have similar income and demographic characteristics to existing households.
The weakness of the income data is that its upper end category is too low. The upper end of the distribution for individuals and household income is $150,000 and over, and $200,000 and over, respectively.
Marital status, visible minority, labour force, occupation and visible minority estimates follow a similar process to income and population. Annual surveys for each category at various levels of geography are modeled down to the DA level using the latest Census data and then distributed across 6 digit postal codes using the latest population estimates and household counts.
Family Structure, Education, Ethnic Origin, Mother Tongue and Language Spoken Most Often at Home variables do not have current annual surveys. As a result, these estimates are based on the latest Census data at the DA level and then ratio-adjusted to the most recent household and population counts at the postal code level. What this means is that when household and population counts increase as a result of the growth in the Canada Post postal code data, the specific ethnic mix within a DA is assumed to remain unchanged from the last Census.
The fact that population growth in Canada is primarily based on immigration makes this an important assumption. And while immigration from country of origin and province of destination is available, the data is not available together. This means that to add further detail and model actual immigration from country of origin to individual Canadian neighbourhoods, it requires many qualitative assumptions open to debate and therefore additional error.
Consumer Spend Data (294 variables)
Consumer spend data is based on Statistics Canada's annual Survey of Household Expenditures. Provincial survey expenditures are estimated by household for each income quintile (5 levels). This survey data is then modeled to the DA level based on the latest Census income quintile data and then to the postal code level using average household expenditures by DA.
The most technical part of modeling spend data involves imputing missing survey values where survey responses are too low and therefore suppressed. Data is imputed using a variety of methods including previous surveys adjusted for growth using other quintiles and the use of aggregated regional geography ratio-adjusted for provincial differences.
Consumer Wealth Data (66 variables)
Consumer wealth data is based on the Statistics Canada Survey of Financial Security that is typically available every second year. Like consumer spend data, wealth survey results are available by income quintile and therefore estimated at the DA level using the latest Census data similarly to consumer spend data. For the Northern Territories, which are not covered by the survey, data is modeled from a financial health index survey by the Canadian Council on Social Development.
There are a number of potential issues with the wealth data. First, data is self-reported which is easily totaled for financial items, but it is much more difficult for survey respondents to report non-financial assets, particularly real estate and business equity. Second, the timing and nature of survey results, coupled with the pace of change in variables like real estate, means values will be lagging and lower than actual values in real time.
Copyright © 2020 Exceed Analysis - All Rights Reserved.