Helping Canadians to “better understand their country — its population, resources, economy, society and culture” is no small feat in a complicated, technology-driven and quickly evolving world. Still, that mandate sits at the top of Statistics Canada’s website.
The government agency (often referred to as StatCan) found itself in headlines in October 2018, when StatCan requested private transactional data from banks. In response, the agency received complaints and the Office of the Privacy Commissioner launched an investigation. StatCan defended their request in an October 28 blog post, saying that the data were necessary to continue its work: “Statistics Canada cannot meet the data needs of Canadians with outdated tools and processes, while rich sources of information continue to grow and reflect today’s sophisticated economy and society.”
It appears StatCan — whose main objectives include collecting statistical information and promoting statistical standards — is, like a lot of other entities, facing something of a crisis as the nature of data and the public’s expectations about data change. Data is more abundant than ever but also increasingly difficult to obtain since that rich resource now primarily resides in the private sector. Companies view data as their own property, creating an ironic situation: a national data deficit.
The Rise of Data Deficits
Governments used to be the primary holders of data, and industry depended upon government agencies to collect and provide that information on the state of the marketplace, industry and society. Generally, this was conducted by StatCan surveys, but today, this process — and the nature of the data collected — has become much more complicated. Now it seems the roles have reversed: corporations get vast amounts of data in real time, which statistical agencies are struggling to do. Corporations hold the wealth of data about their activities, the marketplace and individuals but are often not willing to provide it for public use.
This shift has created a data deficit: a growing gap between what the government knows and has access to, and the data that the private sector holds and is willing to provide. This is a critical issue since a data deficit can have a significant impact on public policy, regulations and the relationship between governments, corporations and citizens.
In testimony at Parliament’s Industry, Science and Technology Committee, Anil Arora, Canada’s chief statistician cited Canadians’ increased use of digital devices and services as a primary driver of the data deficit, and the need for StatCan to keep up. The agency needs this data to understand the housing market, debt levels, and, in particular, the emergence of the gig economy.
The “gig economy” offers a case in point. Companies like Uber and Airbnb emerge and initially ignore regulations (or simply operate in a space that has yet to be regulated) in the name of “disruption,” then have a significant hand in the establishment of regulations that accommodate their business model.
These and other companies in the gig economy have been reluctant to share the data they generate and use with governments, as it offers them a considerable advantage when it comes to depicting their role in the larger economy. Sharing that data might also compromise their market position, empower their competitors or aid in the regulation of their sector — which the companies in question are likely to resist.
Another kind of data deficit emerges when the collection of data depends upon external agencies that are prone to disruptions in their operation. For example, the shutdown of the US government in December 2018 and January 2019 impacted StatCan’s ability to gather export data normally collected and shared by the US government.
However, data deficits are not just about the data asymmetry that exists between governments and corporations. They are also about the gap between the data that exists and the data that could exist with appropriate regulations and oversight — and how these data could be used for public policy purposes. For example, the JPMorgan Chase & Company Institute has used data from their financial services operations to do interesting analysis of consumer expenditures, gig work and income, something the public sector would love to have.
The pharmaceutical industry provides an example of how a lack of regulation can create a data deficit that prevents researchers and regulators from properly protecting the public against potential harm. Whether from initial research, clinical trials or ongoing usage, there is a considerable amount of data generated around drugs that helps the industry but is not entirely or effectively shared with other relevant parties.
Specifically, there’s also an opportunity to continue collecting data around the performance and impact of drugs after they’ve been approved and are being used by the public. This data can and should be collected, but this example raises some critical points around data governance. On the one hand, this data is valuable, and companies are reluctant to give it away when it can be sold or licensed. On the other hand, companies are worried about the trust their customers place in them, and do not want to be seen as undermining that trust, by turning over customers’ information to the government. Ultimately, this can also be harmful, and wouldn’t serve anyone well.
The Transparency Dilemma
Researchers are increasingly demonstrating that artificial intelligence (AI) is not impartial or free of bias. Often, the bias is a product of the data used to train the algorithm or construct the machine learning model. Put simply, data is always biased because it reflects the preconceptions of the individual or organization that is collecting it. It can reflect the focus or goal of that collection process or exclude information that was not deemed relevant — or inadvertently (or deliberately) ignored.
While transparency about data’s collection should help to ease public concerns about bias behind data, it can also do the opposite.
In an email interview, Concordia Associate Professor Fenwick McKelvey framed this problem as “the transparency dilemma.” “Because there seems to be little public awareness of data collection, any disclosures raise privacy concerns. Being proactive and explaining to the public how they collect data might lead to more blowback then just keeping quiet about it. Diminished trust in government puts federal agencies at a clear disadvantage.”
McKelvey’s point of diminished trust is an important one; lack of trust would have impacts beyond government as well. For example, amid waning trust, citizens may be reluctant to share data at all or to use certain technologies.
It’s important to note that StatCan has standards in place around data collection, use and dissemination that follow international guidelines and best practices. For example, StatCan results are only shared after personally identifiable information is removed and the data has been anonymized — but the rise of surveillance and data breaches has, perhaps rightly so, raised the public’s level of concern. The federal Office of the Privacy Commissioner is currently investigating StatCan’s request for transaction records, (the purchases Canadian consumers are making using their debit or and credit cards, for example) as well as a mandatory survey that the agency wants to conduct. Given the heightened concern, a rethink of standards might be warranted.
The rapid rate of technological change means that standards that were acceptable yesterday may not be feasible tomorrow. Expectations around privacy largely rest around anonymization standards that protect personally identifiable information.
Advances in AI and de-anonymization techniques suggest that information that was previously thought to be secure may no longer be. A recent study published in JAMA Network Open demonstrated how researchers were able to re-identify patients whose health information had been anonymized.
One of the study’s authors, Anil Aswani of the University of California, Berkeley, spoke with Reuters: “The study shows that machine learning can successfully re-identify the de-identified physical activity data of a large percentage of individuals, and this indicates that our current practices for de-identifying physical activity data are insufficient for privacy.”
Of course, this concern doesn’t just apply to the practices of statistics agencies, but to the methods of any organization collecting data. While the public protested sharing anonymized data with StatCan, citizens willingly hand over personal data, financial data, geolocation data and more to the private sector (think Facebook, Amazon or Uber, among others) every single day. Moreover, Statcan makes available its methodologies — something the private sector generally does not do. Statistics Canada is also involved in the development of these methodologies and has expertise in both survey and administrative uses of data.
The existence of a single data set isn’t the problem; it’s the combination of data, its uses in decision making and its contribution to machine learning (that will later produce even more decisions) that sits at the core of our dilemma.
Reimagining Data Standards
The standards for collecting and sharing data could use some reimagining — and StatCan could be a good setting in which to carry it out.
“There seems to be a lack of imagination about what public data could be,” McKelvey wrote. Like a number of researchers, he argues that the lack is at least in part due to little public understanding about data.
StatCan and other government entities could build out their own mandates by providing more accessible data for those researching and developing AI. From a Canadian perspective, this would seem ideal given the nation’s high concentration of AI researchers and companies, and fits well with Statistics Canada’s modernization initiative which intends to increase access to data to foster innovation and inclusion.
This expanded role, stewardship of data, as well as providing data sets for the development of AI could be an opportunity for StatCan (or researchers working with StatCan) to better monitor the misuse of data, to aid in the creation of oversight and countermeasures and to contribute to a public that is active in the establishment of data standards and frameworks.
Data and statistics should no longer be considered a passive or secondary part of government operations. Rather, they should be recognized as essential elements and driving forces, not just for the public sector, but for all of society. Quality data is central to the country’s success and prosperity, especially when it comes to developing evidence-based policy.
Canadian Data Governance in 2019
Canada’s development of a national data strategy is a nod to some national priorities; data, the data deficit, and the government’s ability to manage data are on the agenda. Combined with an upcoming federal election, these subjects are likely to be politically charged issues and will require greater debate.
It is clear that data is increasingly valuable, and the government needs to address and move toward resolving the data deficit that already exists. The field of predictive analytics provides an interesting — and compelling — case for a national focus on data; the discipline operates on the premise that with enough data, one can, with alarming accuracy, predict the future. That isn’t to say that StatCan’s recent request for financial data was the keystone to predicting the next economic shift or consumer trend. But, over time, improved, thoughtful collection and use of data at a national level has the potential to chip away at the data deficit, and to support Canada’s role in the data economy.
If the goal at StatCan is to genuinely help Canadians better understand their country, then we should not underestimate what is involved in achieving that. Closing the data deficit, modernizing StatCan, and creating a data strategy roadmap for the federal public service are all essential steps. However, the real challenge will be getting popular support and buy-in.