A $1.2 million grant in NSF's Secure and Trustworthy Cyberspace program, funded in part to a University of Michigan School of Information research team, will go toward building a search engine and other tools to help collect and classify billions of privacy documents on the web.
The overall goal of the work led by Penn State researchers is to make the web safer for users by helping scientists study the data practices of online services and what they disclose in privacy policies and other documents.
The team that also includes a member of the Future of Privacy Forum will develop large-scale techniques for interpretation of materials, create tools for research and practical use, and develop mechanisms for crawling the web to locate and index the documents. The forum is an education and advocacy group based in Washington, D.C.
"This multidisciplinary project will dramatically improve researchers', practitioners', and policymakers' capabilities for analyzing and understanding the state of digital privacy, including the effects of regulation," said Florian Schaub, assistant professor at the School of Information and principal investigator of the University of Michigan team. "The goal is to create an infrastructure and tools that researchers can readily use rather than having to build up their own data collection and analysis pipelines from scratch, as is currently the status quo."
The search engine-called PrivaSeer-will use artificial intelligence and natural language processing to help researchers collect, review and analyze documents including privacy policies, terms of service agreements, cookie policies, privacy bills and laws, regulatory guidelines and other related texts on the web.
"Privacy policies are documents that we encounter in our day-to-day lives when we visit websites and, in theory, we're supposed to read them," said Shomir Wilson, assistant professor of information sciences and technology, Penn State and Institute for Computational and Data Sciences affiliate. "But, in practice, few people do that. It's not practical and it doesn't fit into how people use the internet. People often don't have the legal knowledge to understand these documents, either."
Wilson, who serves as the lead principal investigator for the project, said numerous documents about organizations' privacy and data practices are available on the web but researchers face a daunting challenge of identifying and gathering them through painstaking manual searches.
The search engine can also offer insights into how policies change and help users navigate the complex field of online privacy, said C. Lee Giles, the David Reese Professor of Information Sciences and Technology, Penn State, and a project co-principal investigator.
"One of the reasons to have a privacy policy search engine is that you can get an idea about how different companies treat their user privacy currently and over time," said Giles, who is also an ICDS associate. "This can also inform users how they may want to react to those companies."
In addition to the search engine, the team also plans to develop corpora-large datasets of text, application programming interfaces, or APIs, and tools aimed at privacy researchers, advocates, and policymakers.
The University of Michigan team led by Schaub will focus on designing, developing and evaluating such tools with a human-centered design perspective, and on the scientific analysis of privacy documents gathered by the project.
Gabriela Zanfir-Fortuna, director for global privacy at the Future of Privacy Forum is a co-PI on the project.
A preliminary version of the PrivaSeer search engine is available at privaseer.ist.psu.edu.