By Eric Siegel
The NSA can leverage bulk data collection with predictive analytics to target law enforcement activities. But this little-known capability both intensifies and redefines the debate over how much data governments should be collecting.
The US’s National Security Agency (NSA) has endured intense global scrutiny and suffered heavy backlash over its mass data collection that was unveiled in detail by whistleblower Edward Snowden in 2013. But don’t give too much credence to the news or even the books – public discourse leaves out the greatest power law enforcement stands to gain from this data.
Summary of the mainstream debate regarding NSA data collection:
• Privacy advocates: The NSA is violating civil liberties by collecting data on a massive scale about private citizens, including the majority who are not even suspected of any wrongdoing. Access to this data, whether in-house or by proxy via telecom companies, facilitates arbitrary snooping.
• The NSA (and supportive legislators): We require comprehensive data in-house so we can rapidly investigate specific individuals when they become of interest. We do not inspect the activities of ordinary civilians in general.
This contentious dialogue only touches on half the story. Both sides – including the most visible critics of the recent NSA bulk data shutdown – fail to address what’s really at stake for law enforcement: Data empowers not only the investigation of established suspects, but also the discovery of new suspects. I would like to propose the following terminology for this emerging form of data driven law enforcement:[ms-protect-content id=”9932″]
Automatic Suspect Discovery (ASD) – The identification of previously unknown potential suspects by applying predictive analytics to flag and rank individuals according to their likelihood to be worthy of investigation, either because of their direct involvement in, or relationship to, criminal activities.
ASD provides a novel means to unearth new suspects. Using it, law enforcement can hunt scientifically, more effectively targeting its search by applying predictive analytics, the same state-of-the-art, data-driven technology behind fraud detection, financial credit scoring, spam filtering, and targeted marketing.
How It Works: Why the Whole Haystack Is Needed to Shrink the Haystack
To harness this potential, law enforcement needs the “whole haystack.” The government doesn’t desire data about you just to spy at will – on the off chance you turn out to be a suspect. Rather, they actually require this data as a baseline in order to pursue their greater objective with ASD. This approach relies on wide-scale data access, even including data about both you and me – a full regimen of data about normal, innocent civilian activity unrelated to crime of any sort. Mathematically speaking, the broader a swath of noncriminal cases fed into the analysis, the better it works.
Law enforcement (antiterrorism or otherwise) is a numbers game, a quest to find needles in the haystack that is the general population. The working hours spent by agents, officers, and analysts constitute a precious, finite resource that must be allocated as effectively as possible. As staff collects evidence, follows leads, and studies forensics, there is no magic oracle to focus these efforts and ensure the quest is efficient. But predictive analytics can better target a portion of the work, focusing it on individuals predicted more likely to be of interest.
As with fraud detection, prediction shrinks the haystack to be searched. This multiplies the effectiveness of available human resources. By focusing time on the top echelon, those with the highest predictive scores, an investigator is more likely to come across worthy suspects. While it is reasonable to assume ASD pays off over time, investigators must understand the odds have only shifted; it’s not a magic crystal ball. Most targets of investigation still turn out to be innocent – that is to say, the false positive rate will be lowered but by no means eliminated; the haystack is smaller but still large.
But this requires bulk data. As with all application areas, predictive analytics learns from data that encodes both positive and negative cases – in this case, both known needles, e.g., perpetrators or suspects, and the vast haystack of ordinary civilians, respectively. The analytical number-crunching process builds models (e.g., patterns or other formulations) to distinguish needles from hay. The effectiveness of this process depends on access to bulk data.
Presumption: The NSA Uses Predictive Analytics
It’s a foregone conclusion the National Security Agency (NSA) considers predictive analytics a strategic priority. While, any such activities are necessarily a secret, wondering whether they use it is like speculating whether a chef who bought flat pasta, meat sauce, mozzarella, and ricotta is making lasagna. Beyond a reasonable doubt, the world’s largest spy organisation running the country’s largest surveillance data center and employing the world’s largest number of Ph.D. mathematicians strives to analytically learn from data.
There’s much supporting the presumption that the NSA has worked with predictive analytics and will continue to do so. Official documents established NSA capabilities in machine learning; the NSA has purchased intelligence software solutions that include predictive analytics capabilities. NSA job postings for “data scientists” seek candidates experienced with machine learning and other related technologies. The FBI applies predictive analytics for terrorism, and, more generally this technology stands clear as an increasingly common practice for law enforcement of all kinds, including US Armed Forces-funded terrorism prediction, predictive policing, and recidivism prediction, as well as fraud detection, arguably the leading government application of predictive analytics.
Example Patterns: What It Could Discover
With this in mind, let’s look at what the NSA could discover with predictive analytics. The fact is, data brims with predictive potential. Even when the data about each individual is limited to metadata – which characterises e-mail and telephone communications by their time, date, destination, and the like – this provides some of the most revealing nuts and bolts of a person’s behaviour.
A predictive model acts as a choosy, discriminating fishing net. For example, Defense Department-funded university research identified certain circumstances – characterised by the following pattern – that present an 88 percent probability of an attack by the South Asian terrorist organisation Lashkar-e-Taiba:
• PATTERN: Between five and 24 of the organisation’s operatives have been arrested and operatives are on trial in India or Pakistan.
In a similar vein, such patterns could serve to identify potential attackers rather than impending attacks. To illustrate the concept, here are two example patterns to identify possible suspects that could be generated by predictive analytics – these are fictional, for illustrative purposes only:
• PATTERN: The caller has placed calls from at least two countries per week for eight months, calls from an average of four countries per week, has placed two calls to numbers two degrees of separation from a hotlist of numbers, and received a call from a hotlist number within the last 4 hours (such a rule could trigger a real-time alert to analysts).
• PATTERN: The e-mail address, logged into at a flagged Internet café, is likely a proxy for another e-mail address that has second-degree ties to a hotlist of e-mail addresses. The proxy pairing is based on the frequency of forwards between the two that are not replied to, the overlap in the sets of correspondents, and similar geolocation login patterns.
Despite claims that bulk data has thus far provided limited value for homeland security, I believe that continuing to develop predictive analytics deployment that incorporate human creativity and law enforcement expertise is bound to deliver. Although a particular pattern may “catch no fish” and come up empty, when a number of even the most arcane patterns are applied across a large population of civilians, there’s an opportunity to eventually find matches.
With analytical targeting, law enforcement has an unequaled advantage over perpetrators. Criminals lack one key resource required to compete against this form of intelligence: the data. Criminal organisations generally cannot recreate law enforcement’s surveillance of persons of interest, let alone the much larger dataset of negative examples, the civilians. So they have no means to ascertain the predictive patterns that crime fighters derive from this data, which leaves them with no insight to evade being detected by such patterns. Predictive analytics achieves a qualitatively unparalleled advancement in this escalating arms race, the ongoing competition between detection and evasion.
The Debate: Two Opposing Viewpoints
University of DC law professor Andrew Ferguson warns of where we could be headed with the high tech usage of NSA bulk data. “Predictive analytics is clearly the future of law enforcement,” He says. “The problem is that the forecast for transparency and accountability is less than clear.” Can the NSA use this technology to fight terrorism – and can other agencies do so to fight crime in general – without endangering civil liberties?
In deliberating how much data the NSA and other organisations may collect and how it is used for predictive law enforcement, we seek to balance the great value aggregated data bears against the danger it may impose on civil liberties.
But today’s debate falters, because both sides are under informed. With that in mind, here is a summary of the two opposing viewpoints.
The Argument for Increased Data Collection
Law enforcement is intrinsically destined to apply predictive analytics, which serves to discover potential suspects who would otherwise continue undetected. Just as companies screen each transaction for fraud and each employee for propensity to quit their job, so too does a government strive to screen each civilian for connection to crime.
Without an understanding of predictive analytics, privacy advocates trip up on fallacies. Wisconsin Rep. James Sensenbrenner, who himself introduced the Patriot Act in the House, argued, “The bigger haystack makes it harder to find the needle.” It’s a common misconception. As predictive analytics practitioners recognise, the data glut is not a problem – it’s an opportunity. Data encodes experience (prior events) from which to analytically learn.
Critics of bulk data collection must learn the irony intrinsic to predictive law enforcement: Wide-scale data collection can serve to identify the few who should be actively surveilled, rather than spy on the many.
A comparable controversy plays out in the field of medicine, where the potential for lifesaving insights also compels open data. Health care data-sharing proponent John Wilbanks argues that privacy protections on clinical research data slow down research. “These are tools that we created to protect us from harm, but what they’re doing is protecting us from innovation now,” he said in a TED talk.
The Counterargument: Curtail Monitoring to Protect Civil Liberties
For all its promise, mass government surveillance risks civil liberties and therefore cannot go unrestrained. When predictive targeting flags an individual, this does not necessarily mean reasonable suspicion has been established by way of specific evidence. However, when a targeted individual’s data is accessed, it may become the subject of an officer’s particular prejudices, which may in turn lead to unwarranted acts of search, seizure, or detention.
The presence of this potential infliction upon the few curtails liberties for the many. Glenn Greenwald, author of No Place to Hide: Edward Snowden, the NSA, and the U.S. Surveillance State and lead journalist on the 2013 disclosures, wrote that “it is in the realm of privacy where creativity, dissent, and challenges to orthodoxy germinate. A society in which everyone knows they can be watched by the state – where the private realm is effectively eliminated – is one in which those attributes are lost, at both the societal and the individual level.”
Brazilian President Dilma Rousseff brought this reasoning to its natural conclusion: “In the absence of the right to privacy, there can be no true freedom of expression and opinion, and therefore no effective democracy.”
The ACLU calls this profiling. In a discussion of predictive targeting with Allen Gilbert, the executive director at the American Civil Liberties Union of Vermont, he told me: “Predictive analytics is in essence a form of profiling. It provides an excuse rather than evidence to target someone as a criminal suspect. It short-circuits the Fourth Amendment’s protections against search and seizure without reasonable suspicion of crime. A civil libertarian gasps that such pre-judging – prejudice – is considered justified in modern-day crime fighting.”
A More Informed Debate
Want a productive debate rather than a purely contentious one? Then learn more – whichever side you’re on:
Data hustlers who support increased data collection by law enforcement must become deeply familiar with the principles and political history that depict how compromised privacy brings a loss of liberty.
Privacy advocates who support decreased data collection by law enforcement must come to understand why predictive targeting presents a much stronger incentive for broad-scale data collection: a means to unearth new suspects who might otherwise go undetected.
Adapted with permission of the publisher from Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, Revised and Updated Edition (Wiley, January 2016).
About the Author
Eric Siegel, Ph.D. is the founder of the Predictive Analytics World conference series – which covers both business and government deployment – executive editor of The Predictive Analytics Times, and a former computer science professor at Columbia University. Beyond this portion in his book, Siegel also published an op-ed in Newsweek on this topic. For more information about predictive analytics, see the Predictive Analytics Guide.