IP Implications and Opportunities in Big Data
Big data has become a major driving force with broad applicability and implications. From air transport to retail to government to entertainment, everyone has access to enormous quantities of information that can be characterized as "big data." It seems almost every organization today is either currently using big data or looking for ways to use it. Those who are doing neither proceed at their considerable peril.
This relatively new and expanding opportunity adds import to the role of forward-looking intellectual property systems. Capturing and building upon intra-organizational innovation, surveying patterns of innovation in the surrounding field, and understanding the changing IP legal landscape will put an organization in the best position to capitalize on developments in big data.
But what questions should we be asking of big data? This article begins with a discussion of some of the overarching business questions raised by big data. It proceeds to examine two competing frameworks for utilizing big data findings and closes with prescriptive ideas for managing and developing IP in the big data space.
Big Picture: Asking Questions
Big data brings with it many business opportunities and challenges. It may lead to the discovery of correlations potentially helpful to the business; for example, a clothing retailer might use big data analysis to discover which day of the week 23- to 29-year-old working professionals buy the most cotton-wool blend socks online. So the first question asked should be what information—or, more specifically, what correlations—can be ascertained from data that would be helpful?
From that question, an organization may ask: Are we prepared to change business direction based on what we learn from data? This question of business direction in turn raises more technical challenges: Where do we find useful data? How do we detect and correct inaccurate records within the data (also known as "data cleansing")? How do we gain search access to unstructured and heterogeneous data? How do we parse and analyze the data in order to gain insight from it?
Throughout this process, questions must be asked as to how data was obtained—and whether it is reliable. How confident can we be in a particular insight—and why? How do we seek out those questions we don't know to ask of the data (the "stuff we don't know we don't know")? And from the business and technical direction we encounter intellectual property challenges: Are new approaches needed to identify and capture inventions around big data? How do we optimally describe big data inventions in patent applications? How should patent claims be framed to protect big data inventions in ways that will withstand the test of shifting legal standards? How do we avoid— or embrace, if necessary—multijurisdictional infringement issues, divided/joint infringement issues, induced infringement issues, and extra-territorial enforcement issues? Perhaps most importantly, how do we ensure others don't get patents that impede or prevent our own use of big data to benefit our customers and obtain maximum competitive advantage?
In addition to the first- and second-order questions listed above, there are even larger, strategic questions that will drive the development of a tailored approach to intellectual property management and development. These questions relate to how big data fits within an organization's business, how it will be used, and who within the organization possesses the skills required to harness the promise of big data while recognizing and avoiding traps inherent in the reliance on it. At the source of these strategic questions are the promises and pitfalls of two approaches: an approach based primarily on correlation versus an approach based primarily on causation.
Correlation and Causation
Some leading thinkers posit that the value of big data lies in the promise of uncovering previously unknown correlations. Such discovery, the argument holds, allows businesses to profit from theretofore unseen connections. The guiding principle for these thinkers is the more data the better—at a certain point, the numbers speak for themselves. This line of thinking demonstrates a focus on the "what" while minimizing emphasis on the "why" of big data analysis.
In the socks example posited above, according to this line of thinking, knowing that 23- to 29-year-old working professionals were more likely to buy cotton-wool blend socks online on Tuesdays than on Thursdays would be valuable in and of itself—even if there were no articulable theory of why Tuesday was a better day to buy socks.
Other leading thinkers concede that analysis of big data might reveal previously unknown correlations, but urge cautious interpretation of these newly revealed correlations. The guiding principle for these thinkers is that mere correlation, without meaningful exploration of causation, fails as an effective strategic guidepost. Numbers do not speak for themselves but rather are given a voice by those who gather and interpret them.
In the words of Albert Einstein (as quoted by New York Times reporter and blogger on big data issues Steve Lohr): "Not everything that counts can be counted, and not everything that can be counted counts." Adherents to this line of thinking warn against overreliance on discovered correlations (the "what") with only limited investigation into the causation underlying those correlations (the "why"). Thus, they might discourage a clothing retailer from adopting a new sales strategy that relies solely on the Tuesday correlation to online sock purchases without first exploring why sales to a particular demographic tend to be higher on that day.
Put differently, there are two approaches to managing the correlation/causality interface. The first, "read the gauges" approach demands that the organization put primacy on the correlations discovered and for the most part suspend the search for causation. If massive amounts of reliable data reveal that socks sell better on Tuesdays than on Thursdays, then the correlation itself justifies a strategic approach that takes that correlation into account.