In May of 2013 I was asked to contribute an article to ProVISION, an IBM magazine published in Japan for current and potential customers and clients. That piece was translated into Japanese and published at the end of July. Below is the original article I wrote in English describing some of the work we do in my group by the scientists, software engineers, postdocs and other staff in IBM’s Research labs around the world. It was not meant to inclusive of all analytics work done in Research but was instead intended to give an idea of what a modern commercial research organization that focuses on analytics and optimization innovations does for the company and its customers.
When you look in the business media, the word analytics appears repeatedly. It is used in a vaguely mathematical, vaguely statistical, vaguely data intensive sense, but often is not precisely defined. This causes difficulties when you are trying to identify the problems to which analytics can be applied and even more when you are trying to decide if you have the appropriate people to work on a problem.
I’m going to look at analytics and the related areas to help you understand where the state of the art research is and what kind of person you need to work with you. I’ll start by separating out big data from analytics, look at some of the areas of research within my own area in IBM Research, and then focus on some new applications to human resources, an area we call Smarter Workforce.
This is the most inclusive term in use today but, because of that generality, it can be the most confusing. Let’s start by saying that there is a large amount of data out there, there are many different kinds of data, it is being created quickly, and it can be hard to separate the important and correct information from that which you can ignore. That is, there are big data considerations for Volume, Variety, Velocity, and Veracity, the so-called “Four Vs.”
I think these four dimensions of big data give you a good idea of what you need to consider when you process the information. The data may come from databases that you currently own or may be able to access, perhaps for a price. In large organizations, information is often distributed across different departments and while, in theory, it all belongs to you, you may not have permission to use it or know how to do so. It may not be possible to easily integrate these different sources of information to use them optimally.
For example, is the “Robert Sutor who has an insurance policy for his automobile” the same person as the “Robert Sutor who has a mortgage for his home” in your financial services company? If not, you are probably not getting the maximum advantage from the data you have and, in this case, not delivering the best customer service.
Data is also being created by social networks and increasingly from devices. You would expect that your smart phone or tablet is generating information about where you are and what you are doing, but so too are larger devices like cars and trucks. Cameras are creating data for safety and security but also for medical and retail applications.
What do you do with all this information? Do you look at it in real time or do you save it for processing later? How much of it do you save? If you delete some of it now, how do you know those parts won’t be valuable when we have better technologies in a year or two?
When you have so much data, how do you process it quickly enough to make sense of it? Technologists have created various schemes to solve this. Hadoop and related software divides up the processing into smaller pieces, distributes them across multiple machines, and then recombines the individual results into a whole.
Streams processing looks at data as it is created or received, decides what is important or not, and then takes action on the significant information. It may do this by combining it with existing static data such as a customer’s purchase history or dynamic data like Twitter comments.
So far I’ve said that there is a large amount of data currently stored and being newly created, and there are some sophisticated techniques for processing it. I’ve said almost nothing about what you are doing with that data.
In the popular media, big data is everything: the information and all potential uses. Among technologists we often restrict “big data” to just what I’ve said above: the information and the basic processing techniques. Analytics is a layer on top of that to make sense of what the information is telling you.
The Wikipedia entry for analytics currently says:
Analytics is the discovery and communication of meaningful patterns in data. Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favors data visualization to communicate insight.
Let’s look at what this is saying.
The first sentence has the key words and phrases “discovery,” “communication,” and “meaningful patterns.” If I were to give you several gigabytes, terabytes, or petabytes of data, what would you do with it to understand what it could tell you?
Suppose this data is all the surveys about how happy your customers are with your new help desk service. Could you automatically find the topics about which your customers are the most and least happy? Could you connect those to specific retailers from which your products were bought?
At what times of the day are customers most or least satisfied with your help desk and what are the characteristics of your best help desk workers? How could this information be expressed in written or visual forms so that you could understand what it is saying, how strongly the data suggests those results, and what actions you should take, if any?
Once you have the data you want to use, you filter out the unnecessary parts and clean up what remains. For example, you may not care whether I am male or female and so can delete that information, but you probably want to have a single spelling for my surname across all your records. This kind of work can be very time consuming and can be largely automated, but usually does not require an advanced degree in mathematics, statistics, or computer science.
Once you have good data, you need to produce a mathematical model of it. With this you can understand what you have, predict what might happen in the future if current trends continue or some changes are made, and optimize your results for what you hope to achieve. For our help desk example, a straightforward optimization might suggest you need to add 20% more workers at particular skill levels to deliver a 95% customer satisfaction rate. You might also insist that helpful responses be given to customers in 10 minutes or less at least 90% of the time.
A more sophisticated optimization might look at how you can improve the channels through which your products are purchased, eliminating those that cause the most customer problems and deliver the least profit to you.
For the basic modeling, prediction, and optimization work, one or more people with undergraduate or masters graduate degrees in statistics, operations research, data mining/machine learning, or applied mathematics may be able to do the work for you if it is fairly standard and based on common models.
For more sophisticated work involving new algorithms, models, techniques, or probabilistic or statistical methods, someone with a Ph.D. is most likely needed. This is especially true if multiple data sources are combined and analyzed using multiple models and techniques. Analytics teams usually have several people with different areas of expertise. It is not uncommon to see one third of a team with doctorates and the rest with undergraduate or masters degrees.
Our work at IBM Research
I lead the largest worldwide commercial mathematics department in the industry, with researchers and software engineers spread out over twelve labs in Japan, the United States, China, India, Australia, Brazil, Israel, Ireland, and Switzerland.
While the department is called “Business Analytics and Mathematical Sciences,” we are not the only ones in IBM Research who do either analytics or mathematics. We are the largest concentration of scientists working on the core mathematical disciplines, which we then apply to problems in many industries, often in partnership with our Research colleagues and those in IBM’s services business divisions.
We divide our work in what we call strategic initiatives, several of which I’ll describe here. In each of these areas we write papers, deliver talks at academic and industry conferences, get patents, help IBM’s internal operations, augment our products, and deliver value to clients directly through services engagements.
One of the topics in IBM’s 2013 edition of the Global Technology Outlook is Visual Analytics. This differs from visualization in that it provides an interactive way to see, understand, and manipulate the underlying model and data sources in an analytics application. Visual analytics often compresses several dimensions of geographic, operational, financial, and statistical data into an easy to use form on a laptop or a tablet.
Visual Analytics combines a rich interactive visual experience with sophisticated analytics on the data, and I describe many of the analytics areas in which we work in the sections below. Our research involves visual representations and integration with the underlying data and model, client-server architectures for storing and processing information efficiently on the backend or a mobile device, and enhancing user experiences for spatio-temporal analytics, described next.
This is a particularly sophisticated name for looking at data that changes over time and is associated with particular areas or locations. This is especially important now because of mobile devices. Information about what someone is doing, when they are doing it, and where they are doing it may be available for analysis. Other example applications include the spread of diseases; impact of pollution; weather; geographic aspects of social network effects on purchases; and sales pipeline, closings, and forecasting.
The space considered may be either two- or three-dimensional, with the latter becoming more important in, for example, analysis of sales over time in a department store with multiple floors. Research includes how to better model the data, make accurate predictions using it, and using new visual analytics techniques to understand, explore, and communicate insights gained from it.
Event Monitoring, Detection, and Control
In this area, many events are happening quickly and you need to make sense of what is normal and what is anomalous behavior. For example, given a sequence of many financial transactions, can you detect when fraud is occurring?
Similarly, video cameras in train stations produce data from many different locations. This data can be interpreted to understand what are the normal passenger and staff actions at different times of the day and on different days, and what actions may indicate theft, violent behavior, or more even more serious activities.
Analysis of Interacting Complex Systems
The world is a complicated place, as is a city or even the transportation network within that city. While you may be able to create partial models for a city’s power grid, the various kinds of transportation, water usage, emergency healthcare and personnel, it is extremely difficult to mathematically model all these together. They are each complicated and changes in one can result in changes in another that are very hard to predict. There are many other examples of complex systems with parts that interact.
Simulation is a common technique to optimize such systems. The methods of machine learning can help determine how to realistically simulate the components of the system. Mathematical techniques to work backwards from the observed data to the models can help increase the prediction accuracy of the analytics.
Decision Making under Uncertainty
Very little in real life is done in conditions of absolute certainty. If you run a power plant, do you know exactly how much energy should be produced to meet an uncertain demand? How will weather in a growing season affect the yield from agriculture? How will your product fare in the marketplace if your competitor introduces something similar? How will that vary by when the competitive product is introduced?
If you factor in uncertainty from the beginning, you can better hedge your options to maximize metrics such as profit and efficiency. There are many ways to quantify uncertainty and incorporate it into analytical models for optimization. The exact techniques used will depend on the number and complexity of the events that represent the uncertainty.
Revenue and Price Optimization
The decisions you make around the price you charge for your products or services, and therefore the hoped-for revenue, are increasingly being affected by what happens on the demand side of your business. For example, comments spread through social media can significantly increase or decrease the demand for your products. Aggressive low pricing given to social media influencers can increase the “buzz” around your product in the community, thereby increasing the number of units sold. If you can give personalized pricing to consumers that is influenced by their past purchase behavior, you can affect how likely they are to buy from you again.
Demand shaping can help match what you have in inventory to what you can convince people to buy. This focus area therefore affects inventory and manufacturing, and so the entire supply chain.
Condition Based Predictive Asset Management
When will a machine in your factory fail, a part in your truck break, or a water pipe in your city spring a leak? If we can predict these events, we can better schedule maintenance before the breakages occur, keeping your business up and running.
We can get the parts we need and the people who will do the work lined up and doing the work in a timely way. Since there are multiple assets that may fail, we can help prioritize which work should be done earlier to keep the whole system operating, even given the process dependencies.
Integrated Enterprise Operations
You can think of this focus area as a specific application of the Analysis of Interacting Complex Systems work described above to the process areas within an organization or company. For example, a steel company receives orders from many customers requesting different products made from several quality grades. These must be manufactured and delivered in a way that optimizes use of the stock material available, configures and schedules tooling machines, minimizes energy usage, and maintains the necessary quality levels.
While each component process can be optimized, the research element of this concerns how to do the best possible job for all the related tasks together.
I think of this as the analytical tools necessary for the Chief Financial Officer of the future. It integrates both operational and financial data to optimize the overall financial posture of an organization, including risk and compliance activities.
Another element of Smarter Finance includes applications to banking. These include optimization of branch locations and optimal use of third party agencies for credit default collections.
A particularly large strategic focus area in which we are working is the application of analytics to human resources, which we call Smarter Workforce. We’ve been involved with this internally with IBM’s own employees for almost ten years, and we recently announced that we would make two aspects of our work, Retention Analytics and Survey Analytics, available to customers.
Retention analytics provides answers to the following questions: Which of my employees are most likely to leave? What characterizes those employees in terms of role, geography, and recent evaluations and promotions? What will it cost to replace an employee who leaves? How should I best distribute salary increases to the employees I most want to stay in order to minimize overall attrition?
Beyond this, we are doing related research to link the workforce analytics to the operational and financial analytics for the rest of an organization. For example, what will be the affect on this quarter’s revenue if 10% of my sales force in Tokyo leaves in the next two weeks?
Survey analytics measures the positive and negative sentiment within an organization. While analytics will not replace a manager knowing and understanding his or her employees, survey analytics takes employee input and uncovers information that might otherwise be hard to see. Earlier I discussed a customer help desk. What if that help desk was for your employees? How could you best understand what features your employees most liked or disliked, and their suggestions for improvement?
This is one example of using social data to augment traditional analytics on your organization’s information. Many of our focus areas now incorporate social media data analytics, and that itself is a rich area of research to understand how to do it correctly to get useful results and insight.
Analytics has very broad applications and is based on decades of work in statistics, applied mathematics, operations research, and computer science. It complements the information management aspects of big data. As the availability of more and different kinds of data increases, we in IBM Research continue to work at the leading edge to create new algorithms, models and techniques to make sense of that data. This will lead, we believe, to more efficient operations and financial performance for our customers and ourselves.