For the Love of Big Data: What is Big Data?

Several weeks ago I was on the panel “Privacy and Innovation in the Age of Big Data” at the International Association of Privacy Professionals in Washington, DC USA. My role was to present the attraction and value of data but not to constantly interrupt myself with “but, but, but” for policy and privacy issues. That is, I was the set up for IBM‘s Chief Privacy Officer Christina Peters and Marty Abrams, Executive Director and Chief Strategist, Information Accountability Foundation, to talk policy. The audience was mainly privacy experts and attorneys.

I presented four slides and I previously posted those via SlideShare. Here and in three other posts I go through the bullets, providing more detail and points for discussion.

What is Big Data?

For the Love of Big Data: What is Big Data?

Big data is being generated by everything around us

I think many people are aware of the data that is available every time you do a transaction on the web or buy something in a store. In the latter case, even if you do not use a credit card, the purchase data can be used for restocking inventory, determining how well something is selling, and finding what items are often bought together. This could then be used in marketing and coupon campaigns.

Online, even more information is kept about what you did. Not only does a given vendor know what you bought, they know everything you ever bought from them. They may then guess what you will buy next. They possibly know how you rate an item and can offer you future deals based on your habits. They may also have some sense of your buying network, or “friends,” and can use this data to drive sales by giving extra incentives to those in the network who are the most influential.

Social data such as that in Twitter, Facebook, LinkedIn, and Pinterest is also used, though this is often highly unstructured. That is, it may be free text that must be interpreted. This is not always the case, however, because if you choose to specify the schools you went to from a given list, this data now has exact structure which can be mined.

Perhaps more interesting is the sensor data that is being created by the devices all around you. These include your phones, car, home, and appliances, plus wind turbines, factory machines, and many previously mechanical things that have become more electronic and increasingly connected into the Internet of Things.

Every digital process and social media exchange produces it

If a process is digital, that means data is involved. How much of that is saved and can be used for later analysis?

When you take part in a social network someone knows what you are saying, when you said it in context of your other updates, if it was part of a conversation, possibly what you you were discussing (“rotfl mebbe not”), and the influence structure of your extended network. That is, what you say is just the very beginning of a very long chain of direct and inferred collection of data.

Much of this data is actually metadata. When I do a status update on Twitter, my text is the data, but the time I tweeted and where I was when I did it are both examples of metadata.

When you use a mobile app, a lot of metadata is available too. It’s not just what you did, it’s the sequence in which you did things and with whom. This information can be used to improve the app for you, or allow the app provider to make its services, possibly paid, more attractive to you.

Systems, sensors and mobile devices transmit it

If something is connected to the Internet, it is possible for data to be transmitted and received. This might be via wifi or a cellular connection, although technologies like BlueTooth may be used for local data collection that is then later transmitted at higher bandwidth.

Not everything has to be connected all the time. Some remote machines like tractors allow farmers to employ USB sticks to periodically collect performance and diagnostic data. This may then be used for predictive asset maintenance: let me fix something as late as is reasonable but before it breaks down and causes expensive delays. In this case, the data from the USB stick might be analyzed too late, and a direct network connection would be better.

Big data is arriving from multiple sources at amazing velocities, volumes and varieties

So data is coming from everywhere, and I’ve seen estimates that the amount of metadata is at least ten times greater in size than the original information. So the data is coming in fast (velocity), there is a lot of it (volume), and it is very heterogeneous or even unstructured (variety).

As you you start making connections among all the data, such as linking “Bob Sutor” coming from one place with “R. S. Sutor” coming from another, the size can increase by another order of magnitude.

To extract meaningful value from big data, you need optimal processing power, storage, analytics capabilities, and skills

So with all this bigness, you have a lot of information and you need to process it quickly, possibly in real time. This may require high performance computing, divide-and-conquer techniques using Hadoop or commercial Map Reduce products, or streams. If you are saving data, you need a lot of storage. People are increasingly using the cloud for this data storage and scalable processing.

Now that you have the information, what are you going to do with it? Will you just try to understand what is happening, as in descriptive analytics? How about predictive analytics to figure out what will happen if trends continue or if you modify conditions? Can you optimize the situation to get the result you want? (You might want to see my short “Simple introduction to analytics” blog entry for more detail.)

Technologists are trying to get us closer to the “plug in random data and get exactly the insights you want with amazing visualizations” dream, though it may just be enough to get you started in your explorations. You need solid analytics to do valuable things with the data, and people with the skills to build new and accurate models that can then drive insights you can use.

IBM Watson Analytics is doing some interesting work in this space.

Next: Why do data scientists want more data, rather than less?

Also see:

For the Love of Big Data: Why do data scientists want more data, rather than less?

Several weeks ago I was on the panel “Privacy and Innovation in the Age of Big Data” at the International Association of Privacy Professionals in Washington, DC USA. My role was to present the attraction and value of data but not to constantly interrupt myself with “but, but, but” for policy and privacy issues. That is, I was the set up for IBM‘s Chief Privacy Officer Christina Peters and Marty Abrams, Executive Director and Chief Strategist, Information Accountability Foundation, to talk policy. The audience was mainly privacy experts and attorneys.

I presented four slides and I previously posted those via SlideShare. Here and in three other posts I go through the bullets, providing more detail and points for discussion.

Why do data scientists want more data, rather than less?

For the Love of Big Data: Why do data scientists want more data, rather than less?

It is there

This may seem simplistic, but if you are a statistician or someone looking for patterns via machine learning or data mining, you love data. A lot of data. More is better.

Data is the basis of the models we create to explain, predict, and affect behavior

Just like physics uses mathematical tools like partial differential equations to model the universe both large and small, data analytics uses other parts of mathematics like statistics, probability, and linear algebra to model what is happening by looking for patterns and behavior from the data we can discern.

Both kinds of models can vary from the very accurate to the really imprecise, depending on our skills, understanding, and information available. New techniques merge physical and data approaches to increase accuracy and decrease the computational load.

With more data, our models become more sophisticated and, we hope, more accurate

The more you know, the more you can fit together the pieces into a coherent whole. Modeling a city, for example, needs a lot of different kinds of data, and this information is often interdependent and redundant. It is non-trivial, however, to put models and simulations based on different techniques together and this often requires advanced, Ph.D.-level knowledge.

Do you know what the relationships are? That is data unto itself and improves the model. Separating the signal from the noise will help you pull out the important parts for analysis, but knowing why the noise is there will teach you about the accuracy of what you are measuring as well as how it is being measured.

How much data is too much data?

When Twitter was first available, people scoffed at the silliness of people talking about what they had for lunch, where, and with whom. Now this is called “social marketing data.”

We just don’t know what techniques we will develop tomorrow that can make better sense of the data we collect today. New work on statistical summarization can help reduce what we save, but it is best to err on the side of saving more data if you can. We might really need that detailed sales information and social media discussions going back five years to see patterns that can help your business grow tomorrow.

Previous: What is Big Data?
Next: What issues can analytics present?

Also see:

For the Love of Big Data: What issues can analytics present?

Several weeks ago I was on the panel “Privacy and Innovation in the Age of Big Data” at the International Association of Privacy Professionals in Washington, DC USA. My role was to present the attraction and value of data but not to constantly interrupt myself with “but, but, but” for policy and privacy issues. That is, I was the set up for IBM‘s Chief Privacy Officer Christina Peters and Marty Abrams, Executive Director and Chief Strategist, Information Accountability Foundation, to talk policy. The audience was mainly privacy experts and attorneys.

I presented four slides and I previously posted those via SlideShare. Here and in three other posts I go through the bullets, providing more detail and points for discussion.

What issues can analytics present?

For the Love of Big Data: What issues can analytics present?

Are all aspects of privacy, anonymization, and liability understood by the practitioners?

Absolutely not, and that’s why you may need to get guidance from an attorney or a privacy professional. Look for precedent on your intended use just as much as you look for someone having done something technically similar before.

If you are a data scientist and not a privacy expert, don’t pretend to be one. It could be a stupid and a costly mistake.

Be especially careful when working with Personally Identifiable Information – data about people, often coming from HR systems within organizations.

If I tell you that you cannot look at some data but you can infer the information (e.g., gender) anyway, is that all right?

Generally, no.

This can be tricky: if you get a non-negative answer to questions about the number of weeks taken for maternity or paternity leave, you might infer the gender. Certain types of color blindness are 16 times more likely in men than women, so that and other hints might lead you to believe with high probability that someone is male.

Check local laws and policies before you engage in any sort of this “go around the rule” cleverness. It’s not just about gender. You must know privacy restrictions before you start playing with data.

It’s also not just about personal data. If I’m a farmer and you have data about what is growing in my fields, that is my data. I very possibly care what you do with it. This is not just an ownership question: does the data tell you more about me and how I operate than I want to share with you?

What are the rules for working with metadata and summarized data?

If I tweet something, the metadata about that includes when I tweeted it and maybe where I was (location data is not always available). Summarized data might be a statistical snapshot that gives you information about means, standard deviations, counts, and so on. Other technology can start with a very big data set and produce a smaller one that has many of the same statistical properties you care about.

Assume, to start, that data, metadata, and summarized data all have the same privacy restrictions. If I cannot move data out of an organization or a country, assume the same about the other two. From there, look to see if laws or contractual agreements soften this at all.

You may need to know where the data resides. If you put the data in a cloud, can you guarantee that it never leaves a country’s boundaries if that is a requirement?

How do we process static, collected data together with more real-time, rapidly changing information such as location?

This question is really asking about how we combine traditional structured information in databases—Systems of Record—with data and metadata coming from social, mobile, and Internet of Things interactions—Systems of Engagement.

While some analytics can be done in batch mode that is relatively non-urgent, some must be done in close to real time. If you are looking to make a stock transaction based on fluctuations in the market, you can’t wait 12 hours after you’ve crunched the numbers all night. If you need to adjust a patient’s treatment based on information streaming in from 5 different kinds of sensors, sooner rather than later will more likely guarantee a better result.

However, if you analyzing the sales results for the last year to better understand why revenue was up in some geographies yet down in others, that can be done in batch. This may mean you use something like Hadoop to break the problem into smaller pieces and then recombine the results at the end, but this is not the only technique. It really depends on the amount and type of data. Similarly, if you are doing healthcare analytics based on longitudinal studies of patients’ responses to treatment regimens, it does not have to be done in real time.

Analytics today often use a combination of static collected data with information that is coming in quickly. Techniques like map/reduce are combined with streams. Data may be put in traditional relational databases, social and network graph databases, other NoSQL databases, or just discarded after it is seen.

The trick is to combine the right underlying big data information management infrastructure with the right analytics and mathematical techniques to achieve the result you need.

Previous: Why do data scientists want more data, rather than less?
Next: Approach to policy can determine outcomes

Also see:

For the Love of Big Data: Approach to policy can determine outcomes

Several weeks ago I was on the panel “Privacy and Innovation in the Age of Big Data” at the International Association of Privacy Professionals in Washington, DC USA. My role was to present the attraction and value of data but not to constantly interrupt myself with “but, but, but” for policy and privacy issues. That is, I was the set up for IBM‘s Chief Privacy Officer Christina Peters and Marty Abrams, Executive Director and Chief Strategist, Information Accountability Foundation, to talk policy. The audience was mainly privacy experts and attorneys.

I presented four slides and I previously posted those via SlideShare. Here and in three other posts I go through the bullets, providing more detail and points for discussion.

Approach to policy can determine outcomes

For the Love of Big Data: Approach to policy can determine outcomes

Reductions in the amount and kinds of data can produce diminished or inaccurate results

I think this is obvious: if crucial data that could help define the context is missing, then the model will probably be incomplete. Therefore, it will be hard for you to produce results that have a high confidence level, or you will be missing factors that are needed for better prediction or optimization.

Think of all the data that is necessary to predict how a particular farm field of tomatoes will do during this next summer season. Geography with elevation, expected sun and rain amounts, wind, soil conditions, planned fertilization, and natural or irrigated water supply all have an effect on what will actually happen. If your model is missing these, your prediction or your ability to tweak the factors to maximize yield will be compromised.

Policy must take into account the value received by individuals for the use of their personal data

You as a consumer may happily give a coffee shop information about your  location in order to get a discount on your next beverage. This is not exactly being “off the grid.”

Increasingly people are willing to give up personal information in exchange for discounts, higher quality, and position within social networks. Formal policy must take into account that privacy is not a fixed idea and will vary over time from person to person as more lucrative items can be traded for it.

Enforced data localization may decrease analytical completeness unless we can move intermediate results or the site of computation

If I am forced to keep all my data, metadata, and derived results within some geographical boundaries, my ability to combine these with information elsewhere will be curtailed. Can we move those other results to the constrained site for more extensive computation? Indeed, can we move the computation to the data instead of the other way around?

These are issues of both policy and cloud technology. What we do and where we do it is limited by both, but as technology often seems to advance far faster than policy, there will be a mismatch between what we can (and should) do with what we are allowed to do.

Big Data is not just about the data. It is about the intended use and the location of that use. Privacy, security, and data ownership are all important considerations that must be factored in before we do the “science” with the data.

Previous: What issues can analytics present?

Also see:

Presentation: For the Love of Big Data

This is a presentation I gave yesterday at the International Association of Privacy Professionals in Washington, DC USA, on March 6, 2014. This short presentation was meant to stimulate ideas that would then be complemented by discussions about privacy policies as it relates to Big Data, and in that sense is not complete regarding all aspects of privacy that come from the issues discussed.

Students ask: “Where do analytics get used?”

In the last few months I have given several talks to students getting graduate degrees in fields that involve analytics. For many of these students, their first question is “How do I get a job?”. Once we move beyond that, I talk about where analytics is used in companies. As exhaustive list would be exhausting, so let me give you some ideas and how you should think about them.

Let’s first look at this in one dimension. On the left we have some of the academic or technological disciplines that make up the broad field of what we call analytics.

analytics spectrum

You could study these and through examples learn some of the applications. With this approach you might say “I am an Operations Research expert. What are the various fields in which I could work that use analytics?”.

On the right, I list some of those fields of application. If you start on the right, you might ask “I am a supply chain expert. What disciplines within analytics should I learn?”. Those are on the left, though you may not need all of them.

A better way of viewing this is in two dimensions.

analytics space

Here you get a better idea that the disciplines can be used in different application areas and the application areas need various technical expertise. You might go on from here and weigh the intersection points to understand if, say, Machine Learning is more important for Pricing or Supply Chain.

Here’s my advice:

  • Among the disciplines, decide which you love the most and have the best aptitude. Go deep on those, but learn enough about the others so you know when a given solution will require them.
  • If you are working on a team, seek out others who have skills that complement your own (this is good advice in general).
  • If you are working in an application area, understand that broadly and know how the disciplines are used in each. Become expert in one or two of the disciplines but over the course of your career, learn more and more about the adjacent fields and pick up those skills.
  • It is likely that if you start in one application area, you will be employed in another within 3 to 5 years.
  • Shift jobs within your organization or between organizations to learn more disciplines and application areas. Beware becoming a mile wide and an inch deep: truly become an expert in some of the areas of technology and use.

Also see

My scale of cognitive computing

Last week I hosted a strategy session for my group at IBM Research and I used the following as a scale of how cognitive certain types of computing are:

Bob’s Scale of Cognitive Computing

Sentient “we can do without the humans” systems
Learning, reasoning, inferencing systems
Cognitive-enough systems
Most analytics today
“Cognitive because marketing says so” systems
Sorry, no way is this cognitive

At the top we have the systems of science fiction, be it HAL from 2001: A Space Odyssey, SkyNet from the Terminator series, or perhaps Isaac Asimov’s robots. Don’t expect these soon, or ever, and that might be a very good thing.

Next we have the learning, reasoning, and inferencing systems that absorb massive amounts of document data, textual data, structured data, and real time information, and either passively or actively augment human thinking and information retrieval. I think IBM’s Watson is the closest system out there to this.

Next are the “good enough” systems. If a user thinks something is cognitive, as above, is that sufficient? Here we have almost all the systems out there today which look at your calendar, weather reports, flight schedules, and so on to help you be where you need to be, getting there as efficiently as possible, with the right information to do whatever you intend to do.

This and the last category are what people today really think of as being sufficiently cognitive. Ten or fifteen years ago much of what these systems can do would have seemed miraculous, but improvements in algorithms, network bandwidth, mobile devices, social media, and general information storage and retrieval are driving the progress we’ve seen.

Next we have analytics and optimization today which are quite sophisticated but not necessary cognitive. Doing machine learning alone does not make you cognitive. In my opinion, you need strong real time processing of data to help push you over that line, for example.

Finally, we have the two questionable categories. Just because the marketing department says something is cognitive that doesn’t make so, and this has been true for thousands of other technologies and claims before. So beware false statements and promises henceforth unrealized.

Lastly, there are some things that no one with a sense of shame would dare say were cognitive (“It must be cognitive, I’m using a spreadsheet.”).

While you don’t necessarily have to be a purist, e.g., being cognitive enough may be ok, this is an important transition for the IT industry and it will be seen by users in their cars, homes, and on their mobile devices. Don’t be a cognitive wannabe, do the R&D work that makes it real.

Rules for fun and profit

Yesterday the New York Times published an article called “Apps That Know What You Want, Before You Do.” In it they described the category of so-called “predictive search” apps that behave like digital personal assistants, to a point.

The idea with such apps is that they look around at some of your data for certain types of tasks, e.g., traveling, and then collect other data to augment, improve, or make the core experience better. So the app will look ahead a few days, note that you plan to be in a location of significant distance from where you are now, and give you a weather report. It might also remind you to pack your galoshes.

In a similar way, it could start trolling traffic reports a few hours before you are supposed to drive to the airport and give you some good routes to avoid congestion.

This is all good and useful stuff. It is the kind of activity that frequent travelers do all the time. If you have an assistant, that person might be charged with doing such tasks on your behalf. It is also very rules-based: look for X and if Y then Z.

Look ahead at my calendar three days. If I am going to be at least 100 miles away from where I am now, show me the weather report for that location.

The hardest part is often getting access to the core data itself. If there happens to be a company that owns your email, your calendar, your social network, and most of your web searches, getting that data might not be so hard for the company, subject to privacy policies.

Systems that are based on rules, even  in their new mobile app clothes are, I repeat, handy and I am not diminishing their value. It’s just in the long run we’re going to be able to do much better.

More sophisticated systems use machine learning to look for general patterns and then help guide you based on your experiences and the experiences of others. Some of the systems described in the Times article may do some of this. I hope so, because large pure rules-based systems can be very fragile, prone to breakage, contradict themselves, and get very unwieldy when you have a lot of rules.

Cognitive systems will do more than just access data and process rules. They will have characteristics more similar to the human brain than a processor of many if-then conditions. They’ll have notions of understanding the data, they will learn new things, and they will have some reasoning capabilities. More on this next time.

 

What We Do When We Do Analytics Research

In May of 2013 I was asked to contribute an article to ProVISION, an IBM magazine published in Japan for current and potential customers and clients. That piece was translated into Japanese and published at the end of July. Below is the original article I wrote in English describing some of the work we do in my group by the scientists, software engineers, postdocs and other staff in IBM’s Research labs around the world. It was not meant to inclusive of all analytics work done in Research but was instead intended to give an idea of what a modern commercial research organization that focuses on analytics and optimization innovations does for the company and its customers.

When you look in the business media, the word analytics appears repeatedly. It is used in a vaguely mathematical, vaguely statistical, vaguely data intensive sense, but often is not precisely defined. This causes difficulties when you are trying to identify the problems to which analytics can be applied and even more when you are trying to decide if you have the appropriate people to work on a problem.

I’m going to look at analytics and the related areas to help you understand where the state of the art research is and what kind of person you need to work with you. I’ll start by separating out big data from analytics, look at some of the areas of research within my own area in IBM Research, and then focus on some new applications to human resources, an area we call Smarter Workforce.

Big Data

This is the most inclusive term in use today but, because of that generality, it can be the most confusing. Let’s start by saying that there is a large amount of data out there, there are many different kinds of data, it is being created quickly, and it can be hard to separate the important and correct information from that which you can ignore. That is, there are big data considerations for Volume, Variety, Velocity, and Veracity, the so-called “Four Vs.”

I think these four dimensions of big data give you a good idea of what you need to consider when you process the information. The data may come from databases that you currently own or may be able to access, perhaps for a price. In large organizations, information is often distributed across different departments and while, in theory, it all belongs to you, you may not have permission to use it or know how to do so. It may not be possible to easily integrate these different sources of information to use them optimally.

For example, is the “Robert Sutor who has an insurance policy for his automobile” the same person as the “Robert Sutor who has a mortgage for his home” in your financial services company? If not, you are probably not getting the maximum advantage from the data you have and, in this case, not delivering the best customer service.

Data is also being created by social networks and increasingly from devices. You would expect that your smart phone or tablet is generating information about where you are and what you are doing, but so too are larger devices like cars and trucks. Cameras are creating data for safety and security but also for medical and retail applications.

What do you do with all this information? Do you look at it in real time or do you save it for processing later? How much of it do you save? If you delete some of it now, how do you know those parts won’t be valuable when we have better technologies in a year or two?

When you have so much data, how do you process it quickly enough to make sense of it? Technologists have created various schemes to solve this. Hadoop and related software divides up the processing into smaller pieces, distributes them across multiple machines, and then recombines the individual results into a whole.

Streams processing looks at data as it is created or received, decides what is important or not, and then takes action on the significant information. It may do this by combining it with existing static data such as a customer’s purchase history or dynamic data like Twitter comments.

So far I’ve said that there is a large amount of data currently stored and being newly created, and there are some sophisticated techniques for processing it. I’ve said almost nothing about what you are doing with that data.

In the popular media, big data is everything: the information and all potential uses. Among technologists we often restrict “big data” to just what I’ve said above: the information and the basic processing techniques. Analytics is a layer on top of that to make sense of what the information is telling you.

Analytics

The Wikipedia entry for analytics currently says:

Analytics is the discovery and communication of meaningful patterns in data. Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favors data visualization to communicate insight.

Let’s look at what this is saying.

The first sentence has the key words and phrases “discovery,” “communication,” and “meaningful patterns.” If I were to give you several gigabytes, terabytes, or petabytes of data, what would you do with it to understand what it could tell you?

Suppose this data is all the surveys about how happy your customers are with your new help desk service. Could you automatically find the topics about which your customers are the most and least happy? Could you connect those to specific retailers from which your products were bought?

At what times of the day are customers most or least satisfied with your help desk and what are the characteristics of your best help desk workers? How could this information be expressed in written or visual forms so that you could understand what it is saying, how strongly the data suggests those results, and what actions you should take, if any?

Once you have the data you want to use, you filter out the unnecessary parts and clean up what remains. For example, you may not care whether I am male or female and so can delete that information, but you probably want to have a single spelling for my surname across all your records. This kind of work can be very time consuming and can be largely automated, but usually does not require an advanced degree in mathematics, statistics, or computer science.

Once you have good data, you need to produce a mathematical model of it. With this you can understand what you have, predict what might happen in the future if current trends continue or some changes are made, and optimize your results for what you hope to achieve. For our help desk example, a straightforward optimization might suggest you need to add 20% more workers at particular skill levels to deliver a 95% customer satisfaction rate. You might also insist that helpful responses be given to customers in 10 minutes or less at least 90% of the time.

A more sophisticated optimization might look at how you can improve the channels through which your products are purchased, eliminating those that cause the most customer problems and deliver the least profit to you.

For the basic modeling, prediction, and optimization work, one or more people with undergraduate or masters graduate degrees in statistics, operations research, data mining/machine learning, or applied mathematics may be able to do the work for you if it is fairly standard and based on common models.

For more sophisticated work involving new algorithms, models, techniques, or probabilistic or statistical methods, someone with a Ph.D. is most likely needed. This is especially true if multiple data sources are combined and analyzed using multiple models and techniques. Analytics teams usually have several people with different areas of expertise. It is not uncommon to see one third of a team with doctorates and the rest with undergraduate or masters degrees.

Our work at IBM Research

I lead the largest worldwide commercial mathematics department in the industry, with researchers and software engineers spread out over twelve labs in Japan, the United States, China, India, Australia, Brazil, Israel, Ireland, and Switzerland.

While the department is called “Business Analytics and Mathematical Sciences,” we are not the only ones in IBM Research who do either analytics or mathematics. We are the largest concentration of scientists working on the core mathematical disciplines, which we then apply to problems in many industries, often in partnership with our Research colleagues and those in IBM’s services business divisions.

We divide our work in what we call strategic initiatives, several of which I’ll describe here. In each of these areas we write papers, deliver talks at academic and industry conferences, get patents, help IBM’s internal operations, augment our products, and deliver value to clients directly through services engagements.

Visual Analytics

One of the topics in IBM’s 2013 edition of the Global Technology Outlook is Visual Analytics. This differs from visualization in that it provides an interactive way to see, understand, and manipulate the underlying model and data sources in an analytics application. Visual analytics often compresses several dimensions of geographic, operational, financial, and statistical data into an easy to use form on a laptop or a tablet.

Visual Analytics combines a rich interactive visual experience with sophisticated analytics on the data, and I describe many of the analytics areas in which we work in the sections below. Our research involves visual representations and integration with the underlying data and model, client-server architectures for storing and processing information efficiently on the backend or a mobile device, and enhancing user experiences for spatio-temporal analytics, described next.

Spatio-temporal Analytics

This is a particularly sophisticated name for looking at data that changes over time and is associated with particular areas or locations. This is especially important now because of mobile devices. Information about what someone is doing, when they are doing it, and where they are doing it may be available for analysis. Other example applications include the spread of diseases; impact of pollution; weather; geographic aspects of social network effects on purchases; and sales pipeline, closings, and forecasting.

The space considered may be either two- or three-dimensional, with the latter becoming more important in, for example, analysis of sales over time in a department store with multiple floors. Research includes how to better model the data, make accurate predictions using it, and using new visual analytics techniques to understand, explore, and communicate insights gained from it.

Event Monitoring, Detection, and Control

In this area, many events are happening quickly and you need to make sense of what is normal and what is anomalous behavior. For example, given a sequence of many financial transactions, can you detect when fraud is occurring?

Similarly, video cameras in train stations produce data from many different locations. This data can be interpreted to understand what are the normal passenger and staff actions at different times of the day and on different days, and what actions may indicate theft, violent behavior, or more even more serious activities.

Analysis of Interacting Complex Systems

The world is a complicated place, as is a city or even the transportation network within that city. While you may be able to create partial models for a city’s power grid, the various kinds of transportation, water usage, emergency healthcare and personnel, it is extremely difficult to mathematically model all these together. They are each complicated and changes in one can result in changes in another that are very hard to predict. There are many other examples of complex systems with parts that interact.

Simulation is a common technique to optimize such systems. The methods of machine learning can help determine how to realistically simulate the components of the system. Mathematical techniques to work backwards from the observed data to the models can help increase the prediction accuracy of the analytics.

This focus area provides the mathematical underpinnings for what we in IBM do in our Smarter Cities and Smarter Planet work.

Decision Making under Uncertainty

Very little in real life is done in conditions of absolute certainty. If you run a power plant, do you know exactly how much energy should be produced to meet an uncertain demand? How will weather in a growing season affect the yield from agriculture? How will your product fare in the marketplace if your competitor introduces something similar? How will that vary by when the competitive product is introduced?

If you factor in uncertainty from the beginning, you can better hedge your options to maximize metrics such as profit and efficiency. There are many ways to quantify uncertainty and incorporate it into analytical models for optimization. The exact techniques used will depend on the number and complexity of the events that represent the uncertainty.

Revenue and Price Optimization

The decisions you make around the price you charge for your products or services, and therefore the hoped-for revenue, are increasingly being affected by what happens on the demand side of your business. For example, comments spread through social media can significantly increase or decrease the demand for your products. Aggressive low pricing given to social media influencers can increase the “buzz” around your product in the community, thereby increasing the number of units sold. If you can give personalized pricing to consumers that is influenced by their past purchase behavior, you can affect how likely they are to buy from you again.

Demand shaping can help match what you have in inventory to what you can convince people to buy. This focus area therefore affects inventory and manufacturing, and so the entire supply chain.

Condition Based Predictive Asset Management

When will a machine in your factory fail, a part in your truck break, or a water pipe in your city spring a leak? If we can predict these events, we can better schedule maintenance before the breakages occur, keeping your business up and running.

We can get the parts we need and the people who will do the work lined up and doing the work in a timely way. Since there are multiple assets that may fail, we can help prioritize which work should be done earlier to keep the whole system operating, even given the process dependencies.

Integrated Enterprise Operations

You can think of this focus area as a specific application of the Analysis of Interacting Complex Systems work described above to the process areas within an organization or company. For example, a steel company receives orders from many customers requesting different products made from several quality grades. These must be manufactured and delivered in a way that optimizes use of the stock material available, configures and schedules tooling machines, minimizes energy usage, and maintains the necessary quality levels.

While each component process can be optimized, the research element of this concerns how to do the best possible job for all the related tasks together.

Smarter Finance

I think of this as the analytical tools necessary for the Chief Financial Officer of the future. It integrates both operational and financial data to optimize the overall financial posture of an organization, including risk and compliance activities.

Another element of Smarter Finance includes applications to banking. These include optimization of branch locations and optimal use of third party agencies for credit default collections.

Smarter Workforce

A particularly large strategic focus area in which we are working is the application of analytics to human resources, which we call Smarter Workforce. We’ve been involved with this internally with IBM’s own employees for almost ten years, and we recently announced that we would make two aspects of our work, Retention Analytics and Survey Analytics, available to customers.

Retention analytics provides answers to the following questions: Which of my employees are most likely to leave? What characterizes those employees in terms of role, geography, and recent evaluations and promotions? What will it cost to replace an employee who leaves? How should I best distribute salary increases to the employees I most want to stay in order to minimize overall attrition?

Beyond this, we are doing related research to link the workforce analytics to the operational and financial analytics for the rest of an organization. For example, what will be the affect on this quarter’s revenue if 10% of my sales force in Tokyo leaves in the next two weeks?

Survey analytics measures the positive and negative sentiment within an organization. While analytics will not replace a manager knowing and understanding his or her employees, survey analytics takes employee input and uncovers information that might otherwise be hard to see. Earlier I discussed a customer help desk. What if that help desk was for your employees? How could you best understand what features your employees most liked or disliked, and their suggestions for improvement?

This is one example of using social data to augment traditional analytics on your organization’s information. Many of our focus areas now incorporate social media data analytics, and that itself is a rich area of research to understand how to do it correctly to get useful results and insight.

In Conclusion

Analytics has very broad applications and is based on decades of work in statistics, applied mathematics, operations research, and computer science. It complements the information management aspects of big data. As the availability of more and different kinds of data increases, we in IBM Research continue to work at the leading edge to create new algorithms, models and techniques to make sense of that data. This will lead, we believe, to more efficient operations and financial performance for our customers and ourselves.

Think top-down about Big Data

Don’t think piecemeal about all the data that is becoming available for analysis. Think comprehensively about the right and wrong uses of the data to avoid surprises.

“Big Data” is a term broadly used in the IT industry about all the data that is being produced by consumers on the web, from sensors, from applications running, and from transactions, to name a few sources. It’s not really just about the data itself, but what you do with it, how you gain insight from it, how you correlate different data sets and then somehow make sense of it all.

It’s no longer a terribly precise term since it seems to have swallowed up all of analytics which in turn swallowed up all of optimization.

There’s a lot of data that we produce that we’re happy about when it is processed, such as giving us better product recommendations when we shop online, improved courses of medical treatments or more interesting choices for streaming movies. Then there are other situations such as the recent NSA news about metadata from cell phone usage that made many people unhappy.

All of this has raised significant issues about privacy, who owns the data, and who gets to process it or the derived metadata. Just what data are we talking about anyway? To that question, there’s more of it than you think. Reading the fine print about the use of surveillance cameras while you are shopping to determine your buying behavior and preferences is more important than ever.

This may be no more offensive than the online shopping data collection, but it certainly spooked some people because they didn’t know anyone might want to do that. Ultimately it may just be considered a normal part of an enhanced customer experience, but if it surprises you, it may then shock and disturb you.

Here’s what I think is a big problem: people often only think about data that is being collected and analyzed in a bottoms-up way. That is, they think of visiting that one web site, or being on that video camera across the street, or doing that particular banking transaction. Proper use of this data with permission that respects privacy can lead to some real improvements in how we get personalized benefits. Knowledge of your calendar, your travel schedule, and your food preferences can lead to a better structured, more enjoyable, and more efficient business trip, more example.

It’s when someone or some organization steps over some line, a line we may not have noticed before, that things get creepy.

Here’s a better way to think about this, I believe. Assume all data about everything will be available.

Driving down a highway? Someone will know where you are (as well as everyone else on the road), how fast you are traveling, what rest areas you go to, how much gas you are using, what radio stations you listen to, the efficiency of your air conditioner, and how much air you have in your tires.

I now get monthly reports about the state of my car’s mechanics and electrical systems that are uploaded wirelessly. It’s rather convenient, actually, since one of my tires seems to be chronically underinflated.

In fact, for automobiles and travel it is hard to come up with examples of data that is not being collected already. It might not be processed together yet for combined insights (three out of four dentists who listen to Outlaw Country on SiriusXM satellite radio prefer rest stops with Burger Kings – I made that up), but the data is there.

Given the assumption that all data is available, let’s have the discussion about what we can legitimately do with it now and in the future and where those invisible lines are that should not be crossed. Don’t be surprised when you hear about some new data that is available for analysis. Start with assuming it is there, use the good stuff, set up the correct policies for that usage, and move on.

As you walk around, look at every device, every appliance, every computer-connected activity, every stream and lake and roadway, and imagine what data could come from each of them. Don’t be naive and assume the data won’t ever be available because much of it will. If we do it right, we’ll be more sustainable, more efficient, less polluting, and more effective on our various individual and community activities. If we do it wrong, we’ll get back to creepy, which is a bad thing.

P.S. Although I’m encouraging you to think about all possible data ultimately being available, that doesn’t mean it will be easy or cheap to get. You’re doing it wrong if you think “why would anyone ever want that?”. Someone will. That can be a good thing, but think it through.

Collaborate but be decisive

Many people in organizations are lucky to have so many colleagues with wonderful skills and experiences that can be brought to bear on any particular problem. The trick is to bring together a team of the right size and act with speed. Don’t bring in more people than you need.

Do your research, but only enough to get the right confidence level that you know how to start to proceed on a path that is likely to deliver exactly what a project or client needs. Learn to recognize when you have reached the point of diminishing returns. Know when it is time to stop strategizing and start coding, selling, or implementing.

In my opinion it is best to make learned decisions fast and then adjust as you go. You’ll never have all the answers at the beginning of the process. If you wait too long, you’ll miss the opportunity with your client and possibly the key time when the market is hot for what you can offer.

Workforce analytics announcement

Today my team at IBM Research, in collaboration with our software colleagues, announced that we’ll be offering to customers new analytics capabilities in the areas of employee retention and workforce sentiment. The main details are the press release and an entry I published over on the IBM Smarter Planet Blog.

Comments on the Smarter Planet Blog, please.

Some questions to ask yourself if you want to be a data scientist

Last October, the Harvard Business Review published an article called “Data Scientist: The Sexiest Job of the 21st Century.” I could virtually hear the rejoicing in the work hallways of analysts, mathematicians, statisticians, and computer scientists everywhere. At last, recognition!

While it is debatable in this case if the job really is sexy or even hip and will make you either, there’s really no question that the rise of analytics and big data are making these skills increasingly in demand. Is this the right job for you?

A good place to start to understand what is needed to be a data scientist is at the INFORMS Analytics Certification website. It costs money to get this, but the program information gives you an idea of the kinds of questions on the test, the sorts of case studies with which you should be comfortable, and the books and websites you can use for further learning.

In a more informal way, let me here ask some questions you should answer about yourself and your knowledge to see if this is a career or job you might consider. I’ve included a few technical questions to encourage you to learn more about some of the disciplines involved.

  • Do you suffer from math anxiety? Does solving equations, working with matrices, or making sense of table or graphs scare you? If so, this probably is not the field for you.
  • Are you comfortable with statistics? Could you in your spare time over the next month do what would equate to a first, solid, mathematically sound statistics course? Would you get an A for your efforts? You’ll need statistics to understand the data and to give yourself sanity checks about the conclusions you are drawing.
  • Is Microsoft Excel or OpenOffice Calc your favorite tool in your office productivity suite? Doing real analytics and big data often goes well beyond what you can do in a spreadsheet, but if this kind of software terrifies you, data science might not be a good match.
  • Do you know how services like Netflix choose what movies you would like to watch? Make an educated guess and then go learn some of the techniques. I won’t give you a reference, go explore what you find on the net.
  • Do you understand the differences between descriptive, diagnostic, predictive, and prescriptive analytics? Where does optimization fall among these?
  • Do you like saying the word “stochastic”? Do you know what it means?
  • Who was Andrey Markov and what was his obsession with chains?
  • What criteria and analysis would you use to predict who will win the next World Series, Super Bowl, or World Cup?
  • Under what situations would you use Hadoop, Hive, HBase, Pig, SPSS, R, or CPLEX?
  • How would you go about constructing your personal profile from all the public data about you on the web? This could be from yourself (e.g., your Twitter feed) or produced by others. Include your gender, your approximate age and income, the town in which you live, the high school to which you went, your hobbies, the name of your significant other, the number of children you have, your favorite color, your favorite sport, your best friend’s name, and the color of your hair. Does this scare you?
  • When can Twitter add to your insight about marketing campaigns and when does it just add unnecessary noise?

This is by no means an exhaustive list, but if these topics intrigue you, you have or are willing to get the technical background, and you know who Nate Silver is, you just might have a career in data science.

 

 

Math and Analytics at IBM Research: 50+ Years

Soon after I arrived back in IBM Research last July after 13 years away in the Software Group and Corporate, I was shown a 2003 edition of the IBM Journal of Research and Development that was dedicated to the Mathematical Sciences group at 40. From that, I and others assumed that this year, 2013, was the 50th anniversary of the department.

Herman Goldstine at IBM Research

I set about lining up volunteers to organize the anniversary events for the year and sent an email to our 300 worldwide members of what is now called the Business Analytics and Mathematical Sciences strategy area. Not long afterwards, I received a note from Alan Hoffman, a former director of the department, saying that he was pretty sure that the department had been around since 1958 or 59. So our 50th Anniversary became the 50+ Anniversary. Evidently mathematicians know the theory of arithmetic but don’t always practice it correctly

The first director of the department was Herman Goldstine who joined after working on the ENIAC computer and a stint at the Institute for Advanced Study in Princeton. Goldstine is pictured in the first photo on the right at a reception at the T.J. Watson Research Center in the early 1960s. Goldstine died in 2004, but all other directors of the department are still alive.

Directors of the Mathematical Sciences Department at IBM Research

We decided that the first event of the year celebrating the (more than) half century of the department would be a reunion of the directors for a morning of panel discussions. This took place this last Wednesday, May 1, 2013.

Reunion of the directors of the Math Sciences Department at IBM Research

Photo credit: Mary Beth Miller

I started the day by giving a glimpse of what the department looks like today: the above-mentioned 300 Ph.D.s, software engineers, postdocs, and other staff distributed over the areas of optimization, analytics, visual analytics, and social business in 10 of IBM’s 12 global labs.

I then introduced our panel pictured in the photo above. From left to right we have me, Brenda Dietrich, Bill Pulleyblank, Shmuel Winograd, Roy Adler (a mathematician who was in the department during the tenures of all the other directors except me), Alan Hoffman, Dick Toupin, Hirsh Cohen, and Ralph Gomory.

Ralph Gomory, Benoit Mandelbrot, and other IBM researchers pondering a math problem

My goal for the discussion was to go back and look at some of the history and culture of the math department over the last five decades. I was hoping we would hear anecdotes and stories of what life was like, the challenges they faced, and the major successes and disappointments.

Other than a few questions I had prepared, I wasn’t sure where our conversation would go. The many researchers who joined us in the auditorium at the T. J. Watson Research Center in Yorktown Heights, NY, or via the video feed going out to the other worldwide labs would have a chance to ask questions near the end of the morning.

I’m not going to go over every question and answer but rather give you the gist of what we spoke about.

  • Ralph Gomory reminded us that the department was started in a much different time, during the Cold War. The problems they were trying to solve using the hardware and the software of the day were often related highly confidential. However, every era of the department has had its own focus, burning problems to be solved, and operational environment.
  • Hirsh Cohen got his inspiration for the mathematics he did by solving practical problems such as those related to the large mainframe-connected printers. Many people feel that mathematics shouldn’t stray too far from the concrete, but it is not that simple. This isn’t just applied mathematics, it is a way of looking for inspiration that may express itself in more theoretical ways. The panelists mentioned more than once that the original posers of business or engineering problems might not recognize the mathematics that was developed in response. (I think there is nothing wrong with theoretical mathematics with no direct connection to the physical world, but there are some areas of mathematical pursuit that I think are just silly and of marginal pure or applied interest.)
  • In response to my question about balancing business needs with the desire to advance basic science, Shmuel Winograd told me I had asked the wrong question: it was about the integration of business with basic science, not a partitioning of time or resources between them. This very much sets the tone of how you manage such a science organization in a commercial company. The successful integration of these concerns may also be why IBM Research is pretty much the sole survivor of the industrial research labs from the 1950s and 1960s.
  • There was general consensus that it is difficult to get a researcher to do science in an area that he or she fundamentally does not want to work. This was redirected to the audience members who were reminded to understand what they loved to do and then find a way to do it. (This sounded like a bit of a management challenge to me, and I suspect I’ll hear about it again.)
  • Time gives a great perspective on the quality and significance of scientific work that is just not obvious while you are the middle of it. This is one of the reasons why retrospectives such as this can be so satisfying.

Discussing the future of BAMS

Photo credit: Mary Beth Miller

After the first panel and coffee break, we came back and I started the session looking at the future of the department instead of the history. We have an internal department social network community in IBM Connections and I started by summarizing some of the suggestions people came up with about what we’ll be doing in the department in five, ten, and twenty years.

Sustainability, robotic applications of cognitive computing, and mathematical algorithms for quantum computing were all suggested. Note that his was all fun speculation, not strategy development!

Eleni Pratsini, Director of Optimization Research, and Chid Apte, Director of Analytics Research, then each discussed technical topics that could be future areas for scientific research as well as having significant business use.

After the final Q&A session, we got everyone on stage for a group photo.

BAMS group photo

Photo credit: Steve Hamm

One thing that struck me when we were doing the research through the archives was how much more of a record we have of the first decade of the department than we do of the 40+ years afterwards. In those early days, each department did a typed report of its activities which was then sent to management and archived.

With the increasing use of email and, much later, digital photos, we just don’t have easy if any access to what happened month by month. As part of this 50+ Anniversary, I’m going to organize an effort to do a better job of finding and cataloging the documents, photos, and video of the department.

This should make it easier for future celebrations of the department’s history. I suspect I’m not going to make it to the 100th anniversary, but I just might get to the 75th. For the record for those who come after me, that will be in 2034.

Simple introduction to analytics

This came up today in a discussion today, so I thought it would be useful to clarify some terminology related to analytics. Reporting and dashboards are not predictive analytics.

More generally,

  • Descriptive analytics is where you describe what is or has already happened, usually by processing a lot of data (which could be big). To show this, you use reporting or a dashboard that gives you tables, summaries, and graphics that make it easy to understand the situation and offers some insights that might not be obvious. The data might be last quarter’s sales information, movie theatre attendance, or e-book purchase preferences broken down by demographics, for example.
  • Predictive analytics uses techniques like simulation, statistics, and machine learning to extrapolate from past data or behavior to predict what might happen. Variations might be introduced so that you can get an idea of future results if you increase your sales force by 10%, decrease your price by 5%, or increase your manufacturing capacity, for example.
  • Prescriptive analytics often uses serious mathematical optimization techniques, simulation, and algorithms to help you understand how you should reach your goals. How many planes with given capacities, with crews based in specific locations, should be used on a given set of routes to accomodate so many passengers, all while minimizing fuel use and maximizing profits, for example.

By the way, be careful what you call analytics. If you compute the average sales attained by your staff last month this is probably just statistics and not deserving of the title “analytics.” I might be even less charitable and say it is just arithmetic.

Also see: “What We Do When We Do Analytics Research”

New year, newish job

As of the end of December, I’ve been with IBM for 30 years. That seems like a lot, but I think you just sort of wake up one day and discover that time has gone by, you’ve been working, your career has been advancing, your kids are no longer little, and so forth. It’s not dissimilar to that strange feeling when you graduate from college and you realize that the experience you prepared for over 18 years is now over. Life is like that.

At the end of last July, I moved to my current position in IBM Research as head of the Business Analytics and Mathematical Sciences group. That’s the “newish” job in the the title; I have not moved to a completely new position in the last few days. Nevertheless, I’m not quite six months in this role, so I still feel somewhat new, though less so every day. After six months, the honeymoon (if there really was one) will be completely over. I’m very happy to be back. I’m tempted to say that this is the best job I’ve ever had.

It was a bit strange coming back to Research. I was here from 1984 to 1999 with three years out in the middle to finish up my Ph.D. program, though I was still an IBM employee during that period. I’m based at the Watson Research Center in Yorktown Heights, NY, though the Research division is really global, with 12 labs around the world. There are folks under my aegis in Switzerland, Israel, Japan, China, India, Singapore, and other countries.

The strangeness I initially felt was the immediate familiarity and perhaps deja vu: I know the layout of the building, I know where the cafeteria is,and  I know that the restrooms are in the stairwells toward the front of the building. I still get confused at times between the top and the middle floors since they look quite a bit alike. I do manage to always make it back to my office, and I just pretend that I meant to walk down aisle 31 on the wrong floor.

Even with that acceptance of what has remained the same, much else has changed. When I left in 1999, Research had nowhere near the global presence I described above. The department I lead, now called Business Analytics and Mathematical Sciences, was then just Mathematical Sciences, or, more simply Math Sciences. Analytics, though an overused and hence fuzzy term, was simply not as well known thirteen years ago.

The work I did with others on symbolic computation and semantically rich scientific publishing is no longer done in the department. We still do a tremendous amount of optimization work and algorithms, though there is much more applied probability and machine learning than there was then. We do more work directly with clients now, so we can balance the theory work with solving real problems with real datasets for customers in almost every industry you can think of.

These are some of my initial impressions of what it means to me to be back in Research for Round 2. I’ll share more as I get further into this role.

Analytics: Interacting systems make things tricky

Every year around holiday times, millions of people take to the roads, the skies, and the rails to visit friends and families. Sometimes everything works perfectly: the traffic isn’t too heavy, there are no accidents, schedules are met, lines are short, and the weather cooperates. In my experience, that tends to be more the exception than the rule.

If you think about our transportation systems, there are plenty of places where things can go wrong. One traffic accident on the main road to the airport can disrupt the lives of hundreds, if not thousands, of people. Why is this? What can we do about it?

When I was a teenager growing up north of New York City, I took a summer course at the Polytechnic Institute of New York in Brooklyn. This was a special class for high school science geeks (a label I wear proudly now, though maybe not so much then), that looked at simulation of traffic. For example, suppose you have a traffic light at the intersection of two streets. Your task is to determine how long the light should be red or green in each direction so that traffic moves most smoothly. That is, you would like maintain a reasonable volume of traffic flowing each way, while allowing it to move quickly and safely enough. How complicated could that be?

There are several unknowns and decisions to be made before you can start to get to an answer. How many cars approach the intersection from each direction during some fixed period of time, say, per minute? How does this vary during the day? For how long are drivers willing to wait patiently for the light to go from red to green? How many cars would you like to move through the intersection during each green light?

I didn’t say much about the intersection or the traffic light. What if one road is a one way street and the other is not? Do you want to have one or more right turn arrows? How long should those be green? If this is an existing intersection, do you have the history of accidents (quantitative) and traveler complaints (qualitative)?

By the way, will people be crossing the roads? How many do you need to account for and how long will the “walk lights” permit the crossings?

This is just one intersection! During my summer class all those years ago, we did not look at so many variations. With actual cars and drivers, you must.

With some patience, some research, and some measurement of existing conditions, you can use any one of several simulation software programs to help you model the situation, adjust the various parameters, and come up with a reasonable set of timings.

Unless you live in a town with one traffic light, things get more complicated quickly. Imagine a sequence of traffic lights as you travel down a road. I suspect you’ve heard or even said the phrase “I hit every red light on my way to work today.” That happens then the traffic is not in sync with the lights because of a change in volume (more cars), speed (perhaps an accident), or bad design (light timings that work at Noon may not be optimal for 8 AM).

If you have a light at the bottom of an exit ramp from a highway, you could have traffic backing up unsafely onto the highway if not enough traffic can get off the ramp and onto the local street fast enough. What we have here is the interaction of complex systems, and these can be very hard to model and optimize. That said, we can’t just ignore hard problems, because real people have to move through real cities every day and every hour.

Part of the work we do in the Business Analytics and Mathematical Sciences group at IBM Research tackles just such big, complex, intertwined, and sometimes just downright messy problems. We use analytics and statistics to describe what is currently going on and predict what will happen if changes are made. We use the mathematics and algorithms of optimization together with simulation to produce more optimal configurations and test our hypotheses.

If I think again about that simple intersection above, I will need to consider carefully what I’m really trying to optimize. It might be volume and speed as I said above, or it could be pollution or gasoline usage reduction. Ask the right questions and carefully determine what you are really looking to improve first, then let modern analytics and optimization techniques help you understand what steps you need to take to get to that more optimal state.

Where I’ve been

It’s been quite some time since I’ve posted an entry here. It’s been a very busy summer both in my personal life as well as my business one. I changed jobs within IBM effective August 1: I went from the IBM Software Group  where I co-led the Mobile Enterprise strategy as well as led Product Management for the WebSphere Application Server, over to IBM’s Research Division. Here in Research I’m the VP for Business Analytics and Mathematical Sciences (BAMS).

This is actually a return to Research for me. I spent 1984 to 1999 in the Mathematical Sciences Department, as it was called then, including three years away at Princeton finishing my Ph.D. in theoretical Mathematics. During my time since I left Research I had various jobs in IBM Corporate and in the Software Group working on and leading efforts in web services, standards, open source, Linux, WebSphere, and Mobile.

I now am responsible for a wordwide community of several hundred researchers focusing on basic and applied science in analytics and optimization. I’ve spent a lot of time over the last few weeks meeting my team members, coming up to speed on the work of BAMS as well as the Research Division and, well, doing the job.

It’s very different from what I’ve been doing over the last few years. When I can discuss it, I’ll talk about the work, what it means and why it is important, what its importance is for the industry, and how it will affect us all. In that last sense, I’ll talk about analytics and optimization in general, and not just about what we are doing here.

There’s a lot of confusion about analytics and my sense is that the term is applied much too widely. That said, there are many more areas of applicability than I think many people realize. So it’s a really a question of sharpening the definitions and terms used, and then employing them correctly.

I also plan to get back to some of the things in my personal life that I have not written about recently. For example: yes, the sailboat is in the water, but not Lake Ontario.