Wednesday, March 30, 2011

Patterns of Success - Sam Adams

I first met Sam Adams back in 1992. I was an independent consultant giving advice on object technology and Sam was working at Knowledge Systems Corporation, helping customers learn how to develop applications using Smalltalk.
He had this kind of magic trick where he would sit in front of the computer and ask somebody to describe a business problem and as the person was talking he would be building the application in front of your eyes. Every 5-10 minutes he would present the latest iteration and ask if this was the solution he/she was talking about. Very Agile development before its time. Sam and I both moved on to IBM where we were part of IBM's first Object Technology Practice. In 1996, Sam was named one of IBM's first Distinguished Engineers and has spent the past 10 years in IBM Research.

John - Thanks for joining me on the Patterns of Success interview series. What kind of projects have you been working on recently?

Sam - Last year I worked on IBM's Global Technology Outlook (GTO). Every year IBM Research goes through an extensive investigation of major trends and potential disruptions across all technologies that are relevant to IBM's business. My GTO topic area was peta-scale analytics and ecosystems. This topic emerged from our thinking about commercialization of our current BlueGene high performance computing technology as we push higher toward exascale computing. Another major influence was the coming disruptions in systems architecture anticipated when very large Storage Class Memories (SCM) become affordable over the next 5 years.

John - Let me calibrate this another way. When you talk about the Bluegene and the peta-scale how does that compare to the recently popular Watson computer that won the Jeopardy! match?

Sam - In terms of raw computing power, Watson is about an order of magnitude less powerful than a BlueGene/P, which can provide sustained calculations at 1 petaflop..

John - That helps.

Sam - Another trend that we considered and an area I have been working on for the last three years is the single-core to multi-core to many-core transition. How are we going to program these things? How are we going to move everybody to a massively parallel computing model? One problem we are working on is that CPU availability is no longer the limiting factor in our architectures.The most critical factor these days is I/O bandwidth and latency. As we move to a peta-flop of computing power we need to be able to feed all those cores as well as empty them of results very, very quickly. One of the things we realized is that this scale of compute power will need a new model of storage, something beyond our current spinning disk dominated approach. Most current storage hierarchies are architected assuming that CPU utilization was the most important factor. In the systems we envision, that is no longer the case. Current deep storage hierarchies (L1 - L2 - DRAM - Fast Disk - Slow Disk - Tape) have lots of different latencies and buffering built in to deal with the speed of each successive layer. Petascale systems such as those we envision will need a very flat storage hierarchy with extremely low latency, much closer to DRAM latency than that of disks.

John - It seems to me that one of the more significant successes in this area has been the map/reduce, Hadoop movement used by Google for their search engine. How does the research you are working on compare/contrast to this approach?

Sam - We see two converging trends, the supercomputing trend with massively parallel computing being applied to commercial problems, and a trend of big data / big analytics which is where Hadoop is being used. The growth of data on the internet is phenomenal, something like 10 fold growth every five years. The business challenge is how do you gain insight from all this data and avoid drawing in the flood. Companies like Google and Amazon are using Hadoop architectures to achieve amazing results with massive data sets that are largely static or at least "at rest". In the Big Data space, we talk about both data-at-rest and data-in-motion. The storage problem and map/reduce analytics are largely focused on massive amounts of data at rest. But with data-in-motion you have extreme volumes of fast moving data with very little time to react. For instance, imagine dealing with a stream of data like all the transactions from a stock exchange being analyzed in real-time for trends. IBM has a product call Infosphere Streams  that is optimized for such data-in-motion applications.
So the combination of many-core supercomputers, data-at-rest analytics, and data-in-motion analytics at the peta-scale is where the leading edge is at today.

John - So with the data-in-motion stream analytics is not one limitied by the performance of the front end dispatcher which looks at the event in the stream and then decides where to pass it? If the stream keeps doubling will not that component eventually choke?

Sam - Everything is bound by the ingestion rate. However, the data is not always coming in on the same pipe. Here you are getting into one of the key architectural issues... the system interconnect. In most data centers today use a 1Ge or 10Ge inter-connect bandwidth. This becomes a bottleneck, especially when you are trying to move hundreds of terabytes of data all around the data center.

John - So as much as we hold Google up as a massive computing system with its exabytes of storage and its zillions of processors, it is dealing with a very parallel problem, with all the search queries coming in over different communications infrastructure to different data centers dealing with random data sets. Compare this to a weather forecasting application that can reduce the problem to separate cells  for parallel operation but must assemble all these results to produce the forecast.

Sam - The most difficult parallel computing problems are the ones that require frequent synchronization of the data and application state. This puts a severe strain on I/O, shared resources, locking of data, etc. 
At the end of the day the last bastion we have for performance improvements is in reducing the latency in the system. And to reduce end-to-end latency, we must increase the density of the system. Traditional chip density has just about reached its limit because of thermal issues (There has been some work at IBM Zurich that could shrink a supercomputer to the size of a sugar cube). Beyond increasing chip density there has been a growth in the number of cores, then the number of blades in a rack, and the number of racks in a data center, and the number of data centers that can be shared for a common problem. While each tier of computing increases the computing power enormously, the trade-off is that the interconnect latency increases significantly and eventually halts further improvement in overall system performance.
One big area for innovation in the next 5-10 years will be how do we increase this system density, primarily by reducing the interconnect latency for each computing tier. The ultimate goal would be for any core to access any memory element at almost the same speed.

John - So in your area of research on high performance computing, particularly working with customers who have tried to adopt some of these emerging ideas, what have been the successful outcomes, and did customers do anything special to be successful? I guess because you are in IBM Research, even the work with a customer is considered an experiment with a high risk of failure.

Sam - If you look at the whole shift towards massive parallelism, the successes have, unfortunately, all been in niches. I say unfortunately because we would love to have some general solution that applies to all computing problems. The example we spoke of earlier with Google using massive parallel computing to solve its search problem. They have optimized their solution stack from the hardware up through the OS to their application architecture. It solves their problem but it is a niche solution. 
The functional programming folks have introduced solutions like Haskell that supports concurrency and parallelism. The problem with functional programming is that the programming model provided in the various languages that is not intuitive enough and difficult for the large majority of programmers to grasp. Contrast this with the success of the object oriented movement. The programming model mapped cleanly with the real world and still allowed the programmer to manage the organizational complexity. 

John - And in the OO programming model each object is separated from other objects by a defined set of sending and receiving communications. So, in theory, these objects could be distributed and run concurrently.

Sam - We need something like that to be successful with high performance parallel computing... a programming model that allows someone to develop in the abstract without explicitly thinking about the issues involved with the underlying system implementation, and then a very clever virtual machine that can map the code to the chips / cores / blades / servers / data centers so that the best performance is achieved.

John - It seems like some of the successes have been because the nature of the problem happened to fit the ability of the technology at that time. 

Sam - To a point. For example in the Google search problem it is often quite challenging for the programmer to figure out the map and reduce details so that it works efficiently. So successes have been niche areas where the application was exploited to successfully use parallelism.

John - Like with weather forecasting. Because the forecast is based on the combination of many cells, with each cell representing the physical conditions within a given space, then calculations for each cell are the same with the results varying depending on the initial conditions. To increase the accuracy of the forecast, increase the number of cells in the model. The algorithm stays the same. You just need more resources. 

Sam - If you increase the number of cells (for example going from a 10km resolution to a 1km resolution) you also have to increase the frequency of the calculation because the physical conditions change more rapidly for any one cell at that resolution. This requires a lot more resources. But the algorithm does stay basically the same. An excellent example of a niche solution. IBM Research actually did this with a project call Deep Thunder.

John - Now tell me about some failures to launch. Examples of where the technology just did not work out as expected, And some of the reasons why.

Sam - Rarely do I see the issue being the emerging technology. More often, it is the surrounding ecosystem of people, business models, and other systems not willing to adapt to the disruption the emerging technology introduces. Could we have built an iPhone thirty years ago? Well maybe. But it would not have mattered. The ecosystem was not in place, a wireless internet, an app store business model, third party developers building apps like Angry Birds, or Twitter. A generation of consumers familiar with carrying cell phones. All these elements needed to be in place. Somebody has to come up with a compelling application of the emerging technology that demonstrates real value in order to move people over Moore's chasm.

John - So bringing us back to the area of emerging high performance computing... Is this a reason why IBM develops computers like Watson? To demonstrate a compelling application of the technology?

Sam - We tackle these grand challenge problems for a couple of reasons. One of them is to actually push technology to new levels. But the other is to educate people on what might be possible. After developing Watson to solve a problem on the scale of Jeopardy! we will see pilots using data in fields like medicine and energy and finance. Domains that have enormous amounts of unstructured data. 

John - Final topic is THE NEXT BIG THING. In the area of high performance computing what do you think we will see in about three years that will be a disruptive innovation?

Sam - I think there will be a widespread adoption of storage class memory. This means 100's of gigabytes to petabytes (on high end systems) of phase change memory or memristor-based memory. Flash memory will be used early on but it has some issues that will not let it scale to the higher end of what I envision. What you are going to see a movement away from disk based systems. Even though disks will continue to decrease in cost, you reach a tipping point where the cost of the storage class memory is cheap enough when you consider the 10,000 times lower latency.
The other significant change will be the many-core processors available for servers. By many-core, I mean at least 100 cores. This will dramatically increase the capacity for parallel processing on typical servers and open up fresh territory for innovation.
Taken together these two trends will produce systems that are very different architecturally from those we see today. For example, we will see the emergence of operating systems based on byte addressable persistent memory instead of the class file metaphor. Content-addressable memories will also become more common, which will support more biomorphic styles of computing.

John - So if this three year projection of many-core processors and storage class memory comes to pass, how will our day-to-day lives be different?

Sam - I think you will see a lot more mass customization of information. Custom analytics, tuned to your needs at that time, will produce predictions of what you might be interested in at that very moment. Aside from the obvious retail applications, like the shopping scene in "Minority Report", think how this could impact healthcare, government, engineering and science. Consider how these timely yet deep insights could affect our creativity.

John - Thanks for sharing your insights with us Sam.

No comments:

Post a Comment