Month: December 2016
Looking Deeper into Nokia’s Deepfield Deal
Like most players in the network space, Nokia is eyeing SDN and NFV with a mixture of hope and fear. I’d guess for Nokia it may be a bit more of the latter, because major changes in the market could upset the merger of Nokia and Alcatel-Lucent, the latter being an example of almost perpetual merger trauma itself. Now, Nokia has announced…drum roll…a new acquisition, Deepfield, to improve their network and service automation. The obvious question is whether this makes any sense in the SDN/NFV space. A less obvious question is whether it makes sense without SDN or NFV.
Virtualization creates a natural disruption in network and service management because the actual resources being used don’t look like the virtual elements that the resources are assigned to support. A virtual router is router software running in a data center, on servers, using hypervisors, vSwitches, real data center switches, and a bunch of other stuff that no router user would expect to see. Because of this disconnect, there’s a real debate going on over just how you manage virtual-based networks. The disagreement lies in just how the virtual and real get coordinated.
If you looked at a hypothetical configuration of a totally virtualized (using NFV and SDN) IP VPN service, you’d see so much IT that it would look like application hosting. Imagine what happens, then, when you have a “server fail” event. Do you tell the router management system the user has connected that you have a server failure? Hardly. Broadly, your realistic options are to try to relate a “real” resource failure to a virtual condition, or to just fix everything underneath the virtual covers and never report a failure unless it’s so dire that causal factors are moot.
To put the latter option more elegantly, one approach to virtualization management is to manage the virtual elements as intent models with SLAs. You simply make things happen inside the black box to meet the SLAs, and everyone is happy. However, managing this way has its own either/or choice—do you manage explicitly or probabilistically.
Explicit management means knowing what the “virtual to real” bindings are, and relating the specific resource conditions associated with a given service to the state of the service. You can do this for very high-value stuff, but it’s clearly difficult and expensive. The alternative is to play the numbers, and that (in case you were wondering if I’d gotten totally off-point) is where big data, analytics, and Deepfield come in.
Probabilistic network management is based on the idea that if you have a capacity plan that defines a grade of service, and if you populate your network with resources to meet the goals of that plan, then any operation that stays within your plan’s expected boundaries meets the SLAs with an acceptable probability. Somewhere, out there, you say, are the resources you need, and you know that because you planned for the need.
This only works, of course, if you didn’t mess up the plan, and if your resources don’t get messed up themselves. Since the question of whether a massive, adaptive, multi-tenant, multi-application network or service is running right, or just how it’s wrong if it is, is complex, you need to look at a bunch of metrics and do some heavy analytics. The more you can see and analyze the more likely you’ll obtain a correct current state of the network and services. If you have decent baseline of normal or acceptable states, that gets you a much higher probability that your wing-and-a-prayer SLA is actually being met when you think it is.
Many people in the industry, and particularly in the telco space, think explicit management is the right answer. That’s why we hear so much about “five-nines” today. The fact is that almost none of the broadband service we consume today can be assured at that level, and almost none of it is explicitly managed. Routers and Ethernet switches don’t deliver by-the-second SLAs, and in fact the move to SDN was largely justified by the desire to impose some traffic management (besides what MPLS offers) on the picture. In consumption terms, consumer broadband is swamping everything else already, it’s only going to get worse, and consumers will trade service quality for price to a great degree. Thus, it’s my view that probabilistic management is going to win.
That doesn’t mean that all you need to manage networks is big data, though. While probabilistic management based on resource state is the basis for a reasonable management strategy for both SDN and NFV, there’s still a potential gap that could kill you, which I’ll call the “anti-butterfly-wings” gap.
You know the old saw that if a butterfly’s wings flap in Japan, it can create a cascade impact that alters weather in New York. That might be true, but we also could say that a typhoon in Japan might cause no weather change at all in nearby Korea. The point is that a network resource pool is vast, and if something is buggered in a given area of the pool there’s a good chance that nothing much is impacted. You can’t cry “Wolf!” in a service management sense just because something somewhere broke.
That’s where Deepfield might help. Their approach adds endpoint, application, or service awareness to the mass of resource data that you’d have with any big-data network statistics app. That means that faults can be contextualized by user, service/application, etc. The result isn’t as precise as explicit management, but it’s certainly enough to drive an SLA as good or better than what’s currently available in IP or Ethernet.
The interesting this about this approach, which Nokia’s move might popularize, is the notion of a kind of “resource-push” management model. Instead of having the service layer keep track of SLAs and draw on resource state to get the necessary data, the resource layer could push management state to the services based on context. At the least, it could let service-layer processes know that something might be amiss.
At the most, it opens a model of management where you can prioritize the remedial processes you’re invoking based on the likely impact on customers, services, or applications. That would be enormously helpful in preventing unnecessary cascades of management events arising from a common condition; you could notify services in priority order to sync their own responses. More important, you could initiate remedies immediately at the resource level, and perhaps not report a service fault at all.
That’s the key point. Successful service management in SDN and NFV is successful not because it correctly, or at least logically, reflects the fault to the user. It’s successful because no faults are reported because no SLA violations occur. It will be interesting to see how Nokia plays this, though. Most M&A done these days fails because the buyer under-plays its new asset.
Federation, Virtual Network Operators, and 5G Slicing, and Their Relationship to SDN/NFV
Every network operator I’ve surveyed has some sort of wholesale/retail relationship with other operators. Most fit into two categories—a relationship that extends geographic scope or one that incorporates one operator’s service inside another (like backhaul or MVNO). Given this, it is natural to assume that services built on SDN and/or NFV would have to be covered by these same sorts of deals. The question is how, and how the need to support these relationships could impact the basic architecture of SDN or NFV. It’s important because the 5G specifications are going to make “slicing” into a new standard mechanism for virtual networking.
To avoid listing a host of relationships to describe what we’re going to talk about here, I’m going to adopt a term that has been used often (but not exclusively) in the market—federation. For purposes of this blog, federation is a relationship of “incorporation”, meaning that one operator incorporates services or service elements from another in its own offerings. We should note, though, that operators are a kind of special case of “administrative domains”, and that federation or sharing-and-incorporation capabilities could also be valuable or essential across business units of the same operator, across different management domains, etc.
We used to have federation all the time, based on intercarrier gateways. Telcos intercall with others, and early data standards included specific gateway protocols—the venerable and now-hardly-used packet standard X.25 used the X.75 gateway standard for federation. All of this could be called “service-level” federation, where a common service was concatenated across domains. Federation today happens at different levels, and creates different issues at each.
There is one common, and giant, issue in federation today, though, and it’s visibility. If a “service” spans multiple domains, then how does anyone know what the end-to-end state of the service is? The logical answer is that they have a management console that lets them look, but for that to work all the federated operators have to provide visibility into their infrastructure, to the extent needed to get the state data. Management visibility is like taxes, if you can invoke it then it can lead to destruction. No operator wants others to see how their networks work, and it’s worse if you expect to remediate problems because a partner actually exercising management control could lead to destabilizing events.
The presumption we could make to resolve this issue is that service modeling, done right, would create a path to solution. A service model that’s made up of elements, each being an intent model that asserts a service-level agreement, would let an operator share the model based only on the agreed/exposed SLA. If we presumed that the “parameters” of that element were exposed at the top and derived compatibly by whatever was inside, then we could say that what you’d see or be able to do with the deployed implementation of any service element would be fixed by the exposure. In this approach, federation would be the sharing of intent-modeled elements.
This is a big step to solving our problems, but not a complete solution. Most operators would want to have different visibility for their own network operations and those of a partner. If I wholesale Service X to you, then your network ops people see the parameters I’ve exposed in the relationship, but I’d like my own to see more. How would that work?
One possibility is that of a management viewer. Every intent model, in my thinking, would expose a management port or API, and there’s no reason why it couldn’t expose multiple ones. So, a given element intent model would have one set of general SLA and parametric variables, but you’d get them through a viewer API, based on your credentials. Now partners would see only a subset of the full list, and you as the owner of the element could define what was exposed.
Another possibility is an alias element. We have a “real” service element, which decomposes however it has to into a stream of stuff toward the resources. We have another element of the same kind, which is a fork upward from this real element. Internal services compose the real element, but you expose the alias element in federation, and this element contains all the stuff that creates and thus limits the management visibility and span of control.
The issues of visibility can be addressed, but there remain two other federation curve balls to catch. One is “nesting” and the other is “foundation services”.
Nesting is the creation of a service framework within which other services are built. A simple example is the provisioning of trunks or virtual wires, to be used by higher-layer Ethernet or IP networks. You might think this is a non-issue, and in some ways it might be, but the problem that can arise goes back to management and control. Virtual resources that create an underlayment have to be made visible in the higher layer, but more importantly the higher layer has to be constrained to use those resources.
Suppose we spawn a virtual wire, and we expect that wire to be used exclusively for a given service built at L2 or L3. The wire is not a general resource, so we can’t add it to a pool of resources. The implications are that a “layer” or “slice” creates a private resource pool, but for that to be true we either have to run resource allocation processes on that pool that are pool-specific (they don’t know about the rest of the world of resources, nor does the rest of the world see them) or we have to define resource classes and selection policies that guarantee exclusivity. Since the latter would mix management jurisdictions, the former approach is best, and it’s clearly federation. We’re going to need something like this for 5G slicing. The “slice-domain” would define a series of private pools, and each of the “slice-inhabitants” would then be able to run processes to utilize them.
The key point in any layered service is address space management. Any service that’s being deployed knows its resources because that’s what it’s being deployed on. However, that simple truth isn’t always explicit; you almost never hear about address spaces in NFV, for example. Address spaces, in short, are resources as much as wires are. We have to be explicit in address management at every layer, so that we don’t partition resources in a way that creates collisions if two layers eventually have to harmonize on a common address space, like the Internet. We can assign RFC 1918 addresses, for example, to subnets with regard for duplication across federated domains because they were designed to be used that way, with NAT applied to linking them to a universal address space like the Internet. We can’t assign Internet-public addresses that way unless we’re willing to say that our parallel IP layers or domains never connect with each other or with another domain—we’d risk collision in assignment.
The other issue I noted was what I called “foundation services”. We have tended to think of NFV in particular as being made up of per-customer-and-service-instance hosted virtual functions. Some functions are unlikely to be economical or even logical in that form. IMS, meaning cellphone registration and billing management, is probably a service shared across all virtual network operators on a given infrastructure. As IoT develops, there will be many services designed to provide information about the environment, drawn from sensors or analysis based on sensor data. To make this information available in multiple address spaces means we’d need a kind of “reverse NAT”, where some outside addresses are gated into a subnet for use by a service instance. How does that get done?
How do we do any of this in SDN, in NFV? Good question, but one we can’t really answer confidently today. As we evolve to real deployments and in particular start dealing (as we must) with federation and slicing, we’re going to have to have the answers.
Looking Deeper into Cisco’s Decision to Drop Intercloud
Cisco probably didn’t surprise too many people with its announcement that it was shutting down its Intercloud public cloud business. It’s been a stretch from the first, an attempt by Cisco to get parity with other IT players like IBM and HPE, both of whom have diddled with or exited their public cloud positions. You have to wonder, though, why Cisco thought this was a good idea to start with, and also whether it should be teaching us something about the public cloud.
Vendors who sell gear into data centers have always had a love/hate relationship with the public cloud. The value proposition of cloud computing as it was first presented is simple; economy of scale would let cloud providers eat most or all of IT spending. It follows, if this is true, that the sum of sales of data center gear would have to be lower in a cloud-based IT future, so vendors would sell less. However, in the near term, cloud provider sales represent an incremental pop. In today’s current-quarter-driven financial market, short term trumps strategic every time.
Cisco also had a competitive concern, as I’ve noted. It was becoming clear to everyone that data center deployment was the strategic priority for both server and network vendors. Even WAN equipment sales tend to be pulled along by strategic leadership in the data center, leadership that Cisco comfortably held up to the cloud age. Now, with cloud servers more strategic than networks, IBM and HPE were threatening that dominance, and Cisco had to jump into the server space and then the cloud space to defend.
You could argue that these issues fully explain Cisco’s decision, but there’s more going on and it’s also possible Cisco factored in a set of more strategic points. The company is trying to realign itself, and that makes companies more forward-thinking than Wall Street pressures would normally.
The first of the “mores” is that it’s very clear that IaaS sucks as a business. The margins are low, differentiation is minimal, and it’s really hard to work up a model for “selling” IaaS in a direct sense because of all of this. You “market” the cloud, not “sell” it, and Cisco is still a sales-driven operation. This is exacerbated by the fact that Cisco, having gotten into the cloud for very tactical reasons, really never had a strategic vision for it, and that’s what you can market.
The second point is related; it seems pretty clear that the whole cloud market is going to fall to Amazon and Microsoft. Credit Suisse issued a note on the Cisco decision that said just that, and if there was any method behind Cisco’s initial cloud madness, it was that Cisco would be able to use Intercloud to boost carrier cloud sales (that, after all, is what the name implies, not just “public cloud” services). If carrier cloud isn’t going to do what Cisco hoped, then Cisco’s public cloud mission is compromised out of the box.
The third point is that even the current public cloud market is moving very quickly beyond IaaS, which is about the only kind of cloud service a company like Cisco would be able to field credibly. Microsoft’s Azure is PaaS out of the box, and is being enhanced regularly with cloud-hosted features. Amazon’s cloud is now as much or more a kind of dynamic PaaS as IaaS, with two dozen or more web services to add middleware-like features. Cisco would never be able to meet table stakes in the real cloud game that’s already developing, and the future will bring more of the same.
What can we learn from all of this, in a general cloud market sense? I think a lot, but I also think that the signals for cloud market change run back into 2015, and that the market has successfully ignored them up to now. We may see more head-in-the-sand behavior before we get to enlightenment.
The easy and obvious lesson is that IaaS isn’t the driver of cloud growth any longer. If all you had to do to compete was to push out some servers and connect them with cloud software, IT giants who make the servers (like Dell, HPE, and IBM) would be winning the public cloud game now. If lowest cost is the right approach, which it is with IaaS, then people who can get their servers from their own production lines would be hard to beat.
As I’ve noted in other blogs, what is winning is a new model of the cloud as a distributed, feature-rich, virtual operating system and middleware. This is a battle that Google, not a powerhouse in IaaS and not even named a cloud winner by Credit Suisse, is also going to be a winner. It’s a battle that Amazon has just shown us (with Greengrass) will be fought not only inside the cloud but at the edge of the cloud in private IT resources.
The cloud of the future is a new model of development first, and a public hosting opportunity second. Those who can promote application development with critical tools are the contenders. Cisco is not such a company, nor could they become one easily. But neither are most other companies. Rackspace knew that and has moved to become a cloud integrator.
At the same time, this shifts cloud momentum back to the IT players, though. You need to be strongly software-driven to be a contender in the future battle for cloud supremacy, in no small part because the battle won’t be about the cloud at all, but about how we build distributable, efficient, applications. The cloud opportunity will be geographically distributed hosting of that new development framework.
Which, ironically, could favor carrier cloud and empower an enlightened Cisco Intercloud model. In fact, Cisco could have led this initiative, an initiative that would also have subsumed SDN and NFV activity. That would have made Cisco the powerhouse in the space were the largest number of future data centers could deploy—the carrier cloud. Next year, my model says that carrier cloud could generate 350 new data centers, no tiny opportunity. In 2018 the number rises to 1,600, then to 3,300, then to 6,700 in 2020, and the growth rate doesn’t plateau till 2022. By 2030 we could add over a hundred thousand carrier cloud data centers.
“Could”, of course, is the operative word. A lot of good things have to happen to create this outcome, and cloud computing services by operators aren’t much of a factor. Even in the peak growth period from 2020 to 2023, public cloud services account for less than a quarter of the drivers for carrier cloud.
This, I think, is the big lesson. Cisco is right in dumping Intercloud because the service it proposed to promote, carrier-offered cloud computing, is not going to be a major contributor to carrier cloud at any point. They’re wrong if they don’t start thinking about their strategy for those drivers that will deploy all those data centers. So are their competitors.
There’s other news that relates to the broader question of carrier cloud. DT is getting out of the web-hosting business, a move that suggests that just being an OTT player isn’t enough to make a network operator a success in the Internet age. That makes credible drivers of carrier-cloud opportunity even more important because it proves that you can’t simply follow the trail of the OTTs and hope to overtake (and overrun) them.
The cloud is going to transform by 2021, but what we’re seeing from Cisco and others is probably more a reaction to what’s driving the change than to a recognition of what the change is, and means. IaaS was never an end-game, it was a very limited on-ramp. We can start to see the end-game emerging, or at least we can if we don’t have blinders on. It’s easy to realize the old approach isn’t working, but much harder to see, and lead, the new one that will work. Will Cisco pass that test? We’ll probably see in 2017.
Getting NFV Orchestration Up to Speed with the Cloud
Whatever else you have, or think you have, the NFV business case will depend on software automation of the service lifecycle. VNFs matter functionally, but only if you can operationalize them efficiently and only if they combine to drive carrier-cloud adoption. The core of any justification for NFV deployment in the near term is operational efficiency. Software automation generates that, if you have it at all. The core for the future is service agility and low cost, and software automation generates that too. Get it right, and you’re everywhere. Get it wrong and you’re nowhere.
Getting software automation right is largely a matter of modeling, but not “modeling” in the sense of somehow picking an optimum modeling language or tool. You can actually do a good job of NFV service modeling with nothing more than XML. What you can’t do without is the right approach, meaning you can’t do without knowing what you’re modeling in the first place. You are not modeling network topology, or even service topology. You’re modeling multi-domain deployment.
A service, from the top, is a retail contract with an SLA. You can express this as an intent model that defines first the service points (ports) where users connect and then as the characteristics the service presents at those points, which is the SLA. In the most general case, therefore, the high-level service looks like a list of port-classes representing one or more types of user connection, and a list of ports within each class. It also represents a list of service-classes available at these ports, again as SLAs.
Most people agree with this at this high level. The problem comes when you try to go downward, to deploy this on resources. There we have a whole series of approaches that essentially come down to two—the model decomposition approach and the model deployment approach.
Model deployment is what a lot of vendors have thought of. In model deployment, a service model is linked to a series of scripts that provision resources, manipulate management interfaces, invoke controllers, and so forth. Even if model-deployment implementations of NFV allow for the “nesting” of models to, for example, split an IP VPN service into the VPN Core and Access components, the bulk of the provisioning decisions are made in one giant step.
Model decomposition is different. Here the goal is to define a service from top to bottom in abstract. A Service consists of a Core element and Access. Each of these can then be divided—Access into point-of-attachment features (firewall, etc.) and access connectivity. Those can then be divided into specifics—the specific edge features, the specific connectivity type. You go as far as you can go in decomposition of a service before you start thinking about implementation.
The easiest way to assess these differences is to look at the two logical ends of a service deployment, the service-level and the resource management. What the two approaches do at these extremes will likely determine how broadly useful, or risky, they could be. That’s particularly true when you assume that networks are already populated with different technologies, and multiple vendors within each. We’re now introducing SDN and NFV, and they won’t deploy uniformly or instantly. The network we’re trying to software-control is in a state of confusion and flux technically.
In the model-deployment approach, the process of building an access connection across multiple technology and vendor choices has to be built into the logic of the model that represents the VPN. If we imagine this as a script, the script has to account for any variation in the way something gets deployed because of the technology available there, the vendor and API standards, etc. As a result, there is a risk that you might end up with many, many, versions of “IP VPN”, with each version differing in what technology mix it’s designed to deploy on. Changes in the network because of orderly evolution would break the scripts that depended on the old configuration, and if you wanted to change a service based on a customer order, the change might be radical enough to force you to break down one deployment model and initiate another. Even a failure, one that for example shifts a piece of a service from a VNF to a legacy device, might not be handled except by kill-and-start-over.
At the other end of the service, the top end, having a lot of implementation details puts pressure on the service order and management portals, because if you make the wrong selection of a model element to represent the order, the order might not even be compatible with your choice. You also have to resolve the management portal relationship with all the resource details, and if every model deploys differently on different equipment, the harmonization of the management view would have to be explicitly part of the deployment process, and it would be as brittle as the models were. You could change how a service was deployed and it would then change what the user saw on a management console, or could do in response to what they saw.
You probably think that the model-decomposition approach solves all of this, but that’s not totally true. Model decomposition would typically be based on the notion of “successive abstractions”. A “service” would decompose into “sub-services” like Access and VPN Core. Each of the sub-services would decompose into “feature elements”, and each feature element into a “virtual device” set. Virtual devices could even decompose into “device-class variants” before you actually started putting service pieces onto network or hosting resources. This structure, in and of itself, can contain the impact of changes in infrastructure or service, and it can also make it easier to substitute one “tree” of decomposition (targeting, for example, real Cisco routers) with another (targeting virtual router instances hosed as VNFs). It doesn’t necessarily make things dynamic, and it doesn’t solve the management problem.
What you really need to have to do both these things is an extension of the basic notion of model-decomposition, which is this: since every model element is an intent model known to the outside world only by its properties, which include its SLA, you can manage every model element, and should. If you have an element called “Access”, it has an SLA. Its sub-elements, whether they are divided by administrative domain, geography, or technology, also have SLAs. The SLA of any “superior” object drives the SLAs of the subordinate objects, which at the bottom then drive the network behavior you’re parameterizing, configuring, and deploying. You can provide a user or customer service rep with SLA status at any intent model, and they see the SLA there, which is what they’d expect. Only at the bottom do you have a complicated task of building an SLA from deployed behaviors.
Speaking of the bottom, it’s helpful here to think about a companion activity to the service modeling we’ve been talking about. Suppose that every administrative domain, meaning area where common management span of control exists, has its own “services”, presented at the bottom and utilized in combination to create the lowest-level elements inside the service models? In my ExperiaSphere project I called these resource-side offerings Behaviors to distinguish them from retail elements. They’d roughly correspond to the resource-facing services (RFS) of the TMF, and so my “service models” would roughly map to the TMF customer-facing services (CFS).
Now we have an interesting notion to consider. Suppose that every metro area (as an administrative domain) advertises the same set of behaviors, regardless of their technology mix and the state of their SDN/NFV evolution? You could now build service models referencing these behaviors that, when decomposed by geography to serve the specified endpoints, would bind to the proper behaviors in each of the metro areas. Service composition now looks the same no matter where the customers are.
I’ve described this as a service-domain/resource-domain split, and in my own model the split occurs where technical capabilities are linked to functional requirements—“Behaviors” are linked to “service elements”. Above the split, the model is primarily functional and logical, though in all cases a model element that has for example five underlayment elements to integrate would have to control the gateway processes that connect them. Below the split, the goal is primarily technical. You could justify a difference in modeling approach above and below, and even multiple modeling strategies in the resource domain, as long as the same behavior set was presented in the same way to the service domain.
This approach, which combines service-modeling using intent-based abstractions with resource modeling that takes place underneath a single set of “behaviors” abstractions that are presented upward, seems to me to offer the right framework for software automation of the service lifecycle processes. Each of the model abstractions is a state/event machine whose states are self-determined by the architect who builds them, and whose events are generated by the objects immediately above or within. The states and events create, for each model element we have, a table of processes to be used to handle the event in each possible operating state. That’s how service lifecycle management has to work.
I like TOSCA as a modeling approach for this, but as I said what’s critically important is that the modeling technology support the structure I’m describing. It’s a major plus if there are tools to manage the models available in the open-source community. TOSCA is already gaining acceptance as part of vendor NFV strategies. It’s supported by Cloudify as a specific NFV solution (through Aria, a general modeling tool), Alien4Cloud, and Ubicity. OpenStack’s HEAT is based on it, and so is Amazon’s CloudFormation. OpenTOSCA, an open-source implementation of the model, was used by Jorge Cardoso in a proof-of-concept university project I’ve often cited. There are also some NFV ISG TOSCA documents available. I think TOSCA is the smart choice in the service domain, and also at the top of the resource domain.
What happens in that resource area, below where TOSCA is a good choice? That’s by definition inside an opaque intent model, so in truth it doesn’t matter. Any administrative domain could use any modeling and deployment tools they liked as long as they were published as something like my Behaviors to be used in services. TOSCA would work there overall, and work best IMHO where cloud elements (like VNFs) deploy, but the key to NFV success is to embrace whatever equipment is out there, and demanding a reformulation of the management of that equipment to support a new model wouldn’t make sense.
I think that this software automation approach handles orchestration and full lifecycle management, can easily support wholesale/retail partners, 5G slicing and operating within those slices as a VNO, and any combination of legacy vendors and technologies, as well as SDN and NFV. There may be other ways too, and I’m not trying to say this is the only solution. It is, however, a good solution and one I’m absolutely confident could work as well as any in making SDN and NFV transformation successful. It’s fully described in my ExperiaSphere material (http://www.experiasphere.com/page16.html) and please note that I’ve made all this available and open without restrictions, even attribution, as long as you don’t use the trademarked name. Take a look and see if it works for you, and let me know if you find something better.
Why We Need to Totally Rethink our VNF Strategy to Make NFV Succeed
If we need to apply advanced cloud principles to virtual network functions (VNFs), what exactly would that mean for VNF design, onboarding, and management? This is an important question because the current VNF onboarding process is broken from the operators’ almost universal perspective. Management, in my own view, is at least as broken. Fixing these issues isn’t trivial, and the effort could be wasted if we get behind the market with respect to supporting cloud-think.
A VNF is a hosted version of a network feature/function, something that in most cases would otherwise have been performed by a device. The VNF, like the device, would have some number of port/trunk data-plane connections, and in most cases, would also have one or more management connection. Usually the latter are port addresses associated with the address of the device itself. Thus, the device (and the VNF that represents it in NFV) is a part of the user’s address space.
One thing that the cloud says is absolutely not to be tolerated is that the control elements of the cloud software (OpenStack, etc.) have management and control interfaces that are exposed in the users’ address space. This would allow for hacking of shared infrastructure, which can never be allowed to happen, and a common element in two different address spaces can easily become a router, passing real traffic between what are supposed to be isolated networks.
When you deploy a VNF, then, you are putting the logic in some address space. You have two options; you can put the VNF in the user address space where the port/trunk connections can be directly exposed, or you can put it in a private space and use a NAT facility to expose the port/trunk interfaces. The latter would be the best practice, obviously, since it prevents the NFV hosting and control framework from appearing in the user address space. Practically every home network uses a private IP address (192.168.x.x) and the second option follows that same model for VNFs.
In this approach, the management interfaces present an issue since they are now also in a private address space. If you NAT them to expose them to the user, as would normally be the case, then you cede device management to the user, which makes it hard to provide automated service lifecycle management, even just to parameterize the VNF on deployment. If you share the port somehow by NATing the NFV management system into the user space, you still have the problem of collision of commands/changes. So logically you’d need to have a kind of two-step process. First, you connect the NFV management software to the VNF management interfaces within the private address space. Second, you create a proxy management port within the NFV software, which at the minimum accepts user management commands and serializes them with commands/requests generated by the NFV software.
In the cloud, whether it’s VMs or containers, everyone involved in deployment is intimately aware of address spaces, what goes into them, and how things are exposed from them. Fundamental to cloud deployment is the establishing of addressability for the components of an application, among themselves and with respect to their users. This is where VNF onboarding should start, and yet we hear little about it.
Suppose we resolve addressing. All of this is then great but it still doesn’t let you onboard VNFs easily. The problem is that each VNF has different management requirements, even different implementations of the same function. That makes it hard to slot a given VNF in; there’s not likely to be a compatible “socket” for it. Thus, the next step is to use that proxy management port process you’ve created to format the VNF’s specific management interface to a generic interface. Think of this as being a kind of device-class MIB. All firewalls have the same device-class MIB, and if either a real device or a VNF is managed by NFV, its real MIB is proxied by your management port process into (and, for writes, from) the device-class MIB. Thus, the proxy management port includes a kind of plugin (of the kind we see in OpenStack Neutron) that adapts a generic interface to a specific one.
A VNF that’s ready for onboarding would then consist of the VNF functional element(s) and the proxy management port “plugin” that adapts its MIB to the standard structure of the device class. You could, of course, write a VNF to use the device-class standard, in which case all you’d have to do is NAT the management port into the NFV control/management address space.
So, we’re done, right? Sadly we’re not, as the cloud has already shown. We have DevOps tools to deploy cloud elements, and while these tools go a long way toward standardizing the basic deployment task, you still have lifecycle management issues. What do you do if you get an abnormal condition? If you presume that, for each device class, there is a standard MIB, then it follows that for each standard device class you’d have a standard lifecycle response structure. That means you’d have standard “events” generated by MIB conditions, and these would result in standard responses and state changes. The responses might then result in a SET for a MIB variable. If this is the case, then the stub element in the proxy management port would translate the change as it’s inbound to the VNF (or device).
Even this isn’t quite enough, because some lifecycle conditions aren’t internal to the VNF. An example is the horizontal scaling capability. Horizontal scaling means instantiating multiple parallel instances of a function to improve load-handling. At the minimum, horizontal scaling requires load balancing, but simple load balancing only works if you have truly stateless behavior, and parallel routers (for example) are not truly stateless because packets could pass each other on the parallel paths created, and that generates out-of-order arrivals that might or might not be tolerated by the application. A good general model of scaling is more complicated.
Assume we have a general model of a scalable device as being a striper function that divides a single input flow into multiple parallel flows and schedules each flow to a process instance. We then have a series of these process instances, and then a de-striper that combines the parallel flows into a single flow again, doing whatever is required to preserve packet ordering if that’s a requirement. If you don’t have out-of-order problems, then the striper function is a simple algorithmic load-balancer and there’s no de-striper required except for the collecting of the process flows into a single output port.
The point here is that we see all of this in the cloud even now, and we see in most of the service-chained VNFs an example of what might be called a pipelined microservice, particularly if we assume that operators would induce vendors to decompose device software into elements that would then be assembled. Sure, a VNF Component might be simply bound with others into a single image, but the cloud teaches us not to fall prey to general-casing issues. In the general case, the components are likely to live in a pipeline relationship if the VNF replaces a typical port/trunk device. In the general case, we should be able to distribute these along a data flow.
The cloud is not hosted virtualization, or if it is then it would have limited impact on IT over the long term. NFV is not just a matter of sticking static functions in static boxes. It needs the same level of dynamism as the cloud needs, but the cloud is already moving to support that kind of dynamism, and it’s already accepted the most basic truth, which is that static applications build static solutions no matter what you lay underneath them. VNFs will have to be designed to optimize NFV just as applications have to be designed to optimize the cloud, and it’s way past time for us to accept this and act on the knowledge.
A Flow-and-Lambda Vision of NFV Execution
I hope that convincing you that having NFV evolve in sync with the leading edge of the cloud hasn’t proven too difficult. If it has, then I hope the rest of this series will do the job. If you’re on board, then I hope that the rest gives you a few action items to study. The next step in the series is going to be a philosophical challenge for some. I want you to stop thinking about services as we’re accustomed to, to stop thinking about NFV as “service chains”. Instead I want to think about a service as being a series of workflows, and functions are not sites through which we chain services, but steps we put in the path of them. That applies both to the data plane and the management and control plane.
All services that matter today have an intrinsic topology, and there are essentially three such topologies recognized today; the LINE or point-to-point, the LAN or multipoint, and the TREE or point to multipoint. The workflows, or connection paths, that form these topologies are impressed on infrastructure based on the combination of what the flow needs and where those needs can be met. One of the fundamental challenges that NFV faces is that unless you presume a fairly rich deployment of cloud data centers, particularly close to the service edge, you find yourself looking for a Point B that you can reach from a given service-flow Point A without unnecessary diversion. Longer paths consume more resources in flight, generate more risk of failure because they traverse more things that can fail, and generate more delay and packet loss.
Ideally, you’d like to see an optimized path created, a path that transits the smallest number of resources and takes the most direct route. This is true with legacy technology, virtual functions, or anything between. Where NFV complicates the picture is in the limitations created by where you can host something, relative to the structure of the network. This is no different from the cloud, where in most cases you have a small number of cloud data centers that you have to connect with to host functions there. When the number of data center hosting points is limited relative to the service geography, the optimum routes are distorted relative to “pure” network routes because you have to detour to get to the data centers, then return to the best path. You could lay such a path onto a network map without knowing anything about hosting or virtual functions if you could presume a fairly dense distribution of hosting points. Yet nobody talks about NFV as starting with the formulation of a set of paths to define the optimum route.
This is the first of many places where we nail NFV to the traditional ground of legacy technology. We are trying to define network functions that are as near to location-independent as cloud density permits, and then we start by defining where things go based on abstract policies. We should be going about this in a totally different way. The only fixed points in an NFV hosting plan are the service edge points, the place where users connect. What that means is that we would normally want to assume that primary hosting responsibility lays right at those edges. You put as much there as you can, hosted in CPE or in the edge office. You then work inward from those edge points to site additional elements, always aware that you want stuff to be where the natural service flows pass, and you want technology associated with specific features to be hosted where those features appear in service logic.
A VPN service is access devices linked to access pipes, linked to VPN onramps or gateways. You can see that as we start to move inward from the edge we find places where geography and topology creates concentrations of access pipes, which means that we could add some onramp or aggregation features there. We could also, if we had issues with features hosted closer to the edge, pull back some of those features along the path of the flow to the next natural hosting point, that point of aggregation or gateway. This approach presumes what I think a lot of operators have deduced for other reasons, which is that services will tend to be better if we can push features close to the user.
If something breaks, and if you have to redeploy to get around the failure, the best strategy will be one that has the least impact on the commitments already made, which in most cases will be one where the substitute resource used is proximate to the original one. A minor break that impacts only one service flow (from edge to gateway or aggregator) won’t change service topology much at all, which means that you don’t have to recommission a bunch of things and worry about in-flight data and delays. A major fault that breaks a bunch of paths would probably mean you’d have to examine failure-mode topologies to find the choices that would result in the smallest number of impacted flows.
If you have to add something to a service, the right place to put the addition would be based on the same service flow analysis. A single site, or a bunch of sites, wants a new feature? First goal is to edge-host it. Second, pull it back to the primary aggregation level behind the user connection points, the place where natural traffic concentration gets multiple flows together.
To make this work, you’d have to assume (as I have already noted) a fairly rich deployment of cloud data centers. In fact, it would be best to have one in every facility that represented a concentration of physical media. Where fiber goes, so goes service flows, and at the junction points are where you’d find ideal interior points of hosting. You’d also have to assume that you had fairly uniform capabilities in each hosting point so you didn’t have a risk of needing a specialized resource that wasn’t available there. You’d also probably want to presume SDN deployment so you could steer paths explicitly, though if you follow a gateway-to-gateway hopping model across LAN and VPN infrastructure you can still place elements at the gateway points and the edge.
The special case of all of this comes back to that functional programming (Lambda). If we viewed all “VNFs” as being pipelined Lambda processes, then we’d simply push them into a convenient hosting point along the paths and they’d run there. Unlike something that we had to put into place and connect, a pipelined function is simply added to the service flow in any convenient data center it transits. You don’t really have to manage it much because it can be redeployed at a whim, but if you want to control its operation you could presume that each Lambda/VNF had a management pipeline and a data pipeline and that it passed its control messages along, or that every hosting location had a management bus to which each could be connected.
The management bus comment is relevant because if we should be aware of a flow model of connectivity and function placement, we should be even more mindful of these concepts when we look at the distribution of the control software used in SDN and NFV. I remarked in my last blog that there might be benefits to using a robust container approach to hosting VNFs because containers seemed to lend themselves more to distributability of the basic control functions—the lifecycle management. Perhaps we can go even further.
A modeled service is a hierarchy of objects, starting with the “service” at the retail level at the top, and then decomposing to an upside-down tree where the tendrils of the branches touch the resources. In a model like this, only the lowest layer has to be made aware of resource state. The higher-level objects, if we follow the abstraction principles to their logical conclusion, would not see resources because they’re inside the black box of their subordinate objects and so are invisible. What these higher-level objects know about is the state of those subordinates. This implies that every object is a finite-state machine that is responding to events generated within the model tree.
If every object is a self-contained state-event process, we could in theory distribute the objects to places determined by the service topology. Objects close to the bottom might live closer to resources, and those at the top closer to services. In effect, all these objects could be serviced by interpretive Lambdas, pushed in a stateless way to wherever we need them and operating off the specific model element for its data and state. This model is a logical extension of how the cloud is evolving, and we need to look at it for NFV, lest we fall into a trap of trying to support dynamic virtualization with static software elements for our management and orchestration.
Nothing here is “new”; we already see these trends in the cloud. Remember that my thesis here is that from the first NFV was expected to exploit the cloud, and that means understanding the best of cloud evolutionary trends. We need the best possible implementations to make the best business case, and drive optimum deployment.
Keeping Up with the Cloud: The Developments that MUST Guide SDN and NFV
When we design a transformation strategy for operators today we’re really designing something to be deployed at scale in perhaps 2020 or 2021. The telco world has the longest capital cycle of any technology sector, with elements that are expected to live for ten years in many cases—sometimes even more. It’s critical for virtualization in any form to satisfy current requirements, but it’s just as critical that it support the longer-term trends. Otherwise, transformation capital and effort is vulnerable to being fork-lifted out just as it’s getting started. Are our “virtualization revolution” strategies, like SDN and NFV, looking forward? I don’t think so, at least not far enough.
While the visionary ten operators who launched NFV back in 2012 didn’t envision it in these terms, what they were doing was applying virtualization principles of the time to problems of the time. We do have a fair measure of how those problems are evolving, and so we can presume that the requirements at the business level are known quantities. The virtualization technology is another matter.
The very first NFV paper stated that “Network Functions Virtualization will leverage modern technologies such as those developed for cloud computing.” At the time (October 2012) that meant leveraging IaaS hosting. We are on the cusp of a cloud revolution that’s being created by going beyond IaaS in a decisive way. Doesn’t that mission statement back in 2012 mean that NFV should leverage the technology elements of this revolution too? Surely, given how long it would take to transform networks with NFV, the state of the cloud will have moved forward by the time it’s being broadly deployed. Surely, NFV should then be based on the leading-edge cloud stuff that would prevail at that point.
The evolution of cloud computing, at a high level, is an evolution that takes it from being a passive outsourced-hosting framework to an active fully distributed development platform for applications. We’ve had platform-as-a-service clouds almost from the first (Microsoft Azure is the best-known), but what’s now happening is that IaaS is transforming to a kind of PaaS model that I’ve called “features-as-a-service” or FaaS to distinguish it from the past approach. Both Amazon and Microsoft have added about two-dozen features that in the old days we’d call “middleware” or “services”. These, accessed by APIs, let developers build applications that are specific to the cloud. Some facilitate agile, distributable, cloud development, and others perform tasks that should be done differently (and often can be done better) in the cloud.
This new vision of the cloud is already clearly visible in the announcements of the cloud providers (Amazon’s Greengrass and Snowball most recently). My modeling says that the new cloud vision will transform cloud usage decisively right around the 2020 milestone that’s also when we could expect to see a business-justified NFV model deploying. The signs of the cloud transformation will be even more clear in 2017 than they are today, and they’ll be inescapable by 2018. Thus, IMHO, a failure to consider the impact of this transformation on carrier virtualization in all its guises could stall progress while carrier visions of virtualization catch up with the public cloud.
What would that catch-up involve? What specific major advances in cloud computing should or must be incorporated in the SDN and NFV vision? We can’t answer a question like that in detail in a single blog, but I can present the questions and issues here and develop them in subsequent blogs. So, let’s get started.
The primary attribute of the cloud of the future is that it expects to host fully distributable and scalable elements. IaaS is nothing but hosted server consolidation. Future cloud computing applications will be written to be placed and replicated dynamically. Even today’s NFV specifications envision this kind of model, but they don’t define the basic design features that either virtual network functions (VNFs) or the NFV control software itself would have to adopt to enable all that dynamism.
For the VNFs, the problem is pretty clear. The current approach is almost identical to the way that applications would be run on IaaS. Every application/VNF is different, and that means that no standard mechanism will deploy VNFs or connect them to management systems. There’s a goal of making the VNFs horizontally scalable, but nothing is said about how exactly that could be done. In the emerging FaaS space we have distributed load balancing and, most important, we have state control practices and tools to ensure that the applications that are expected to scale in and out don’t rely on data stored in each instance.
To return to the Amazon announcement, Greengrass software includes what Amazon calls “Lambda”, which comes from “lambda expressions” used to create what’s also known as “functional programming”. Lambda expressions are software designed to operate on inputs and create outputs without storing anything inside. You can string them together (“pipeline”) to create a complex result, but the code is always stateless and simple. It’s ideal for performing basic data-flow operations in a distributed environment because you can send such an expression anywhere to be hosted, and replicate it as needed. It’s waiting, it’s used, it’s gone. If the cloud is moving to this, shouldn’t NFV also be supporting it?
For the NFV control software, we have a more explicit problem. The framework itself is described as a set of monolithic components. There’s no indication that the NFV software itself can scale, or fail over, or be distributed. What happens if a VNF fails? The NFV people would tell you that it’s replaced dynamically. What happens if the NFV MANO process fails? Who replaces it, given that MANO is the element that’s supposed to do the replacing?
A lower-level issue just as difficult to rationalize is that of VMs versus containers. From the first, the presumption was that VNFs would be hosted in virtual machines, largely because that was all that was available at the time. Containers are now being considered, but they’re fitting into a model that was developed with the limitations of VM deployment in mind. Is that correct?
Container deployment and orchestration are different from VM deployment and orchestration. OpenStack operates on hosts and VMs, but container technologies like Docker and Kubernetes are deployed per host and then separately coordinated across host boundaries. OpenStack is essentially single-threaded—it does one thing at a time. Docker/Kubernetes lends itself to a distributed model; in fact, you have to explicitly pull the separate container hosts into a “swarm” or “cluster”.
I’m not suggesting that you could do fully distributed Docker/Kubernetes control processes today, but you could do more there than with VMs and OpenStack, and you could almost certainly transform the container environment into something that is fully distributable and scalable. With less effort than would be required to transform VMs and OpenStack. VMs are, of course, more tenant-isolated than containers, but if VNFs are carrier-onboarded and certified, do you need that isolation? We need to decide, and if we do to either weigh that loss against the gains of distributability, or fix the isolation issues.
The final point is networking. You could make a strong argument that both Amazon and Google built their clouds around their networks. Google offers the most public model of what a cloud network should be, and it explicitly deals with address spaces, NAT, boundary gateway functions, and so forth. We are building services with NFV, services that will include both virtual and non-virtual elements, services that need control-level and management-level isolation. Google can do all of this; why are we not talking about it for NFV? I don’t think it’s possible to build an agile public cloud without these features, and so we should be exploiting standard implementations of the features in SDN and NFV. We are not.
Are the management ports of a VNF in the address space of the user service? If not, how does the user change parameters? If so, then are the connections between that VNF and the NFV control element called the VNF Manager also in the service address space? If not, how can the VNF be in two spaces at once. If so, then doesn’t that mean that user-network elements can hack the VNFM?
We already have an integration challenge with NFV, one widely recognized by network operators. That which we do not address in the specifications for SDN or (more importantly) NFV are going to end up becoming perhaps bigger integration problems in the near term and obsolescence problems in the long term. The cloud is already moving forward. SDN’s greatest successes have come in cloud deployments. NFV’s success will depend on effectively piggybacking on cloud evolution. We have a clear indication of what the cloud is capable of, and where those capabilities are heading. We need to align to those capabilities now, before the cost of alignment becomes too high or the lack of it threatens deployment.
Why are Operators Souring on NFV Progress (and Can it Be Fixed?)
An article in Light Reading yesterday caught my eye, because it said that a survey of operators showed a lower level of confidence in their meeting virtualization goals. This isn’t a surprise to me, given that my own contact with operators has been saying they had little confidence all along. It does offer some insights into what’s going on with NFV and the broader and more important topic of “transformation”.
One obvious question raised by what I just said is “Why did I say that operators never showed much confidence in NFV and LR shows a decline?” It’s probably a matter of who you survey and how. NFV has been driven largely by the CTO organization (what was once called “Science and Technology”), since it’s this body that contributes people to the standards process that’s been the most visible NFV initiative. This group is notoriously fixated on “does it work?” rather than on “does it help?”. Most surveys grab the standards people for subjects, so there’s no surprise that they’d be looking at NFV through rose-colored glasses.
Nothing succeeds just because you can get it into a lab (unless it’s lab equipment). As NFV or anything else progresses, it has to pass from the technical domain to the financial domain. That transition exposes a bunch of issues that were accidentally left out, deliberately declared out of scope, or were never part of the CTO mission to start off with. In particular, with NFV, the maturation of the testing process has made it clear that there weren’t enough software types involved in the ETSI ISG, and that as a result critical software issues were never addressed.
The biggest problem in that sense is the management, which is flat unworkable and has always been. This manifests itself with the connection between VNFs and the rest of the NFV world. This connection is the place where “integration” at the software level is required; the data-plane connectivity of a VNF is managed largely the way it would be managed for any piece of software. What was always needed for NFV was a set of firm APIs that provided the linkage, and a management-network model that described how you connected with those APIs while staying secure. Neither was provided.
There were also financial issues, but they related to one critical point that I tried to make myself in the summer of 2013. It should have been clear from the first that VNF providers were going to try to protect and expand their own revenue streams in the transition to virtual functions. The only reasonable way to reduce this opportunism was to focus on open-source VNFs. These, because they were open-source could have been easily modified to support those standard APIs.
Open-source orchestration to support proprietary VNFs isn’t the most logical approach in the first place, and priority should have been given to open VNFs. I think the ISG was trying too hard to get broad engagement from potential VNF sources, in part because they’d perhaps outnumber network vendors or operators. It’s possible this point is also why the process ended up with such a kludgy management concept; the VNF vendors don’t want to have to create something that conforms to a standard API because it would make plug-and-play substitution too easy.
That’s also why there’s such a blowback against VNF licensing terms. If operators wanted them to be reasonable, they had to be able to use competitive pressure against the VNF vendors. Absent an open framework, and open-source VNF competition, how likely was that? So where we are on the technical points has weakened operator leverage on the commercial terms.
The thing is, reasonable commercial terms is what operators want. They also want open-source software and plug-and-play, and yet it’s clear that the steps needed to realize these needs and maximize their benefits weren’t taken. I think the big eye-opener for NFV survey targets has been coming up against these points as the process of trials advances. It’s not that these problems haven’t been there from the first (they have), but that people successfully ignored them. Now, the time has passed where that’s possible, so everyone is in agonizing reappraisal mode.
The CFOs that I’ve talked with, and even the COOs and CIOs, have never been all that convinced that the NFV pie (or the SDN pie) was fully baked. Remember that almost none of the CFOs believed that the current tests would advance the business case, and they didn’t see any initiatives underway that would. But just as the CTO people are facing the NFV march to the sea, the CFOs are starting to see progress. Over the last two months, they’ve been warming to the idea that AT&T’s ECOMP project just might expose all the critical issues in making a business case, and move to address them.
The question is whether we’re asking too much from ECOMP at this point. Yes, AT&T has done a much better job of framing everything needed for transformation of the carrier business model than ETSI did. But as I pointed out in my exploration of ECOMP, there are still issues in management orchestration and connection that aren’t fully described in the ECOMP documents. The code isn’t available yet for review, so I can’t be sure that AT&T has gone far enough.
Then there’s the future. What we’re trying to do with SDN and NFV and carrier cloud at this point is play catch-up. We don’t have an architecture that covers current requirements well enough to make a convincing business case, and that’s the first lesson of the slip in operator confidence that LR is reporting. But the second lesson is that you can’t ignore what’s happening if you want to address it, and there’s more happening in the cloud today than there was in 2012 when the Call for Action paper launched NFV.
The cloud is a new platform for software development, a platform that includes distributability, elasticity, agility, and more. There are common web services already offered by competing cloud providers, services that should be incorporated in any NFV or SDN architecture, and yet they have not been incorporated or (as far as I know) even considered. Even today, we could be shooting way behind the duck, especially considering how long it would likely take to deploy something that met even minimal requirements. We should be looking at the cloud in 2021 as a lesson for NFV development in 2017. I propose to do that, and I hope we can have some interesting discussions on the points!
What’s Really Behind Amazon’s New “Premises-Cloud” Push?
Amazon has been working hard to make the cloud more than just outsourced server consolidation, and its latest move might be its most significant. They’ve announced a distributed platform (hardware and software) that can extend some important AWS API services to nearly anywhere—not only the data center but potentially anywhere you can run Ubuntu (or Amazon) Linux. It’s not exactly how it’s being described in many online stories, but in some ways, it’s more. To get to reality we need to look at what Amazon announced, and what’s happening in the cloud.
The basics of Amazon’s announcement are simple. The software component is called “Greengrass” and it’s designed to provide an edge hosting point for Amazon’s IoT and Lambda (functional programming) tools, to permit forward placement of logic to reduce the impact of propagation delay on control loops used in a number of process automation and communications applications. The hardware is called
“Snowball Edge”, and it’s a secure, high-performance, edge appliance for the Snowball high-speed data service Amazon has offered for some time. Snowball Edge offers corporate users the ability to stage large databases securely in the cloud. Snowball Edge appliances can also run Greengrass, which makes the combination a nice choice for edge event management and collection.
All of this is logical given the fact that cloud computing is now clearly evolving. We are not going to “move” applications to the cloud in the current thinking, we’re going to write applications for the cloud. Since the general trend in applications has been toward the worker, new cloud applications would probably be ones designed to push IT directly into their hands, at the point of activity. That means applications would have to be more real-time, more distributed, to be responsive to worker needs. In short, we’re actually moving toward enterprise software that looks a bit like a multi-level hierarchy of processes, with simple ones at the edge and complex ones deep inside.
For Amazon, in a technical sense, the goal here is simple; if you have a cloud data center that’s located hundreds of miles from the user (or even more) then you have the risk of creating enough of a lag between event reception and processing that some real-time tasks are not going to be efficient, or work at all. Amazon’s two services (IoT and Lambda) are both designed to support that very kind of real-time application.
The details of the target features are important, IMHO. Most of the IoT features Amazon offers relate to device registration and control, which logically are edge functions. Millions of devices (hypothetically) could swamp centralized facilities. Lambda services are really a part of a software evolution toward pushing logic to the point of need. A Lambda function is a nubbin of code that is fully, truly, stateless in its behavior and can be deployed in any number of instances anywhere it’s useful. You have simple features assigned to simple tasks. There’s no fixed resource assignment, either. Lambdas float around in the cloud, and from a charging point you pay only for what you use. They’re also simple; there’s evidence that you could make Lambda functional programming populist enough for end-users with some tech savvy to perform. All these factors make it ideal for distribution to the edge.
Amazon has no real edge presence, of course. They aren’t an IT company so they don’t sell computers or application software. They don’t own any mobile phones (their phone wasn’t a success). Could they push functionality to a Fire tablet? Sure, but not unless they developed an approach to function distribution that’s general, simple, and applicable to near-term needs. Sounds like IoT and real-time, like Lambdas and data caches.
The key point here is that this initiative is very focused. Amazon is not trying to get into IBM’s or Microsoft’s computing business or trying to replicate their cloud computing model. Such a move would be a giant step and probably a giant cost, risk, and error for Amazon (who has enough on its plate as it is). I think the Greengrass/Snowball combination is really aimed specifically at IoT and real-time cloud, and I think that if there’s a competitive thrust to it, the trust is against Microsoft first and Google second.
Microsoft Azure is a platform-as-a-service cloud, and as such it naturally extends itself onto the premises, which means that “Azure components” are also Windows Server components. That makes it easy for Microsoft to build real-time applications that employ “shallow” and “deep” processes. If you look at Google’s Cloud Platform material on IoT, you see the same basic features that Microsoft and Amazon have, but you also see emphasis on Google’s fiber network, which touches most of the major ISPs in peering for content delivery. That gives Google a low-latency model.
IoT, if anyone ever gets a useful handle on it, would be the largest cloud application driver in the market. No cloud provider could hope to survive without a strong position in it, and that includes Amazon. Thus, having a platform to push to the cloud edge and aligning it explicitly with IoT and real-time applications is essential if Amazon is to hold on to its lead in the cloud. Remember that Microsoft already does what Amazon just did; the only thing that’s prevented IoT and real-time from shaking the leadership race in the cloud has been lack of market opportunity.
I’ve said in a number of forums that we are starting to see the critical shift in the cloud, the shift from moving to it to writing for it. It’s that shift that creates the opportunity and risk, and since that shift is just starting to happen Amazon wants to nip its negative impacts in the bud and ride its positive waves. That means that this cloud market shift is important for everyone.
First, forget any notion of “everything in the cloud”, which was silly to start with. What the future holds is that model of process caching that I blogged about before. You push processes forward if they support real-time applications or if they need to be replicated on a large scale to support workload. You pull them back when none of that is true, and it’s all dynamic. We have a totally different development model here.
Second, functional or Lambda programming isn’t just some new software gizmo to titillate the geek kingdom. We’re moving from a model where we bring data to applications, toward one where we bring applications (or their components) to data. The logical framework for that is the highly scalable, transportable, Lambda model of software. Lambdas, though, are more likely to be tied to real-time push-it-forward components than to the whole of an application, which says that components will become increasingly language-and-architecture-specialized over time, with shallow and deep parts of software done differently.
Third, while Amazon doesn’t want to compete with IBM and Microsoft in their basic data center business, they probably do cast covetous eyes at the IT business overall, in the long term. What Greengrass/Snowball show is that if the mission is process caching, then a cloud-centric approach is probably the right one. Premises software and hardware opportunity won’t be captured by Amazon by competing in those spaces, but by making those spaces obsolete. By subsuming them into the cloud, and thus trivializing them as just another place to host cloud stuff. Marginalize what you cannot not win.
What this shows most of all is that we have totally misread the cloud revolution. The original view of just moving stuff to the cloud was never smart, but even the realization that software written for the cloud would be the source of cloud growth undershoots reality. The cloud is going to transform software, and that means everything in IT and consumer tech will be impacted. It’s going to ruffle a lot of feathers, but if you like excitement, be prepared.