Clusters, Service Models, and Carrier Cloud

If we want to apply cluster techniques to carrier cloud services, we need to first catalog just what kind of services we’re talking about.  Otherwise we can’t assess what the feature-hosting mission of carrier cloud technology would be, and without that, we can’t assign infrastructure requirements.  You’d think that all this would have been done long ago, but as you’ll see, the primary focus of function hosting so far has been a single service with limited scope.

Referring back to (and summarizing) my last couple blogs, a “cluster” is a homogeneous set of resources that acts as a single unit in virtualization and is scheduled/managed by an organizing entity.  Some clusters are “cloud-like” because they support large numbers of tenants as discrete service, and some are more like grid computing where applications draw on elastic resources, perhaps for multi-tenant missions.

“Services” can be defined in a lot of ways, most of which would deal with things like their features.  Features of services determine the functions that are exploited by the service, which for hosted services means the functions that are deployed.  A software function is what it’s written to be, and like an application, what it does doesn’t necessarily have much impact on how it’s hosted.  A more relevant thing to look at for hosting or zero-touch automation is the service model.

NFV has focused on a single-tenant provisioned service, and more specifically on edge features that are typically offered for business VPN services within custom appliances.  Virtual CPE (vCPE) is the most common NFV application referenced.  Single-tenant provisioned services are “ordered”, “modified” and “sustained” for their lifespan, as specific services, meaning they are paid for discretely and have a service-level agreement (SLA) or contract associated with them.

The most common “service model” in terms of total number of users and total traffic is the multi-tenant shared service, where a service provides mutual connectivity to a large population of users, each of which has their own communications mission.  The Internet is obviously such a service, and so is the old public switched telephone network (PSTN) and our mobile cellular services.

A third model of service that has emerged is the foundation or platform service, which is a multi-tenant service that is used not directly but rather as a feature within another service.  A good example of a platform service is the IP Multimedia Subsystem and Evolved Packet Core (IMS and EPC) of mobile networks.  Every customer “uses” IMS and EPC, but an instance of these two aren’t set up for each call.

From this, you can probably see that we can say that a “service” is something you have to set up and sustain.  Most of our service use today is based on a multi-tenant or foundation model where the service is used in support of many tenants.  Services have users/tenants, SLAs, lifecycles to manage, etc.  The way that tenants map to services is really a function, a piece of service logic, and so it has only an indirect impact on the way we need to think about service lifecycle management, including deployment of hosted functions.

Let’s look at the Internet in this light.  The Internet is made up of three basic foundation services.  One is IP networking, one is Domain Name Service, and one is address assignment (DHCP).  To “get on” the Internet, you get an address (DHCP), open your browser, click a URL, and it’s translated to an IP address that is used by the IP network service.  The Internet hosts other “web services” that are accessible once this has been done.

These three service types frame out the cluster/hosting implications fairly nicely, because they speak to an important point about all services, which is the nature of the binding between the logical/functional dimension and the physical/resource dimension.  Think of a cluster or resource pool as a literal pool from which stuff is withdrawn.  Some service models pull resources out and take them up toward, or into, the service layer, and others leave them below in the pool.

The service types frame the management model, by framing the requirements.  The paramount question in service management, and thus in zero-touch automation, is the relationship between the resources used by the service and the service-level agreement (SLA).  It’s this relationship that we have to map to cluster behavior.

Single-tenant services like vCPE are provisioned and managed per user, which means that resource behavior is tightly bound to each service, making it important to remediate problems at the resource level in conformance to specific service SLAs.  A resource assigned to such a service is actually assigned, and while there may be a sharing model for it (virtual machines sharing a server), the thing that’s assigned is committed.  Put another way, tenant awareness extends downward into the resource pool.

The fact that these services use committed (if often virtual) resources means that per-service remediation is likely the basis for sustaining operations within the SLA.  If something breaks, you have to either replace it 1:1, or redeploy some or all of the service to restore correct operation.  It is possible that some autonomous behavior at the resource level might act to help, as would be the case with adaptive routed networks, but this behavior is actually a multi-tenant service behavior, as we’ll see.

Because there are obviously going to be a lot of tenants for a single-tenant service, and because of the explicit service/resource association, these services require considerable orchestration, and they’re subject to the problem of cascading faults.  A broken trunk connection might impact a thousand services, and while fault correlation might be expected to address that problem at one level, a substitute trunk could well not provide the SLA needed by some or all of the services impacted.  At the least, there might be a better solution for some than simple rerouting.

The cluster implications for this model should be obvious.  You can’t do simple 1:1 resource substitution, you have to rely on more complex service lifecycle management.  A cluster tool would need to be cloud-like to start with since hosted components are single-tenant (like cloud applications) and you’d need service automation throughout.

Multi-tenant services are capacity-planned, meaning that the resource pool is sized to support the performance management guidelines set for the service, and there may also be admission control exercised to prevent the number of tenants drawing on the service from overloading the resource pool.  The resource pool could also adapt to changes in overall tenant or traffic load and replace “broken” components by redeploying.

In these multi-tenant capacity-planned services, the SLA is probabilistic rather than deterministic.  You set a probability level for the risk of failure and build to keep things operating below that risk point.  As long as you can do that, you’re “successful” in meeting your SLA, even if once in a while some users get unlucky.  The SLA itself can be tight or loose (the extreme of the latter being best-efforts), but whatever it is, it is on the average that counts.

Because you’re playing the law of averages, reconfiguring to meet a capacity plan is a very different thing from single-tenant reconfiguration.  The service layer doesn’t have any explicit resource commitments, even to virtual resources.  What it has is a capacity-planned SLA, and so you could easily define resource pool failure modes and simply reconfigure into one of them when you have a problem.  This means that cluster software could easily handle most of the problem management and service lifecycle tasks.

The in-between or foundation services are the final consideration.  Like multi-tenant services in general, foundation services will generally have a capacity-planned SLA, but often there will be more specific stipulations of things like response time, because the “host” service that references the platform service likely has some broad timing constraints.  An example is easily found in mobile services, where calls tend to be abandoned if the user doesn’t get some signal of progress fairly promptly.

Could a foundation service slip across the boundary into requiring a more explicit SLA?  I think it could, which would create a multi-tenant class that had more specific virtual-resource commitments and more explicit orchestration of scaling and redeployment.  There’s no reason why, in this case, we couldn’t regard the “user” of the platform service as being the host service to which it’s integrated, rather than the users of that host service.  Everyone with a phone isn’t a user of IMS, they’re a user of the system that includes it.

The future of services is already visible in the Internet—a collection of foundation services bound into a larger service model, supporting many users concurrently.  That mission is certainly not the focus of NFV today, and it’s not even a perfect match to current thinking on cloud computing.  It may be something that tips more to the traditional grid missions, but that’s a question we probably can’t address until we get more enlightened thinking on the role of foundation services in the future of networking.