What Would the “Right” Model for SDN, NFV, and Virtualization Look Like?

There are sometimes concrete answers to abstract questions.  In some cases, the reason why those answers don’t seem to gain traction or acceptance is that they come from a different sector.  So it is today with network transformation or “next-gen” networks.  We have spent half-a-decade framing principles that were already framed better elsewhere, and with every day that passes, those “elsewheres” are moving forward toward even-more-useful developments, while networking seems stuck in neutral.

It didn’t have to be this way.  Back in 2013 when I worked with a group of vendors to launch the CloudNFV initiative, we submitted a proof-of-concept proposal to ETSI, which was the first to be formally approved.  In 2014, when my commitment to offer free services to the activity expired, the PoC was modified considerably, but I’ve posted the original PoC HERE for review by those who missed the original document.  I want to call your attention in particular to Section 1.2 PoC Goals, and to the eight goals defined there.  What you will see is that every single goal that’s emerged from years of subsequent work was explicitly stated in that section.  Interoperability?  It’s in there.  Conformance to current management practices?  There too.  Infrastructure requirements?  There.  Onboarding is there.  Portability is there too.

The PoC defined a software architecture built around a data model (“Active Contract”) that conformed to the principles of the TMF’s NGOSS contract.  The architecture was mapped to the ETSI E2E model, and the early PoC phases were intended to demonstrate that the architecture as defined would conform to that model’s requirements and the underlying ETSI work, and was also scalable.  The first phase of the PoC was demonstrated to the two sponsor operators (Sprint and Telefonica) before the PoC goals and structure was changed.

We picked the name “CloudNFV” for a reason; we believed that NFV was first and foremost a cloud application.  You can see in the implementation the same principles that are emerging in the cloud today, particularly in Amazon’s AWS Serverless Platform.  We have controllable state and small scalable processes that draw their state information from a data model, making them scalable and distributable.  In short, the functionality of NFV and the scalability of an implementation were designed into the model using the same principles as we’ve evolved for the cloud, which is what most cloud technologists would have recommended from the first.

I’m opening with this explanation because I want to demonstrate that it is not only possible to take a top-down view of the evolution to virtual networking, it’s also possible to harmonize it with the functional concepts of ETSI and the technology evolution of the cloud.  Having set that framework, I now want to talk about some specific technology rules we could apply to our next-gen evolution.

The clearest truth is that we have to start thinking of networks as being multiplanar.  There’s a critical abstraction or model of a service, created from a combination of TMF NGOSS Contract principles and modern intent modeling.  Then we have traffic and events living in literally different worlds.  To the extent that network traffic is either aggregated or representing operator services aimed at linking specific business sites, the traffic patterns of the network are fairly static.  We can visualize future networks as we do present ones, as connections and trunks that link nodal locations that, under normal conditions, stay the same over long periods.

However, “long” doesn’t mean forever, and in virtualization it doesn’t mean as long as it means in traditional networks.  Networks are created today by devices with specialized roles.  Those devices, being physical entities are placed somewhere and form the nexus of traffic flows.  In the virtual world, we have a series of generalized resources that can take on the role of any and all of those physical devices.  You can thus create topologies for traffic flows based on any of a large set of possible factors, and put stuff where it’s convenient.

It’s all this agility that creates the complexity at the event level.  You have a lot of generalized resources and specific feature software that has to be combined and shaped into cohesive behaviors.  Not only does that shaping take signaling coordination, so does the ongoing life-sustaining activities associated with each of the elements being shaped.  This is all complicated by the fact that since all the resources are inherently multi-tenant, you can’t expose the signaling/event connections for general access or attack.

In the world of virtualized-and-software-defined networks, you have a traditional-looking “structure” that exists as abstract flows between abstract feature points by the service model and its decomposition.  This virtual layer is mapped downward onto a “traffic plane” of real resources.  The “bindings” (to use the term I’ve used from the first) between the two are, or should be, the real focus of management in the future.  They represent the relationships between services (which means informal goals and expectations ranging onward to formal SLAs) and resources (which are what breaks and have to be fixed or replaced).  Explicit recognition of the critical role of bindings is essential in structuring software and understanding events.

When a service is created, the process starts with the abstract flow-and-feature structure at the top, and is pressed downward by creating bindings through a series of event exchanges.  With what?  On one end, obviously, is the resource being bound.  At the other is, staying with our abstraction, the model.  Obviously abstract things can communicate only abstractly, so we need to harden the notion of the model, which is the goal of the software architecture.

Logically speaking, a local binding could be created by spawning a management process in the locality and giving it the abstract model as a template of what needs to be done.  We don’t need to have some central thing that’s doing all of this, and in fact such a thing is an impediment to scalability and resiliency.  All we need is a place to host that local process, and the data-model instruction set.  The data model itself can hold the necessary information about the process to permit its selection and launching, then the process takes over.  NFV, then, is a series of distributed processes that take their instructions from a detailed model, but is coordinated by the fact that all those detailed local models have been decomposed from a higher-level model.

This is what frames the hierarchy that should be the foundation of the next-gen network software.  We need to “spawn a local management process”, which means that we must first use policies to decompose our global service abstraction into something that looks like a set of cooperative locations, meaning domains.  How big?  There’s no fixed size, but we’d likely say that it couldn’t be bigger than a data center or data center complex that had interconnecting pipes fast enough to make all the resources within look “equivalent” in terms of performance and connectivity.  High-level model decomposition, then, picks locations.  The locations are then given a management process and a piece of the model to further decompose into resource commitments, via those critical bindings.

The bindings are also critical in defining the relationship between locations, which remains important as we move into the in-service phase of operation.  A “primary event” is generated when a condition in a real resource occurs, a condition that has to be handled.  The big management question in virtual networking is what happens next, and there are two broad paths—remediation at the resource level, or at a higher level.

Resource-level remediation means fixing the thing that’s broken without regard for the role(s) it plays.  If a server fails, you substitute another one.  This kind of remediation depends on being able to act within the location domain where the original resource lived.  If I can replace a “server” in the same location domain, that’s fine.  The local management process can be spun up again (there’s no reason for it to live between uses), access the data model, and repeat the assignment process, for each of the “services” impacted.  And we know what those are because the sum of the local data models contain the bindings to that resource.

Higher-level remediation is needed when replacing the resource locally isn’t possible, or when the problem we’re having doesn’t correlate explicitly to a single resource.  It’s easy to imagine what causes the first of these conditions—we run out of servers, for example.  For the second, the easy examples are an end-to-end event generated at the service level, or a service change request.

So, if the resource remediation process runs with the resources, in the local domain, where does the higher-level process run?  Answer, in the location that’s chosen in the modeling, which is where the resource domains are logically connected.  Data-center domains might logically link to a metro domain, so it’s there that you host the next-level process.  And if whatever happens has to be kicked upstairs, you’d kick it to the next-level domain based on the same modeling policy, which is the logical opposite of the process of decomposing the model.

At any level in the remediation or event analysis process, the current process might need to generate an upstream event.  That event is then a request for the next-level management process to run, because you can only jump to an adjacent level (up or down) in event generation in a logical non-fragile model implementation.  A single resource event might follow a dozen service bindings back to a dozen higher-level processes, each of which could also generate events.  This is why event and process management is important.

And the service model controls it all.  It’s the model that’s specific, but even the model is distributed.  A given layer in the model has to know/describe how it’s implemented in a local resource domain, and what its adjacent-domain (upward toward the services, downward toward the resources) are bound in.  That’s all they need.  Each piece of functionality runs in a domain and is controlled by the local piece of that global distributed model.

But it’s scalable.  There is no “NFV Manager” or “MANO” or even Virtual Infrastructure Manager, in a central sense; you spin up management functions where and when you need them.  As many as you need, in fact.  They would logically be local to the resources, but they could be in adjacent domains too.  All of these processes could be started anywhere and run as needed because they would be designed to be stateless, as lambda functions or microservices.  Amazon, Google, and Microsoft have already demonstrated this can work.

This is how the cloud would solve next-gen networking challenges.  It’s how SDN and NFV and IoT and 5G and everything else in networking that depends to any degree on virtual resources and functions should work.  It’s what cloud providers are already providing, and what network vendors and operators are largely ignoring.

All you have to do in order to validate this approach is look at how Amazon, Google, and Microsoft are evolving their public cloud services.  All of this has been visible all along, and even today it wouldn’t take an inordinate amount of time to create a complete implementation based on this approach.  I think that’s what has to be done, if we’re really going to see network transformation.  The future of networking is the cloud, and it’s time everyone faced that.