What Role Can AI Play in Service Lifecycle Automation?

I hate to open another hype can of worms, but this is a question that has to be asked.  Is there a role for artificial intelligence (AI) in service lifecycle automation, virtualization, SDN, and NFV?  The notion raises the specter of a glittering robot sitting at a console in a network operations center, and surely, we’re going to be seeing media stories along these lines, because excitement and not realism is the goal.  Underneath the overstatement and underthinking, though, there may be some very interesting truths.

I asked an old friend who runs a big NOC my opening question, and the first response was a wry “I’m wondering if there’s a role for natural intelligence there!”  On further consideration, I got what I think is the starting point for the whole discussion, “It depends on what you mean by AI.”

What Wikipedia says about AI is “In computer science, the field of AI research defines itself as the study of “intelligent agents”: any device that perceives its environment and takes actions that maximize its chance of success at some goal.  Colloquially, the term “artificial intelligence” is applied when a machine mimics “cognitive” functions that humans associate with other human minds, such as “learning” and “problem solving”.

If we take the formal, first, definition, it’s pretty clear that service lifecycle automation would fall within the scope of AI.  If we take the second, it’s a bit fuzzier, and to decode the fuzz we have to look deeper at the mission itself.

Service lifecycle automation is based on the notion that a given network service has a desired state of behavior, one that was sold to the service user and designed for by the network engineers who put the whole thing together.  The purpose of service lifecycle automation is to get a newly ordered service into that preferred state and keep it there for the lifetime of the service.  At that point, any capacity or resources would be returned and the service would be no longer available.

Not even a human operator would be able to perform this task without knowing what the preferred state of the service was, and also the current state of the service.  Generally, NOC personnel managing service lifecycles manually would respond to a condition that changed service state from the goal state, and that response would be expected to get things right again.  This process has become increasingly complicated as services and infrastructure become more complicated, and as a result there’s been growing focus on automating it.

DevOps tools are an example of automation of software deployment tasks, and much of network service lifecycle automation is based on similar concepts.  DevOps either supports the recording of “scripts” or series of steps that can be invoked either manually or in response to an event, or the definition of a series of possible states and how, in each state, processes would facilitate the transformation of the service into that ideal state.

In AI terms, the person who puts together these scripts would be called a “subject matter expert”, and would be expected to sit down with a “knowledge engineer” to transfer expertise into machine-applicable form.  I would argue that this is what happens when you do DevOps, or when an expert defines the state/event and event/process relationships associated with service lifecycle management.  This is why I think that the first definition of AI is met by the kind of service lifecycle automation I’ve been blogging about for years.

The real AI question, then, is that second part of the definition, coupled with the pithy comment of the NOC manager.  Would there be a value to a “cognitive” component to service lifecycle automation, one that perhaps involved “learning” and “problem-solving?”  If so, what might this component look like, how would it help, and how could it be implemented?

Most NOC people and operations specialists I’ve talked with say that they would not want a service lifecycle automation system to simply run off and do stuff in defiance of specific rules to the contrary, any more than they’d want ops personnel to do that.  That means that if we have a specific condition and specific instructions to do something when that condition arises, the AI system should do it.  No cognition there.

However, every NOC/operations type knows that there comes a time in service lifecycle management known as the “Oh **** moment”.  Everyone in technology operations has experienced one of these.  They usually happen for one of two reasons.  First, a bunch of bad things happen all at the same time.  In state/event terms, this means that you either have a flood of events or you have a combination of events that you never thought about, and didn’t create a state for.  The second reason is that you take what is supposed to be a restorative action and things get worse rather than better.

I saw the latter personally in the healthcare industry.  A seemingly minor parameter error was made in setting up a bunch of routers.  The result was that on the next full working day, connections between workers and critical applications began to fail.  The normal remedial process, which was to simply reset and restore the connections made things worse.  The final step was to assume that the host software had gone bad and reload/restart.  That killed everything.

You can make a darn convincing argument that machine cognition could have been valuable at this point.  The same can be said in any situation where there are either a bunch of bad things (which implies common cause) or a remediation failure (which implies unexpected consequences).  However, it may well be that these issues are beyond the range of reasonable AI response, at least in the near term.

In my healthcare example, diagnosis of the problem required a combination of experience that was apparently uncommon enough to not occur either in the healthcare provider or the router vendor organization.  Might the necessary skills have been “taught” to AI?  Perhaps, if somebody were willing to fund an enormous collection of subject-matter experts and the even-more-expensive dumping of their brains into a database.

A real advance in AI application to service lifecycle management would have to come, IMHO, from a combination of two factors.  First, we’d need to be able to substitute massive analytics for the subject matter expert.  Collecting data from the history of a given network, or in fact from all networks, might allow us to create the inputs about correct and incorrect operation, and the impacts of various restorative steps, that a knowledge engineer would need.  Second, we need an on-ramp to this more AI-centric state that’s not overly threatening.

What would be useful and incremental, perhaps, is what could be called the “Oh ****” state and an event of the same name.  The reception of the so-named event in any operating state causes a transition to the you-know-what state, where a process set designed to bring order out of presumptive chaos is launched.  That implies a designed-in capability of restoring the state of everything, perhaps by some form of by-domain redeployment.

There is an AI opportunity here, because it would be difficult for human operators to catalog the states of chaotic network infrastructure.  Analytics and AI principles could be used to match behavior patterns observed in the past with the way the situation developed and how it progressed.  This could then be used to decide what action to take.  In effect, AI becomes a backstop for policies, where policies simply haven’t, or can’t, be developed.

From there, of course, you could expand the scope.  Over time, NOC personnel and operations executives might become more comfortable with the idea of having their own rules of network state and lifecycle management supplanted by machine decisions.  If the automated processes can prove themselves where humans can’t go, it’s much easier to commission them to perform routine tasks.  People, having experience with AI operating where policies can’t easily be formulated, will eventually allow AI to formulate policies in an ever-broader way.

This isn’t going to be an overnight process, though.  AI still has to work from some base of knowledge, and just gathering data isn’t going to deliver everything it needs.  Even a long historical data timeline isn’t the full answer; networks rarely operate with unchanged infrastructure for even a year.  We’ll need to examine the question of how to gather and analyze information on network conditions to get the most of AI, and we’ll probably need humans to set policies for a long time to come.

Lessons the Fiber Market Can Teach Networking Overall

We already know that fiber technology can be divided into families based on features—long-haul versus access, passive optical versus point-to-point or active reconfigurable.  We might be seeing signs that it is dividing at the business level, into what might be called “tactical fiber” and “strategic fiber”.  If so, that could have major implications on the networking market in general, and for L2/L3 vendors in particular.  Finally, if all this is true, then the quarters reported by fiber firms Ciena and Infinera may demonstrate the state of market recognition of that tactical/strategic difference.

There has always been a tension between the strategic and the tactical in technology.  Do you see the future as something that develops slowly from the present, or something that responds to radical new technologies and opportunities in a totally new way?  Sales people generally don’t like strategy—it takes too long.  Over the last decade, strategy has fallen a bit out of favor even with senior executives and the Street, because the financial markets live quarter by quarter.  However, there are times when tactics won’t get you to where you need to be.

Look at Infinera, whose future according to the Street is subsea transport and data center interconnect (DCI).  OK, we have oceans to cross.  We have data centers to connect.  However, neither application is the operator goal; it’s not like subsea fiber is a migration path for data or that data centers demand interconnection.  There’s a business model lurking here somewhere, or there should be.  There are tactical fiber roles within that model, but if you’re an operator with business problems to solve, do you look for those tactics, or for the strategies that address your problem?

The bellwether quote from the Infinera earnings call (CEO Fallon) is “Looking to the future, we see opportunities in the horizon stemming from architectural evolutions that are at the beginning of their planning phase, particularly around fiber densification in the metro for cable operators and 5G for mobile service providers. Additionally, the enormous growth in cloud services is likely to persist, which will increasingly require the most scalable and cost-efficient transport networks.

During this period of architectural evolution, our technology approach will allow us to deliver the most reliable, high-capacity, power-optimized solutions in form factors that our customers want to deploy, both as integrated solutions and purpose-built platforms.”

The statement says that the future is a product of strategic initiatives by operators, which will develop into tactical opportunities for Infinera.  They’ll supply the elements needed by the buyer to fulfill buyer needs, and that’s a tactical mission.

Ciena serves the same overall markets, and they sell products tactically, but they position themselves much more as a strategic player.  If you look at the transcripts for the current-quarter earnings calls of the two companies, you find a stark difference.

The focus quote from Ciena’s earnings call (CEO Smith) is “…we also introduced Liquid Spectrum, our approach to redefining really how optical networks are fundamentally built. This software driven platform combines our Blue Planet software, WaveLogic Ai chipset and a reconfigurable Photonic Layer. This is a truly unique offering in the market. It is the first dynamic capacity on demand solutions that does not require pre-deploying hardware for peak periods of traffic. And I want to be very clear, our success to-date in the market does not take into account the technology advantages yet to come with WaveLogic Ai, which when combined with our global scale and deep customer relationships positions us extremely well to capture additional market share globally.”

This says that Ciena has a strategic vision for fiber networking, has built symbiotic product elements to support that vision, and is reaping the benefit of the strategic vision in deep customer relationships.  A “unique offering” is something that requires context to plan, context to sell.  It’s not a tactical vehicle at all.

The day after its quarterly announcement, which was a miss on revenue, Infinera’s stock dropped almost 10%.  The day after Ciena’s quarterly announcement, which was a beat on revenue and backed up with strong guidance, its stock gained about the same amount.  So, the obvious question is whether the difference in outcomes was created in large part by the difference between tactical market behavior and strategic behavior.

Do you think there are any major network operators anywhere in the world who believe they can continue with infrastructure plans as they are today?  Are there any who see the future as just a somewhat refined version of the present?  I don’t talk to any with those views, and if that’s a reliable indicator then I think you have to presume that operators are looking for dialogs with vendors who can tell them what the infrastructure plans for the future should be.  Yes, we’ll have submarine cables, metro fiber, data center interconnect.  They’ll fit into a different network and service model, though.

So, we are back to context.  The fiber tactic of the future has to fit the context of the future, not the one of the present.  Vendors who position their wares in such a way as to demonstrate understanding of the future are more credible positioning for that future context.  Even in an earnings call, and even at the CEO level, Ciena wins in that regard.  That’s especially true given that I think Ciena could have done even better in strategic positioning, and gained even more.

Is the context lesson one that should be learned other classes of vendor?  Certainly, server vendors or network software vendors should take the issue seriously.  I’ve seen literally half-a-hundred major pitches on “transformation” from vendors that define the term as “transforming your purchase of a competitor’s gear to your purchase of mine.”  Just because the network of the future might need “billing” doesn’t mean that you start your transformation off by buying billing software, or that just buying billing software transforms you.  Operators tell me they value a vendor’s vision of next-generation infrastructure as much or more than they value individual products.  Another vote for strategy.

How about those L2/L3 vendors I mentioned at the opening of this blog?  At one level, understanding the truth is always helpful, so they could benefit in that sense by understanding the strategic context of their buyers.  On another, facing it often isn’t helpful at all.  Fiber is the strategic lingua franca of the future network, because you can’t have a network without it.  Fiber vendors have a lovely positional advantage because they have a position.  Same for server/software vendors; if you want to have virtualization in the future, then servers and software is inescapable.  For the rest of the network, the other layers, there is no such certainty of mission.  A switch/router vendor might be arguing for a strategic future designed to work without switches or routers.

If we could create a full-optical mesh between users at an affordable price and using low-touch technology, nobody would want an electrical layer at all.  What the electrical layer does is aggregate and distribute traffic to build a service footprint larger by far than a practical fiber footprint.  If you make the optical layer cheaper (in particular, reduce the marginal cost of fiber capacity so fatter trunks aren’t disproportionately expensive) and improve optical-layer connectivity, you reduce the need for the electrical layer.  If you add SDN virtual-path grooming to optics, you reduce things further, and still further if you add hosted instances of switching/routing and SD-WAN technology.  That’s what makes strategic positioning of L2/L3 products difficult; you may be positioning for absence as much as for change.

However, the story of the two fiber families can have a happy ending.  Suppose, just suppose, that it indicates that buyers are starting to look past the ends of their noses, and that some sellers, at least, are seeing the change in focus.  We then have the means and the opportunity to think about that strategic future, and shape it.

 

What Would the “Right” Model for SDN, NFV, and Virtualization Look Like?

There are sometimes concrete answers to abstract questions.  In some cases, the reason why those answers don’t seem to gain traction or acceptance is that they come from a different sector.  So it is today with network transformation or “next-gen” networks.  We have spent half-a-decade framing principles that were already framed better elsewhere, and with every day that passes, those “elsewheres” are moving forward toward even-more-useful developments, while networking seems stuck in neutral.

It didn’t have to be this way.  Back in 2013 when I worked with a group of vendors to launch the CloudNFV initiative, we submitted a proof-of-concept proposal to ETSI, which was the first to be formally approved.  In 2014, when my commitment to offer free services to the activity expired, the PoC was modified considerably, but I’ve posted the original PoC HERE for review by those who missed the original document.  I want to call your attention in particular to Section 1.2 PoC Goals, and to the eight goals defined there.  What you will see is that every single goal that’s emerged from years of subsequent work was explicitly stated in that section.  Interoperability?  It’s in there.  Conformance to current management practices?  There too.  Infrastructure requirements?  There.  Onboarding is there.  Portability is there too.

The PoC defined a software architecture built around a data model (“Active Contract”) that conformed to the principles of the TMF’s NGOSS contract.  The architecture was mapped to the ETSI E2E model, and the early PoC phases were intended to demonstrate that the architecture as defined would conform to that model’s requirements and the underlying ETSI work, and was also scalable.  The first phase of the PoC was demonstrated to the two sponsor operators (Sprint and Telefonica) before the PoC goals and structure was changed.

We picked the name “CloudNFV” for a reason; we believed that NFV was first and foremost a cloud application.  You can see in the implementation the same principles that are emerging in the cloud today, particularly in Amazon’s AWS Serverless Platform.  We have controllable state and small scalable processes that draw their state information from a data model, making them scalable and distributable.  In short, the functionality of NFV and the scalability of an implementation were designed into the model using the same principles as we’ve evolved for the cloud, which is what most cloud technologists would have recommended from the first.

I’m opening with this explanation because I want to demonstrate that it is not only possible to take a top-down view of the evolution to virtual networking, it’s also possible to harmonize it with the functional concepts of ETSI and the technology evolution of the cloud.  Having set that framework, I now want to talk about some specific technology rules we could apply to our next-gen evolution.

The clearest truth is that we have to start thinking of networks as being multiplanar.  There’s a critical abstraction or model of a service, created from a combination of TMF NGOSS Contract principles and modern intent modeling.  Then we have traffic and events living in literally different worlds.  To the extent that network traffic is either aggregated or representing operator services aimed at linking specific business sites, the traffic patterns of the network are fairly static.  We can visualize future networks as we do present ones, as connections and trunks that link nodal locations that, under normal conditions, stay the same over long periods.

However, “long” doesn’t mean forever, and in virtualization it doesn’t mean as long as it means in traditional networks.  Networks are created today by devices with specialized roles.  Those devices, being physical entities are placed somewhere and form the nexus of traffic flows.  In the virtual world, we have a series of generalized resources that can take on the role of any and all of those physical devices.  You can thus create topologies for traffic flows based on any of a large set of possible factors, and put stuff where it’s convenient.

It’s all this agility that creates the complexity at the event level.  You have a lot of generalized resources and specific feature software that has to be combined and shaped into cohesive behaviors.  Not only does that shaping take signaling coordination, so does the ongoing life-sustaining activities associated with each of the elements being shaped.  This is all complicated by the fact that since all the resources are inherently multi-tenant, you can’t expose the signaling/event connections for general access or attack.

In the world of virtualized-and-software-defined networks, you have a traditional-looking “structure” that exists as abstract flows between abstract feature points by the service model and its decomposition.  This virtual layer is mapped downward onto a “traffic plane” of real resources.  The “bindings” (to use the term I’ve used from the first) between the two are, or should be, the real focus of management in the future.  They represent the relationships between services (which means informal goals and expectations ranging onward to formal SLAs) and resources (which are what breaks and have to be fixed or replaced).  Explicit recognition of the critical role of bindings is essential in structuring software and understanding events.

When a service is created, the process starts with the abstract flow-and-feature structure at the top, and is pressed downward by creating bindings through a series of event exchanges.  With what?  On one end, obviously, is the resource being bound.  At the other is, staying with our abstraction, the model.  Obviously abstract things can communicate only abstractly, so we need to harden the notion of the model, which is the goal of the software architecture.

Logically speaking, a local binding could be created by spawning a management process in the locality and giving it the abstract model as a template of what needs to be done.  We don’t need to have some central thing that’s doing all of this, and in fact such a thing is an impediment to scalability and resiliency.  All we need is a place to host that local process, and the data-model instruction set.  The data model itself can hold the necessary information about the process to permit its selection and launching, then the process takes over.  NFV, then, is a series of distributed processes that take their instructions from a detailed model, but is coordinated by the fact that all those detailed local models have been decomposed from a higher-level model.

This is what frames the hierarchy that should be the foundation of the next-gen network software.  We need to “spawn a local management process”, which means that we must first use policies to decompose our global service abstraction into something that looks like a set of cooperative locations, meaning domains.  How big?  There’s no fixed size, but we’d likely say that it couldn’t be bigger than a data center or data center complex that had interconnecting pipes fast enough to make all the resources within look “equivalent” in terms of performance and connectivity.  High-level model decomposition, then, picks locations.  The locations are then given a management process and a piece of the model to further decompose into resource commitments, via those critical bindings.

The bindings are also critical in defining the relationship between locations, which remains important as we move into the in-service phase of operation.  A “primary event” is generated when a condition in a real resource occurs, a condition that has to be handled.  The big management question in virtual networking is what happens next, and there are two broad paths—remediation at the resource level, or at a higher level.

Resource-level remediation means fixing the thing that’s broken without regard for the role(s) it plays.  If a server fails, you substitute another one.  This kind of remediation depends on being able to act within the location domain where the original resource lived.  If I can replace a “server” in the same location domain, that’s fine.  The local management process can be spun up again (there’s no reason for it to live between uses), access the data model, and repeat the assignment process, for each of the “services” impacted.  And we know what those are because the sum of the local data models contain the bindings to that resource.

Higher-level remediation is needed when replacing the resource locally isn’t possible, or when the problem we’re having doesn’t correlate explicitly to a single resource.  It’s easy to imagine what causes the first of these conditions—we run out of servers, for example.  For the second, the easy examples are an end-to-end event generated at the service level, or a service change request.

So, if the resource remediation process runs with the resources, in the local domain, where does the higher-level process run?  Answer, in the location that’s chosen in the modeling, which is where the resource domains are logically connected.  Data-center domains might logically link to a metro domain, so it’s there that you host the next-level process.  And if whatever happens has to be kicked upstairs, you’d kick it to the next-level domain based on the same modeling policy, which is the logical opposite of the process of decomposing the model.

At any level in the remediation or event analysis process, the current process might need to generate an upstream event.  That event is then a request for the next-level management process to run, because you can only jump to an adjacent level (up or down) in event generation in a logical non-fragile model implementation.  A single resource event might follow a dozen service bindings back to a dozen higher-level processes, each of which could also generate events.  This is why event and process management is important.

And the service model controls it all.  It’s the model that’s specific, but even the model is distributed.  A given layer in the model has to know/describe how it’s implemented in a local resource domain, and what its adjacent-domain (upward toward the services, downward toward the resources) are bound in.  That’s all they need.  Each piece of functionality runs in a domain and is controlled by the local piece of that global distributed model.

But it’s scalable.  There is no “NFV Manager” or “MANO” or even Virtual Infrastructure Manager, in a central sense; you spin up management functions where and when you need them.  As many as you need, in fact.  They would logically be local to the resources, but they could be in adjacent domains too.  All of these processes could be started anywhere and run as needed because they would be designed to be stateless, as lambda functions or microservices.  Amazon, Google, and Microsoft have already demonstrated this can work.

This is how the cloud would solve next-gen networking challenges.  It’s how SDN and NFV and IoT and 5G and everything else in networking that depends to any degree on virtual resources and functions should work.  It’s what cloud providers are already providing, and what network vendors and operators are largely ignoring.

All you have to do in order to validate this approach is look at how Amazon, Google, and Microsoft are evolving their public cloud services.  All of this has been visible all along, and even today it wouldn’t take an inordinate amount of time to create a complete implementation based on this approach.  I think that’s what has to be done, if we’re really going to see network transformation.  The future of networking is the cloud, and it’s time everyone faced that.

The Technical Pieces of a Successful NGN

What do we need, in a technical sense, to advance to next-generation networking?  Forget trite suggestions like “carriers need to change their culture” or “we need to focus on customer experience.”  When has any of that been in doubt, and how long has it been said?  If there are problems that need to be solved, what are they?  Three, in my view.  We need a good service modeling architecture, we need a framework for automating the service lifecycle, and we need to have a strong and scalable management architecture to bind services and resources.

To my mind, defining a good service modeling architecture is the primary problem.  We need one that starts with what customers buy, dives all the way to resource commitments, covers every stage of the service lifecycle, and that embraces the present and the future.  Forget technology in this effort; we should be able to do this in abstract…because service models are supposed to be abstract.  The abstract should cover four key points.

Point number one is hierarchical structure.  An object in the modeling architecture has to be able to represent a structure of objects that successively decompose from those above.   “Service” might decompose into “Core” and “Access”, and each of the latter might decompose based on technology and/or geography.

Point number two is intent-based elements.  An object in the architecture should have properties that are based on what it does, not how it does it.  Otherwise the object is linked to a particular implementation, which then limits your ability to support evolving infrastructure, multiple vendors, etc.

The third point is per-element state/event-based process mapping.  Each object needs to have a state/event table that defines the operating states it can be in, the conditions it expects to handle, and the processes associated with the state/event combinations.  “If State A and Event B, then Run Process C and Set State X” might be a form of expression.

Point four is that the parameters input into and expressed by elements must be related to the parameters received from the next-higher element and the next-lower elements.  If “Access” and “Core” report normal states then “Service” does likewise.  Any numerical or Boolean properties of an object would, if set, result in something being sent downward to subordinates, and anything below has to be transformed to a common form published by the layer above.

The single biggest failing in our efforts to transform services and networks is the fact that we have not done the modeling work.  Let make this point clearly; without a strong service model, there is no chance of transformation success.  And, folks, we still do not have one.

Service automation is the second problem we have to resolve.  If you have a good service model, then you can define good software processes to decompose the model and link lifecycle processes to states/events.  We’ve only recently started accepting the importance of service automation, but let me make it clear in financial terms.

This year, “process operations” costs, meaning costs directly attributable to service and network operations, will account for 28 cents of each revenue dollar.  If we were to adopt SDN and NFV technology in its limited, pure, standards-based form, at the largest plausible scale, we could reduce 2018 costs by a bit less than two and a half cents.  If we were to adopt service automation principles absent any technology shifts whatsoever, we could reduce the 2018 costs by three cents more per revenue dollar—more than double the savings.  Furthermore, the investment needed to secure the 2.4 cents of SDN/NFV savings would be thirty-eight times the investment needed to secure the 5.4 cents of operations savings.

Perhaps one reason it’s complicated is that “service automation”, is really a combination of two problems.  First is the already-mentioned lack of a good service modeling architecture.  The second is a scalable software architecture with which to process the model.  It does little good to have one without the other, and so far, we have neither.

I’ve been involved in three international standards activities and two software projects, aimed in part at service automation.  From the latter, in particular, I learned that a good model for scalable service management is the “service factory” notion, which says that a given service is associated with an “order” that starts as a template and is then solidified when executed.  I’ve tried out two different approaches to combining software and models, and both seem to work.

One approach is to use a programming language like Java to author services by assembling Java Classes into the service models.  This creates a combination of an order template that can be filled in, and a “factory” that when supplied with a filled-in template (an “instance”), will deploy and manage the associated service.  You can deploy as many factories as you like, and since all the service data (including state/event data) lives in the instance, any factory can process any event for any service it handles.

The second approach is to have generalized software process a service data model and execute processes based on states and events.  To make this work, you need to make all the service lifecycle steps into state/event behaviors, so things might start with an “Offer” state to indicate a service can be ordered, and progress from there.

My personal preference is for the second of the two approaches, but you can see that neither one bears any resemblance to the ETSI End-to-End structure, though I’ve been told by many that the model was never intended to be taken as a literal software architecture.  I think you can fit either approach to the “spirit of ETSI”, meaning that the functional progression can be made to align.

The final technical problem to be resolved in getting to next-gen networking is a management model.  Think for a moment about a global infrastructure, with tens of thousands of network devices, millions of miles of fiber, tens of thousands of data centers holding millions of servers, each running a dozen or so virtual machines that in turn run virtual functions.  All of this stuff needs to be managed, which means it needs attention when something happens.  How?

The first part of the “how?” question is less about method detail than about overall policy.  There are really two distinct management models, the “service management” and the “resource management” models.  Service management says that management policies are set where they’re offered to the customer in terms of SLAs.  You thus report conditions that violate or threaten SLAs, and you use service policies to decide what to do.  Resource management says that resources themselves assert “SLAs” that define their design range of behavior.  You manage the resources against those SLAs, and if you’ve assigned services to resources correctly, you’ll handle services along the way.

We’ve rather casually and seemingly accidentally moved away from formal service management over time, largely because it’s difficult in adaptive multi-tenant services to know what’s happening to a specific service in case of a fault.  Or, perhaps, it would be more accurate to say that it’s expensive to know that.  The problem is that when you start to shift from dedicated network devices to hosted software feature instances, you end up with a “service” problem whether you want one or not.

The goal of management is remediation.  The scope of management control, and the range of conditions that it responds to, has to fit that goal.  We’re not going to be focusing on rerouting traffic if a virtual function goes awry; we’re going to redeploy it.  The conditions that could force us to do that are broad—the software could break, the server, some of the internal service-chain connections, etc.  The considerations relating to where to redeploy are equally diverse.  So in effect, virtualization tends to move us back at least a bit toward the notion of service management.  It surely means that we have to look at event-handling differently.

Real-device management is straightforward; there’s a set of devices that are normally controlled by a management stack (“element, network, service”).  Conditions at the device or trunk level are reported and handled through that stack, and if those conditions are considered “events” then what we have are events that seek processes.  Those processes are large monolithic management systems.

In a virtual world, management software processes drift through a vast cloud of events.  Some events are generated by low-level elements or their management systems, others through analytics, and others are “derived” events that link state/event processes in service models of lifecycle management.  In the cloud world, the major public providers see this event model totally differently.  Processes, which are now microcomponents, are more likely to be thrown out in front of events, and the processes themselves may create fork points where one event spawns others.

The most important events in a next-generation management system aren’t the primary ones generated by the resources.  These are “contextless” and can’t be linked to SLAs or services.  Instead, low-level model elements in modern systems will absorb primary events and generate their own, but this time they’ll be directing them upward to the higher-level model elements that represent the composition of resource behaviors and services.  Where we run the event processes for a given model element determines the source of the secondary events, and influences where the secondary processes are run.

“Functional programming” or “lambda processing” (and sometimes even “microservices”, in Google’s case) is the software term used to describe the style of development that supports these microcomponent, relocatable, serverless, event-driven systems.  We don’t hear about this in next-gen networking, and yet it’s pretty obvious that the global infrastructure of a major operator would be a bigger event generator than enterprises are likely to be.

The event management part of next-generation networks is absolutely critical, and so it’s critical that the industry takes note of the functional programming trends in the cloud industry.  If there’s anything that truly makes the difference between current- and next-generation networks, it’s “cloudification”.  Why then are we ignoring the revolutionary developments in cloud software?

That should be the theme of next-gen networking, and the foundation on which we build the solutions to all three of the problems I’ve cited here.  You cannot test software, establish functional validity of concepts, or prove interoperability in a different software framework than you plan to use, and need, for deployment.  The only way we’re going to get all this right is by accepting the principles evolving in cloud computing, because we’re building the future of networking on cloud principles.  Look, network people, this isn’t that hard.  We have the luxury of an industry out there running interference for us in the right direction.  Why not try following it instead of reinventing stuff?  I’ll talk more about the specifics of a cloud-centric view in my blog tomorrow.

Can We Answer the Two Top Operator Questions on Service Lifecycle Automation?

Operators tell me that they are still struggling with the details of service lifecycle automation, even though they’re broadly convinced that it’s a critical step in improving profit per bit.  There are a lot of questions, but two seem to be rising to the top of everyone’s list, and so exploring both the questions and possible answers could be valuable for operators, and so for the industry.

Service lifecycle automation means creating software processes to respond to service and network conditions, rather than relying on a mixture of network operations center reactions and both contemporaneous and delayed responses in operations systems.  “Something happens, and the software reacts…completely, across all the impacted systems and processes, dealing both with service restoration and SLA escalation and remediation,” is how one operator put it.

That statement raises the first of the operator questions, which is how you can get service lifecycle management to cover the whole waterfront, so to speak.  There’s a lot of concern that we’ve been addressing “automation” in a series of very small and disconnected silos.  We look at “new” issues in orchestration and management, like those created by NFV hosting, but we don’t address the broader service/network relationships that still depend on legacy elements.  They point out that when you have silos, no matter where they are or why/how they were created, you have issues of efficiency and accuracy.

The silo problem has deep roots, unfortunately.  As networking evolved from circuit (TDM) to packet (IP), there was an immediate bifurcation of “operations” and “network management”.  Most of this can be traced to the fact that traditional OSS/BSS systems weren’t really designed to react to dynamic topology changes, congestion, and other real-time network events.  The old saw that OSS/BSS systems need to be “more event-driven” harkens from this period.

The separation of operations and management has tended to make OSS/BSS more “BSS-focused”, meaning more focused on the billing, order management, and other customer-facing activities.  This polarization is accentuated by the focus of operators on “portals”, front-end elements that provide customers access to service and billing data.  Since you can’t have customers diddling with network resources in a packet world, the portalization of planning forces delineation of business operations and network management boundaries.

One way to address this problem is to follow the old TMF model of the NGOSS Contract, which has at the latest morphed into TMF053 NGOSS Technology Neutral Architecture (TNA).  With this model, operations systems are implicitly divided into processes that are linked with events via the contract data model.  Thus, in theory, you could orchestrate operations processes through event-to-process modeling.  That same approach would work for service lifecycle automation, which would provide a unified solution, and you could in theory combine both in a single service model.  Operators like a single modeling language but are leery about having one model define both management and OSS/BSS lifecycles.

That raises the operators’ second question, which is about events.  Lifecycle management is all about event-handling, and just having a nice data-model-driven approach to steering events to processes doesn’t help if you can’t generate the events or agree on what they mean.  This is particularly important when you are looking at multi-tenant infrastructure where the relationship between a service and infrastructure conditions may be difficult to obtain, and where correlation costs are such that the measures would be impossible to financially justify.

An “event” is a condition that requires handling, and it’s obvious that there are questions on what conditions should create events and how the events should be coded.  Ideally, you’d like to see service lifecycle events standardized at all levels, including how a pair of models—one managing network resources for services and the other operations processes for customers—would use events to synchronize their behavior.  Operators have no idea how that’s going to happen.

Events are critical in service automation, because they’re the triggers.  You can’t automate the handling of something you don’t know about, and “know about” here means having an unambiguous, actionable, signal.  If something that’s dedicated to and an explicit part of a given service breaks, it’s easy to envision that such a signal could be produced, though differences in vendor implementations might require standardization of error conditions.  Where shared or virtualized resources fail it’s another matter.

One problem is that there are a lot of different sources of “events”, and many sources are from different pieces of infrastructure with different status conditions.  A server might report one set of events and a fiber trunk another.  How do you correlate the two, or even understand what they mean?  For example, a server might report an overheat warning.  Does that mean anything in terms of capability?  If a server reports overheating and a fiber trunk overloading, do the two have a common cause?

Another problem is that a condition in infrastructure doesn’t always impact all the services, so you have to understand what the scope of impact in.  A fiber failure surely impacts services that happen to use the fiber, but what services are those?  In order to find out for an IP service, you’d have to understand the addresses involved in the service and the way those addresses were routed.  And it’s more than just destination address.  Two different users accessing the same server might find that one is impacted by a fiber failure and the other is not.

“Analytics” is the most problematic source of events.  Analytics have a presumptive multiplicity of dimensions that separate it from simple status reporting.  Those added dimensions make it harder to say what a given analytics “prediction” or “event” would mean to services.  Last week, analytics might say, these conditions led to this result.  We already know that the result might be difficult to associate with the state of a given service, but we now add the question of whether the conditions last week had any specific relevance to that service last week.  If not, is there a reason to think we should care now?  Do we then do service-specific correlations for analytics?

Event correlation is critical to the notion of service automation, because it’s critical to establishing service conditions.  You can do infrastructure or resource management a lot easier because the event relationship with infrastructure is clear; events are generated there.  This means that it would probably make sense to separate services and infrastructure so that the services of infrastructure (what I’ve always called the “resource layer”) are presented as services to the service layer.  Then you need only determine if a resource service (“behavior” is the term I use) is meeting the SLA, and you can readily generate standard events relating to SLA conformance.

This leaves us with what are effectively three layers of modeling and orchestration—the OSS/BSS layer, the network services layer, and the resource/behavior layer.  This multiplication seems to make operators uneasy at one level, and comforts them at another.  Lots of layers seems to indicate unneeded complexity (likely true), but the layers better reflect current structures and dodge some important near-term issues.

We don’t have nearly enough dialog on these topics, according to operators.  Even those involved in SDN and NFV trials say that their activity is focused at a lower level, rather than addressing the broad operations relationships that we really need to see addressed.  I’m hoping that the expanding work of operators like AT&T and Telefonica will at least expose all the issues, and perhaps also offer some preliminary solutions.  Eventually, I think that operator initiatives will be the drivers to getting real discussions going here; vendors don’t seem willing to take the lead.

Why Not Have REAL Virtualization?

Does a network that presents an IP interface to the user constitute an “IP network?”  Is that condition enough to define a network, or are there other properties?  These aren’t questions that we’re used to asking, but in the era of virtualization and intent modeling, there are powerful reasons to ask whether a “network” is defined by its services alone.  One powerful reason is that sometimes we could create user services in a better way than we traditionally do.

SDN is about central software control over forwarding.  In the most-often-cited “stimulus” model of SDN, you send a packet to a switch and if it doesn’t have a route for it, it requests one from the central controller.  There is no discovery, no adaptive routing, and in fact unless there’s some provision made to handle control packets (ICMP, etc.) there isn’t any response to them or provision for them.  But if you squirt in an IP packet, it would get to the destination if the controller is working.  So, this is an IP network from a user perspective.

If this sounds like a useless exercise, reflect on the fact that a lot of our presumptions about network virtualization and infrastructure modernization rely implicitly on the “black box” principle; a black box is known only by the relationship between its inputs and outputs.  We can diddle around inside any way that we like, and if that relationship is preserved (at least as much of it as the user exercises) we have equivalence.  SDN works, in short.

Over the long haul, this truth means that we could build IP services without IP devices in the network.  Take some optical paths, supplement them with electrical tunnels of some sort, stick some boxes in place that do SDN-like forwarding, and you have a VPN.  It’s this longer-term truth that SD-WANs are probably building toward.  If the interface defines the service, present it on top of the cheapest, simplest, infrastructure possible.

In the near term, of course, we’re not likely to see something this radical.  Anything that touches the consumer has to be diddled with great care, because users don’t like stuff that works differently.  However, there are places where black-box networks don’t touch users in a traditional sense, and here we might well find a solid reason for a black-box implementation.  The best example of one is the evolved packet core (EPC) and its virtual (vEPC) equivalent.

Is a “virtual EPC” an EPC whose functionality has been pulled out of appliances and hosted in servers?  Is that all there is to it?  It’s obvious that the virtual device model of an EPC would fit the black-box property set, because its inputs and outputs would be processed and generated by the same logic.  One must ask, though, whether this is the best use of virtualization.

The function of the EPC in a mobile network is to accommodate a roaming mobile user who has a fixed address with which the user relates to other users and to other network (and Internet) resources.  In a simple description of the EPC, a roaming user has a tunnel that connects their cell site on one end, and to the packet network (like Internet) gateway on the other.  As the user moves from cell to cell, the tunnel gets moved to insure the packet is delivered correctly.  You can read the EPC documentation and find citations for the “PGW”, the “SGW”, the “MME” and so forth, and we can certainly turn all or any of these into hosted software instances.

However…if the SDN forwarding process builds what is in effect a tunnel by creating cooperating forwarding table entries under central control, could we not do something that looked like an EPC behavior without all the acronyms and their associated entities?  If the central SDN controller knows from cell site registration that User A has moved from Cell 1 to Cell 5, could the SDN controller not simply redirect the same user address to a different cell?  Remember, the world of SDN doesn’t really know what an IP address is, it only knows what it has forwarding rules for, and what those rules direct the switch to do.

You could also apply this to content delivery.  A given user who wants a given content element could, instead of being directed at the URL level to a cache, simply be connected to the right one based on forwarding rules.  Content in mobile networks could have two degrees of freedom, so to speak, with both ends of a connection linked to the correct cell or cache depending on content, location, and other variables.

I’m not trying to design all the features of the network of the future here, just illustrate the critical point that virtualization shouldn’t impose legacy methodologies on virtual infrastructure.  Let what happens inside the black box be based on optimality of implementation, taking full advantage of all the advanced features that we’re evolving.  Don’t make a 747 look like a horse and buggy.

We’re coming to accept the notion that the management of functionally equivalent elements should be based on intent-model principles, which means that from a management interface perspective both legacy and next-gen virtualization technology should look the same.  Clearly, they have to look the same from the perspective of the user connection to the data plane, or nothing currently out there could connect.  I think this proves that we should be taking the black-box approach seriously.

Part of this is the usual buzz-washing problem.  Vendors can claim that anything is “virtual”; after all, what is less real than hype?  You get a story for saying you have “virtual” EPC and nothing much if you simply have EPC or a hosted version.  Real virtualization has to toss off the limitations of the underlying technology it’s replacing, not reproduce those limits.  vEPC would be a good place to start.

Is There Really a Problem With OpenStack in NFV?

Telefonica has long been a leader in virtualization, and there’s a new Analyst Mason report on their UNICA model.  There’s also been increased notice taken of Telefonica’s issues with OpenStack, and I think it’s worth looking at the report on UNICA and the OpenStack issues to see where the problems might lie.  Is OpenStack a problem, is the application of OpenStack the issue, or perhaps is the ETSI end-to-end model for NFV at fault?  Or all of the above?

In ETSI NFV’s E2E model, the management and orchestration element interfaces to infrastructure via a Virtual Infrastructure Manager.  I have issues with that from the first, because in my view we shouldn’t presume that all infrastructure is virtual, so an “Infrastructure Manager” would be more appropriate.  It also showcases a fundamental issue with the VIM concept, one that UNICA might not fully address.

We have lots of different infrastructure, both in terms of its technology and in terms of geography.  Logically, I think we should presume that there are many different “infrastructure managers”, supplied by vendors, by open-source project, or even developed by network operators.  Each of these would control a “domain”.  It’s hard to read the story from the report, but it I’ve heard stories that Telefonica has had issues with the performance of OpenStack while deploying multiple VNFs, and in particular issues with performance when requirements to deploy or redeploy collide.

The solution to the issue in the near term is what Telefonica calls “Vertical Virtualization”, which really means vertical service-specific silo NFV.  For vEPC, for example, they’d rely on Huawei.  This contrasts with the “horizontal” approach of UNICA, where (to quote the Analyst Mason paper) “Ericsson supplies the NFVI and related support for UNICA Infrastructure, which is the only infrastructure solution globally that will support VNFs.”

So here is where I think the issue may lie.  NFVI, in the ETSI document, is monolithic.  There is therefore a risk that a “domain” under NFVI control might be large enough to create hundreds, thousands, or even more service events per minute.Hu;;Hu  There is a known issue with the way OpenStack handles requests; they are serialized (queued and processed one at a time), because it’s very difficult to manage multiple requests for the same resource from different sources in any other way.  The use of parallel NFV implementations bypasses this, of course, but there are better ways.

Parallel implementations, “vertical virtualization” creates siloed resources, so the solution has only limited utility.  What is better is that there be some “VIM” structure that allows for the separation of domains, separation so that different vendors and technologies are separated.  Multiple VIMs can resolve this.  But you also need to have a way of partitioning rather than separating.  If OpenStack has a limited ability to control domains, you first work to expand that limit, and then you create domains that fit within it.

The biggest problem with OpenStack scaling in NFV is the networking piece, Neutron.  Operators report that Neutron can tap out with less than 200 requests.  It’s possible to substitute more effective network plugins, and here my own experiences with operators suggest that Nokia/Nuage has the best answer (not that Ericsson is likely to pick it!).

If you can’t expand the limits, then size within them.  An OpenStack domain doesn’t have to wrap around the globe.  Every data center could have a thousand racks, and you can easily define groups of racks as being a domain, with the size of the group designed to ensure that you don’t overload OpenStack.  However, you now have a resource pool of, say, 200 racks.  How do you make that work?

Answer; by a hierarchy.  You have a “virtual pool” VIM, and this VIM does gross-level resource assignment, not to a server but to a domain.  You pick a server farm at a high level, then a bank/domain, and finally a server.  Only the last of these requires OpenStack for hosting.  Networking is a bit more complicated, but only if you don’t structure your switching and connectivity in a hierarchical way.  In short, it’s possible to use decomposition policies to decompose a generalized resource pool into smaller domains that could be easily handled.

It’s also possible, if you use a good modeling strategy, to describe your service decomposition in such a way as to make a different VIM selection depending on the service.  Thus, you can do service modeling that does a higher-level resource selection.  Then you could use the same modeling strategy to decompose the resources.  If you’re interested in this, take a look at the annotated ExperiaSphere Deployment Phase slides HERE.

The point here is that the fault isn’t totally with OpenStack.  You can’t assign resources for a dozen activities in parallel, drawing from the same pool.  Thus, you have to divide the pool or nothing works.  You can make the pool bigger by having more efficient code, but in the end, you’re disposing of finite resources here, and you come to a point where you have to check status and commit one.  That’s a serial process.

This is a problem that’s been around in NFV for years, and many (including me, multiple times) have called it out.  I don’t think it’s a difficult problem to solve, but every problem you decline to face is insurmountable.  It’s not like the issues haven’t been recognized; the TMF SID (Shared Information and Data Model) has separated service and resource domains for literally several decades.  I don’t think they envisioned this particular application of their model, and I like other modern model approaches better, but SID would work.

No matter how you hammer vendors (or open-source groups, or both) to “fix” problems, the process will fail if you don’t identify the problem correctly.  Networks built up from virtual functions connected to each other in chains are going to generate a lot more provisioning than cloud applications generate.  If there was no way to scale OpenStack deployments properly without changing it, then I think Telefonica and others could make a case for demanding faster responses from the OpenStack community.  But there is a way, and a better way.

Telefonica has a lot of very smart people, including some who I really respect.  I think they’re just stuck in the momentum of an NFV vision that didn’t get off to a good start.  The irony to me is that there’s nothing in the E2E model that forecloses the kind of thing I’ve talked about here.  It’s just that a literal interpretation of the model encourages a rigid, limited, structure that pushes too much downward onto open tools (like OpenStack) that were never intended to solve global-scale NFV problems.  I’d encourage the ISG to promote the “loose construction” view of the specs, and operators to push for that.  Otherwise we have a long road ahead.