What Should We Expect from Controllers and Infrastructure Managers?

One of the key pieces of network functions virtualization (NFV) is the “virtual infrastructure manager” or VIM.  In the E2E ETSI model for NFV, the VIM takes instructions from the management and orchestration element (MANO) and translates them to infrastructure management and control processes.  One of the challenges for NFV implementation is just what shape these instructions take and just how much “orchestration” is actually done in MANO versus in the VIM.  To understand the challenges, we have to look at the broader issue of how services as abstractions are translated to infrastructure.

A service, in a lifecycle sense, is a cooperative behavior set impressed on infrastructure through some management interface or interfaces.  Thus, a service is itself an abstraction, but the tendency for decades has been to view services as a layer of abstractions, the higher being more general than the lower.  Almost everything we see today in service lifecycle management or service automation is based on an abstraction model.

The original concept probably came from the OSI management standards, which established a hierarchy of element, network, and service management.  It’s pretty clear that the structure was intended to abstract the notion of “service” and define it as being a set of behaviors that were first decomposed to network/administrative subsets, and finally down to devices.  This was the approach used by almost all router and Ethernet vendors from the ‘90s onward.

If we presume that there’s a service like “VPN” it’s not hard to see how that service could be first decomposed by the administrative (management) domains that were needed to cover the scope of the service, and then down to the elements/devices involved.  Thus, we could even say that “decomposition” is an old concept (even if it might have gotten forgotten along the way to new developments).

The Telemanagement Forum (TMF) largely followed this model, which became known as “MTNM” for Multi-Technology Network Management.  An implicit assumption in both the old service/network/element hierarchy and the MTNM concept was that the service was a native behavior of the underlying networks/devices.  You just had to coerce cooperation.  What changed the game considerably was the almost-parallel developments of SDN and NFV.

SDN networks don’t really have an intrinsic service behavior that can be amalgamated upward to create retail offerings.  A white-box switch without forwarding policy control sits there and eats packets.  NFV networks require that features be created by deploying and connecting software pieces.  Thus, the “service behaviors” needed can’t be coerced from devices, they have to be explicitly created/deployed.  This is the step that leads to the abstract concept of an “infrastructure manager”.

Which is what we should really call an NFV VIM.  All infrastructure isn’t virtual; obviously today most is legacy devices that could still be managed and service-coordinated the old way.  Even in the future it’s likely that a big piece of networks will have inherent behavior that’s managed by the old models.  So an “IM” is a VIM that doesn’t expect everything to be virtual, meaning that on activation it might either simply command something through a legacy management interface or deploy and connect a bunch of features.  In SDN, an IM is the OpenFlow controller, and in particular those infamous northbound interfaces (NBIs).

It’s comforting, perhaps, to be able to place the pieces of modern network deployment and management into a model that can also be reconciled to the past.  However, we can’t get cocky here.  We still have issues.

I can abstract a single management interface, at a low level.  I can abstract a high-level interface.  The difference is that if I do abstraction at a low level, then I have to be able to compose the service myself, and issue low-level commands as needed to fulfill what I’ve composed.  If I can abstract at a high level, I have the classic “do-job” command—I can simply tell a complex system to do what I want.  In that case, though, I leave the complexity of composition, optimization, orchestration, or whatever you’d like to call it, to that system.

This is a natural approach to take in the relationship between modern services and OSS/BSS systems.  Generally, service billing and operations management at the CRM level depend on functional elements, meaning services and meaningful, billable, components.  Since billable elements are also a convenient administrative breakdown, this approach maps to the legacy model of network management fairly well.  However, as noted, this supposes that there’s a sophisticated service modeling and lifecycle management process that lives below the OSS/BSS.

That’s not necessarily a bad thing, because we’ve had a pretty hard separation between network management and operations and service management and operations for decades.  However, having two ships-in-the-night operations processes running in parallel can create major issues of coordination in a highly agile environment.  I’m not saying that the approach can’t work, because I think it can.  I am saying that you have to co-evolve OSS/BSS and NMS to make it work through a virtualization transition in infrastructure.

The thing that seems essential is the notion of a service plane separate from the resource plane.  This separation acknowledges the way operators have organized themselves (CIOs run the former, and COOs the latter), and it also acknowledges the fact that services are compositions built from resource behaviors.  The infrastructure has a set of domains, a geographic distribution, and a set of technical capabilities.  These are framed into resource-level offerings (which I’ve called “behaviors” to separate them from the “service” elements), and the behaviors are composed in an upward hierarchy to the services that are sold.

Infrastructure managers, then, should be exporters of those “behaviors”.  You should, in your approach to service modeling, be able to ask an IM for a behavior, and have it divide the request across multiple management domains.  You should also be able to call for a specific management domain in the request.  In short, we need to generalize the IM concept even more than we’re working to generalize it today, to allow for everything from “do-job” requests for global services to “do-this-specifically” requests for an abstract feature from a single domain.

But we can’t dive below that.  The basic notion of intent modeling demands that we always keep a functional face on our service components.  Behaviors are functional.  Service components are functional.  In the resource domain, they are decomposed into implementations.

I do think that the modeling approach to both service and resource domains should be the same.  Everything should be event-driven because that is clearly where the cloud is going, and if service providers are going to build services based on compute-hosted features, they’re darn sure not going to invent their own architecture to do the hosting and succeed.  The cloud revolution is happening and operators first and foremost need to tap it.  Infrastructure management and controller concepts have to be part of that tapping.

How an Event-Centric Cloud Model Might Influence the Edge Devices

If we assume that the notion of an event-driven cloud is correct, we have to ask ourselves what that cloud model would do to the way edge devices get information and content.  If the cloud is a new computing paradigm, does the paradigm extend to the edge?  How does it then impact the way we build software or deliver things?  The answers are highly speculative at this point, but interesting.

Right now, consumers and workers both tend to interact with information/content resources through a URL click.  This invokes a “connection”, a session, between the user and the resource, and that is sustained through the relationship.  In an event model, things would have to work differently, but to see why (and how they would then have to work) we’ll have to use an example.

Let’s say we have a smartphone user walking down a city street.  In a traditional model of service, the user would “pull” information from the phone, looking for a location or perhaps a retail store.  In an event-driven model the user might have information pushed to the device instead, perhaps based on “shopping habits” or “recent searches”.  Indeed, this sort of push relationship is the most plausible driver for wearables, since hauling the phone out to look at messages would be intrusive to many.

Making this sort of thing work, then, is at least a reasonable goal.  Let’s start with the notion of “push”, which would mean having events cast to the user, representing things that might warrant attention.  It’s easy to envision a stream of events going to the user’s phone, but is that really logical, optimal?  Probably not.

A city street might represent a source for hundreds or thousands of IoT “events” per block.  Retail stores might generate more than that, and then we have notifications from other users that they’re in the area, alerts on traffic or security ahead, and so forth.  Imagining tens of thousands of events in a single walk is hardly out of line, but it’s probably out of the question in terms of smartphone processing.  At the least, looking all that stuff up just to decide if it’s important would take considerable phone power.  Then you have the problem of the network traffic that sending those events to every user nearby would create.

Logically speaking, it would seem that event-based applications would accelerate the trend toward a personal agent resident in the cloud, a trend that’s already in play with voice agents like Apple’s Siri or Amazon’s Alexa or “Hey, Google”.  It’s not a major step from today’s capabilities to imagine a partner process to such an agent in the cloud, or even cloud-hosting of the entire agent process.  You tell your agent what you want and the agent does the work.  That’s the framework we’d probably end up with even without events.

What events do is create a value for in-cloud correlation.  If there’s a cloud agent representing the user then there’s a way of correlating events to create useful context, not just flood users with information like an out-of-control visual experience.  We can do, in the cloud, what is impractical in the smartphone.  Best of all, we can do it in a pan-user way, a way that recognizes that “context” isn’t totally unique to users.

Say our smartphone user is at a concert.  There’s little doubt that the thing that defines the user’s focus and context at that moment is the concert, and that’s just what is defining those things for every user who attends.  News stories also create context; everyone who’s viewing an Amber Alert or watching breaking news is first and foremost a part of the story those channels convey.

If there are “group contexts” then it makes sense to think of context and event management as a series of processes linked in a hierarchy.  For example, you might have “concert” as a collective context, and then perhaps divide the attendees by where they are in the venue, by age, etc.  In our walk-on-the-street example, you might have a “city” context, a “neighborhood” and a “block”.  These contexts would be fed into a user-specific personal-agent process.

I say “hierarchy” here not just to describe the way that contexts are physically related.  It would make sense for a city context to be passed to neighborhood contexts, and then on down.  The purpose of this is to ensure that we don’t overload personal-agent processes with stuff that’s not helpful or necessary.

In this sort of world, a smartphone or PC user doesn’t need to access “the web” nearly as much; they are interacting with personal agent and context agents, which are cloud processes.  It’s pretty easy to provide a completely secure link to a single cloud process.  It’s pretty easy to secure cloud processes’ connections with each other, and to authenticate services these processes offer to other processes (if you’re interested in a “consensus” model HERE is how the original ExperiaSphere project approached it back in 2007).  Thus, a lot of the security issues that arise with the Internet today can’t really happen; all the identities and relationships are secured by the architecture.

This approach doesn’t define an architecture for context creation or personal agency, or the specific method for interconnection; those issues can be addressed when someone wants to implement the architecture.  The approach does define, in effect, the relationship between personal agent and user appliance.  It’s what the name suggests agency.  In some cases, the agent might provide a voice or visual response, and in others it might do something specific.  Whatever happens, though, the agent is acting for the user.  We see that now with Amazon’s Alexa in particular; some people tell me that they talk to it almost as they would a person.

Which I think is obviously where we’re headed with all of this.  The more sophisticated our processing and information resources are, and the more tightly they’re bound to our lives, the harder it is to surmount artificial barriers created by explicit man-machine interactions like clicking a URL.  We want our little elves to be anthropomorphic, and our devices likewise.

The biggest trend driving networking today is the personalization of our interaction with our devices and the information resources those devices link us with.  The second-biggest trend is the growth in contextual information that could be used to support that personalization, in the form of events representing conditions or changes in conditions.  The biggest trend in the cloud is the shift in focus of cloud tools toward processing and exploiting these contextual and event resources.  The second trend, clearly, is driven by the first.

As contextual interpretation of events becomes more valuable, it follows that devices will become more contextual/event-aware themselves.  The goal won’t be to displace the role of the cloud agent but to supplement that agent when it’s available and substitute for it when the user is out of contact.  The supplementation will obviously be the most significant driver because most people will never be out of touch.

Devices are valuable sources of context for three reasons.  First, since they’re with the user they can be made aware of location and local conditions.  Second, because the device can be the focus of several parallel but independent connections, the device may be the best/only place where events from all of them can be captured.  Texting, calling, and social-media connections all necessarily involve the device itself.  Third, the device may be providing the user a service that doesn’t involve communications per se.  Taking a picture is an example for smartphones, or perhaps movement of the device or changes in its orientation.  An example for laptops is running a local application, including writing an email.

The clearest impact of event-centric cloud processing is event-centric thinking in the smartphone.  Everything a user does is a potential event, something to be contextualized in the handset or in the cloud, or both.  Since I think that contextualization is hierarchical, as I’ve noted above, handset events would likely be correlated there.  The easy example is a regular change in GPS position coupled with the orientation shifts associated with walking or driving.  This combination of things lets the device “know” the user is on foot or in a vehicle.  You could correlate the position with the location of public transport vehicles to see if it’s a car or not.  You can learn a lot, and that learning means you can provide the user with more relevant information, which increases your value as a service provider.

The net of this is that devices, particularly smartphones, are going to transform to exploit cloud agency and contextual processing of events.  But even laptops will be impacted, becoming more event-centric with respect to application context and social awareness.  We can already see this in search engines, and every step it expands offers users and workers and businesses more value from IT.  It’s this value increase that will drive any increases in spending, so it’s important for us all.

Cisco’s Earnings and the Need for Aggressive Thinking on the Cloud

Cisco reported roughly in-line numbers for the quarter but the stock was down over 2% because it is still reporting a sequential decline in revenues.  Guidance was also at the low end of Street expectations, which further suggests a “no-improvement” scenario, and there was special Street concern for the fact that security products didn’t do as well as expected.  I’d guess you’d not be surprised if I said this wasn’t a good sign for networking.

Why does equipment revenue decline?  Because operator and enterprise capex is at least not growing, and overall is declining.  Why is that happening?  Because it takes new benefits to justify new spending, and buyers are focused on reducing their own costs.  Cisco and other vendors are cutting costs so their profits aren’t dropping with revenues.  Buyers are supposed to do something different?  Get real.

If you’re not surprised that I don’t think this is good for the industry, you won’t be surprised if I say that it’s hardly news.  Operators have told the Street that they’re cutting capex.  Enterprises don’t have that kind of long-term capital-project planning, but CIOs are telling me that every round of new network spending is focused on lowering costs overall, and if that can’t be done then at least raising them as little as possible.  All of this because ROI for network projects isn’t meeting internal guidelines for approval unless the cost is lower, not higher.  Lower cost, lower spending, lower vendor revenue.  QED (which for those not blessed as I with two years of high-school Latin, means “quod erat demonstrandum”, or in English “Thus it has been demonstrated.”  Or, if you like “res ipsa loquitur” or “the thing speaks for itself”.  Don’t you love a practical education?)

I really feel like “Groundhog Day” here, as long as we’re doing quotes.  Yes, I have been saying that absent new benefits there cannot be new spending.  No, people don’t seem to be paying attention.  So I’m saying it again now and perhaps offering some comments based on Cisco’s earnings call comments.

Vendors love to explain shortfalls as being due to market conditions and not to their own missteps.  Well, gosh, is it a surprise to them that they are in the market?  What conditions did they expect, and did they have anything other than blind hope that they’d come along?  If you dig past the usual trash in vendor comments, they all are saying in effect that they thought that more traffic would drive more earnings in the operator space, and also in the enterprise space.  More bits means more bucks, and that’s true for vendors.  Not so for buyers.

The future is never a linear extension of the past.  Any technology, any business idea, has a logical lifespan beyond which its benefits no longer grow, and so no longer justify increased investment.  We are today depending on a notion of networking that goes back about 40 years.  What in tech has survived that long?  In 1974, the year TCP/IP arguably was born, a small computer was one that would fit in a 19-inch rack and cost thousands of dollars.  Today a computer a hundred times more powerful can be carried in your pocket.  So how could computing change so much and networking change so little?  It’s not logical.

To be fair, though, computing’s change was more quantitative than qualitative in networking terms.  Today’s systems are a lot faster, but they are still discrete devices that have fairly static relationships with networks.  A network that connects a multi-core smartphone and a network that connected a DEC PDP-8 still address endpoints the same, and expect connectivity between those endpoints to be the essence of any service.  Cisco and others, perhaps, might be forgiven if they fall today into the same service-mission trap that ensnared them four decades ago.

Can ignoring four years of static or declining revenue be forgiven, though?  Certainly it shouldn’t be ignored.  I saw the classic “profit-per-bit” compression and crossover slides in 2013, and so did a lot of other people.  Cisco now, perhaps more than other network equipment vendors, is ready to face the truth and push more for “software and subscriptions” as a revenue source.  The question is whether this shift can really accomplish what Cisco needs.

All my modeling and all the logic in the industry suggests that networking and computing had a kind of push-pull relationship initially.  Computing creates an information/content pool that, for a time, was inhibited by network infrastructure designed for low-speed voice and terminal traffic.  Cisco took advantage of the sudden excess of that-to-be-delivered and insufficiency of delivery options.  Now the problem is that we need more from the compute side—processing resources to enhance the value of delivery again.  And it’s not obvious how that happens.

Corporate IT needs to reframe its network to support point-of-activity worker empowerment, creating what is effectively an event relationship with workers.  Consumer services need to be able to contextualize every consumer interaction to make them more valuable.  I know that Cisco knew there were at least some who said this a decade ago, because I told them.  They, like most of the industry, elected to stay the course of traditional networking.

Security is another example of short-thinking.  Do we really think that network operators and enterprises will pay nearly as much to secure networks as they paid to build them?  Is the fact that, as Cisco says on their earnings call, IoT device security attacks are up 90% an indication that we need to spend gazillions of dollars on IoT security?  We need to be spending more on making networking intrinsically secure, not gluing remedy onto imperfection.

Cisco’s call says in essence that data center is growing and WAN is flat, and they correctly name the cloud as the reason.  However, where in the call does Cisco say they know why the cloud is growing in importance?  It’s not because it’s cheaper for current applications, but that it’s the right platform for future applications.  There’s a lot of computing changes between us and where we need to be, in order to support those future applications at the server/software level.  That’s where Cisco, and other network equipment vendors, need to be.  Don’t expect the consequences of cloud expansion to win the game for you, expect to win it by driving that expansion directly.

Why a Model for Network-Computing Fusion is Important

After my blog on a model-driven service lifecycle management technique, I got a bunch of emails from operator and vendor contacts (and from some who’d never contacted me).  Part of the interest was driven by a Light Reading article that noted my skepticism about the way that ETSI NFV was being implemented.   Most was driven by interest in the specifics of the model that I think would work, and it’s that group that I’m addressing here.

My ExperiaSphere approach to NFV has been extensively documented HERE in a series of tutorials, and I’ve now added another tutorial to the list, this one cutting horizontally across the service-lifecycle-stage approach taken by the earlier tutorials.  I’ll be adding it to the ExperiaSphere tutorial list, but in the meantime the new presentation can be found HERE.

My goal in this extra tutorial is to link the event-centric evolution in public cloud services I’ve blogged about, with the needs of network service lifecycle automation.  This was a principle of ExperiaSphere from the very first (which some of you may remember was in 2007), but the specific features of Amazon, Microsoft, and Google can now be used to explain things in a modern-relevant way.

The key point here is the same one I’ve made in the earlier blog I referenced, and that I made to Carol Wilson in the interview for the Light Reading piece.  We have to do NFV using the most modern technology available, which means “event-centric” cloud and data-model-driven portability of features.  We know that now because of the direction that all the major cloud providers are taking.

We’ve built our concept of networking on the notion that a network connects addresses, which represent real and at least somewhat persistent things.  We’re entering an age where the concept of addresses, persistence, or even “things” is restrictive.  In the cloud, there’s no reason why features can’t migrate around to find work, rather than the other way around.  There’s no value to a specific resource, only to resources as a collection.  Users are transient things, and so are the services they consume.  This is the future, both of the cloud and of networking.

All of this was pretty clear a decade ago, and so were the challenges to promoting the vision.  I was a part of a group called the “IPsphere Forum” or IPSF, and it was probably the first initiative that saw services as things you composed on demand.  The founding force behind it was a vendor, which prompted the vendor’s arch-rival to try to torpedo the whole notion.  Operators jumped on it and worked hard to bring it to fruition, but they were defeated in part by regulations that forbid their “collusion” in operator-dominated standards work.  And above all of this was the fact that at the time socializing a concept this broad and different was difficult because most in the industry had never even thought about it.

They think about it now.  We’re now seeing the future of networking differently, and in no small (and sad) part because we’re really seeing it in the cloud and not in the network.  In networking, everyone has focused on protecting their turf as market changes and competition threaten their profits.  There was no massive upside to connection services, no big benefit cloud to fight for.  Cloud provider competition, rather than trying to protect the status quo, is trying to advance the cloud faster into new areas.  Regulations don’t impair cloud providers.  Most of all, for the cloud we’re sitting on a trillion-dollar upside.

That upside could have gone to network providers, both services and equipment.  Networking was, and still is, the broadest of all tech industries in geography and “touch”.  It can tolerate low ROIs, and has plenty of cash to invest.  As an industry, networking has squandered a host of advantages it had in defining the fusion of network and computing that we call “the cloud”.  As an industry, networking has even largely squandered itself, because its future is now out of its own hands.

ExperiaSphere is my attempt to frame the future in at least a straightforward way.  Maybe it’s not something everyone understands, but I think every network and IT and cloud professional would understand it.  I want to emphasize here that I’m not “selling” ExperiaSphere, or in fact selling anything related to it.  The material on the website is all open and public, available to anyone without attribution or fees.  I’m selling an idea, a vision.  Better yet for the bargain-conscious, I’m giving it away in these tutorials and my blogs.

This blog is posted on LinkedIn, and anyone who has a view on the issues I’ve raised can comment and question there.  There’s an ExperiaSphere Google+ community too and I’d be happy to take part in discussions there as well.  Do you share the vision I’ve cited, or do you have reasoned objections?  Let’s hear your views either way.

How the Cloud That’s Emerging Will Shape the Network of the Future

Have you noticed that in the last six months, we’ve been having more stories about “cloud” networking and fewer about SDN or NFV?  Sure, it’s easy to say (and also true, as it happens) that the media jumps off a technology once it becomes too complicated to cover or is discredited in terms of impact versus hype.  In the case of the cloud, which is older conceptually than both SDN and NFV, that can’t explain the shift.  What’s going on here?

One fairly obvious truth is that a lot of what has been said about the impact of SDN and NFV is really about the impact of the cloud.  SDN is highly valuable in cloud data centers, and SDN software is therefore a critical adjunct to cloud computing, but it’s the cloud computing part that’s pulling SDN through.  Without the cloud, we’d be having relatively little SDN success.  NFV, somewhat in contrast, has been assigned a bunch of missions that were in truth never particularly “NFV” at all.  Many were cloud missions, and that’s now becoming clear.

A truth less obvious is that underneath its own formidable burden of hype, the cloud is maturing.  There was never any future in the notion that cloud services would be driven by the movement of legacy apps from data center to cloud, but it wasn’t clear what would be driving them.  Now we know that the cloud is really about event-handling, and that most of the applications that will deliver cloud revenues to providers in the future aren’t even written yet, or are just now being started.

All of this begs us to rethink what “the cloud” is.  It’s not a pool of resources designed to deliver superior capital economies of scale.  It’s about a pool of resources that are widely distributed, pushed out to within ten miles or less of almost every financially credible user and within forty miles of well over 99% of all users.  It’s about features, not applications, being hosted.  It’s about things that are cheaper because they’re rarely done and widely distributed when they are, not about centralized traditional OLTP.

SDN and NFV are consequences of the “true cloud”, applications of it, and elephant-behind-the-curtain glimpses of the final truths of the cloud.  If we have what is almost a continuous global grid of computing power, we obviously need to think about connecting stuff differently, and similarly have to start thinking about what we could do to utilize that grid to simplify other distributed applications, which networking clearly is.  But if both SDN and NFV only glimpsed the truth of the cloud future, what would give us a better look?

Let’s start with SDN.  The notion behind SDN is that adaptive networks whose intrinsic protocols and service protocols are the same are restrictive.  Sure, they work if your goal is to provide connection services alone, but if we have this enormous fabric of computing out there, most of our connectivity is within the fabric and not between users.  SDN’s most popular mission today is that of creating extemporaneous private LANs and WANs for cloud hosting.  But SDN still focuses on connections—it just makes them less “adaptive” and more centrally controlled.  Is that the real solution?

Mobile networks kind of prove it isn’t.  We have this smartphone that’s hauled about on errands and business trips, and we have to adapt the networks mightily (via the Evolved Packet Core or EPC) to let users sustain services while roaming around.  Even recent work on what we could call “location-independent routing” falls short of what’s needed.  Most cloud networking will depend on what we could call “functional routing” where the packet doesn’t specify a destination in an address sense at all, but rather asks for some form of service or service feature.

A generated event may have a destination, but that’s more an artifact of how we’ve built event processing than of the needs of the event.  Current trends toward serverless (meaning functional, lambda, or microservice) computing demonstrate that we don’t have fixed hosting points for things, in which case we really don’t have fixed addresses for them either.  That’s what we need to be looking at for the cloud-centric future.

Then we have NFV.  We build networks by connecting trunks through nodes.  Nodes are traditionally purpose-built devices like routers, and NFV was aimed at making a node into a software instance that could be hosted somewhere.  Where?  Today, the notion would be in general-purpose virtual CPE boxes on premises, or in a fairly limited number of operator data centers.  But if we have a global compute fabric in the cloud, does that make sense?

A network built from hosted software instances of routing functionality doesn’t differ all that much from one built using appliances.  Same trunks, same locations, since most operators would host their virtual functions in the same places they now house network devices.  The specific target for NFV was the non-connective appliances like firewalls and encryption elements, or embedded functions like IMS and EPC.  These features would almost surely be radically changed if we shifted from a user-connecting to a cloud-connecting mission.  Many of the things these appliances do wouldn’t be as relevant, or perhaps wouldn’t be relevant at all, because the focus would have shifted away from traditional “connection” services.

We are not, or should not be, trying to build today’s networks in a somewhat different way.  The cloud is already demonstrating that we’ll be composing services more than delivering them, and that the process of composition will render the communications needs of the future in a totally different way.  I’d bet you that engineers at Google have already started to work on the models of addressing and networking that the future will require.  I think it’s likely that Amazon and Microsoft are doing the same.  I’d bet that most network operators have done nothing in that space, and few network equipment vendors have either.

SDN and NFV were never transformative technologies, because technologies are really not transformers as much as they are enablers of transformation.  The cloud is much more fundamental, in no small part because software that we ran decades ago would still run and be useful today.  The model of computing has not changed, and that may be a big piece of why the model of networking has also been static.  Computing is now changing, and changing radically, and those changes are already unlocking new service models, because software processes are what the cloud changes fundamentally and software processes are what create the services of the future.

What Role Can AI Play in Service Lifecycle Automation?

I hate to open another hype can of worms, but this is a question that has to be asked.  Is there a role for artificial intelligence (AI) in service lifecycle automation, virtualization, SDN, and NFV?  The notion raises the specter of a glittering robot sitting at a console in a network operations center, and surely, we’re going to be seeing media stories along these lines, because excitement and not realism is the goal.  Underneath the overstatement and underthinking, though, there may be some very interesting truths.

I asked an old friend who runs a big NOC my opening question, and the first response was a wry “I’m wondering if there’s a role for natural intelligence there!”  On further consideration, I got what I think is the starting point for the whole discussion, “It depends on what you mean by AI.”

What Wikipedia says about AI is “In computer science, the field of AI research defines itself as the study of “intelligent agents”: any device that perceives its environment and takes actions that maximize its chance of success at some goal.  Colloquially, the term “artificial intelligence” is applied when a machine mimics “cognitive” functions that humans associate with other human minds, such as “learning” and “problem solving”.

If we take the formal, first, definition, it’s pretty clear that service lifecycle automation would fall within the scope of AI.  If we take the second, it’s a bit fuzzier, and to decode the fuzz we have to look deeper at the mission itself.

Service lifecycle automation is based on the notion that a given network service has a desired state of behavior, one that was sold to the service user and designed for by the network engineers who put the whole thing together.  The purpose of service lifecycle automation is to get a newly ordered service into that preferred state and keep it there for the lifetime of the service.  At that point, any capacity or resources would be returned and the service would be no longer available.

Not even a human operator would be able to perform this task without knowing what the preferred state of the service was, and also the current state of the service.  Generally, NOC personnel managing service lifecycles manually would respond to a condition that changed service state from the goal state, and that response would be expected to get things right again.  This process has become increasingly complicated as services and infrastructure become more complicated, and as a result there’s been growing focus on automating it.

DevOps tools are an example of automation of software deployment tasks, and much of network service lifecycle automation is based on similar concepts.  DevOps either supports the recording of “scripts” or series of steps that can be invoked either manually or in response to an event, or the definition of a series of possible states and how, in each state, processes would facilitate the transformation of the service into that ideal state.

In AI terms, the person who puts together these scripts would be called a “subject matter expert”, and would be expected to sit down with a “knowledge engineer” to transfer expertise into machine-applicable form.  I would argue that this is what happens when you do DevOps, or when an expert defines the state/event and event/process relationships associated with service lifecycle management.  This is why I think that the first definition of AI is met by the kind of service lifecycle automation I’ve been blogging about for years.

The real AI question, then, is that second part of the definition, coupled with the pithy comment of the NOC manager.  Would there be a value to a “cognitive” component to service lifecycle automation, one that perhaps involved “learning” and “problem-solving?”  If so, what might this component look like, how would it help, and how could it be implemented?

Most NOC people and operations specialists I’ve talked with say that they would not want a service lifecycle automation system to simply run off and do stuff in defiance of specific rules to the contrary, any more than they’d want ops personnel to do that.  That means that if we have a specific condition and specific instructions to do something when that condition arises, the AI system should do it.  No cognition there.

However, every NOC/operations type knows that there comes a time in service lifecycle management known as the “Oh **** moment”.  Everyone in technology operations has experienced one of these.  They usually happen for one of two reasons.  First, a bunch of bad things happen all at the same time.  In state/event terms, this means that you either have a flood of events or you have a combination of events that you never thought about, and didn’t create a state for.  The second reason is that you take what is supposed to be a restorative action and things get worse rather than better.

I saw the latter personally in the healthcare industry.  A seemingly minor parameter error was made in setting up a bunch of routers.  The result was that on the next full working day, connections between workers and critical applications began to fail.  The normal remedial process, which was to simply reset and restore the connections made things worse.  The final step was to assume that the host software had gone bad and reload/restart.  That killed everything.

You can make a darn convincing argument that machine cognition could have been valuable at this point.  The same can be said in any situation where there are either a bunch of bad things (which implies common cause) or a remediation failure (which implies unexpected consequences).  However, it may well be that these issues are beyond the range of reasonable AI response, at least in the near term.

In my healthcare example, diagnosis of the problem required a combination of experience that was apparently uncommon enough to not occur either in the healthcare provider or the router vendor organization.  Might the necessary skills have been “taught” to AI?  Perhaps, if somebody were willing to fund an enormous collection of subject-matter experts and the even-more-expensive dumping of their brains into a database.

A real advance in AI application to service lifecycle management would have to come, IMHO, from a combination of two factors.  First, we’d need to be able to substitute massive analytics for the subject matter expert.  Collecting data from the history of a given network, or in fact from all networks, might allow us to create the inputs about correct and incorrect operation, and the impacts of various restorative steps, that a knowledge engineer would need.  Second, we need an on-ramp to this more AI-centric state that’s not overly threatening.

What would be useful and incremental, perhaps, is what could be called the “Oh ****” state and an event of the same name.  The reception of the so-named event in any operating state causes a transition to the you-know-what state, where a process set designed to bring order out of presumptive chaos is launched.  That implies a designed-in capability of restoring the state of everything, perhaps by some form of by-domain redeployment.

There is an AI opportunity here, because it would be difficult for human operators to catalog the states of chaotic network infrastructure.  Analytics and AI principles could be used to match behavior patterns observed in the past with the way the situation developed and how it progressed.  This could then be used to decide what action to take.  In effect, AI becomes a backstop for policies, where policies simply haven’t, or can’t, be developed.

From there, of course, you could expand the scope.  Over time, NOC personnel and operations executives might become more comfortable with the idea of having their own rules of network state and lifecycle management supplanted by machine decisions.  If the automated processes can prove themselves where humans can’t go, it’s much easier to commission them to perform routine tasks.  People, having experience with AI operating where policies can’t easily be formulated, will eventually allow AI to formulate policies in an ever-broader way.

This isn’t going to be an overnight process, though.  AI still has to work from some base of knowledge, and just gathering data isn’t going to deliver everything it needs.  Even a long historical data timeline isn’t the full answer; networks rarely operate with unchanged infrastructure for even a year.  We’ll need to examine the question of how to gather and analyze information on network conditions to get the most of AI, and we’ll probably need humans to set policies for a long time to come.

Lessons the Fiber Market Can Teach Networking Overall

We already know that fiber technology can be divided into families based on features—long-haul versus access, passive optical versus point-to-point or active reconfigurable.  We might be seeing signs that it is dividing at the business level, into what might be called “tactical fiber” and “strategic fiber”.  If so, that could have major implications on the networking market in general, and for L2/L3 vendors in particular.  Finally, if all this is true, then the quarters reported by fiber firms Ciena and Infinera may demonstrate the state of market recognition of that tactical/strategic difference.

There has always been a tension between the strategic and the tactical in technology.  Do you see the future as something that develops slowly from the present, or something that responds to radical new technologies and opportunities in a totally new way?  Sales people generally don’t like strategy—it takes too long.  Over the last decade, strategy has fallen a bit out of favor even with senior executives and the Street, because the financial markets live quarter by quarter.  However, there are times when tactics won’t get you to where you need to be.

Look at Infinera, whose future according to the Street is subsea transport and data center interconnect (DCI).  OK, we have oceans to cross.  We have data centers to connect.  However, neither application is the operator goal; it’s not like subsea fiber is a migration path for data or that data centers demand interconnection.  There’s a business model lurking here somewhere, or there should be.  There are tactical fiber roles within that model, but if you’re an operator with business problems to solve, do you look for those tactics, or for the strategies that address your problem?

The bellwether quote from the Infinera earnings call (CEO Fallon) is “Looking to the future, we see opportunities in the horizon stemming from architectural evolutions that are at the beginning of their planning phase, particularly around fiber densification in the metro for cable operators and 5G for mobile service providers. Additionally, the enormous growth in cloud services is likely to persist, which will increasingly require the most scalable and cost-efficient transport networks.

During this period of architectural evolution, our technology approach will allow us to deliver the most reliable, high-capacity, power-optimized solutions in form factors that our customers want to deploy, both as integrated solutions and purpose-built platforms.”

The statement says that the future is a product of strategic initiatives by operators, which will develop into tactical opportunities for Infinera.  They’ll supply the elements needed by the buyer to fulfill buyer needs, and that’s a tactical mission.

Ciena serves the same overall markets, and they sell products tactically, but they position themselves much more as a strategic player.  If you look at the transcripts for the current-quarter earnings calls of the two companies, you find a stark difference.

The focus quote from Ciena’s earnings call (CEO Smith) is “…we also introduced Liquid Spectrum, our approach to redefining really how optical networks are fundamentally built. This software driven platform combines our Blue Planet software, WaveLogic Ai chipset and a reconfigurable Photonic Layer. This is a truly unique offering in the market. It is the first dynamic capacity on demand solutions that does not require pre-deploying hardware for peak periods of traffic. And I want to be very clear, our success to-date in the market does not take into account the technology advantages yet to come with WaveLogic Ai, which when combined with our global scale and deep customer relationships positions us extremely well to capture additional market share globally.”

This says that Ciena has a strategic vision for fiber networking, has built symbiotic product elements to support that vision, and is reaping the benefit of the strategic vision in deep customer relationships.  A “unique offering” is something that requires context to plan, context to sell.  It’s not a tactical vehicle at all.

The day after its quarterly announcement, which was a miss on revenue, Infinera’s stock dropped almost 10%.  The day after Ciena’s quarterly announcement, which was a beat on revenue and backed up with strong guidance, its stock gained about the same amount.  So, the obvious question is whether the difference in outcomes was created in large part by the difference between tactical market behavior and strategic behavior.

Do you think there are any major network operators anywhere in the world who believe they can continue with infrastructure plans as they are today?  Are there any who see the future as just a somewhat refined version of the present?  I don’t talk to any with those views, and if that’s a reliable indicator then I think you have to presume that operators are looking for dialogs with vendors who can tell them what the infrastructure plans for the future should be.  Yes, we’ll have submarine cables, metro fiber, data center interconnect.  They’ll fit into a different network and service model, though.

So, we are back to context.  The fiber tactic of the future has to fit the context of the future, not the one of the present.  Vendors who position their wares in such a way as to demonstrate understanding of the future are more credible positioning for that future context.  Even in an earnings call, and even at the CEO level, Ciena wins in that regard.  That’s especially true given that I think Ciena could have done even better in strategic positioning, and gained even more.

Is the context lesson one that should be learned other classes of vendor?  Certainly, server vendors or network software vendors should take the issue seriously.  I’ve seen literally half-a-hundred major pitches on “transformation” from vendors that define the term as “transforming your purchase of a competitor’s gear to your purchase of mine.”  Just because the network of the future might need “billing” doesn’t mean that you start your transformation off by buying billing software, or that just buying billing software transforms you.  Operators tell me they value a vendor’s vision of next-generation infrastructure as much or more than they value individual products.  Another vote for strategy.

How about those L2/L3 vendors I mentioned at the opening of this blog?  At one level, understanding the truth is always helpful, so they could benefit in that sense by understanding the strategic context of their buyers.  On another, facing it often isn’t helpful at all.  Fiber is the strategic lingua franca of the future network, because you can’t have a network without it.  Fiber vendors have a lovely positional advantage because they have a position.  Same for server/software vendors; if you want to have virtualization in the future, then servers and software is inescapable.  For the rest of the network, the other layers, there is no such certainty of mission.  A switch/router vendor might be arguing for a strategic future designed to work without switches or routers.

If we could create a full-optical mesh between users at an affordable price and using low-touch technology, nobody would want an electrical layer at all.  What the electrical layer does is aggregate and distribute traffic to build a service footprint larger by far than a practical fiber footprint.  If you make the optical layer cheaper (in particular, reduce the marginal cost of fiber capacity so fatter trunks aren’t disproportionately expensive) and improve optical-layer connectivity, you reduce the need for the electrical layer.  If you add SDN virtual-path grooming to optics, you reduce things further, and still further if you add hosted instances of switching/routing and SD-WAN technology.  That’s what makes strategic positioning of L2/L3 products difficult; you may be positioning for absence as much as for change.

However, the story of the two fiber families can have a happy ending.  Suppose, just suppose, that it indicates that buyers are starting to look past the ends of their noses, and that some sellers, at least, are seeing the change in focus.  We then have the means and the opportunity to think about that strategic future, and shape it.

 

What Would the “Right” Model for SDN, NFV, and Virtualization Look Like?

There are sometimes concrete answers to abstract questions.  In some cases, the reason why those answers don’t seem to gain traction or acceptance is that they come from a different sector.  So it is today with network transformation or “next-gen” networks.  We have spent half-a-decade framing principles that were already framed better elsewhere, and with every day that passes, those “elsewheres” are moving forward toward even-more-useful developments, while networking seems stuck in neutral.

It didn’t have to be this way.  Back in 2013 when I worked with a group of vendors to launch the CloudNFV initiative, we submitted a proof-of-concept proposal to ETSI, which was the first to be formally approved.  In 2014, when my commitment to offer free services to the activity expired, the PoC was modified considerably, but I’ve posted the original PoC HERE for review by those who missed the original document.  I want to call your attention in particular to Section 1.2 PoC Goals, and to the eight goals defined there.  What you will see is that every single goal that’s emerged from years of subsequent work was explicitly stated in that section.  Interoperability?  It’s in there.  Conformance to current management practices?  There too.  Infrastructure requirements?  There.  Onboarding is there.  Portability is there too.

The PoC defined a software architecture built around a data model (“Active Contract”) that conformed to the principles of the TMF’s NGOSS contract.  The architecture was mapped to the ETSI E2E model, and the early PoC phases were intended to demonstrate that the architecture as defined would conform to that model’s requirements and the underlying ETSI work, and was also scalable.  The first phase of the PoC was demonstrated to the two sponsor operators (Sprint and Telefonica) before the PoC goals and structure was changed.

We picked the name “CloudNFV” for a reason; we believed that NFV was first and foremost a cloud application.  You can see in the implementation the same principles that are emerging in the cloud today, particularly in Amazon’s AWS Serverless Platform.  We have controllable state and small scalable processes that draw their state information from a data model, making them scalable and distributable.  In short, the functionality of NFV and the scalability of an implementation were designed into the model using the same principles as we’ve evolved for the cloud, which is what most cloud technologists would have recommended from the first.

I’m opening with this explanation because I want to demonstrate that it is not only possible to take a top-down view of the evolution to virtual networking, it’s also possible to harmonize it with the functional concepts of ETSI and the technology evolution of the cloud.  Having set that framework, I now want to talk about some specific technology rules we could apply to our next-gen evolution.

The clearest truth is that we have to start thinking of networks as being multiplanar.  There’s a critical abstraction or model of a service, created from a combination of TMF NGOSS Contract principles and modern intent modeling.  Then we have traffic and events living in literally different worlds.  To the extent that network traffic is either aggregated or representing operator services aimed at linking specific business sites, the traffic patterns of the network are fairly static.  We can visualize future networks as we do present ones, as connections and trunks that link nodal locations that, under normal conditions, stay the same over long periods.

However, “long” doesn’t mean forever, and in virtualization it doesn’t mean as long as it means in traditional networks.  Networks are created today by devices with specialized roles.  Those devices, being physical entities are placed somewhere and form the nexus of traffic flows.  In the virtual world, we have a series of generalized resources that can take on the role of any and all of those physical devices.  You can thus create topologies for traffic flows based on any of a large set of possible factors, and put stuff where it’s convenient.

It’s all this agility that creates the complexity at the event level.  You have a lot of generalized resources and specific feature software that has to be combined and shaped into cohesive behaviors.  Not only does that shaping take signaling coordination, so does the ongoing life-sustaining activities associated with each of the elements being shaped.  This is all complicated by the fact that since all the resources are inherently multi-tenant, you can’t expose the signaling/event connections for general access or attack.

In the world of virtualized-and-software-defined networks, you have a traditional-looking “structure” that exists as abstract flows between abstract feature points by the service model and its decomposition.  This virtual layer is mapped downward onto a “traffic plane” of real resources.  The “bindings” (to use the term I’ve used from the first) between the two are, or should be, the real focus of management in the future.  They represent the relationships between services (which means informal goals and expectations ranging onward to formal SLAs) and resources (which are what breaks and have to be fixed or replaced).  Explicit recognition of the critical role of bindings is essential in structuring software and understanding events.

When a service is created, the process starts with the abstract flow-and-feature structure at the top, and is pressed downward by creating bindings through a series of event exchanges.  With what?  On one end, obviously, is the resource being bound.  At the other is, staying with our abstraction, the model.  Obviously abstract things can communicate only abstractly, so we need to harden the notion of the model, which is the goal of the software architecture.

Logically speaking, a local binding could be created by spawning a management process in the locality and giving it the abstract model as a template of what needs to be done.  We don’t need to have some central thing that’s doing all of this, and in fact such a thing is an impediment to scalability and resiliency.  All we need is a place to host that local process, and the data-model instruction set.  The data model itself can hold the necessary information about the process to permit its selection and launching, then the process takes over.  NFV, then, is a series of distributed processes that take their instructions from a detailed model, but is coordinated by the fact that all those detailed local models have been decomposed from a higher-level model.

This is what frames the hierarchy that should be the foundation of the next-gen network software.  We need to “spawn a local management process”, which means that we must first use policies to decompose our global service abstraction into something that looks like a set of cooperative locations, meaning domains.  How big?  There’s no fixed size, but we’d likely say that it couldn’t be bigger than a data center or data center complex that had interconnecting pipes fast enough to make all the resources within look “equivalent” in terms of performance and connectivity.  High-level model decomposition, then, picks locations.  The locations are then given a management process and a piece of the model to further decompose into resource commitments, via those critical bindings.

The bindings are also critical in defining the relationship between locations, which remains important as we move into the in-service phase of operation.  A “primary event” is generated when a condition in a real resource occurs, a condition that has to be handled.  The big management question in virtual networking is what happens next, and there are two broad paths—remediation at the resource level, or at a higher level.

Resource-level remediation means fixing the thing that’s broken without regard for the role(s) it plays.  If a server fails, you substitute another one.  This kind of remediation depends on being able to act within the location domain where the original resource lived.  If I can replace a “server” in the same location domain, that’s fine.  The local management process can be spun up again (there’s no reason for it to live between uses), access the data model, and repeat the assignment process, for each of the “services” impacted.  And we know what those are because the sum of the local data models contain the bindings to that resource.

Higher-level remediation is needed when replacing the resource locally isn’t possible, or when the problem we’re having doesn’t correlate explicitly to a single resource.  It’s easy to imagine what causes the first of these conditions—we run out of servers, for example.  For the second, the easy examples are an end-to-end event generated at the service level, or a service change request.

So, if the resource remediation process runs with the resources, in the local domain, where does the higher-level process run?  Answer, in the location that’s chosen in the modeling, which is where the resource domains are logically connected.  Data-center domains might logically link to a metro domain, so it’s there that you host the next-level process.  And if whatever happens has to be kicked upstairs, you’d kick it to the next-level domain based on the same modeling policy, which is the logical opposite of the process of decomposing the model.

At any level in the remediation or event analysis process, the current process might need to generate an upstream event.  That event is then a request for the next-level management process to run, because you can only jump to an adjacent level (up or down) in event generation in a logical non-fragile model implementation.  A single resource event might follow a dozen service bindings back to a dozen higher-level processes, each of which could also generate events.  This is why event and process management is important.

And the service model controls it all.  It’s the model that’s specific, but even the model is distributed.  A given layer in the model has to know/describe how it’s implemented in a local resource domain, and what its adjacent-domain (upward toward the services, downward toward the resources) are bound in.  That’s all they need.  Each piece of functionality runs in a domain and is controlled by the local piece of that global distributed model.

But it’s scalable.  There is no “NFV Manager” or “MANO” or even Virtual Infrastructure Manager, in a central sense; you spin up management functions where and when you need them.  As many as you need, in fact.  They would logically be local to the resources, but they could be in adjacent domains too.  All of these processes could be started anywhere and run as needed because they would be designed to be stateless, as lambda functions or microservices.  Amazon, Google, and Microsoft have already demonstrated this can work.

This is how the cloud would solve next-gen networking challenges.  It’s how SDN and NFV and IoT and 5G and everything else in networking that depends to any degree on virtual resources and functions should work.  It’s what cloud providers are already providing, and what network vendors and operators are largely ignoring.

All you have to do in order to validate this approach is look at how Amazon, Google, and Microsoft are evolving their public cloud services.  All of this has been visible all along, and even today it wouldn’t take an inordinate amount of time to create a complete implementation based on this approach.  I think that’s what has to be done, if we’re really going to see network transformation.  The future of networking is the cloud, and it’s time everyone faced that.

The Technical Pieces of a Successful NGN

What do we need, in a technical sense, to advance to next-generation networking?  Forget trite suggestions like “carriers need to change their culture” or “we need to focus on customer experience.”  When has any of that been in doubt, and how long has it been said?  If there are problems that need to be solved, what are they?  Three, in my view.  We need a good service modeling architecture, we need a framework for automating the service lifecycle, and we need to have a strong and scalable management architecture to bind services and resources.

To my mind, defining a good service modeling architecture is the primary problem.  We need one that starts with what customers buy, dives all the way to resource commitments, covers every stage of the service lifecycle, and that embraces the present and the future.  Forget technology in this effort; we should be able to do this in abstract…because service models are supposed to be abstract.  The abstract should cover four key points.

Point number one is hierarchical structure.  An object in the modeling architecture has to be able to represent a structure of objects that successively decompose from those above.   “Service” might decompose into “Core” and “Access”, and each of the latter might decompose based on technology and/or geography.

Point number two is intent-based elements.  An object in the architecture should have properties that are based on what it does, not how it does it.  Otherwise the object is linked to a particular implementation, which then limits your ability to support evolving infrastructure, multiple vendors, etc.

The third point is per-element state/event-based process mapping.  Each object needs to have a state/event table that defines the operating states it can be in, the conditions it expects to handle, and the processes associated with the state/event combinations.  “If State A and Event B, then Run Process C and Set State X” might be a form of expression.

Point four is that the parameters input into and expressed by elements must be related to the parameters received from the next-higher element and the next-lower elements.  If “Access” and “Core” report normal states then “Service” does likewise.  Any numerical or Boolean properties of an object would, if set, result in something being sent downward to subordinates, and anything below has to be transformed to a common form published by the layer above.

The single biggest failing in our efforts to transform services and networks is the fact that we have not done the modeling work.  Let make this point clearly; without a strong service model, there is no chance of transformation success.  And, folks, we still do not have one.

Service automation is the second problem we have to resolve.  If you have a good service model, then you can define good software processes to decompose the model and link lifecycle processes to states/events.  We’ve only recently started accepting the importance of service automation, but let me make it clear in financial terms.

This year, “process operations” costs, meaning costs directly attributable to service and network operations, will account for 28 cents of each revenue dollar.  If we were to adopt SDN and NFV technology in its limited, pure, standards-based form, at the largest plausible scale, we could reduce 2018 costs by a bit less than two and a half cents.  If we were to adopt service automation principles absent any technology shifts whatsoever, we could reduce the 2018 costs by three cents more per revenue dollar—more than double the savings.  Furthermore, the investment needed to secure the 2.4 cents of SDN/NFV savings would be thirty-eight times the investment needed to secure the 5.4 cents of operations savings.

Perhaps one reason it’s complicated is that “service automation”, is really a combination of two problems.  First is the already-mentioned lack of a good service modeling architecture.  The second is a scalable software architecture with which to process the model.  It does little good to have one without the other, and so far, we have neither.

I’ve been involved in three international standards activities and two software projects, aimed in part at service automation.  From the latter, in particular, I learned that a good model for scalable service management is the “service factory” notion, which says that a given service is associated with an “order” that starts as a template and is then solidified when executed.  I’ve tried out two different approaches to combining software and models, and both seem to work.

One approach is to use a programming language like Java to author services by assembling Java Classes into the service models.  This creates a combination of an order template that can be filled in, and a “factory” that when supplied with a filled-in template (an “instance”), will deploy and manage the associated service.  You can deploy as many factories as you like, and since all the service data (including state/event data) lives in the instance, any factory can process any event for any service it handles.

The second approach is to have generalized software process a service data model and execute processes based on states and events.  To make this work, you need to make all the service lifecycle steps into state/event behaviors, so things might start with an “Offer” state to indicate a service can be ordered, and progress from there.

My personal preference is for the second of the two approaches, but you can see that neither one bears any resemblance to the ETSI End-to-End structure, though I’ve been told by many that the model was never intended to be taken as a literal software architecture.  I think you can fit either approach to the “spirit of ETSI”, meaning that the functional progression can be made to align.

The final technical problem to be resolved in getting to next-gen networking is a management model.  Think for a moment about a global infrastructure, with tens of thousands of network devices, millions of miles of fiber, tens of thousands of data centers holding millions of servers, each running a dozen or so virtual machines that in turn run virtual functions.  All of this stuff needs to be managed, which means it needs attention when something happens.  How?

The first part of the “how?” question is less about method detail than about overall policy.  There are really two distinct management models, the “service management” and the “resource management” models.  Service management says that management policies are set where they’re offered to the customer in terms of SLAs.  You thus report conditions that violate or threaten SLAs, and you use service policies to decide what to do.  Resource management says that resources themselves assert “SLAs” that define their design range of behavior.  You manage the resources against those SLAs, and if you’ve assigned services to resources correctly, you’ll handle services along the way.

We’ve rather casually and seemingly accidentally moved away from formal service management over time, largely because it’s difficult in adaptive multi-tenant services to know what’s happening to a specific service in case of a fault.  Or, perhaps, it would be more accurate to say that it’s expensive to know that.  The problem is that when you start to shift from dedicated network devices to hosted software feature instances, you end up with a “service” problem whether you want one or not.

The goal of management is remediation.  The scope of management control, and the range of conditions that it responds to, has to fit that goal.  We’re not going to be focusing on rerouting traffic if a virtual function goes awry; we’re going to redeploy it.  The conditions that could force us to do that are broad—the software could break, the server, some of the internal service-chain connections, etc.  The considerations relating to where to redeploy are equally diverse.  So in effect, virtualization tends to move us back at least a bit toward the notion of service management.  It surely means that we have to look at event-handling differently.

Real-device management is straightforward; there’s a set of devices that are normally controlled by a management stack (“element, network, service”).  Conditions at the device or trunk level are reported and handled through that stack, and if those conditions are considered “events” then what we have are events that seek processes.  Those processes are large monolithic management systems.

In a virtual world, management software processes drift through a vast cloud of events.  Some events are generated by low-level elements or their management systems, others through analytics, and others are “derived” events that link state/event processes in service models of lifecycle management.  In the cloud world, the major public providers see this event model totally differently.  Processes, which are now microcomponents, are more likely to be thrown out in front of events, and the processes themselves may create fork points where one event spawns others.

The most important events in a next-generation management system aren’t the primary ones generated by the resources.  These are “contextless” and can’t be linked to SLAs or services.  Instead, low-level model elements in modern systems will absorb primary events and generate their own, but this time they’ll be directing them upward to the higher-level model elements that represent the composition of resource behaviors and services.  Where we run the event processes for a given model element determines the source of the secondary events, and influences where the secondary processes are run.

“Functional programming” or “lambda processing” (and sometimes even “microservices”, in Google’s case) is the software term used to describe the style of development that supports these microcomponent, relocatable, serverless, event-driven systems.  We don’t hear about this in next-gen networking, and yet it’s pretty obvious that the global infrastructure of a major operator would be a bigger event generator than enterprises are likely to be.

The event management part of next-generation networks is absolutely critical, and so it’s critical that the industry takes note of the functional programming trends in the cloud industry.  If there’s anything that truly makes the difference between current- and next-generation networks, it’s “cloudification”.  Why then are we ignoring the revolutionary developments in cloud software?

That should be the theme of next-gen networking, and the foundation on which we build the solutions to all three of the problems I’ve cited here.  You cannot test software, establish functional validity of concepts, or prove interoperability in a different software framework than you plan to use, and need, for deployment.  The only way we’re going to get all this right is by accepting the principles evolving in cloud computing, because we’re building the future of networking on cloud principles.  Look, network people, this isn’t that hard.  We have the luxury of an industry out there running interference for us in the right direction.  Why not try following it instead of reinventing stuff?  I’ll talk more about the specifics of a cloud-centric view in my blog tomorrow.

Can We Answer the Two Top Operator Questions on Service Lifecycle Automation?

Operators tell me that they are still struggling with the details of service lifecycle automation, even though they’re broadly convinced that it’s a critical step in improving profit per bit.  There are a lot of questions, but two seem to be rising to the top of everyone’s list, and so exploring both the questions and possible answers could be valuable for operators, and so for the industry.

Service lifecycle automation means creating software processes to respond to service and network conditions, rather than relying on a mixture of network operations center reactions and both contemporaneous and delayed responses in operations systems.  “Something happens, and the software reacts…completely, across all the impacted systems and processes, dealing both with service restoration and SLA escalation and remediation,” is how one operator put it.

That statement raises the first of the operator questions, which is how you can get service lifecycle management to cover the whole waterfront, so to speak.  There’s a lot of concern that we’ve been addressing “automation” in a series of very small and disconnected silos.  We look at “new” issues in orchestration and management, like those created by NFV hosting, but we don’t address the broader service/network relationships that still depend on legacy elements.  They point out that when you have silos, no matter where they are or why/how they were created, you have issues of efficiency and accuracy.

The silo problem has deep roots, unfortunately.  As networking evolved from circuit (TDM) to packet (IP), there was an immediate bifurcation of “operations” and “network management”.  Most of this can be traced to the fact that traditional OSS/BSS systems weren’t really designed to react to dynamic topology changes, congestion, and other real-time network events.  The old saw that OSS/BSS systems need to be “more event-driven” harkens from this period.

The separation of operations and management has tended to make OSS/BSS more “BSS-focused”, meaning more focused on the billing, order management, and other customer-facing activities.  This polarization is accentuated by the focus of operators on “portals”, front-end elements that provide customers access to service and billing data.  Since you can’t have customers diddling with network resources in a packet world, the portalization of planning forces delineation of business operations and network management boundaries.

One way to address this problem is to follow the old TMF model of the NGOSS Contract, which has at the latest morphed into TMF053 NGOSS Technology Neutral Architecture (TNA).  With this model, operations systems are implicitly divided into processes that are linked with events via the contract data model.  Thus, in theory, you could orchestrate operations processes through event-to-process modeling.  That same approach would work for service lifecycle automation, which would provide a unified solution, and you could in theory combine both in a single service model.  Operators like a single modeling language but are leery about having one model define both management and OSS/BSS lifecycles.

That raises the operators’ second question, which is about events.  Lifecycle management is all about event-handling, and just having a nice data-model-driven approach to steering events to processes doesn’t help if you can’t generate the events or agree on what they mean.  This is particularly important when you are looking at multi-tenant infrastructure where the relationship between a service and infrastructure conditions may be difficult to obtain, and where correlation costs are such that the measures would be impossible to financially justify.

An “event” is a condition that requires handling, and it’s obvious that there are questions on what conditions should create events and how the events should be coded.  Ideally, you’d like to see service lifecycle events standardized at all levels, including how a pair of models—one managing network resources for services and the other operations processes for customers—would use events to synchronize their behavior.  Operators have no idea how that’s going to happen.

Events are critical in service automation, because they’re the triggers.  You can’t automate the handling of something you don’t know about, and “know about” here means having an unambiguous, actionable, signal.  If something that’s dedicated to and an explicit part of a given service breaks, it’s easy to envision that such a signal could be produced, though differences in vendor implementations might require standardization of error conditions.  Where shared or virtualized resources fail it’s another matter.

One problem is that there are a lot of different sources of “events”, and many sources are from different pieces of infrastructure with different status conditions.  A server might report one set of events and a fiber trunk another.  How do you correlate the two, or even understand what they mean?  For example, a server might report an overheat warning.  Does that mean anything in terms of capability?  If a server reports overheating and a fiber trunk overloading, do the two have a common cause?

Another problem is that a condition in infrastructure doesn’t always impact all the services, so you have to understand what the scope of impact in.  A fiber failure surely impacts services that happen to use the fiber, but what services are those?  In order to find out for an IP service, you’d have to understand the addresses involved in the service and the way those addresses were routed.  And it’s more than just destination address.  Two different users accessing the same server might find that one is impacted by a fiber failure and the other is not.

“Analytics” is the most problematic source of events.  Analytics have a presumptive multiplicity of dimensions that separate it from simple status reporting.  Those added dimensions make it harder to say what a given analytics “prediction” or “event” would mean to services.  Last week, analytics might say, these conditions led to this result.  We already know that the result might be difficult to associate with the state of a given service, but we now add the question of whether the conditions last week had any specific relevance to that service last week.  If not, is there a reason to think we should care now?  Do we then do service-specific correlations for analytics?

Event correlation is critical to the notion of service automation, because it’s critical to establishing service conditions.  You can do infrastructure or resource management a lot easier because the event relationship with infrastructure is clear; events are generated there.  This means that it would probably make sense to separate services and infrastructure so that the services of infrastructure (what I’ve always called the “resource layer”) are presented as services to the service layer.  Then you need only determine if a resource service (“behavior” is the term I use) is meeting the SLA, and you can readily generate standard events relating to SLA conformance.

This leaves us with what are effectively three layers of modeling and orchestration—the OSS/BSS layer, the network services layer, and the resource/behavior layer.  This multiplication seems to make operators uneasy at one level, and comforts them at another.  Lots of layers seems to indicate unneeded complexity (likely true), but the layers better reflect current structures and dodge some important near-term issues.

We don’t have nearly enough dialog on these topics, according to operators.  Even those involved in SDN and NFV trials say that their activity is focused at a lower level, rather than addressing the broad operations relationships that we really need to see addressed.  I’m hoping that the expanding work of operators like AT&T and Telefonica will at least expose all the issues, and perhaps also offer some preliminary solutions.  Eventually, I think that operator initiatives will be the drivers to getting real discussions going here; vendors don’t seem willing to take the lead.