Is ONAP Advancing or Digging a Deeper Hole?

The announcement of ONAP’s Frankfurt Release last month raised a lot of questions from my contacts and clients.  There is no question that the release improves ONAP overall, but it still doesn’t change the underlying architecture of the platform.  I’ve described ONAP as a “monolithic” model of zero-touch operations automation, and said that model is simply not the right approach.  In a discussion with an EU operator, I got some insights into how to explain the difference between ONAP and the (far superior, in my view) TMF NGOSS Contract model.

We think of networks as vast interconnected collections of devices, which is true at the physical level.  At the operations level, though, a network is a vast, swirling, cloud of events.  An event is an asynchronous signal of a condition or condition change, most often a change that represents a significant shift in state/status.  In a pure manual operations world, human operators in a network operations center (NOC) would respond to these events by making changes in the configuration or parameterization of the devices in the network.

An automated lifecycle management system, like those humans in the NOC, have to deal with events, and as usual there’s more than one way to do that.  The obvious solution is to create what could be called an “automated NOC”, a place where events are collected as always, and where some automated processes then do what the human operators would do.  I’ll call this the “AutoNOC” approach.

The problem with AutoNOC is that by inheriting the human/NOC model, it inherited most of the problems that model created.  Two examples will illustrate the overall issue set.

Example One:  A major trunk fails.  This creates an interruption of connectivity that impacts many different services, users, and devices.  All the higher-layer elements that depend on the trunk will generate events to signal the problem, and these events will flood the NOC to the point where there’s a good chance that the staff, or the AutoNOC process, will simply be swamped.

Example Two:  An outside condition like a brownout or power failure occurs as the result of a storm, and there are intermittent failures, over a wide area as a result.  The events signaling the first of these problems are being processed when events signaling later failures occur, and the recovery processes initiated then collide with each other.

What we really need to fix this problem is to rethink our notion of AutoNOC operation.  The problem is that we have a central resource set that has to see and handle all our stuff.  Wouldn’t it be nice if we had a bunch of eager-beaver ops types spread about, and when a problem occurred, one could be committed to solve the problem?  Each of our ops types would have a communicator to synchronize their efforts, and to ensure that we didn’t have a collision of recovery steps.  That, as it happens, is the TMF NGOSS Contract approach.

The insight that the NGOSS Contract brought to the table was how to deploy and coordinate all those “virtual ops” beavers we discussed.  With this approach, every event was associated with the service contract (hence the name “Next-Generation OSS Contract”), and in the service contract there would be an element associated with the particular thing that generated the event.  The association would consist of a list of events, states, and processes.  When an event comes along, the NGOSS Contract identifies the operations process to run, based on the event and state.  That process, presumed to be stateless and operating only on the contract data, can be spun up anywhere.  It’s a microservice (though that concept didn’t really exist when the idea was first advanced).

The beauty of this should be obvious.  First, everything is infinitely scalable and resilient, since any instance of a state/event process can handle the event.  If you have two events of the same type, you could spin up two process instances.  However, if the processing of an event launches steps that change the state of the network, you could have the first event change the state so that a second event of the same kind would be handled differently.  The data model synchronizes all our ops beavers, and the state/event distribution lets us spin up beavers as needed.

What do these processes do?  Anything.  The obvious thing is that they handle the specific event in a way appropriate to the current state of the element.  That could involve sending commands to network elements, sending alerts to higher levels, or both.  In either case, the commands/alerts could be seen as events of their own.  The model structure defines the place where repair is local, where it’s global, and where it’s not possible and some form of higher-layer intervention is required.

I’ve blogged extensively on service models, and my ExperiaSphere project has a bunch of tutorials on how they work, in detail, so I won’t repeat that piece here.  Suffice it to say that if you define a proper service model and a proper state/event structure, you can create a completely cloud-native, completely elastic, completely composable framework for service automation.

Now contract this with AutoNOC.  Classic implementation of this approach would mean that we had an event queue that received events from the wide world and then pushed them through a serial process to handle them.  The immediate problem with this is that the process isn’t scalable, so a bunch of events are likely to pile up in the queue, which creates two problems, the obvious one of delay in handling, and a less obvious one in event correlation.

What happens if you’re processing Item X on the queue, building up a new topology to respond to some failure, and Item X+1 happens to reflect a failure in the thing that you’re now planning to use?  Or it reflects that the impacted element has restored its operation?  The point is that events delayed are events whose context is potentially lost, which means that if you are doing something stateful in processing an event, you may have to look ahead in the queue to see if there’s another event impacting what you’re trying to do.  That way, obviously, is madness in terms of processing.

My view is that ONAP is an AutoNOC process.  Yes, they are providing integration points to new services, features, and issues, but if the NGOSS Contract model was used, all that would be needed is a few new microservices to process the new state/events, or perhaps a “talker” that generates a signal to an OSS/BSS process at the appropriate point in service event processing.  Even customization of a tool for service lifecycle automation would be easier.

My concern here is simple.  Is ONAP, trying to advance further along functionally, simply digging a deeper hole?  The wrong architecture is the wrong architecture.  If you don’t fix that problem immediately (and they had it from the first) then you risk throwing away all enhancement work, because the bigger a monolith gets, the uglier its behavior becomes.  Somebody needs to do this right.

An Approach to Cloud-Native 5G

Could we frame out a true cloud native 5G Core?  What would happen if we set aside as many of the implementation presumptions as we could, and focused on trying to come up with the best technology approach?  It’s an interesting question, and it might turn out to be a highly relevant one now that so many cloud providers and software vendors are aiming at carrier 5G Core deployments.  Along the way to answering this question, we might also glimpse a solution to the problems of getting standards bodies aligned with the world of the cloud.

5G Core is arguably different in four ways.  First, it proposes to replace network elements (devices) with network functions (software instances).  Second, it proposes to create virtual partitions within the 5G network overall, called “network slicing”, somewhat analogous to virtual network overlays on a real network.  Third, it proposes to abstract the user-plane and control-planes and separate them fully, and finally, it proposes to fully abstract network transport.  All of this has to be done in a way that defines specific interfaces that promote interoperability.  So how might that happen?

The mandate of 5G is for “virtualization”, both of connection services and network functions.  The standards and the trend in the industry is to equate this to the adoption of “SDN” and “NFV”, but if you look at the implementations of 5G, it’s clear that vendors are taking a broader swipe at virtualization than the specifics of those two technologies.  We can therefore assume that a true 5G core would rely on virtualization principles and perhaps blow a kiss or two at SDN and NFV as formal components.

The architecture for 5G (a revised chart from the 5GPP organization is my reference for this first figure) shows a data plane (the “User Plane”) and a Control Plane, hereafter referenced as the UP and CP.  The functional elements of each plane are shown as dark blue boxes in my figure, and the interfaces between them (the “N” interfaces) represent the standardized relationships mandated in 5G specs.

The problem with this approach is that it tends to lead vendors into a “virtual box” model, meaning that what they do is implement each of the boxes in the figure as though they were virtual devices.  That happened with NFV, and was IMHO the biggest single reason why the initiative hasn’t met expectations.  The problem is that once you define interfaces, you’re defining things that connect, which leads to defining the things themselves.  What we have to do is stop thinking of these boxes as virtual devices, and think of them in some more generic way.  Since that defies the specs, to a degree, we have to work our way into this.

The place to start in refining this starting point is the separation of control and data planes, our third 5G difference.  In the control plane, 5G functions can be viewed as microservices, having no internal state and being infinitely scalable.  Latency management in the control plane isn’t as critical, so it would be possible to adopt a service-mesh framework or something similar to provide for the necessary combination of discovery, load balancing, and scaling and resilience.  Our cloud-native 5G Core should have its control-plane elements based on this framework, since this is where “cloud-native” is clearly going at the application level.

Within the CP piece of my figure, we could then presume that each of the blue boxes are service-meshed microservices with full resilience and load-balancing, or perhaps in some cases are serverless functions loaded on demand.  The difference would be the rate of activity for a given function.  Since the current leading serverless tool is Knative, which works beneath Istio for service-mesh and above Kubernetes for deployment, we have the potential to identify a single container, Kubernetes, Istio, and Knative model for our cloud-native control plane.

On the data plane, things are a bit more complex.  A router function is not stateless; routing tables are stored locally and it’s these tables that determine how packets are handled.  Router functions are also more difficult to scale or replace without impact on traffic, because packets in flight and currently being handled could be lost or delivered out of order.  Ordering is a feature of some data-plane protocols (like TCP) but it always creates a risk of a sharp increase in latency.

We have two options available to us in creating the data plane.  The first is to fall back on our first “difference”, saying that physical devices are replaced by virtual functions.  If we presume that replacement, with respect to packet forwarding at least, is 1:1, then we could say that the data plane will be implemented with stateful virtual devices.  The second is to presume OpenFlow-style SDN control, with central route control, and then assume that each instance of the router function would retrieve its routing table centrally, which would mean that any new instance created for scaling or availability would be equipped to take up its role.

The second model is the one I believe would be most “cloud-native”.  Back-end state control is an acceptable way to introduce stateful information into a microservice, and we would have the option to either deliver updates to the routing tables (from the central control point) or to reload or refresh the table entirely when a change is made.

This second model would be compatible with formal OpenFlow-style SDN, but it could also be applied to any architecture that can build an overall network topology and derive per-device forwarding from it.  In short, we could separate the router-adaptive model of routing now in use, build the topology map in the control plane, and deliver it to the function instances as needed.

I think that this model fits the general thrust of the 5G specs, but it would tend to define what I’ll call a “sub-plane” structure to the control plane.  5G has specific elements it expects to be in the control plane, and what this approach does is generate what’s perhaps a “mapping sub-plane” that maps the 5G control model and data model to the specific virtual network and virtual function requirements of the data plane.  The figure shows the interfaces as vertical lines connecting CP and UP elements.

My presumption is that in a cloud-native model, these N interfaces would terminate not in the UP elements but in intent-model abstractions of each element’s functionality.  These function-based intent models would assert as their external properties the various N interfaces the functions are required to supply by the 5G specs.  Each function’s implementation would involve a set of microservices that represent the features of the mapping sub-plane, which are largely the control and management planes of IP.  These would interface with the real IP data-plane elements, which might be white boxes, hosted instances of forwarding devices, P4-equipped virtual switches, or whatever.

In effect, this mapping sub-plane then would look like an overlap between the CP and UP, as my second figure shows.  From a functional perspective, it would be part of the UP because it would terminate 5G CP N interfaces to the UP.  From an implementation perspective, it would be based on the same microservices/serverless framework as the CP.  Within the CP mesh portion, the control functions of the mapping layer would live, both as a global service (the green box) and as a set of elements distributed within the abstract models of each UP element.

The horizontal N interfaces between UP elements would also be interfaces to these intent-model abstractions, which means that a “virtual UP element” would consist of interfaces to the mapping sub-plane from the CP, interfaces between the UP elements within the UP, and logic to supply the mapping-plane implementation of the IP control and management planes.

You may wonder why all this mapping is necessary, and the reason is that 5G, despite its desire to push telcos toward the cloud, still has very much a device-centric bend.  There is more than one way to deploy in the cloud, more than one way to orchestrate.  Implementations of a given CP or UP function could be conformant to 5G goals and not be identical, to the point where the optimum means of interfacing wouldn’t be simply (for example) an “N” interface to an instance of a UP function.  We have to allow some flexibility because our vision of how the cloud works is evolving as our experience with the cloud expands.

The presumption of this model is that all the “functions” within the CP and all the mapping-layer functions are true microservices, deployed within a service mesh and potentially in serverless, event-driven, form.  The UP elements that make up the “real” data plane are placed and orchestrated rather than meshed, using Kubernetes.

This isn’t the only way to do a cloud-native 5G, but it’s a way that seems to conserve the value of ongoing cloud development, and at the same time harness the benefits of intent modeling.  It’s the combination of these things that create what I think can be justifiably be said to be a true cloud-native approach to 5G architecture.

Open Space for Open Networking?

The best place for open networking is where it was never closed.  Any network initiative that requires a write-down of existing technology, retraining of staff, and acceptance of new risks, faces considerable executive skepticism.  That’s why we may be seeing the dawn of real open networking in two different developments, one relating to O-RAN and the other to open IoT.

O-RAN stands for “open RAN”, of course, and it reflects the common operator fear that vendors will create closed ecosystems to support new services, locking them in for the long term.  The solution to that is to mandate technology that’s componentized and connected via standardized APIs and interfaces, which is what O-RAN offers.

From a vendor perspective, of course, it’s hard to beat a good, closed, ecosystem—unless your competitor is the one who has it.  It’s therefore not surprising that the first major 5G vendor to offer an O-RAN product suite is Nokia.

Nokia today is the end result of the consolidation of three companies, Nokia, Lucent, and Alcatel.  Anyone who’s ever studied large-company M&A, particularly in a highly conservative technology market, knows that the restructuring of organizations and personnel takes years, and that many of the best and most innovative people are lost, having decided to seek their fortunes elsewhere.  In fact, one of my long-standing friends, still at Nokia, says that “consolidation made three conservative organizations into one more conservative organization, when it was conservatism that drove the need for consolidation in the first place.”

The point is that Nokia isn’t the leader in 5G and isn’t going to overtake the firms ahead of it (Ericsson and Huawei) unless it does something radical.  The obvious radical thing would be to bust open the 5G ecosystems of competitors by promoting an open model.  It’s hard for operators to argue for proprietary in the face of openness, and even if Nokia’s competitors argue they use/have open technology, it’s hard to argue that in the face of a specific O-RAN commitment.

Which Nokia has now made.  I don’t agree with the point the story makes about operators and governments “demanding” O-RAN, but there’s certainly that long-standing fear or lock-in, and certainly a value in exploiting it.  Since there is absolutely no reason why an O-RAN network wouldn’t be as functional as that of another model, openness is a benefit that brings no cost, providing that there are alternative vendors for the open framework.  There are plenty of vendors in the O-RAN alliance, including all the major network vendors (outside the mobile space!) and the major IT players, but that doesn’t guarantee the availability of alternative vendors.  There are three problems that concern operators.

The first problem is that an “open RAN” isn’t the same as “open 5G”.  O-RAN can break a proprietary RAN logjam for operators, but there’s still the question of the rest of the 5G infrastructure, and in particular, 5G Core.  Nevertheless, it’s clear to operators that there will be many sources for 5G Core implementation, and so they think it’s the RAN that keeps them from creating a fully open framework.

The second problem is implementation within O-RAN.  Operators tell me that fewer than a third of the vendors in the alliance actually offer an O-RAN-compatible element.  Many of the vendors in O-RAN are there to promote broader competition and participation in 5G, supporting the initiative financially and with technical resources, but not actually committing to implement an element of the architecture.  It’s therefore difficult to say, at the moment, just how many O-RAN elements will have multiple sources.  Operators hope this will change over time.

The final issue is integration.  All open strategies tend to founder, at least to a degree, on the rocks of integration of the open elements into a cohesive system.  “Someone to watch over me” should be the official theme song of the operator community; a single vendor is a single point of responsibility, but a community of vendors is an invitation to a circular firing squad.  Better get an integrator, but who should that be?

Nokia likely sees themselves as being a candidate.  Right now, its O-RAN strategy is to build a Nokia 5G RAN on O-RAN interfaces, not to include any of the small number of potential partners it could draw from the alliance.  In the long term, that’s not going to fly, and Nokia knows it.  What it’s working to do, so I hear, is to sign up partners and build use cases that will both validate the open claims and protect Nokia’s own products.  The future of O-RAN may depend on how Nokia does at this complex juggling.

The other open initiative comes from the operator side, from Deutsche Telekom, and it claims to be the world’s first open IoT network.  The central concepts are the IoT Hub, which is just what the name suggests, and the Cloud of Things, a platform for the execution of IoT applications.

The hub provides a central integrating piece that provides for workflow through the various and diverse elements of an IoT ecosystem.  The goal is to make access to IoT information as device-independent as possible, and to allow easy integration of data processing and application components to the device framework.

The Cloud of Things is the potential revolution here, but just how revolutionary it might be is hard to assess at this point.  My problem with the way it seems to be aimed is that where IoT should be based on an “information-as-a-service” concept, the Cloud of Things seems to focus instead on the “things”, making it device-centric.  Even if you presume that the devices are “virtualized” and mapped (via the IoT Hub) to a standard information exchange framework, you’re still diddling with devices.

There are certainly some IoT applications that require specific device awareness.  If you want to open a gate when an access sensor is triggered, you obviously have to know which sensor is associated with which gate.  In general, I think, “private” IoT applications can almost always be made to work effectively with at least a modified form of a device-centric approach.

The problem comes when you want to collectivize.  The broader the scope of distribution of sensor data, the less likely it is that the exchange should be made directly between sensor and user.  There’s a problem of overloading sensors with access requests, a problem with security, and a problem with duplication of interpretive logic.  The latter means that if there are five applications that need sensor information, it’s likely all five would perform similar analysis on the data before actually doing something with the results.

IoT information-as-a-service (IoTaaS) would create a kind of middleware layer between sensors and applications, exposing not the direct sensor information but relevant IoT information that might be collected from multiple sensors, interpreted in various contexts (correlation of different sensor trends, historical analysis of a single sensor trendline, etc.).  Applications would generally consume the IoTaaS services, enhancing sensor isolation and security.

I think the Cloud of Things could build this kind of framework, but there doesn’t seem to be any expressed intent to do that.  The questions, then, would be whether a service-centric vision is needed, and whether it could be added later on.

I’ve said many times that I do not believe IoT can succeed if it doesn’t convert sensor and controller behaviors into services, abstracting the information and capabilities from the actual devices.  The problems with IoT security and governance are almost insurmountable where direct sensor/controller access is provided, particularly where sensors are shared among a community of users.  I also think that since IoT is in its infancy, it’s hard to predict how far it might evolve, and how fast.  Abstraction could protect application components from changes in the hardware.

Without service-centricity, the Cloud of Things isn’t really much different from the Internet of Things, as far as the “things” are concerned.  It would then be left to DT to somehow validate their platform to developers based on a much narrower value proposition, and a greater risk to regulatory intervention based on privacy concerns or sensor manipulations.

The question of whether Cloud of Things could become “Service of Things” is more complicated, because it seems likely that an application ecosystem or framework like that would require broad industry acceptance.  At the high level, then, the question is whether DT would have the influence to drive the market toward any sort standard.  The next question would be whether a Service of Things standard driven by DT (an operator) would end up the same place that things like NFV have.  A five-year slog to indifferent acceptance isn’t going to look attractive to software companies.

It may be that the best outcome for the Deutsch Telekom initiative is that it drives competitive efforts, particularly from the open-source community.  A Dell/VMware or IBM/Red Hat could breathe some fresh air into the concept, and as a bonus would surely align it with broader cloud initiatives.  A Cloud of Things has to be a cloud, before the “things” will matter much.

I think these open initiatives demonstrate two things.  First, there is an inherent appeal to open approaches to new technologies and opportunities.  People want to believe that they’re not being locked in or gouged for excess profits.  Second, wishing won’t make it so.  If broad appeal is the driver for openness, then blowing kisses and erecting attractive billboards will do wonders for those who want to claim the benefits.

An open field is probably essential for the growth of open networking, but it’s a necessary condition not a sufficient condition.  Effective exploitation is still needed.  The real opportunities in these spaces exist, the real values are clear.  The real winners may still be waiting in the wings.

What’s the REAL State of the Cloud?

Do you believe that within a year, a quarter of enterprises will have moved entirely to the cloud?  Do you believe that almost one in five are already there?  I don’t, not for a minute, and because I’m going to dispute these numbers, I’m not going to cite the source.  I think it’s important to know what’s going on in the cloud, and so I’ll offer my own insights, based on surveying nearly 300 enterprises with good continuity for four decades.

Before I start, I want to note that surveying is a lot harder than it seems.  The problem is that even if survey questions are carefully designed, about a third of people who take surveys will give answers that are totally incorrect, meaning that their company doesn’t do, or have, what they say.  When questions are not properly designed, ignorance can result in a finding that’s so totally wrong it’s laughable.

Example: Back in the early days of ATM (remember that?), a big proponent of ATM commissioned a study on ATM adoption.  I was asked to review the questions, and I pointed out that the question “Does your company use ATM?” wasn’t going to yield useful results based on my experience.  They then proposed to spell out the acronym (asynchronous transfer mode), and I was still not a fan.  I asked them to add the question “If yes, at what speed is ATM delivered?”.  The results came in, and over two-thirds of responses said they used ATM, the kind that ran at 9600 bps!  Obviously, they mistook async communications, and modems, for ATM.

A second problem is that you have to ask people who actually would likely know the answer.  I’ve found that you can ask a CIO, a CFO, the head of data center operations, and a director of software development a question about IT spending and directions, and hope to get an accurate response.  Any other source is problematic, and of course most surveys ask people much lower down the organization.  Often, they don’t know, but who’s going to tell a nice survey person they have no idea what their company’s cloud usage is?  They’ll make something up.

OK, let’s go to the reality of the cloud, as far as my latest contacts go.  Of 277 companies in the survey, half of whom were enterprises and half mid-sized businesses, three had totally moved to the cloud, and those three were mid-sized businesses.  No enterprise had moved, and no enterprise said they were contemplating such a move.  On the other hand. All of the enterprises said that they used public cloud services, and all of them said that they expected to increase their public cloud use in 2020 and again in 2021.

You might wonder whether the issue here is that “moving everything to the cloud” might mean fully adopting private cloud or public cloud, but that’s not what companies say.  If “private cloud” means deploying cloud technology on the premises, there are a dozen companies who propose that, and admittedly about half are interested in the concept.  However, if “moving everything to the cloud” includes hybrid cloud, meaning a partnership between cloud and data center, then the survey data is perhaps even pessimistic.  All of the enterprises I talked with already use hybrid cloud.  About half of SMBs do as well.  Many of the SMBs mix data center and cloud usage, using SaaS public cloud services for some applications and their own hosting for others, so they’re not strictly hybrid cloud users.

What enterprises want from “the cloud” boils down to two things.  Architecturally, they want a model of “elastic computing” that’s applicable to all their hosting, whether in the data center or in public clouds.  This can be achieved most easily, say the enterprises, through adoption of container orchestration.  Financially, they want to be able to shift work across all their hosting options to reflect the best price/feature balance available.  Dynamic shifting would be nice, but it’s a requirement primarily for the front-end piece.  Planned migration is accepted for the back-end part, and very few enterprises plan to migrate that piece in the near term.

The near-term focus of cloud adoption is the creation of mobile and web front-ends to traditional applications, which enable easier consumer, partner, and employee access to applications using traditional phone/browser technology.  This sometimes involves tweaks to the data center stuff, but the development work is totally cloud-focused.

That might be the reason the survey results overstate cloud penetration.  The squeaky wheel gets the oil; the majority of IT people will be focused on the work they’re doing.  If it’s all cloud work, they may not see the residual pieces of the application, running as usual in the usual places.  And make no mistake, the great majority of current development work is focused on that front-end piece, a piece that’s almost a given for cloud hosting.

Would it be possible that enterprises would shift all their work to the cloud?  If that’s without constraining the time period, most enterprises say it would be possible.  If asked whether they expect that to happen within five years, the majority will say they do not; about ten percent think it might be possible in three years, and only 20% in five years.

I’m not dissing the cloud here; most of you know I’m a long-standing cloud fan.  I am concerned about information that would possibly contaminate application and technology planning.  We need to know the truth about the cloud, just as we need to know the truth about any technology concept, product, or service.  A project cannot succeed, no matter how well it’s executed, if it’s aiming at an unattainable goal.  If most enterprises are not abandoning their data centers, then you should assume that the collective reasoning of professionals in those enterprises is that it would be a bad idea.

We need to transform applications to an architectural model that can exploit the cloud optimally.  We need to be able to organize the deployment and lifecycle management of these applications across all the hosting options that such a model would facilitate.  These goals would acknowledge that most enterprises won’t fully shift to cloud hosting, which is true, but would also acknowledge that all enterprises will want to exploit the cloud’s unique properties.  That’s true too, and true already.

https://www.oreilly.com/radar/cloud-adoption-in-2020/

Do We Have a Shared-Edge Opportunity Emerging?

Could the future of metro networks lie somewhere other than the interconnection of central offices or 4G/5G infrastructure?  Could shared hosting facilities play a role in the future of the cloud, as shared towers have in mobile networking?  All this is possible, if there’s going to be a lot of edge data centers in our future.  Traffic, and revenue opportunity, set the focus of any infrastructure deployment, and those factors may be shifting in a new direction.

If we are to explore parallel-to-the-Internet sources of network resources, we have to consider parallel-to-the-cloud hosting.  Wall Street, in fact, is seeing this space as being one of the main drivers of incremental IT growth for both 2020 and 2021, and yet we don’t really see much discussion of the topic.  What’s going on here could be important, not only for the cloud but for the specific growth opportunities in the server and platform software spaces.

We think of public clouds as massive data centers where shared resources live.  More careful assessment shows that public clouds are almost always made up of multiple data centers, with each one serving an associated market area.  Clouds have spread from their first form of concentrated resources to something more distributed, but how far can that spread continue if we rely on per-provider hosting facilities?

We think of edge computing as a collection of shared resources close to the point of user attachment.  The problem with this vision is that users are everywhere, and so a presumption of edge opportunity presumes that somehow we’d be able to extend the edge close to a large percentage of the user population.  Can that happen if we demand that every edge provider deploys its own edge data centers?

Look up at a cell tower the next time you pass one.  Do you see one big antenna on top?  There are normally quite a few antennas on a cell tower, because many (most, in many areas) towers are owned by third parties who lease tower space to providers.  That eliminates the need for every provider to build unique facilities, something called “competitive overbuild”, and that reduces the cost of providing services, thus helping to contain pricing.

Obviously, the reason I mention this is that the same sort of competitive overbuild risk exists for the cloud, and in particular for edge computing.  If that’s true, then the same remedies are also likely to be needed, meaning that we should expect to see third-party data centers offering hosting to the cloud and edge providers, in areas where dedicated per-provider facilities can’t be justified.  The Street is already seeing growth in what they would classify as “second-tier” data center providers, driven primarily by growing interest in cloud services (particularly hybrid cloud) from enterprises headquartered there.

Another possible growth opportunity for these shared data centers is the 5G and “carrier cloud” activity.  5G and other carrier-cloud services are likely to deploy first where there’s significant demand, which means where there’s a large population of users.  These areas can probably justify per-provider facilities, but eventually successful services have to expand to a wider base.  Since signaling applications like 5G are probably better examples of latency-sensitive applications than most cloud applications would be, there’s a good chance that edge facilities in an expanding 5G/services market will need shared data centers.

We know from the experience of Equinix (who happens to be one of the leaders in the shared-data-center space) that one of the biggest requirements for this sort of hosting is access to a bunch of ISPs, easiest to achieve if you’re an interconnect yourself (as Equinix is) or if you’re near one.  You need access to high-speed fiber paths or the value of these shared resources is limited.  That means that it’s not unlikely that some of the shared-data-center players will create at least metro networks using fiber transport.  This will not only drive up the parallel-to-the-Internet network capacity available, it will also shift the focus of metro transport toward data-center interconnect (DCI).

One of the potential impacts of a DCI shift would be an increased role for SDN, which has been deployed more in the data center than in the wider area.  That may not be the most important impact; it may in fact be a symptom of a broader shift.  We might be seeing so much capacity built into DCI connections to support positive revenue opportunities at the edge, that non-DCI traffic could end up piggybacking on those connections.  The future of metro might be the data center interconnect space, not traditional connection services.

This doesn’t mean that your Internet traffic is going to pass through an edge host, but that the data center networks and their associated switching would make up part of the path for access (including wireline and wireless) and metro traffic.  It could shift the focus from building a multi-service metro fiber network (like the old SONET/SDH rings) to building metro DCI and then exploiting it.

There are good reasons for this, beyond just taking advantage of in-place DCI capacity.  If we assume (again) that things like 5G and IoT will actually justify edge computing, then some of the services we’re connecting will actually require some feature support from locally hosted virtual functions.  Where better to extract and reinsert this sort of stuff than in a data center that traffic is passing through anyway?

This broad truth is what makes SDN a more likely play for DCI, and even for metro.  Most hyperscale data centers accept the idea that you really need “composed connectivity” rather than “Ethernet switching” in complex data centers, and the presence of what’s basically transit traffic within a data center network is an almost sure driver of SDN.  So is the need for multi-tenancy within a data center, which of course is the driver of growth in these shared data centers in the first place.  The earliest “commercial” SDN offerings were aimed explicitly at multi-tenant management, and if the tenants are competing cloud providers, the requirement for isolation of tenants (major tenants like cloud providers and “sub-tenants” like their customers) is even more acute.

What excites Wall Street is that all this might create a kind of land-rush mindset among the larger cloud providers or edge-computing aspirants.  There are only a limited number of companies with any real market mass in the shared data center space; somewhere between 10 and 20, depending on how stringent your criteria for selection happen to be.  About half of these are public companies, so we’ve got a mixture of potential for IPOs and potential for stock appreciation.

What excites vendors about this is that whether we call this part of the edge computing trend or not, it’s a new place to sell servers and platform software, and the land-rush mindset I’ve noted would likely fund at least some anticipatory deployment of infrastructure.  In a market that’s looking for some demand strength, that’s a powerful situation.

Finding a Model for Open-Model Networks

If operators want open-model networking, it’s inescapable that they need an open model.  We know from past experience that simply insisting on white boxes and open-source software doesn’t secure the benefits operators expect; there’s too much cutting and fitting required in building a cooperative system.  That raises two questions; what exactly is in an “open model”, and who is likely to provide it?  The first question is fairly easy to frame out in detail, but more complicated when you try to answer it.  The second is a “Who knows?”, because candidates are just starting to emerge.

Virtualization of networks involves dealing with two dimensions of relationships, as I’ve noted in prior blogs.  One dimension is the “horizontal” relationship between functional elements, the relationship that corresponds to the interfaces we have and the connections we make in current device-centric networks.  The other dimension is the “vertical” relationship between functional elements and the resources on which they’re hosted—the infrastructure.  An open model, to be successful, has to define both these relationships tightly enough to make integration practical.

Because most open-model networks will evolve rather than fall from the sky in full working order, they’ll have to interact with traditional network components when they come into being.  That means that there has to be a boundary, at least, where the open-model stuff looks like a collection of traditional network technology.  That might come about because open-model components are virtual devices or white-box devices that conform to traditional device interfaces (router instances or white-box routers), or because a boundary layer makes an open-model domain look like a router domain (Google’s SDN-inside-a-BGP-wrapper).

The common thread here is that an open-model network has to conform to one of several convenient levels of abstraction of traditional networks.  It can look like a whole network, an autonomous system or administrative domain, a collection of devices, or a single device.  It’s important that any open model define the range of abstractions it can support, and the more the better.  If it does, have we satisfied the open-model requirements?  Only “perhaps”.

The challenge here is that while any coherent abstraction approach is workable, all of them aren’t equally open.  If I can look like a whole ISP, presenting standard technology at my edge and implementing it in some proprietary way within, then I’ve created a network model more likely to entrap users than the current device-network proprietary model.  Thus, we would have to say that to retain “openness”, our model has to either define functional, horizontal, relationships just as current networks do, or rely on accepted standards to define relationships within its lowest level.  If you can look like a collection of devices, then it’s fine if your implementation defines how that happens, based on accepted technical standards and implementations.

If that resolves our horizontal relationship requirements in an open model, we can move on to what’s surely the most complex requirement set of all, the vertical relationships between functional elements and function hosts.  The reason this vertical relationship set is complex is that there are multiple approaches to defining it, and each has its own set of plusses and minuses.

If we presume that the goal is to create open-model devices, the simplest way of linking function to hosting, then what we need is either a standard hardware platform (analogous to the old IBM PC hardware reference) or a set of standard APIs to which the device software will be written.  In the former case, all hardware is commoditized and common, and in the latter, hardware is free to innovate as long as it preserves those APIs.  Think Linux, or OpenStack, or P4.  Either of these approaches is workable, but obviously the second allows for more innovation, and for open-model devices, it seems to be the way the market is going.

The problem with the open-model device approach is that it doesn’t deal with “virtual” devices, meaning collections of functionality that deliver device-like behavior but are implemented to add scalability, resilience, or some other valuable property.  It also doesn’t work for “network elements” that aren’t really devices at all, like OSS, BSS, or NMS.  For these situations, we need something more like an open-function model to allow us to compose behaviors.  Even here, though, we face decisions that relate back to our initial question of just what an open-model really models.

If we could agree on a standard set of functions from which network services and features are composed, we could define them and their horizontal and vertical relationships just as we did with devices.  Something like this could likely be done for traditional network services, the “virtual network functions” of NFV, but it would require creating that kind of “class hierarchy” that I’ve talked about in prior blogs, to ensure that we got all our functions lined up, and that functions that were variations on common themes were properly related to ensure optimality in service-building and management.

This would be a lot harder in the network software space, because as anyone who’s ever written application software knows, there are a lot of ways of providing a given set of features, and the structure of the software depends as much on the path you choose as on the features at the end of the road.  For this particular situation, we have to look at two other options, the decomposable service option, and the universal hosting and lifecycle option.

Decomposable service approaches to openness say that network software divides into a specific set of “services” whose identity and relationships are fixed by the business processes or feature goals the software is intended to support.  Payroll applications, for example, have to maintain an employee list that includes payment rate, maintain a work record, print paychecks, and print required government documentation.  In effect, these decomposable services are the “virtual devices” of software elements.

The presumption for this option is that it’s the workflow among the decomposable services that has to be standardized.  The location of the services, what they’re hosted on, and how the services are implemented are of less concern.  This is a workable approach where overall services can be defined and where the assumption that each would come from a single source, in its entirety, is suitable for user needs.

The universal hosting and lifecycle option says that the goal is to build network software up from small discrete components in whatever way suits user needs best.  The goal is to unify how this combination of discrete components is hosted and managed.  This option requires that we standardize infrastructure, meaning hardware and associated platform software, and that we unite our function-handling through deployment tools that are versatile enough to handle a wide variety of software and service models.

I think this approach mandates a specific state-based vision, meaning that there’s a service model that can apply lifecycle process reactions to events to achieve the goal-state operation the service contract (implicit or explicit) mandates.  This is the most generalized approach, requiring less tweaking of existing network functions or existing applications, but it’s also highly dependent on that broad lifecycle management capability that we’re still unable to deliver.

What’s the best approach?  My view is that the decomposable service concept would offer the industry the most in the long run, but that it faces a pretty significant challenge in adapting to the current set of network functions.  That challenge might be minimized if we continue to struggle to frame a realistic way of exploiting those functions, which is the benefit of the universal hosting and management approach.  We’ll probably see this issue decided by the market dynamic over the next year or so.

Do Broadband Networks Have to be Profitable?

Do network operators have to earn a profit on their networks?  That seems a simple question, but if we consider that operators have reported a decade of decline in profit per bit consumed, we also have to consider whether at some point they’ll be unwilling or unable to continue to expand network reach and capacity.  What happens then?  Australia faced this issue, and their solution was NBN, the “National Broadband Network”, the product of a government-owned corporation (NBN Corp.), and the largest infrastructure project in Australia’s history.

But NBN is considered by many (well, perhaps by most) to have failed in its goals, even after it’s redefined many of them to make them easier to achieve.  See THIS article by Light Reading as an example.  NBN raises a number of questions, three of which I’ll consider here.

The first question is why did Australia take this particular path to providing quality, ubiquitous, broadband, when other countries seem to have done fine without creating what’s almost a return to the concept of a carrier as part of the government?  Question number two is why didn’t the NBN approach work as it was supposed to?  Finally, for number three, is there a realistic alternative to a government-supported broadband future?  We’ll take these one at a time.

Broadband profitability is determined by the relationship between the price of service to the user, and the cost of providing that service.  For virtually all network services, the costliest piece is the “access network”, the part that connects customers to the network and through which their service traffic passes.  Access infrastructure costs per customer are directly dependent on how dense the customer base is.  About 15 years ago, I developed a complex metric called “demand density” to measure this.

Demand density is a combination of the GDP per unit area of service geography, and the density of the right-of-way available to distribute services.  To simplify the measurements, I normalized the formula I’d developed to a value where the US was equal to 1.0.  Countries whose demand density was significantly higher than that of the US would have less trouble creating profitable broadband, and those whose density was significantly less would have more trouble.  On that scale, Australia’s demand density is 0.19, while that of Japan and Korea (both cited as examples of good broadband service) is 12.8 and 9.52, respectively.

Low demand density means you have to string a lot of access to cover your population, which means that your return on infrastructure is low.  Australia’s is the lowest of the industrialized world, which is why they ended up with the problem, and why they sought a form of government ownership and subsidization as the solution.

But as the Light Reading piece suggests, the approach hasn’t worked, and NBN is probably at least as unprofitable as Australia’s national carrier (Telstra) would have been, had it been retained.  What happened?

The most fundamental truth is that changing ownership doesn’t change economic law.  A given level of broadband support will demand a given level of access deployment.  Regardless of who sees the bottom line on their ledger, the way it’s derived from revenue and cost remains.  Government support would have to be financial support, and as the article suggests, nobody is anxious to go to the taxpayer well to draw more dollars.  In fact, it’s my view that the base cost assumptions for NBN were always way too optimistic, made so to get support for the concept.

There’s another fundamental truth, though, and that is that it’s rare that a government corporation is going to create an optimally efficient solution.  Competition among commercial entities is more likely to do that, and NBN effectively killed that.  In fact, the political dimension of NBN made it more difficult for NBN to adapt to new technology approaches than Telstra and smaller competitors would have been.

The third fundamental truth is that it’s hard to hit a moving target, particularly one that’s moving away.  What would be considered “good” broadband today is at least twice as fast as what was considered good when NBN was launched.  The explosion in streaming video has magnified the traffic load by at least a factor of three, perhaps as much as five.  We have more devices per connection, too.  All this adds up to a moving of the goalposts in a game NBN was already losing—was in fact destined to lose.

Which raises the final question; is there another way, another approach?  That turns out to be the hardest question of all to answer.  There’s good and bad news for it.

The good news is that we have some technology changes in play today that NBN didn’t have when it came along.  We also have a financial change, at least a change in financial thinking.  We now realize that broadband in a low-demand-density world has to somehow extend traffic aggregation principles outward closer to the service consumer, and those technology changes offer us options.

The worst of all possible broadband worlds is “home-run” connections between consumers and their serving offices.  One customer has to pay for that access network connection, which means that in low-demand-density areas, costs will be prohibitive.  Things like remote concentrators in copper-loop technology (DSLAMs) came along to provide some local concentration of traffic to eliminate parallel, expensive, links.  We now have fiber-to-the-node or to the curb, with things like 5G as a tail connection to customers.  The cable industry’s CATV coaxial cable has always offered good access concentration, and successive versions of cable standards (DOCSIS) have enhanced CATV carriage of broadband.

But will this solve the problem?  US experience says that it won’t solve it universally.  The US has to provide rural subsidies for broadband services because those demand density issues that bite countries like Australia will also bite regions/states.  In fact, the inevitability of differences in broadband quality by geography, created by differences in demand density, can probably be eradicated only by some form of subsidization.  How much will depend on demand density.

My models say that where demand densities are less than about 3.0, there will be network service geographies that will require some subsidization.  When densities fall to below 2.0, the number of such areas will likely increase significantly, to the point where some government intervention to secure quality services will be needed.  In contrast, where demand densities are about 5.0, there are not likely to be many places where subsidies are required, and operators with a fairly large geographic scope will simply normalize services within their entire geography.

The same modeling says that rather than creating NBN Corporations of their own, other countries should adopt some form of incentive payments to support low-density areas, and should encourage consolidation of operators to raise the footprint of their major players.  The smaller the service area, the harder it will be to normalize services across it without subsidies.  That’s particularly true if the reason the service area is small is that some giant telco sold off unprofitable lines to a smaller company, who then expects to profit from government subsidies.

I’ve not mentioned two other options here—lower costs via virtualization and lifecycle automation and raise revenues through higher-layer services.  I’ve blogged on both these points before, and in any event, it seems clear that most operators are going very slowly on either of these choices.  I also think that while operators can benefit from lifecycle automation, it really only lowers the demand-density thresholds for profitable operation a bit, and in the long run it won’t arrest the declines in profit per bit.  On the revenue side, the sky’s the limit, but operators seem to fear heights.  They may eventually have to learn to face their fears.

Facilitating the Integration of Open Elements in Networks

You can’t build a network with a single element, or at least not a useful one.  As soon as you have multiple elements, though, you have the problem of getting them all to work together.  If those multiple elements are themselves made up of a set of layered software components, then is the problem of coordination impossible?  It better not be, or we’ll end up in a mess as we evolve to open networks and white boxes.  Will it demand so much integration that cost advantages of an open model are compromised?  That better not be either, but all of this could be a challenge to prevent.

A network is a cooperative system, and that means more than we might think.  When you build a network from devices, the interfaces the devices present and the protocol specifications that the devices support define the cooperation.  While there can still be a bit of tweaking, you can generally put compatible devices in the right places and build a network.

Things are a bit more complicated in the virtual world.  Getting virtual devices to cooperate is a bit more complicated than getting real devices to cooperate, and if you expand the notion of virtualization to the function/features level, it’s even more complex.  The problem is that you’re assembling more things, and there are more ways they can go together.  Most, of course, will be wrong.  Given that making all your network clocks chime at the same time is the biggest reason users give for sticking with a single vendor, you can see how open-model virtual networking could be the start of an integration nightmare for users.

Open-source software has always had integration issues.  You have an open-source operating system, like Linux.  It comes in different flavors (distributions or “distros”).  Then there’s middleware tools, things like your orchestration package (Kubernetes), monitoring, and so forth.  When you build something from source code, you’ll need a language, language processor, libraries, and so forth.  All of these have versions, and since features are introduced in new versions, there’s always the risk that something needs a new version of a lower-level element for those features.  Something else might need the old version.

Companies like Red Hat jumped to address this by synchronizing a set of tools so that everything was based on common elements and could be made to work together.  That doesn’t resolve all the “cooperation” problems of the virtual world, but it does at least get you sort-of-running.  We’re starting to see articles about the issues of integration, as they relate to the use of open-source in network elements.  Can the problems be resolved by anything short of an army of integrators that could toss out the cost advantages of open-model networks?  It might be possible through modeling techniques that I’ve advocated for other reasons.

Intent modeling is another name for “black box” functionality.  A black box is something whose properties cannot be viewed directly, but must be inferred by examining the relationship between its inputs and outputs.  Put another way, the functionality of a black box or an intent model is described by its external properties, and anything that delivers those properties is equivalent to anything else that does the same.

Much of the value of intent modeling comes in service or application lifecycle automation.  The classic problem with software implementations of features or functions is that how they’re implemented often impacts how they’re used.  Function A and Function B might do the same thing, but if they do it based on a different set of parameters and APIs, one can’t be substituted for the other.  That creates a burden in adopting the functions, and uncertainty in managing them.  I’ve noted the value of intent modeling in service lifecycle automation in many past blogs, but it has value beyond that, in facilitating integration.

One reason is that can happen is that intent modeling can guide the implementations of what’s inside those black boxes.  You could wrap almost anything in an intent model, but there’s an implicit presumption that what’s inside will either support those external properties directly, or be adapted to them (we’ll get more into this below).  The easier it is to accomplish this, the cheaper it is to use intent models to operationalize applications or services.

This adaption process has to support what we’d normally think of as the “real” external properties of the modeled elements, the visible functional interfaces, but that’s not the end of the story.  In fact, it may be less than half the important stuff, when we consider open substitution of modeled elements.

If our Function A and Function B were designed as “black boxes” or intent models, we could expect the external connections to these functions to be the same, if they were implementations of the same model.  That would mean that they could be substituted for each other freely, and everything would work the same whichever one was selected.  That would mean that when assembling complex software structures, we wouldn’t need to know which implementation was in place, because the implementations are inside those opaque black boxes.

In order for this to work, though, we have to look at intent-modeled functions at a bit of a deeper level.  The traditional vision focuses on what I’ll call the functional interfaces, the interfaces that bind the elements together in a cooperative system.  Those interfaces are critical for composition and management of complex systems of software components, but they’re the “horizontal” direction of a bidirectional challenge.  We also have to look at the “vertical” direction.

Consider this:  Function A and B are both implementations of the same intent model, have the same external interfaces and properties.  Can we substitute one for another in a hosting environment?  We actually don’t know, because the implementation is hidden inside.  We could presume compatibility with a given infrastructure set if we could assume one of three conditions were true.

The first is that, by implementation mandate, all of our intent models had to be designed for a specific hosting framework.  That would be practical in many cases because standardizing operating system and middleware features is a common practice in data center and cloud implementations.

The second is that each of the modeled implementations contains the logic needed to adapt it to a range of implementations that include all our desired options.  Perhaps the deployment process “picks” a software element based on what it has to run on.  This is practical as long as we build the ability to select software based on where/how it’s deployed, provided by the orchestration tools.

The final option is that there is a specific “vertical” or “mapping” API set, a second set of interfaces to our black box that are responsible for linking abstract deployment options to real infrastructure.  This would correspond to the P4 APIs or to the plugins of OpenStack.  This would appear to be the most practical approach, providing we could define these vertical APIs in a standard way.  To let every modeled element decide their own approach would eliminate any real interchangeability of elements.

That standardization of the vertical API relationships is a challenge, but it’s one similar to the challenge of establishing a “class hierarchy” for virtual elements and defining the horizontal interfaces.  I hinted in a blog earlier this week that we likely needed both an infrastructure API set and a functional API set for a good VFPaaS, and I think these points illustrate the importance.  Without it, resolving the hosting requirements of network functions/features can only be solved by applying policies to constrain selection of objects to those that fit hosting needs.  If hosting needs can’t be standardized (as would likely be the case for core-to-edge-to-premises migration of functions), then the standardization of the infrastructure APIs is the only sensible option.

Is the Cloud Creating its Own Security and Availability Risks?

IBM just had a major cloud outage, and it certainly won’t help the company’s efforts to become the “cloud provider to the corporate giants” or maximize the value of its Red Hat acquisition.  It also raises questions about whether there are complexity issues in cloud-building, and in cloud networking, that we may have glossed over.

The problem, according to a number of sources (like THIS one) was caused by an external service provider (an ISP) who flooded IBM’s routers with an incorrect BGP update.  The source isn’t identified, nor has it been revealed whether there’s a suspicion of misbehavior or just sloppy management.  The result was not only a widespread failure of IBM cloud services, but also the failure of many other sites and services that required something hosted on IBM’s cloud.

BGP’s vulnerability to hijacking is legendary, with its roots in the foundation of the Internet.  From the first, Internet protocol designers presumed a “trust” relationship among the various operators involved.  There’s minimal security built in, reflecting the fact that the early Internet was a network of universities and research labs that could all be trusted.  We don’t have that situation today, and that makes the Internet vulnerable, not only at the level we often talk about—where users connect—but in the internals of the Internet.

The specific problem with routing protocols is that when another router advertises a route, there’s a presumption that they really have it and are presenting it correctly.  There’s also a presumption that they will advertise routing changes only when routes are really changing.  If a router either advertises a “false” route or issues a bunch of updates that don’t reflect real changes in the network, the result can be congestion in route processing, false paths, and lost packets.  In the extreme case, you can get a complete collapse that could take hours (or even days) to sort out.

That we should be trying to eliminate a problem with literally decades of history behind it goes without saying, but since it has been a “legendary” problem, we’ve clearly let it go too long.  These days, there are plenty of bad actors out there, and BGP vulnerabilities are demonstrably a good way for one of them to attack Internet and cloud infrastructure.  There’s little point in my adding my voice to the chorus of the unheard on the topic, so let’s move on instead to the cloud itself.

Cloud security is typically taken to mean the security of applications and data in the cloud, and not the security of the cloud itself.  There’s a profound difference, and as it happens the difference is the most critical in the area of “carrier cloud”, and in particular the area of carrier-cloud hosting of virtual network functions.

A “network” is a tri-plane structure—data, control, and management.  In traditional IP, all three of these planes exist within the same address space, meaning that there’s a common IP network carrying all these traffic types.  When we transport ourselves from a device-centric model of networking to a virtual-function model, we add in a fourth layer, which I’ll call the “mapping” layer.  Functions have to be mapped to hosting resources, and the three layers of protocol that normally pass between devices must now be mapped to real connectivity.  There’s a tendency to think of this mapping process as an extension of the normal management layer, and that can be the start of something ugly.

If the mapping layer of a virtual network (in NFV terms, things like the Virtual Infrastructure Manager, Management and Orchestration, and the VNF Manager) are extensions of the management domain of the services, then all these elements are addressable from the common IP address space.  I have a device management port (say, SNMP) on a “virtual router”, and I likewise have a VIM port that lets me commit infrastructure.  I can send an SNMP packet to the former, and presumably (if I have the address of the VIM) I can commit resources.  Except that unless I’m the network operator, I shouldn’t be able to address the VIM ports at all.  If I can, I can attack them from the services themselves.

Securing mapping-layer resources can’t be considered to be a task for access control or encryption (like https), either.  Encryption can prevent me from actually controlling mapping-layer features, but I can still pound them with requests to connect, the classic denial-of-service attack.  What I need to do is to completely isolate the mapping layer from the service layer.  They shouldn’t share an address space at all.

The biggest unanswered question in cloud computing today is whether we have adequate separation between the service layer of customers, and the “mapping layer” where cloud administration lives.  Without it, we have the risk of an attack on cloud administration, which obviously puts everything in the cloud at risk.

Even private cloud users, including container/Kubernetes users, should be thinking about the security of their mapping processes, meaning in this case Kubernetes’ own control ports.  Generally, Kubernetes assumes that all its pods, nodes, etc. are within a single address space, that that address space is shared by everything, so every pod can network with every other one.  Control ports are thus visible in that address space.  This might be a good time to think about having separate address spaces for the control ports.

Another thing we need to think about is the risk associated with cloud-hosted components of applications or services.  There’s a general view among cloud users that dividing work among multiple cloud providers improves reliability, but the opposite is true.  At the fundamental level, an application that’s spread across all of three (for example) different hosting points (two public clouds and a data center) needs all three to be working, which means the reliability is less than it would be if everything was hosted in one place.  In order for our trio system to be more reliable, I need to be able to redeploy between the hosting points to respond to the failure of one.  If you hosted a part of your application in the IBM cloud and another part in your data center, absent this redeployment, you would have a total application failure if the cloud failed (and also if your data center failed).

You also need to be thinking about how you control your multi-cloud or hybrid cloud.  If you’re not careful in how your orchestration/management tools are addressed, you may end up putting your control ports on your overall VPN, which makes them even more exposed.  Container/Kubernetes systems use Linux namespaces, and so the applications have some limitations in addressing built in.  Mix in non-container, non-Linux, stuff and you may now have no control over that critical mapping layer.

Virtualization of any kind increases the “threat surface” of infrastructure; the more layers you have, the more there is to attack.  These threats are exacerbated if care isn’t taken to separate the control network components from user services, or from anything that’s shared among multiple parties.  Finally, if you can’t certify mapping-layer components and even virtual functions for security, you’ll never secure anything.  If we’re not careful, we could be creating a major problem down the line.

Could We Define a “Virtual Function Platform as a Service?”

What exactly is a PaaS in function hosting?  I mentioned the term and concept in my Thursday blog (https://blog.cimicorp.com/?p=4162) and I got some questions from my contacts and clients.  What I’d like to do here is summarize the notion of PaaS (Platform-as-a-Service) in function hosting, in part to prepare for what I hope will be more detailed exploration of the CableLabs/Altran Adrenaline project.

In cloud computing, a “PaaS” is a cloud service that’s based on a platform that includes not only an operating system but also middleware, so that the service looks like an application platform of the sort you’d write code for on a PC or server.  In function-hosting terms, that would mean that a function host would assert a set of APIs that would expose useful function-hosting and service-creation features.

The classic vision of function hosting, from the ETSI NFV project, is that you have a set of external interfaces that define the information flow among discrete components of a hosted function or a function orchestration and management system.  The software you use in either case is pretty much free to work as it likes as long as it connects through these interfaces.

This has proved somewhat-to-largely ineffective in the real world.  First, the virtual functions have too much freedom to define how they work, how they’re managed, how they’re parameterized, and how they deploy.  Since VNFs were presumed to be sort-of-transported physical network functions, meaning extracted device logic, that flexibility assisted in making VNFs available, but it made it impossible to frame a standard means of deploying and managing them.  The “onboarding” process, the conversion of function software to running VNFs, was almost a per-VNF systems and software integration task.

Too many unconstrained variables create operations chaos, so the obvious answer is to start constraining them.  One step to do that is to mandate container use, something that’s already sweeping cloud computing and even data center applications.  Containers are a kind of canned deployment framework, one that presumes a specific way of deploying and connecting related elements.  This can then be combined with an orchestration tool (and Kubernetes is the de facto industry standard) to simplify operations.

The problem with containers and Kubernetes for virtual functions is that the latter introduce a requirement that traditional applications don’t have.  We don’t expect every implementation of CRM or ERP or payroll to work exactly the same way, but every implementation of a given class of virtual function (like “firewall”) should in fact work the same way so that the deployment and operations practices are standardized for the class.  This is where the notion of a PaaS comes in.

All network devices could be said to deliver functionality at three levels.  There’s the data-plane stuff where bits are pushed here and there.  There’s the control-plane stuff that supports cooperative behavior of systems of functional elements by providing a means for them to communicate.  Then there’s the management-plane stuff that allows for control of devices, networks/systems, and services.  If we could define some specific APIs that would define all these layers of behavior, we could then take a powerful step toward operationalizing complex combinations of devices and hosted functions/features.  That would define a PaaS, or more explicitly, a “VNFPaaS” or (more generally) a “VFPaaS”, since some of the functions are likely not to be specifically network functions in the NFV sense.

Having this kind of VFPaaS would also facilitate portability across devices and migration or scaling of functions between the user edge, carrier edge, metro, and even deeper.  If virtual functions talk to their hosting hardware via a set of APIs, anything that can assert these APIs looks like any other function host.  This is what I believe Adrenaline does, at least to a degree.  The obvious question is how one defines a VFPaaS in the first place, meaning how the APIs are selected and then detailed for use.  One way is to follow a convention that’s not only popular in programming, it’s built into application-modeling tools like OASIS TOSCA.  That’s called “inheritance”.

In Java, you can define an object called an “interface”.  This definition describes the inputs and outputs, meaning the API, of the object and at least references its function or purpose.  We could define, in our VFPaaS, an “interface” object called “virtual-CPE”, for example, to describe a general device that provides some inline network functionality at the user end of the service demarcation.  But that’s not all.  We can (maintaining Java reference) define another object that extends our first one.  Say now that “firewall” is a subclass of “virtual-CPE”.  The properties of our new object are additive to those of the object it extends, so those things that are common to all virtual-CPE are provided in firewall.

The value of this in defining a VFPaaS is that if you start “extending” your basic elements, you could learn a lot about what your VFPaaS should look like.  One thing we might learn is that our “virtual-CPE” isn’t high-level enough.  There’s really another level above it, one we could call “virtual-data-plane-element”, that represents anything that serves as part of a data plane of a network.  This is a useful piece of portability insight.

We can also learn that there may be two (or more) “inheritance trees”, and my virtual-CPE example demonstrates this potential.  We have our virtual-data-plane-element, that we now see could be “hosted” on a virtual-CPE element.  That means that we should have a dual set of base objects, one representing the “function” and the other the “platform” or “hosting”.  A given virtual function always exposes its own APIs, but what it’s hosted on also exposes a set of APIs that permit that particular thing to be managed.  A CPE “platform object” would present a customer-facing management interface, where one embedded in the network would likely not.

The APIs needed for VFPaaS can be determined by cataloging the features of the “platform” that the full set of service/function objects consume.  This lets us build an API set that fully supports the virtual functions, standardizing how they interact with their environment, and also how they interact with any general provisioning or management features.

Why “redefine” things at all?  If everything that’s a virtual-data-plane-element has (for example) the same basic API, differing only in how it’s parameterized, why define a “firewall” or a “switch” or a “router” as subclasses?  The answer is that we could presume that all “firewalls” should be freely substituted for each other in a service, but not a “firewall” and a “router”.  We may then want to make it impossible, through modeling or composition tool features, to make these illogical substitutions invalid.  But we also want to avoid, in defining our VFPaaS APIs, needless explosions in the number of objects we use, or any hiding of what should be common relationships.  Otherwise we end up with VFPaaS APIs that do almost the same thing as another of the APIs, complicating implementation of the platform as well as its use.

Things obviously get more complex when you leave virtual network functions for the more general virtual-function level.  While network services can be broken down fairly easily, you may have to get more creative if you decide you want to define a “content-deliver-network”, and surely more creative if you want to describe “personalized-ad-service”.  The need for “creativity” here means that you can’t set a fixed structure of element/object classes, you need a technique you can extend as required.

Programming languages work like that, and so does TOSCA, but you have to be careful in how you do all of this if you’re talking about a VFPaaS.  For each defined object, you need to have associated APIs, and if you keep defining new stuff you risk creating too many APIs.  It should be a goal to define the minimum class hierarchy that covers the functional and hosting requirements.  Minimal class hierarchies define minimal API sets.

Modeling isn’t necessary to exploit a VFPaaS, but it’s helpful in managing it.  The model, if it reflects the same class hierarchy, synchronizes operations processes that are model-driven with the platform relationships that are essential to deployment and operations.  We should therefore think of the class hierarchy as having two outputs, the model and the API set.

TOSCA isn’t the only modeling approach that offers class inheritance, nor is Java the only programming language.  With the almost-universal support of the approach in modern thinking, it’s amazing to me that we’re not thinking more about it in the broad issue of function hosting, where it offers considerable benefits in operations simplification and onboarding of functions.  I’m hoping that we’re going to see more attention paid to the concept as things like the cable-network initiative, Adrenaline, mature.