All That’s Tech Glitter Isn’t Gold

Like most years past, we’re entering 2021 with a healthy population of exaggerations regarding tech issues.  Since, as I’ve said many times, “news” means “novelty” not “truth”, I guess that’s inevitable.  The problem is that 2021 is going to be a pivotal year for tech overall.  We’re recovering from a pandemic-generated economic slump, we’re facing major shifts in work behavior that have grown out of work-from-home, and many vendors and users have told me they had major tech plans on hold until things stabilized.  There’s a lot of things coming to roost, in short, and we need to face them squarely.

One of the biggest areas at risk to over-hype is 5G.  Mobile infrastructure upgrades to support 5G are one of the few budgeted areas in the service provider space, which means vendors are descending on it like vultures in the Masai Mara descend on a kill.  We’ve tied so many things to 5G that it’s creaking under the load.

One major problem is that users are themselves expecting a radical change in their mobile experience with 5G, and both operators and vendors tell me privately that the average smartphone user will likely have to look at the display that tells them what service they’re using to even know they have 5G.  The fact is that 4G is fast enough to distribute streaming experiences to the limit of devices and eyesight.  Yes, you could download faster, but most users have little need to download a lot of stuff.

5G is essential for mobile operators, since it provides for higher subscriber density and per-cell bandwidth, but that makes a story that’s too dull for the average reader, so we’re jazzing it up with implications that the 5G deployments will revolutionize smartphone use.  The primary motivation for operators is that without 5G handsets there’s no value to the service at all, and many smartphone users would have to replace their handsets.

But what happens when users get the handsets, and the service?  What happens with users’ attitudes toward 5G when the truth comes out, as we know it will?  The PR cycle for any new technology shifts quickly from presenting it as the last savior of civilization to the seed of its imminent downfall, and we’re getting close to the flex point, since people are finding out that what they expected isn’t going to happen.  It will make it harder to promote something, even something that could be revolutionary, later.

That potentially revolutionary thing may already be here.  Operators facing 5G’s moment of truth are also facing the truth that the place where 5G is most likely to impact user experience is in the area of fixed broadband.  The hybridization of 5G mm-wave technology with fiber to the node (FTTN) creates a pathway to residential and small-business-site broadband that could revolutionize the cost side for areas where demand density is average or below.  It’s hard to get detailed numbers, but my model says that only about 25% of the US population could be served by FTTH, while over 80% could get 50 Mbps or better using 5G mm-wave.

It’s interesting that in the US, Verizon has been promoting 5G mm-wave more than AT&T, when Verizon needs it less for residential broadband because of its higher demand density.  But maybe that’s the reason; if there’s going to be a large-scale 5G mm-wave deployment for residential broadband, it would create the potential for a large population of cell sites that could also serve smartphones.  This could allow a smart operator to deal with mobile users in high-density areas using the same technology that’s providing home and small-site access.  That could be a big cost savings, and also the basis for a broad-scale modernization of network infrastructure.

Which brings up our second point.  If you read the tech rags, you’d come away with the impression that there are at least half-a-dozen technologies that would single-handedly reverse the problem network operators have with profit-per-bit squeeze.  There are indeed that many profit-sustaining technologies; the fallacy is that any of them would fix the problem by itself.  We’ve created a massive information and content ecosystem that provides us with almost all communications and entertainment.  Its sheer financial inertia is dazzling, and it’s supported by people whose jobs depend on specialized roles in supporting that current ecosystem.  It’s not going to change because somebody waves a little technology magic-wand at it.  We need a massive shift.

Massive shifts are highly unpopular with the people who fund technology news and events, the vendors.  We have as much invested in service provider, content provider, cloud provider, and end-user infrastructure and practices, dependent on the current technology model, as the global GDP of 50 years ago.  Imagine a salesperson with a quarterly quota facing that sunk cost as a barrier to a new technology.  They’d do one sales call and start looking for another job.  Instead, we pretend that we can stick another band-aid on the creaking model, because we can sell that notion.

It’s not working.  We’re adding layers of technology, layers of capital cost, and layers of operations complexity.  We’re creating an ecosystem so complicated that we can’t possibly secure it, because it has too big and fuzzy an attack surface.  Just when we think we’ve gotten a handle on things, we find that there’s a whole new approach to attack that turns our own remedies into vulnerabilities, which is what Solar Winds did.

The vulnerability of networks to malware introduced as a trusted component of infrastructure was raised at least as far back as 2013, because I raised it within the context of NFV.  As far as I know, as of today, it’s not been resolved there, nor has the problem of “creeping insecurity” created by introduced malware been fully addressed in the cloud.  Could the reason be that there’s no quick fix, nothing easy to sell that promises to make you immune?

So, did the operators mess us up with their hype, or the vendors?  Did publications and analysts feed us garbage and delude us?  Yeah, perhaps, but we have to understand that hype is a buyer-driven process.  Nobody tells an unattractive, unpopular, lie, because there’s no value to it.  We have created our own problem by being willing consumers of what we find interesting, exciting, and easy.  We could change this industry, all of tech, in a heartbeat if we simply accepted reality and dealt with it.

I almost got out of the tech business 20 years ago, during the peak of the tech bubble.  I’d get four or five calls a week from startups wanting “consulting”, but really meaning “promotion”, on a concept that had zero chance of meeting its stated goals.  I’ve always refused to work with companies whose story couldn’t pass muster with me, and so I was turning almost everything away.  I stuck it out, and there was a market crash that eventually weeded out the total nonsense, and things got a bit better.

We may be backsliding now.  Do we need that tech bubble, the hype bubble, to happen again?  I hope not, because while sand castles are really attractive, it’s probably not wise to plan to live in one.  I’m still applying my standards for working with companies, saying that I will never say what I don’t believe, and never work with someone who I don’t think is delivering any tangible value proposition.  I’m still getting pushback, pressure to just sell a vendor’s story as the truth when it isn’t.

Let me gently suggest that you have to apply a similar standard if we’re going to change things.  Next time you pass over an article that describes the complexity of this or that new concept, in favor of reading something on self-driving bicycles or AI doorknobs, remember that you’re contributing the pushback against reality.  The tech world isn’t easy for me, but that doesn’t mean I’m not responsible for doing my best.  So are we all.  That’s the best lesson we could learn in the new year.

To Control Complexity, Abstract It!

What is the biggest challenge in tech?  This happens to be a question that unites the service providers and enterprises, the network and the data center and cloud.  According to the people I’ve chatted with over the last quarter of 2020, the answer is complexity.  Both groups say that their need to address complexity grows from both existing and emerging technologies, and both say that they believe that the latter is increasing complexity rather than reducing it.

Complexity issues manifest themselves in a number of ways, some obvious and some less so.  The obvious, and number-one, impact of complexity is increased operations costs.  Companies say that their staff size and worker qualifications have increased, which increase their operations costs.  Ranking second is the transitioning to a new technology, the process of adopting and operationalizing something different.  A less-obvious problem is assessing new technologies properly in the first place.  Companies say that often they stay with existing stuff simply because they can’t gauge the value of, or impact of, something new.

If the challenges of complexity are fairly easy to identify, the path that leads to it and the remedies for it aren’t.  A full third of enterprise network and IT professionals think that vendors deliberately introduce complexity to limit buyer mobility.  My view is that this is less a factor than many believe, and that we have to understand how we got complex if we want to get out of the box it creates.

Modern network and IT complexity arises from the fundamental goal of optimization of resources.  By “resources” here, I mean capital equipment.  Both enterprises and service providers spend more on operations than on the technology elements themselves, but neither group has paid nearly the attention to operations cost impacts that they’ve paid to capex.  CIOs, for example, have consistently told me that their capital budgets get more scrutiny than their operating budgets, and that the justification needed for capital projects focus on the capital side, treating operations impacts as a secondary issue.

In networking, the capacity of optical trunks is managed by the packet layer, which also provides connectivity by threading routes through nodal points that can cross-connect traffic.  Today, more is spent here optimizing fiber capacity than is spent creating it.  In computing, virtualization allows more effective use of server capacity, but adds layers of software features that have to be managed.  We’re not at the point where server management costs more than servers in a capital sense, but many enterprises say that platform software and operations costs related to it are already approaching server costs.

Why does this happen?  There are three reasons, according to both service providers and enterprises.  The biggest problem is that new capabilities are introduced through layering a new technology rather than enhancing an existing one.  Often, a new layer means new devices (security is a good example), and everyone knows that just having more things involved makes everything more complex.  Next in line is that functionality drives adoption, which induces vendors to push things that open new business cases.  This means that there’s less attention given to how functionality and capabilities are evolving, and less opportunity to optimize that evolution.  The third reason is that vendors and buyers both gravitate to projects with limited scope, to limit cost and business impact.  That creates the classic tunnel vision problem and reduces incentives and pressures to think about cost and complexity overall.

The facile answer to the “How do we fix this?” question is to do better architectural planning.  If careful design creates a solution that can evolve as problems and technology evolve, that solution can then optimize the way technology elements are harnessed for business purposes.  Unfortunately, neither buyer nor seller has been able to adopt this approach, even though it’s shown up as a theoretical path to success for three or four decades.  With software and hardware and practices in place to support decades of growing disorder, we can’t expect to toss everything and start over.  Whether it’s the best approach or not, we have to fall back on something that can be adopted now and have some short-term positive impact.

That “something”, I think, is a form of virtualization, a form of abstraction that’s commonly called “intent modeling”.  A virtual machine is an abstraction that behaves like a bare-metal server, but is really implemented as a tenant on a real server.  A virtual network looks like a private network but is a tenant on a public network.  Virtualization has been a major source of complexity, because it introduces new elements to be acquired, integrated, and managed.  Properly applied, it could be an element in the solution to its own problem.  How that happens is through intent modeling.

An intent model is a functional abstraction representing an opaque and unspecified implementation of capabilities.  Think of it as a robot, something with human form and behavior but implemented using hidden technical pieces.  An autonomous vacuum, for example, cleans the floor, which is a simple external property.  The average user of one would have no clue as to how to get one to work, but they don’t have to, because the “how” is hidden inside the abstraction.

I’ve blogged often about how intent models can be applied to operations automation, but they have a direct impact on all the aspects of complexity, and to understand why, we need only consider those vacuuming robots.

How many robot vacuums could we expect to sell if we offered only directions on how to build one, or even a kit from which such a thing could be built?  What people want is a clean floor, not an education in applied electrical engineering.  The first secret in managing complexity is to subduct it into intent models, where it’s safely separated from functionality.  We see enhanced robots (like the dancing ones shown on a popular Internet video) as enhanced features, not as additional complexity, because most people don’t consider or even care about how the features are delivered.  A tomato is something you slice for your salad, and the complexities of growing it aren’t in your wheelhouse.

The next obvious question is why, if intent models and virtualization are such obvious winners, they’ve not exploded into prominence in the market.  There are several answers to this, but the most relevant are the problem of the economics of presentation and the granularity of adoption.

If you’re a vendor, you’re part of a polar ecosystem of buyer/seller.  You’ve adapted your product strategy, from design through sales and support, to the mindset of the prospective buyer.  Here, those three reasons for complexity growth that I introduced earlier apply.  Simply put, you’ve sold your stuff based on what your buyer wants to hear, and your buyer is used to complexity.  They might even owe their jobs to it.  Selling “intent-modeled Kubernetes” to a certified Kubernetes expert is going to be harder than selling it to the CIO, but the Kubernetes expert is the one doing the RFP.

In any event, the value of intent modeling and virtualization as a means of reducing complexity is as limited as the scope over which you apply the techniques.  It would be practical (if not easy) to adopt intent modeling with a new project, but most new projects’ scope is limited and so would the benefits.  Having one assembled robotic vacuum among a host of kits to build others wouldn’t make the combination particularly easy to sell, or reduce the skills needed to assemble all the others.  But “scope creep” is a great way to lose project approvals, as anyone who’s tried to push a project through the cycle well knows.

The fact is that buyers aren’t likely to be the drivers behind intent-model/virtualization adoption, it has to be the sellers.  It would be unlikely, at this stage in the evolution of intent models, to have a general solution applicable to all possible technology elements.  It would be possible to have a vendor introduce an intent-modeling approach that was linked to a specific area, like networking, to a specific technology (AI/ML) or both.  It would be optimal if a leading vendor in a given space were to “intent-ify” its own solutions.

The big virtualization vendors, meaning VMware, Red Hat, and HPE, could well field an offering in this space, either broadly across one of their existing technologies or in a narrow area (like 5G Open RAN).  Network vendors could do the same for the data center, 5G, or broadly.  Juniper, who recently acquired Apstra, a data-center intent-based, player, might be a promising source of something here.

A final possibility is that a standard like OASIS TOSCA could be the basis for a solution.  TOSCA has the ability to define/architect intent models, but it would be a bit of work to do even one such model, and it would likely require extension/expansion for TOSCA itself.  Thus, I think this avenue of progress will likely have to wait for a vendor-driven approach to promote intent modeling interest.

Something is eventually going to be done here, and I think intent modeling and virtualization will be a piece of it.  I’d like to predict that it will happen in 2021, and I think there’s a good chance that the first vendor positioning will indeed come this year.  Where one vendor steps, others will likely want to mark territory themselves, and that might finally move us toward addressing complexity.

What the Heck Does “Disaggregated” Mean?

Is there one good definition for disaggregation, in the context of networking?  I’ve said in a number of my blogs that “commercial disaggregation”, the separation of software and hardware so you can charge an annual license fee for the latter, isn’t disaggregation.  I’ve also suggested that just having software and hardware separated so you can replace a monolithic proprietary router with a monolithic open router, isn’t really disaggregation either, though it may be useful (as we’ll get to below).

What can we use as a definition?  Maybe we have to look at the challenge in a different way, and maybe that will let us draw a picture of what disaggregation should be rather than what it’s claimed to be.

The value of “disaggregating” traditional network devices has to come not from how we take things apart, but how we put them back together.  Properly done, disaggregation builds a bridge between “hosted” and “white-box” technology, between cloud-native and monolithic.  It could even bridge “connection services” like VPNs and so forth, and OTT services like content.  In short, it could move us forward significantly on the path to a new network model.

The most concrete articulation we have of a new network model based on disaggregation is SDN.  SDN’s founding principle is to separate the control plane and forwarding plane, with the latter implemented in simple commodity white boxes and the former centralized in a master control point, likely redundant.  The process of figuring out routes is largely left to implementation; the controller decides what forwarding entries to send to the white boxes, and so decides routes.  Likely, the controller would base the decision on information gathered from the boxes themselves, information reflecting trunk state and load.

SDN has gained favor within the data center, but not so much in the WAN.  Part of the problem is concern over controller faults, and part over the question of how you’d regain control of a device, or an entire network, if the controller lost connection to it.  A maverick route could cut off pieces of the network.  In any event, there’s not been a significant amount of work done on large-scale WAN SDN, but SDN offers an intriguing step that warrants more consideration.

If you separate the control plane, as SDN does, would you have to centralize route management?  It seems like this is at one extreme of a wide range of options, the other being a fully distributed control plane with one element per white-box, but running in the cloud rather than on the devices.  It also seems like the IP control plane shouldn’t be considered the only option to control forwarding, and in fact that multiple options might share even a single white box.

White boxes could also be re-aggregated.  NFV says that a virtual device could be created by chaining hosted VNFs.  Why wouldn’t it be possible to create virtual devices by aggregating white boxes?  A virtual device is handy as a means of simplifying operations.  Think about being able to manage all the ports and trunks at a particular site as a single router.  Why not, given that the stuff all terminates in the same place?  You could even virtualize multiple sites and treat them as a single device.  Why?  Think about the relationship between OSS/BSS and NMS.  Many operators have a means of reflecting the network back into the OSS, but why reflect a bunch of separate routers when the service treats the network as a whole as an asset?

Of course, you could disaggregate and re-aggregate below the box level too.  Think of having multiple virtual routers sharing a single box, or sharing a collection of re-aggregated boxes being treated as a single virtual device.  We could combine this with the multiplicity of control planes and create something that was both an Ethernet switch and a router, constructed extemporaneously from whatever ports/trunks were appropriate.

You can also create virtual devices that embody features beyond simple routing, which is how you’d be able to exploit the ability of a truly disaggregated solution to rise above Level 3.  CDN features that allow the selection of the optimum cache point for a video URL, rather than resolving to a fixed location, are one example.  5G UPF implementations, with their tunnel management, are another.

The virtualization-centric notion of disaggregation and re-aggregation offers an opportunity to build network node functionality and network management visibility almost orthogonally.  What the network does and what it looks like don’t have to be congruent.  That’s a highly useful concept when it comes to both service features and service lifecycle automation.

In this model, any feature related to connectivity could be absorbed into an integrated control plane, and assigned on a per-service or even per-user basis.  VPNs could be maintained in total isolation or supported the way they are today, via a common virtual-router infrastructure.  Same with CDNs and 5G, including each network slice.  It’s these properties that make this networking model so well-adapted to the network-as-a-service approach.

There are a couple of obvious questions about this, of course.  The first one is whether we can know anything about the specific architecture needed to support this model, and the second is whether anyone is doing much to support the architecture.  I’ve already noted the vendor progress toward the goal of network-as-a-service, the best current indicator for progress toward fully realizing disaggregation benefits, HERE.  Now, let’s look at the specific things that would realize disaggregation potential.

The first thing you need in an architecture for this networking model is a fully abstracted forwarding plane with partition/slicing capability.  You need a single language to control forwarding so that it’s easy to write forwarding-control applications for the separate control plane.  You also need the ability to make a static assignment of some ports and trunks to a given application, or allow applications to share control of ports/trunks via a “mediator” function that resolves any conflicts.  The “language” of forwarding at the top of this abstraction needs to be consistent, but it could resolve to different chip- or chassis-level languages below.

The second thing you need is a low-latency, high-availability, “service mesh” to link the control plane elements with each other, and with the forwarding plane.  Separating the control plane creates a general risk of loss of control if the connection with the forwarding plane is lost.  The problem of latency also creeps in, creating a risk of loss of context for control-plane decisions because of a lack of current state data from the control plane.  There’s also an increased risk of collision of instructions created when there’s latency between control and forwarding.  All these risks increase if the control plane is extended either geographically or functionally, meaning wider deployment or a larger number of higher-layer service features.

I’m putting the term “service mesh” in quotes here because it’s not clear just how much of a full-feature implementation of a service mesh would be necessary for this mission, and what the tradeoffs would be relative to latency.  Traditional service meshes add considerable latency, and that might make them unsuitable for connecting control-plane elements with the forwarding plane, or each other.

The third thing needed is an authoritative, singular, model of the network or network domain.  You can’t have multiple control planes creating their own images of the state of the network and have any hope of either efficient operation or control over collisions in commands.  Think of this as “state-as-a-service”, something that any control-plane element can access to determine conditions.  It’s likely that the same singular model would be used to mediate access to forwarding elements where shared access is provided.

The final thing we need is an intent-model-based framework for managing our virtual views of features and services.  If you’re going to have a network whose virtual devices are set by aggregating any number of real elements and features, you need to have an elastic management vision.  The key to that is to provide for a hierarchy of virtual devices, each of which composes the higher-level management view, and each of which conforms to its implicit or explicit SLA through internal management.  This same capability would allow one management jurisdiction to do the “inside” management and another the “outside”.

NaaS figures into this because it’s a service-level representation of capabilities that could be derived from exploiting disaggregation.  While I think all truly disaggregated networks would be NaaS-capable, not all NaaS solutions would derive from disaggregation.  I think monolithic router vendors could deliver NaaS without disaggregating, for example.  What this means is that disaggregation may have to be justified at each intermediate step, because it’s not the only way to a truly revolutionary service model future.  The most interesting question for 2021 will be who might find a way to do that.

Less-Obvious Lessons from the Massive Hack

The recent massive US government hack has raised a lot of security concerns, as it should have.  The details of just what was done and how remain a bit murky, but I think that one thing that’s clear is that the hack attacked applications and data through management and monitoring facilities, not directly.  That’s something that should have been expected, and protected against, but there’s a history of ignoring this dimension of security, and we need to rewrite it.

I’m not going to try to analyze the specifics of the recent hack, or assign blame.  Others with direct exposure to hacked sites are better equipped.  Instead, what I want to do is address an attitude that I think has contributed to hacks before, including this one, and somehow still seems to go on creating problems.  A big part of the problem we’re having today is virtualization, and that’s everywhere.

It’s traditional to think of an application or database as a set of exposed interfaces through which users gain access.  Security focuses on protecting this “attack surface” to prevent unauthorized access.  This is important, but when we talk about “defense in depth” and other modern security concepts, we tend to forget that hacking doesn’t always take the direct route.

Early hacks I was aware of didn’t involve applications at all, but rather focused on attacking the system management APIs or interfaces.  A hacker who could sign on to a server with administrative credentials could do anything and everything, and many organizations forgot (and still forget) to remove default credentials after they’ve installed an operating system.  As we’ve added in layers of middleware for orchestration, monitoring, and network management, we’ve added administrative functions that could be hacked.

All these management applications have their own user interfaces, and so it’s tempting to view them as extensions of the normal application paradigm.  The problem is that more and more of the new tools are really involved in the creation and sustaining of virtual resources.  There are layers of new things under the old things we’re used to securing, and it’s easy to forget they’re there.  When we do, we present whole new attack surfaces, and yet we’re surprised when they get attacked.

One of the most insidious, and serious, problems in the virtual world is the problem of address spaces.  Anything we host and run has to be part of an address space if we expect to connect to it, or connect its components.  When something is deployed, it gets an address from an address space.  Homes and small offices almost always use a single “real” or public IP address, and within the facility assign each user or addressable component an address from a “private” address space.  These addresses aren’t visible on the Internet, meaning that you can’t send things to them, only reply to something they’ve sent you.

The “security” of private address spaces is often taken for granted, but they aren’t fully secure.  You could, if you could form the proper packet headers, create a message addressed to a private address, and it would be delivered.  That’s not the main problem for our new middleware tools, though.  The problem is that anything that’s inside that private address space is able to address everything else there.

Address space management is a critical piece of network security policy.  It won’t prevent all security problems, but if you get it wrong it’s pretty likely to cause a bunch of them.  One of my criticisms of NFV is that it pays little attention to address space management, and the same could be said for some container and Kubernetes implementations.  Whatever can access an application or component can attack it, which means that address space control can limit the attack surface by keeping internal elements (those not designed to be accessed by users) inside a private space.

Another issue is created by management interfaces.  Things designed to work on IP networks presume universal connectivity and apply access control to protect them from hacking.  This presents major problems at best, because ID/password discipline is inconvenient to apply, so people tend to pick easy (and hackable) passwords, write them on post-it notes, and so forth.  Another problem is the one we’ve retained from early days; software is often provided with a default identity (“admin”) and password (blank), and users can forget to delete that identity, rendering the interface open to all who know the software.

In modern networks and applications, this problem is exacerbated by “management creep”.  I’ve seen some examples of this in NFV, where VNF management is extended for convenience to relate to the management of the resources the VNF is hosted on, which then renders those resources hackable by those with access to VNF management.

Then there’s monitoring.  You can’t monitor something without being installed so as to access it, which means that monitoring tools have an almost-automatic back door into many things a user would never willingly (or knowingly) expose.  As in my earlier address-space example, the monitoring element can be contaminated and influence not only the specific thing it’s supposed to probe, but perhaps other things within the same address space.  This kind of attack would defeat address-space-based partitioning to reduce the attack surface.

The net here, sadly, is that most organizations and most developers don’t think enough about security.  Network operations and IT operations have the same problem, but with the specific problem being “tunnel vision”, the focus on threats to a public API or interface rather than to internal interfaces between components that, while not intended to be public, may still be exposed.

The development people have their own tunnel-vision problem too, which is that while many would realize that introducing an infected component into an application during development is almost self-hacking in action, they don’t take steps to prevent it.  A number of enterprises told me that their first reaction to the recent massive hack was to review their own development pipeline to ensure that malware couldn’t be introduced.  Every one of them said that their initial review showed that it was easy for a developer to contaminate code, and “possible” for an outsider to access a repository.

Do you want a comforting solution to all of this, a tool or practice that will make you feel safe again?  It’s not here, nor in my view is it even possible.  There is no one answer.  Zero-trust security is a helpful step.  API security and workflow authentication step by step would help too.  Both will work until the next organized attempt to breach comes along…and then maybe they won’t.  The only answer is to take every possible precaution to defend, and every possible action to audit against attempts to do something.  Unusual access patterns, like failures to authenticate access, or unusual traffic patterns could be an indication of a breach in the making, and if recognized could prevent it from becoming a reality.

This is perhaps a legitimate place for AI to come in, if not “to the rescue” in an absolute sense, at least in the sense of mitigating risk to the point where you could play the odds with more confidence.  Because, make no mistake, security is a game of improving the odds more than one of absolute solutions.  Play smart, and you do better, but you can never relax.