Y2K

The major reason I use social media is the chance to learn new things from people whose messages don’t fit in the soundbites broadcast by the mainstream media. This is how I get to check my assumptions, maybe even reverse a bad take. This article from John Quiggin is a great example of what I want to see. Until now I’ve assumed that large language models will indeed lead to a big increase in electricity usage, potentially reversing the gains from decarbonisation of the grid. But this assumption lacks hard facts to back it up. It’s a misconception ripe for correction.

So it’s a shame when I drain my glass and find this lurking at the bottom:

The Y2K panic, supposedly based on the shortening of digits in dates used in computers, was obviously false (if it had been true, we would have seen widespread failures well before 1 January 2000).

But the appeal of the story was irresistible, at least in the English-speaking world, and billions of dollars were spent on problems that could have been dealt with using a “fix on failure” approach.

Woah. Dude.

I get this a fair bit from outlets like the BBC. I nod along vaguely assenting to their assertions and assumptions, since I’m no expert. Then they publish a piece about something I do know about, and it’s transparently outrageous bollocks. And then I wonder how much bullshit I’ve swallowed from these people. I’m not so used to seeing it from bloggers, who mostly stick to their own field. So it’s a bit of a kick in the teeth here.

Fine. Let’s talk about Y2K.

Baked Beans

Let’s get the first thing out of the way quickly. The people who spent New Year’s Eve 2000 cowering in bunkers were idiots.

I saw in the new year at the Harbour Bridge with half a bottle of cheap bubbly inside me, and kinda forgot that it was Y2K until the next day. I can’t remember thinking about it much in the months beforehand, but I suppose I was pretty confident that the vast majority of important systems had been checked, and the problems would be minor. I guess that’s mostly from my own experience of actually working on the problem. Of course I was aware that a minor cottage industry had grown up deliberately conflating Christian millenialism with Y2K, but that was a joke and mostly happening in America.

It is possible that most of the “Y2K was a hoax” stuff, including the example that set this off, is directed at those people. It’s also possible that this attitude was more pervasive and important than I realised at the time, and it really is imperative to keep those ideas squashed forever. OK, I guess?

That doesn’t go far towards making the quote at the top of the page defensible.

Grift

Was the Y2K industry infested with grifters leveraging fear and ignorance for fat IT contracts to do nothing at all, because they knew nothing was going to happen?

Hell yes. Obviously.

Here in Germany, chickens are coming home to roost. When the Covid crisis hit, the government of the day was caught napping. The country needed personal protective equipment, especially face masks. And there were none. Turns out they’d been quietly outsourcing that to China for decades, and now the government wasn’t in control of its own health system. So the health minister panicked, and started handing out contracts for face masks at absurd prices to anyone who showed up. Once the acute phase of the crisis was over, they couldn’t stomach flushing that money down the drain, so they refused to pay. Now the suppliers are suing.

This is what happens. If you don’t plan ahead, you panic. Whatever you do, it’ll be more expensive than it would have been if you had planned carefully. People will take unfair advantage.

But it’s a bit of a leap to go from there to saying that masks don’t work. That’s a leap many people can comfortably make, but here in the real world, I hope we can agree that indeed the world suddenly needed lots of masks, and should have been willing to pay a premium for them given the circumstances. How much? EUR 5.40 does seem a bit steep. You can buy a box of 20 for that on Amazon. But at the time, obviously you could not. A friend in New Zealand who had two spare offered in all seriousness to post them to me (I declined). Jens Spahn does seem to have messed that one up, but he wasn’t off by an order of magnitude.

Y2K was not nearly such a serious crisis, and we had far more warning. On the other hand, the issues were far more difficult to understand. How much grift? Hard to say. Maybe someone can come up with a figure. My guess though: not more than half, and not a shocking proportion.

Systems

But all of the above is just throat-clearing. Let’s get to the point.

Y2K was an issue with systemic coordinated risk. The threat was that we could end up with a complex emergent failure. Or, to use the industry term, a clusterfuck.

You have a chain of 20 systems working together in elegant harmony. Suddenly, one fails. Oh no! The whole chain collapses and everyone’s angry and upset. It’s your job to fix the failing system. So you twiddle some settings, fix a bug. The light switches back on, and everything’s fine. Hooray! Have a beer.

Until one day, two fail at the exact same time. So you pick one, twiddle some settings, fix a bug. Did it work? Who knows? The light’s still off, but it would be wouldn’t it, the second system is still broken. OK, maybe there are signs of life now, but for all you know you could have made it worse. You’ve got to rip the pieces out, isolate them, bring in the test equipment to confirm that behaviour is according to spec individually. You need different tools. You likely have to build those tools. So you get down to work, and while that’s happening, a third system’s input buffer is filling up. What do you suppose happens when it gets full?

Two weeks ago, Crowdstrike wiped out what turned out to be a critical fraction of Windows computers. They aren’t giving much away as to what happened, but the picture is fairly clear anyway. The system that built content updates did the wrong thing, for whatever reason (for example, building for a new format that wasn’t actually deployed yet). And the automated test system, designed to catch mistakes like that, was also broken. And that’s the story. Two failures. Either one of those on its own is a minor irritation for one engineer. Both together is billions of dollars of damage.

It’s now more than two weeks later and they have proudly announced that 99% of systems are back online. Wait. 99%? You have to boot into safe mode and delete one file. That information was all over the world within hours. It’s hard to imagine a much simpler fix. How does that take two weeks? Well, because the computers are in a dusty cupboard and the person with the key is on holiday, and their flight home was mysteriously cancelled. Or whatever. It always seems to turn out that simple common sense can’t, in fact, see you through. For reasons.

But that’s the easy case. What happens when it’s not easy? Here’s an incident report from AWS. It’s a classic cascade. It started with an engineer(ing team) dialing up a setting, as planned, as designed. But AWS doesn’t control what its clients do. The clients detected something was different, and started behaving differently. That caused a cascade of inexplicable load in systems exquisitely designed to minimise costs while maintaining capacity for predictable contingencies. Oops, sometimes things are unpredictable. Then, AWS’s monitoring systems fell over. Engineers had to make do with what they had, and that led them to misdiagnose the problem. They then wasted time fixing the wrong thing.

Timing

That’s two clusterfuck case studies. But ultimately, neither of them was that big a deal. Cloudstrike struck the whole world, but only one (systemically important, sure) operating system, one specific way, with a single simple fix. The AWS outage struck one data centre at one (systemically important, sure) company. Cloudstrike was the biggest outage in history, but the highest estimate I’ve seen of the cost is 15 billion. Meh.

That’s where Y2K was different. The bug was based on the date. That meant that when things happened, they would happen at the exact same time, all over the world. All systems, from all manufacturers, were all at risk. It’s the perfect setup for a clusterfuck. A cascading sequence of outages, none of them on their own anything that couldn’t be fixed by jiggling a cable and rebooting, but altogether unfixable for months. You don’t have to imagine it. We get tastes of this all the time. But the scale here is a whole different category.

We often do get simultaneous failures from disconnected machines, but these cases are normally where they share common components, such as GPS. Date bugs are different because they affect independent implementations. You do get chains of dependent computers all based on the same platform, sure. But doesn’t it make more sense to specialise? 20 interdependent specialised components, from a variety of different suppliers. Normally that limits the potential for clusterfucks. But when all of them fail because of the date, they don’t need to have anything in common.

Disease

The right way to think about these failures is epidemiological. An isolated failure here and there is no big deal. Local solutions to local problems. An outbreak burns itself out. But as population size increases, and contacts between members of that population become more frequent, the basic reproduction number rises. At some point, a mild concern becomes a full-blown crisis.

The information technology world has taken the idea of division of labour to extremes even experts struggle to comprehend. The industry never met a class of independent systems it couldn’t consolidate under one roof, and then split that into a pipeline of interdependent systems to shave a fraction off the hosting bill. Everything depends on everything else. You probably didn’t think you were using these 11 lines of code, and you probably didn’t notice when they went away, but that’s only because of the diving saves from hero sysadmins around the world.

Every system whose failure can negatively affect another system is a potential transmission vector for disease. And after decades of consolidation due to the “consumer welfare” test the whole world is basically packed shoulder-to-shoulder in a single elevator.

Jabs

But that analogy also gives us the good news. You can vaccinate against disease. And the great thing about vaccination is, you can get close to eliminating 100% of the problem while falling far short of vaccinating 100% of the population. It does work best if you can target the “superspreaders”, but really just getting as many as possible or whoever you can reach will do the job.

That’s what the Y2K industry set about doing. Systematically going through systems one by one to check if they would work. Not everything, of course. No-one checked the cash register at your local fish and chip shop. That doesn’t matter. The top priority was the programming languages and libraries that underpin everything. Then the operating systems. Systemically important applications such as databases. That sort of thing. No-one sane rolls their own database. You buy it from a gigantic vendor. Fix the problem at the source, and cut out a huge source of risk at a stroke.

Most of the actual work that got done was practically invisible. It was done by huge corporations, internally or under non-disclosure agreements.

That’s most of the story why Y2K was a damp squib. It was fixed, not by every mom and pop software firm, but by the big players, and almost all of those in the USA. Ultimately, an enormous amount of work was done, the general public didn’t see any of it happening, and it was enough to give the whole world something analogous to herd immunity.

Risk

I’ve never yet seen a vaccination campaign where the organisers said “welp, that’s 60%, time to take the win everyone”. You vaccinate as much as you can. And Y2K didn’t stop at the big players. Everyone was told to do their bit. Was this dumb?

Maybe. This is the part where you get to argue that Russia did nothing and was fine. Who buys their software from Russia? No-one systemically important to the global economy, that’s who. If the whole of Russia doesn’t have to drop everything to fix Y2K, why should my local primary school?

In retrospect, it definitely should not. But in prospect? John Quiggin suggested two articles from the time where he “got it right” (one here paywalled, not convinced I have the right to link to the dropbox versions). Both of them are exemplars of linear extrapolation. And that is definitely not the right mindset for analysing global information technology outages.

I was pretty young at the time, a freshly minted CompSci graduate. But my impression is that this was the early days of analysing systemic risk in IT. The Internet explosion was only a few years previously. The mania for hooking everything up to centralised services was, partly due to the language barrier, mostly confined to the Anglosphere. The idea of a website crashing and bringing down the economy was still novel. These days we could probably do a better job of analysing the risk. At the time? Blind leading the blind. The best we could do was the best we could do.

If you want to convince me that you really did see how it would play out in advance, and knew exactly what needed to be done, you’d better do an absolutely stellar job of showing off your in-depth technical know-how when making your case. We shall return to this topic.

First-Hand

I graduated in April 1999 and immediately flew to Sydney to start my first job. Wouldn’t you know it, I was working on a Y2K project. What a coincidence.

As a right up against the coal-face frontline worker with my hands deep in the guts of the Y2K crisis, only me standing between the innocent unknowing public and the collapse of civilisation, what were my impressions?

I actually did see a Y2K bug once! I can’t remember much about it. Just that my reaction was “haha, check it out, the Y2K bug is real!” And also, it could not possibly have mattered.

The job I was working on had been thrashed out between a major company which we shall call by the cryptic pseudonym “Telstra” and my own company which was, uh, not major, many months before I joined. My company had earlier supplied a thing you’ve never heard of that sat in a rack minding its own business, and someone’s checklist said it had to be Y2K compliant. Everything was decades obsolete and unfixable. And anyway, the software we’d supplied back then was unfit for purpose, it needed to be reworked. So, rip and replace, and write some code at the same time. Nice juicy contract.

This article, which is not really that bad, makes this point. Most of the spending on Y2K didn’t truly represent an opportunity cost. It was stuff that needed to be done anyway. Perhaps the price was a little inflated, perhaps we could have squeezed a few more years out of the old system, but this is working at the margins. The article acknowledges this, because this was widely understood at all levels. The old system used coaxial network cable, not modern cat5 cable. The data centre had to have both, and it was a rat’s nest. Cleaning that up would have put a smile on some network engineer’s face that just can’t be priced. That was the “why?” The “why now?” was Y2K, and that’s how it was billed. The true spending on “Y2K” as such was a fraction of the headline figure.

It was never my job to actually fix the Y2K bug. And yet I saw it. “Supposedly based” my arse. That’s the point I want to make here.

Prospect

Time to grapple with the text:

if it had been true, we would have seen widespread failures well before 1 January 2000

The simple answer is yes, and we did. Maybe you didn’t see any failures mate. I also didn’t inform you about the critical performance regression I introduced into our flagship product and then delivered to our most critical customer last week. Instead I apologised to exactly one person and fixed it. We shall never speak of this again. A handful of failures inevitably did leak into the outside world. Wikipedia helpfully breaks these down separately.

Ah, but then, how many failures did we see? Not a lot, right? Not “widespread”? None of this systemic coordinated clusterfuck you claim exists?

Take my epidemiological model, but pretend linear extrapolation means something somehow. We’re using the rate of failures detected prior to 2000-01-01 as a metric to predict how many failures we’ll have on 2000-01-01. Imagine we’ll later feed that prediction into our non-linear clusterfuck detector, but I’ll cheerfully ignore that here.

To model that, we need something else. Some code deals only with dates in the past and present. Logging, for example. Other code deals with dates in the future, or both the past and the future. Assume both kinds of code are vulnerable. We need the ratio between the two to have any hope of using one to predict the other. For each line of code that deals with the past, how many lines of code deal with the future?

Answer: I HAVE NO FUCKING IDEA. And you certainly don’t either.

I know that the former is bigger than the latter. That’s obvious if you’ve spent any time dealing with these systems. Is it just one order of magnitude, or three? Could not tell you.

Bear in mind that not all code is created equal. The worst vulnerabilities are in the obscure embedded systems, halfway up a mobile phone mast or down in an oil well. Are those systems calculating compound interest over the next financial year? They are not. Many of them, however, are logging.

But we did see errors. They were quietly fixed. It should be obvious that the further ahead of the date they happened, the less urgent it was to fix them. Right? And that means the earlier they surfaced, the less likely we all were to hear about them. And that means, there was predictably not enough publicly-available information to predict the risk until it was too late to do anything about it. ANZ shuts down and no-one has any money, it’s a front-page story. ANZ quietly fixes a bug that no-one noticed, and, uh, no-one notices. You see the problem here?

The only thing I do know for sure, is that I definitely saw the bug, even though I wasn’t looking for it.

Cosmetic

But I mentioned the Wikipedia page. I’d be the first to admit that those are some slim pickings for my case. It was only two nuclear reactors, and neither went into meltdown. These things happen. Most of the problems were cosmetic.

About that.

Remember that AWS outage? Actually, said the engineer, all the systems were behaving exactly as designed. (HR tells me I’m not allowed to slap people who say that any more. Some kind of quota.) The one exception? The performance dashboard they were using to monitor everything. Not mission critical. Cosmetic.

Except, the effect of that was to send the engineers down a blind alley when they tried to fix the real problem, wasting time, and multiplying the damage by an order of magnitude.

Remember my strawman chain of 20 systems with 2 failures? You can tell that you fixed one, because the monitoring dashboard shows it’s producing good data, even if no-one’s consuming that data.

Oh.

Above I discussed the difference between code that sometimes deals with the future, and code that exclusively deals with the past. Guess which kind we’re discussing here.

Nope, you can’t dismiss failures as “just cosmetic”. Failures are failures. Small or large, there’s a place for everyone in the clusterfuck.

Fix On Failure

That job I had in 1999, I was given a pager. I was conflicted. On the one hand, it was kinda cool. Like, I’m important. People might need me at short notice. Doctor quick! We found a compatible liver! That’s me.

On the other hand, people could page me.

On the other other hand, you could page me all you like on 2000-01-01, it wouldn’t’a done you any good. That day was a write-off, at least judged by engineering standards.

The idea that we should YOLO our way through Y2K and then the engineers would just “fix on failure”? And you would have taken Tuesday as a bonus day because the computers were down? Allow me to personally tell you to fuck all the way off. No.

And the reason that doesn’t work, is not that I wanted to nurse my hangover in peace. It’s that I wasn’t really in a position to get you to work on the Wednesday either. And in fact, with the best will in the world, cascading failures with all our dashboards out of action, we’re talking months of “fixing”, the vast majority of which would be spent only on getting in each others’ way.

Two weeks, and we’re still not over Crowdstrike. Go ahead. Linearly extrapolate, if that’s how you roll.

Litigation

We now come to the second, and far worse, article. The core insight of “Y2K’s nasty legal side effects” is that the real problem here is Americans and their litigious culture.

Reasonable estimates of the cost of the Crowdstrike disaster which, let’s be clear here, is undeniably 100% Crowdstrike’s fault, run to between 5 and 15 billion dollars. Delta airlines is claiming half a billion in damages. Those Americans! They’ll sue because the coffee is too hot!

No, the problem here is that large corporations employ tricksy lawyers to evade responsibility for the externalities they inflict on the rest of us. How can anyone reasonably expect Crowdstrike to invest appropriate resources in avoiding disasters if they never have to pay full price when it happens?

But that’s Crowdstrike. It’s rare that you can attribute 100% of the blame to a single agent. How do you allocate blame for a clusterfuck, where each act of deliberate negligence on its own would have caused only trivial disruption? Because that’s the more common case. It is, and I want to stress this, not fair to leave that up to the victims, and it’s doubly unfair to leave it to me. It doesn’t have to be that way, and it’s not entirely that way. As Schneier says:

Courts can adjudicate these complex liability issues, and have figured this thing out in other areas. Automobile accidents involve multiple drivers, multiple cars, road design, weather conditions, and so on. Accidental restaurant poisonings involve suppliers, cooks, refrigeration, sanitary conditions, and so on. We don’t let the fact that no restaurant can possibly fix all of the food-safety vulnerabilities lead us to the conclusion that restaurants shouldn’t be responsible for any food-safety vulnerabilities, yet I hear that line of reasoning regarding software vulnerabilities all of the time.

Certification

Tying this together, the key to a good vaccination program is that people are either vaccinated or they aren’t. A dirty secret of vaccination programs is that a good percentage of the jabs given to people don’t actually work. But that’s not the point. The point is, you’ve got a list, and you can cross people off the list when they’ve been stuck. You don’t have to get everyone. That’s why the programs work.

Y2K was one of the first big attempts to certify the properties of software. (The other is the ISO 9000 family, which I can proudly cross my heart and swear I know nothing about, and fully intend to maintain that purity thanks.) In many ways, this was stupid. There wasn’t any actual big certification organisation to hand out fancy stamps. People just sorta claimed to be Y2K compliant. Nevertheless, this was a sea change.

Today, every line of code I write is reviewed by two other engineers before it goes into main. My submissions are cryptographically signed with a private key stored only on my own laptop. And these things aren’t just policies: they’re inscribed in the contracts with our customers. In 2000 I changed jobs to a proper company, where code reviews were already standard. But that was an email and “LGTM”. The change in the last 25 years, and the speed with which we all got used to it, is breathtaking.

Back in 1999, as we prised open ancient cupboards to see what kind of device lurked within, we were so, so far away from the modern world. You were grateful if there was even version control (CVS!). Dragging that into the 21st century was always going to be an expensive undertaking. But I cannot stress enough how much of the modern world is built on top of the foundations we laid in 1999. Feel free to take that how you will.

Systemic

If we’re talking about tail risk, the word “systemic” is a hint. The global financial crisis is a terrific example of correlated systemic risk. Why should “subprime loans”, a thing of which I have never heard, affect me? I never took out a loan in my life. Could we not have simply fixed that on failure?

The reason why not is a nexus of technology, timing, and psychology. And Y2K mirrors all of those factors. The difference is, we got our shit together in plenty of time and the sky didn’t fall. The industry full of economists, not so much.

What could have saved us from the Global Financial Crisis, is regulation. Which is to say, a variant of certification. Can you imagine how that would have gone? Systematically going through every subprime loan one by one, checking if it really was a problem, unwinding and rehousing people if not. The expense! The opportunity cost!

Bet you’re glad we didn’t repeat the Y2K mistake on that one.

Infinite Loop

I promised I’d come back to this. It’s a cheap shot, but:

Why, then, are we spending a fortune on eradicating the millennium bug, and doing nothing systematic about the infinite loop?

Part of this is that John Quiggin, or at least 1998 John Quiggin, doesn’t know anything about computers and can hardly be expected to know anything about computers. But, still. I know infinite loops. Here’s one. Thing about infinite loops is, you know they’re happening because your laptop fan turns on. That’s because they are busy doing nothing in a tight loop. That thing where the beachball just spins and nothing happens? That’s deadlock. I see real infinite loops all the time, because I write them. Those rarely go into production, because they’re so literally screamingly obvious. (Last week’s performance regression was not infinite, and also we agreed never to talk about that again.)

This is a lesson.

We are systematically doing something about the infinite loop bug. We introduced automated testing and continuous integration. Performance tests act as gateways that trigger alerts before something goes into production. (My bug was a custom customer release, not in production, and how are you still talking about that?) Unit tests cover behaviour not just in normal cases, but far beyond the usual range of operation. Static analysis spots bugs before they happen, not to mention commit messages that start with non-imperative verbs I just don’t even. These systems are, oh my god believe me when I tell you this, not cheap. They are spotting bugs 24/7 and ensuring they get fixed before they hit the New York Times. No economist need ever know. This is as it should be.

2038

Even at the time, we enjoyed pointing out to each other that we’d have to do it all again in 2038.

But the truth is, we won’t. We’ve learned the lessons. You can make snide remarks about easy contracts all you like, in practice none of us who were there ever want to waste our time on anything like that again. And that alone should give you some sense of how much of that money was truly “wasted”.

In the 70s, storing years as two digits instead of four would get you a pat on the back for improving product quality. We don’t do that any more. More than a decade out and time_t has been 64 bits for years.

But more than this, the vast infrastructure of code quality tools makes it practically impossible for us to release software that’s perfect, let alone code that stores time in a signed integer.

Obviously

The Y2K panic, supposedly based on the shortening of digits in dates used in computers, was obviously false

I’ll concede that the Wikipedia page has decided to teach the controversy. That’s prudent. But still. For something that’s “obviously false” to an economist, the consensus sure did decide that the “true” side should be presented first.

It’s not fair to get too heavy-handed here. But I will anyway.

I have not once seen any single shred of evidence before my eyes that “global warming” is real. Sometimes it’s hot, sometimes it’s cold. Sometimes it snows, mostly it doesn’t. I hear about wildfires and floods now and then, far away from me, but that’s been true my whole life.

I’ll tell you what evidence I have seen: my energy bills have definitely gone up.

On the common-sense evidence of my own eyes, global warming is a hoax by people who want to rip me off. Yes, I know this or that self-proclaimed “expert” has some very clever and impenetrable maths that supposedly proves me wrong. But find me a single one who isn’t also getting paid for this “research”. You can’t, right?

Obviously.

It’s Fine

I’m actually on a big optimism kick right now, for two reasons. First, Harris can win. Second, population predictions have been revised sharply downwards. These are minor but significant reasons to be optimistic about our prospects for the climate crisis. Renewables are much cheaper than fossil fuels and the rollout is happening stunningly fast. The nightly news is trying to scare me with footage of thousands of EVs rolling onto giant ships at Chinese ports. Of a wide range of potential outcomes, “survivable” is the most likely bucket.

But that means 30 years from now, it’ll be “all that fuss and nothing happened!” At some point, even baked beans are easier to swallow.

Leave a Reply

Your email address will not be published. Required fields are marked *