The Cache God, pt 1.

This is some short fiction that was lumping around in my head recently. I think I know where it's going. It is still somewhat raw - I need to do some editing and stuff, but for the time being I just want to get some content out there.

"The problem is, I really do not know how it works." Paul took a long slug of beer and grimaced. "I mean, it does work, which is great. Even better, it seems to work without flaw. It is just that I have no idea why. By rights it should either just lock up or throw null pointer exceptions half the time."

We're drinking. There's this bar at the end of the pier. Its real name is unimportant, because a long time ago (that is, six months, which seems like forever), Steve-0 christened it the Floating Point. Without meaning to, Steve-0 paved the way here, because with a nickname like that, it is easy enough for us to work references in where the boss can't quite grok. Paul says something about fixed precision numerical representations in the morning stand-up, and I know where we'll be having our late, liquid, lunch.

Anyway, the moral of the story here is that semiotics is, (wait, are?) awesome. I can say something out loud and two different people will glean two different meanings. Of course, calling it 'semiotics' right there is another marker, right? You know more about what to expect from me and my story this way. I could have avoided this paragraph entirely - some might suggest that I should. This story is different for you, though.

So, there I am. The time is four in the afternoon, and happy hour just started. It is okay, we have been here for an hour already and we should be happy. I, at least, am. Or was. It is Friday at the end of a long week, and I have gotten through another day on the team without moving our entire product below the MVP line and shouting, "None of this is viable, motherfuckers!" The people who write user stories for us think we can engineer up magic. This week, I had managed to convince the designers that while voice input is interesting, we can't base core functionality on use of voice commands quite yet. Can you imagine a subway car full of people shouting "approve transaction!" at their phone's wallet application when the gate debit pings their cards?

Paul, on the the other hand, does not seem all too happy. I gesture at the bartender, signaling with an efficient nod and circle drawn in the air with my forefinger that we want the same again.

while (!user.drunk) {  

"I don't get it Paul. You had a stellar week. On Monday, I though you were doomed - you got stuck with optimizing data throughput on the mobile stack, which is probably as much fun and as productive as eating glue."

"Paste is much better. Did you know that they stopped selling school paste? You can't buy it anymore, just school glue or glue sticks. Not the same. Anyway, people are swapping recipes online, making DIY paste to eat and remember their childhood."

"Spare me the nostalgia, Paul. As I was saying, you hit our entire network use with some kind of hammer. It is down like forty percent in beta. Do you know what our bandwidth bills are? I mean, this is just mobile, and beta, but if it scales ..."

"It will scale."

"There you go. When it scales, you are going to be famous. You have singlehandedly cut our operating costs by a large amount and improved our responsiveness online. In places like India, where people have to work a day at minimum wage to cover a megabyte of 3G bandwidth, the new system may end up making a noticeable improvement in the standard of living."

"Do not get carried away. I know, we are making the world a better place through more efficient network methodologies. I get it. It will help a lot with desktop as well, for all people use that anymore, and the server infrastructure guys were looking at it too, it may end up help with data replication to our edge caches. This is not just for high-error-rate rural wireless. This shit will push corporate over the moon, once everybody realizes what I have build. I am just glad I work where I do; if a patent comes out of it I will get a fair share."

Paul tosses off the rest of his beer and stares morosely at his empty glass. Burps. We're approaching the end of that loop, and side effects are accumulating. Life is not functional in that way.

"The problem is pretty simple. I am not sure you can patent God."

"What? How many beers have you had? You know about the no-bullshit rule."

"Okay, okay. Look at it this way then. I do not feel optimistic about the code from a patent perspective, because while you can copyright specific code you cannot patent it. You can only patent the idea and its implementation, and I have no idea. No idea whatsoever how it works."

"How do you mean? You wrote it, didn't you? You didn't outsource it or something?" We are always joking about taking the money we make and hiring people in another city or country to do our work at half our pay, and pocket the difference. Privately, I believe that people do that all the time, and corporate notices. These are the people who get promoted to manager. Paul and I love our work, no matter how loudly our complaints at the Point get.

"No, I wrote it all. It is just, well. XKCD-323." Paul trails off, looking at his empty beer. He raises his eyes and locks gaze with the bartender. Brings both hands together, cupped. Paul is hungry and just ordered nachos.

A fresh prisoner arrives in jail. At lunch, he is sitting at a table with other people in prison uniform. One stands up and says, "Number fifteen." He sits down to laughter and applause. A bit later, someone else someone just shouts, "number thirty-seven!" to great laughter. The prisoner asks the person he is sitting next to, "what is going on?"

The other prisoner explains that they have been locked up together for so long, they don't bother with jokes anymore, since they've all been told hundreds of time. Instead, they are numbered. So you can tell a joke by picking the number.

Interested in making new friends at the table, the new prisoner stands up and says, "Number eight!". Nobody laughs. He sits.

His neighbor tells him his delivery needs work.

We are not prisoners, but we do share the context. There is some argument about what constitutes Peak Munroe in XKCD, mainly centering on whether you think his later large-form storytelling was art or wankery. It is generally accepted that 303-398 are the central canon that all self-respecting nerds, geeks, scientists, engineers, and mathematicians should have a command of if they want to communicate efficiently. 323 just is not one I remember. Perhaps it is the proximity to 327: the one where we met young Robert'); drop table students; -- aka Bobby Tables. 305 is where we learned about Rule 34, which you might have heard of even if you have not heard of XKCD-305, and the double indirection appeals to me a lot.

The bar back arrives with a big shallow bowl filled with potato chips that have been topped with a spicy meat and cheese mixture. We had originally assumed the meat was beef, but it was in fact lamb. This was an upscale bar, after all.

Paul grabbed a chip and ate it with relish. Enjoyment, that is.

"323 is the Ballmer Peak. I read some actual peer-reviewed research that indicated this might be true. It feels intuitively true as well: the right amount of alcohol can improve creativity. I was stuck in callback hell already, so I made a new branch in git, got a sixer, and worked all night. The sixer ran out by eight pm, but GrubHub delivers beer now."

"So you wake up, keyboard impression on your cheek, and find a mess in your editor?"

"Actually no. Well, it is not a mess in the traditional sense. My variables are named, indentation is two spaces as is natural law, and I followed coding conventions all throughout. On the surface, the code looks fine, but it really just does not makes sense when you look at it. Some scans like the work of an outsourcer, you know, half copypasta from Stack Overflow. The other half, I don't know. It runs. It is just a black box. There is no test code, which is really the heart of the problem. I ended up slapping some integration level tests on it, input-output stuff, but have no clue how to do unit tests on it. It could fail every single unit test, but the errors all end up stepping on each other and the net result is success. Like I said, it should not run, but God help me, it runs."

Paul is starting to sober up. The lamb had that effect. "But what does it do", I ask, "if it seems that the code is all shite?" I love that word. Shite shite shite. So much more refined than shit.

"Man, I love these chips. It is hard to say if they really are nachos are not. I mean, do nachos require nacho chips? You would think so. But here is some phenomenon, some amazing new thing. It clearly draws on poutine as well, with the thick gravy-like meat sauce and cheese curds instead of melted jack cheese, or worse, velveeta cheese food spread. But the gravy is not beef-based. It is lamb, and has some great, what, Moroccan spices? So here we have a bowl filled with potato chips, with a northern African spice take on a Canadian / Mexican fusion recipe. Only here, in the great melting pot of the world."

"It's a salad now."

"What? No, these are chips."

"No, they don't teach melting pot in schools anymore. Now it's more like a salad. Different people, different groups, but we all come together harmoniously. I think the salad dressing is supposed to be America now."

"Probably ranch dressing."

"Yep. Ranch, the racist-est of dressings."

"Anyway. The problem I was trying to solve was tricky. The user has a mobile device, and that device has storage space. It can remember things, and we ask it to do that. We store stuff that the system needs to run, like the user's authentication tokens, but we also store data in the cache so that the application does not need to re-fetch some image every time it needs to get displayed."

"I know how caches work, Paul."

"Okay, well, anyway, you know then that cache invalidation is probably one of the hardest practical problems you face. If the user is always connected and online, no big deal. But if the user exits, how does the device know what resources have updated by the time they log in again? Furthermore, and this is something that front-end people tend to not worry about, no offense, but even if the user is always online we cannot really trust the network."

"None taken, but why not? We have TCP, we get some guarantees about packet delivery." I'm showing off a bit here. I really am pretty much only front end. Never let anyone tell you otherwise, there is a ton of important and specialized knowledge in building the front end well and I doubt Paul could build an interface as buttery-smooth as I, but by the same token, I'm on thin ice when thinking farther into the network stack than, 'I ask for something, and I get a promise that resolves in its value or an error.'

"Well, you get some best-effort promises from TCP, but if the server tries to send an update packet and there's a failure of any sort, a hiccup in the network, some kind of wumpus just eats the packet, whatever, the server gets a nicely-formatted error about how it did not go through. We do not really know why the network failed, just that it failed. Even worse, sometimes it is not the initial send that fails, but instead the acknowledgement that fails. So after a fail, if we want to retry we have to reconstruct the initial data we want to send and we had better hope that update is idempotent, because we might have accidentally sent it a second time. We cannot reasonably retry indefinitely, because there are over a million connected devices at any time, and they get serviced by different application servers depending on load, so passing the 'this is what I was trying to do' state around for fails is really expensive. So if there is a packet drop, we just drop that error."

"The result of that is the client and the server never really know each others' states, unless we push the whole state, or some digest of it over the wire relatively often. The front end keeps saying, 'hey, my most recent update was at this timestamp on this resource, is that right?' and the back end has to say, 'okay, here is the delta between then and now.' This ends up either being really computationally expensive or we just push a lot of data over the network that the client already has. My code was supposed to address the delta calculations; was it possible to do a better job figuring out what the client needed in a computationally-efficient manner? Tricky, tricky, tricky."

"Okay, so that sounds simple enough. I can see why it would be hard to test, though. You need to construct state on both sides of a mock network connection, simulate some outages, and make sure the two sides become consistent."

"Right. There's a claim in databases called the CAP theorem, that you want distributed systems that are consistent with each other, available, and partition-resistant. The gist of CAP is that you probably cannot get all three, and most of our work has to do with getting systems that are eventually consistent and can respond to network partitions. We sort of let availability drop - if we know we have a network partition we can just refuse to service certain kinds of requests until the partition ends. In distributed databases on our scale, network partitions are not frequent, but nor are they uncommon. They happen, and we need to deal with them. Working with a client - server connection, partitions are a lot more frequent. But the problem is that my cache calculator seems to come very close to fully satisfying CAP."

We switch to water. On the one hand, it is bad form to sit at happy hour and not order drinks, but on the other hand, we had ordered a lot of drinks earlier as well as food, so we probably have a half-hour before our welcome expires. There is a decay function here.

"What do you mean by 'fully satisfying'?"

"Well, the client does not, of course, know the new information that the server has when it emerges from a partition. But both the client and the server seem to be able to calculate precisely what deltas need to be sent in both directions - what are the new actions the user has taken, and how has the world changed, since the partition. There is a handshake message in both directions, they both say, 'my last message from you was at this timestamp', and eventually get those deltas. This is good."

"Sounds good."

"The problem is that the handshake packets error out and get null routed. They are never actually sent over the wire, there is a typo in the code and the handshake is not sent. The deltas get sent regardless. The deltas are perfect, as well. Again, there are self-correcting errors here: my algorithm should not work at all, because it is firing off of the 'hey, I could not send the handshake' error. But somehow, that error gets shoved into the delta calculator, and the delta is computed imperfectly, but all the extraneous fetches generate some kind of race condition error from the database and fail. Because I am a terrible programmer, I forgot to eat the errors at the bottom of that particular promise stack, so they actually fail quietly. The incomplete packet, incomplete from the perspective of the delta calculator, is sent. It is an enormous mess of callbacks and continuation passing, so I am not entirely even sure how it works, like I said. But it works. Somehow, the server, without being asked, sends exactly what the client needs to know and not one byte more down the wire. The client does the same back to the server, too. You can basically keep using the application offline."

Paul shakes his head. "I am afraid to test the application in during a partition where I try to do something that would be rejected based on data developed after the partition started, like debiting an account that gets a huge debit and lock during the partition. I am half-convinced that would somehow develop an extraneous error somehow, like the click handler accidentally throwing."

"If I could figure out what is going on, I would be a rich man. As it is, I am worried about keeping my job. I mean, who turns in untestable, unreadable code? I am surprised it passed code review well enough to get on to beta."

Pushing back from the bar, I say, "Who did that code review anyway? I am going to apologize for invoking the no-bullshit rule. I can see why you might want to refer to this as God. I mean, if what you say is right, the library you wrote knows what its children need and sends them just that. Manna from heaven."

Paul laughs sardonically. "Yeah. Now you know where I got my library codename."