So, I can finally tell you all what it is I’ve been working on in so much secret. For the past few months, I’ve had the honor of being chief architect for Google’s social systems, and today we launched the Google Plus Project at last. This isn’t a final thing — as its name implies, it’s going to be evolving and improving very rapidly. As the news stories say, the purpose is to make sharing more social; to make it match the way we actually relate to our friends in the world. And it has some amazing other features, like Hangouts, Huddles and Sparks, which just make it lots of fun to work with.
I just finished re-reading Jared Diamond’s magnum opus, Guns, Germs and Steel. It’s amazing how well the text holds up on re-evaluation; the analysis is deep, and the number of cases he covers is wide enough to convince me that there is a real meat to his argument. It did, however, get me thinking about some interesting ways to extend his work. He proposes several in the epilogue, including all of the obvious further data searches and analyses which would need to be run to confirm or refute the hypothesis, and these are surely in the hands of people far more qualified to think about them than I am. But he raised one point obliquely which got me thinking about the one thing I have most trouble with in the argument, and it gave me a thought for how to answer the question.
Just a friendly reminder for y’all: The Internet is basically a giant machine for making sure publicly available data is easily searched and found.
There are crawlers continuously scanning the Net for each of the major search engines; for each of the minor search engines; for ad companies; for other net companies which want the data; for academia; for government agencies. Heck, writing and running a crawler is a fairly standard class project for advanced undergrads. The time between a document becoming visible and at least one crawler grabbing a snapshot of it is getting smaller and smaller.
Which means that if you make something publicly visible on the Internet, even for a few minutes, it’s pretty much impossible to undo that. The internet archives and replicates everything.
So always make sure to check security settings three times before posting anything that isn’t meant to be a general-public, on-the-record broadcast.
ETA: Not only is this post number 666 on this blog, but WordPress has assigned an internal ID number of 1337 to it. Apparently the tendency of the Internet to preserve public information as public is both l33t and evil.
This is another post about design. It’s about a principle which can apply fairly broadly; it could equally be about how to structure an API in a software system, or about how to handle a requirement in a business. Here it is in two flavors:
The software version
If your system’s dependency on another system cannot be expressed through a narrow, stable API, don’t depend on that external system — instead, reimplement it yourself.
The business version
If your business depends on some core function, and you care about the details of how the job is done, rather than just whether it’s done to some simple standard, don’t outsource that function. (i.e., FedEx needs to fly its own planes)
At first glance this may sound extreme; “reimplement it yourself” / “do it in-house” is a tall order for many things you may rely on. But in practice, this sort of decision can be life or death for your system. The reason is that, if you care about how a job is done in detail, you’re going to want to probe into it in depth; you’re going to want detailed controls over the individual steps of the task; you’re going to want to be involved in the day-to-day of the operations to make sure it’s done to your particular need. In terms of software, this means that you won’t be communicating with this system just via a narrow API like “write this data to a file;” you’ll be using complex API’s, bypassing API’s altogether to get internal state, and so on.
As this progresses, you gradually move from using the system to being intimately involved in it, debugging it, and ultimately needing to modify it to your particular needs. But crucially, if you don’t control that system, you can’t do that.
Now, this doesn’t mean that you shouldn’t consider outsourcing the job at first, and moving to in-house when your need to mess with the details grows. But if you’re going to do that, you need to recognize that the design constraints of working with this external system are going to shape your own design from the get-go, and even once you go in-house, the legacy of those decisions will be with you forever. If you are confident enough in the API that you believe that these design choices will be correct even afterwards, and that the changes you’ll make as you go in-house will simply be extensions to that initial relationship, great; but if you suspect that your needs are going to end up being fundamentally different from the external system, you may want to bite the bullet and do it yourself from the get-go.
There’s an obvious risk in doing this, of course; it’s more expensive, takes more time and money, and doesn’t give you an immediate advantage over a competitor who outsources. But this risk can pay off if you know that you’re going to hit that transition point reasonably soon — that way, a competitor who built around the wrong outsourcing is suddenly going to find themselves in need of a massive redesign, while you’re revealing wonderful new features to the world.
So a few days ago, I got an amusing idea for an interview question which I realized was totally pointless as an interview question, because it has no practical value whatsoever. So instead, I’m going to post it on my blog, as a way to help waste the time of all my CS friends. There is no prize whatsoever for a correct answer, except for the satisfaction of having
avoided work for a while solved an amusing problem.
Here are two really bad ways to sort an array:
- Random sort: Repeatedly select a random permutation and apply it to the set. Stop when it becomes sorted.
- Brute-force sort: Iterate over the set of all permutations of N elements. Apply each in turn. If the list is now sorted, stop.
The question is: which of the two is less efficient, and (the trickier part) by how much?
(Clarification: For the latter, “how much” in terms of average [mean] time to sort. You can also average over a large number of possible inputs)
While reviewing some code today, a principle of software design somehow distilled itself to clarity in my head.
When designing your system, think of every major system1 upon which your own system directly depends2 as a bug.
By “think of it as a bug,” I mean that sooner or later, you are going to come to truly hate this dependency. It won’t do what you want, or it will turn old and crufty, or it will get outdated, or your system will outgrow it. Perhaps it already stinks. And therefore, think about what you are going to have to do to take it out and replace it with something better, and possibly not even having similar API’s.
Yes, you should have your code sufficiently factored and modular that such a replacement will be minimally invasive. But more importantly: if that replacement requires any change in the API’s3 by which the outside world is using your system, then there is something wrong with your design. Stop and fix that immediately.
1Both external dependencies and major subsystems of your own code. Both will suck in time, I promise you.
2If the systems upon which you directly depend have done this properly, you don’t need to worry about your indirect dependencies. If they haven’t, then you should consider replacing them now, because you are obviously dealing with the work of madmen.
3Or UI’s, if your software is at the top of its software stack. UI’s are just API’s for communicating efficiently with humans. (Or perhaps API’s are just UI’s for communicating with computers?)
(FYI: The following entry is going to be much more technical than most of what I post. Anyone who doesn’t care about code or data serialization can pretty much hit “next” right now.)
A few days ago, Google open sourced one of its key data serialization formats, protocol buffers. There’s already been some chat on how they’re similar to or different than other wire formats, but I thought it would be useful to post some useful tips I’ve come across over the years about how to make them do useful things.
Don’t expect any deep insights into computer science here, just a few notes about working with these libraries.
My former boss just published a great little article about things you need to know when building a search engine. It’s chock-full of some really excellent advice for anyone building any large-scale computer system. For example:
Ah, but SCSIs are hot-swappable, you say. Get over it. Remember, no colo. You cannot afford it and you don’t want it. So if you’re worried about disk failures since you picked your disks out of a Dumpster, then my advice is don’t screw the covers onto your machines and don’t use four screws per disk. This makes IDEs pretty easy to repair, but certainly not hot-swappable.
I do sometimes miss working with Anna.
Media companies are getting antsy about Web companies, as you’ve probably heard. At a recent conference, various representatives of the media talked about this. The quote that caught my eye was:
“The Googles of the world, they are the Custer of the modern world. We are the Sioux nation,” Time Warner Inc. Chief Executive Richard Parsons said, referring to the Civil War American general George Custer who was defeated by Native Americans in a battle dubbed “Custer’s Last Stand”.
“They will lose this war if they go to war,” Parsons added, “The notion that the new kids on the block have taken over is a false notion.”
I wonder if Parsons is aware of how that war ultimately turned out for the Sioux?