Beware keeping data in binary format
Dear KV,
Where I work we are very serious about storing all of our data, not just our source code, in our
source-code control system. When we started the company we made the decision to store as much
as possible in one place. The problem is that over time we have moved from a pure programming
environment to one where there are other people—the kind of people who send e-mails using
Outlook and who keep their data in binary and proprietary formats.
At first some of us dealt with the horrifically colorful e-mails by making our mail server convert
all e-mail to plain text before forwarding it, but that’s not much help when people tell you they
absolutely must use Excel, and then store all of their data in it. The biggest problem is that these files
take up a huge amount of space in our source-code control system, but we still don’t want to store
important information outside of it. Many of us are about ready to give up and just stop worrying
about these types of files, and allow the company’s data to be balkanized, but this doesn’t seem like
the right answer to me.
Binning Binary Files
Dear Binning,
While the size argument used to be a compelling one—perhaps even as recently as five years ago—
we all know that terabyte disks are now cheap, and I would be quite surprised if you told me that
your company didn’t have a reasonably large, centralized filestore for your source-code control
system. I think the best arguments against storing important company data in a proprietary or a
binary format—and yes, there are open binary formats—are about control and versioning.
The versioning argument goes something like this. Let’s say, for example, that the people who
control your data center store their rack diagrams, which show where all your servers and network
gear are located, as well as all the connections between that equipment, in a binary format. Even if
the program they use to set up the files has some sort of “track changes” feature, you will have no
way of comparing two versions of your rack layouts. Any company that maintains a data center is
changing the rack layout, either when adding or moving equipment or when changing or adding
network connections. If a problem occurs days or weeks after a change, how are you going to
compare the current version of the layout to a version from days or weeks in the past, which may be
several versions back? The answer, generally, is that you cannot. Of course, these kinds of situations
never come up, right? Right.
The second and I think stronger argument has to do with control of your data. If a piece of data
is critical to the running of your business, do you really want to store it in a way that some other
company can, via an upgrade or a bug, prevent you from using that data? At this point, if you’re
trusting, you could just store most of your data in the cloud, in something like Google Apps. KV
would never do this because KV has severe trust issues. I actually think more people ought to think
clearly about where they store their data and what the worst-case scenario is in relation to their data.