In this new entry of the blog we will explain how a distributed cache system works. Before I started working on this post my original idea was to explain only one caching algorithm (the idea appear while I was reading groupchache source code to understand the hashing algorithm. By the way, is an amazing piece of software, you should check it) so I started implementing it and when I was done I realize that would be nice to test in some way, so why not implement a simple and basic distributed cache system in a way that is easy to explain? This entry will be splitted as independent entries:
Introduction to a cache system
Understanding the cache/container
Understanding the hashing
Design the protocol [Not ready]
Understanding the discovery system [Not ready]
Here we go ;)
DISCLAIMER
I’m not an expert on caching systems
I’m a python fanatic
The final cache system is not production ready nor finished :)
What is a Cache system?
At a glance a cache system is a basic piece of software, is not rocket sience, afterall is as simple as having a piece of memory for storing data identified by some key.
Well… sounds simple but if we start adding features it becomes more complex, so we need to know when to stop adding fancy stuff. Our system at least will have this requirements:
Modular: Yeah, we need the ability to change the algorithms and stuff as we want without to much trouble
Cache: We want to store things… (Thanks captain obvious :P)
Distributed: We need to add and remove nodes as the data grows…
The least data loss: We should lose the minimun data when removing nodes from the system
Dynamic: We need to add nodes and remove without the need to restart all the nodes (cluster)
Service discovery: Automatic discovery of new nodes when a adding a new one
Our design
The system will be the group of these pieces:
A cache algorithm to store the data
A hashing algorithm to distribute the data across the cluster
A communication protocol to talk to the nodes and between the nodes
A service discovery system so the nodes could talk each other
As I said before this cache system is not for production or use. It has been designed only with educationl pourposes in mind, it has been designed omitting error handling, data replication, optimization… there are a lot of amazing cache systems out there like Memcached or Groupchache (both from the same person bradfitz)
This will be the schema of the cache system when finished, looks exciting, doesn’t it?
Cache schema
Lets start messing around with the cache algorithm! part 2