Distribution assurance
This stage has three components. Load Balancing, Moving SIPs to their primary destination and unpacking them.
Load balancing Having ingested the volume directory metadata the system is now primed to expect the SIPs of data that makeup that file system. The selection of the primary storage of the data is the first task of the load balancer. It allocates a storage server to hold the data held within the SIP and records this in the FCluster in odes table. Allocation is based on the available capacity of the host, its processing power and its estimated time to finish its current task list.
The move file daemon
The move file daemon also uses “checklist” type assurance by constantly scanning the in odes table of FClusterfs
for any SIP that has been allocated a data node, not been marked as being ‘in place’ and where the evidence SIP is
staged in a local directory. If these conditions are met the SIP is transferred to the storage data node as allocated by the load balancer. If, and only if, the transfer is successful does move data update the in ode table with ‘primary storage in place’ set to true. Move data is the only mechanism whereby actual data can be moved around the system. It can only operate when all the preconditions from Ingestion Assurance are met. It does not simply scan an evidence folder and move whatever SIPs are present; it moves only expected SIPs, as recorded in the FCluster in odes table, from a folder.
The unpack daemon Unpacker daemon constantly scans the in odes table to see if there are any SIPs that are on their local server but not unpacked. It takes the entry from the database and looks to see if the files are on its ftp host, as should be the case from the entries in nodes, not the other way round. A file that simply arrives on the server without an entry in in odes would be ignored. When a suitable SIP is identified it is split into header and data sections. The header, containing the metadata is inserted into the ‘metadata’ table and the header file erased. The data section is uudecoded and the data decrypted with a key stored in the Volume Listing table. This was the key first created and issued by the FCluster and used to encrypt the data in the SIP at acquisition time. If the key does not work, the file cannot be decrypted and so unpacking would fail. Only if the file decrypts and the resulting file has an SHA1 checksum that matches both the name of the file itself and the SHA1 as recorded in the in odes table is the data file finally accepted.
Processing assurance
The task daemon scans the tasks table to see if any job is required for a file that it holds locally. Because all file access must take place by utilizing the enhanced FClusterfs filesystem the file must be the correct file and must have the original content that was collected at imaging time. FClusterfs also gives us fine grained access control to the files within a file system. We could, if we wished, control which users can process specific data with specific programs.
Conclusions
We have demonstrated that by ensuring a rigorous protocol when importing SIPs into a distributed cluster we can provide a level of assurance in data transfer and storage. Additionally, by adopting the same approach as Hardtop we have created a prototype of a middleware specifically designed to address the assurance requirements required in the legal process while providing effective distributed processing. As to whether this does achieve an acceptable level we offer this design for further debate. It should be clear that this design draws upon knowledge from many domains and so there is no single criteria set that can be applied. Speed concerns A primary concern with FClusterfs is speed but in practice this has not proven to be a significant problem. Firstly, file access in existing systems is often across a network connection via SMB and NFS shares. FCluster does this in the same way but using the ftp protocol. These are roughly equal or perhaps slightly slower. Secondly, as we have made clear, FCluster is read-only and so has no record or file locking code. As a result, even when FCluster draws from a remote ftp server data is cached locally in RAM and never needs to refer to the source for updates or changes. Thirdly, the system is designed so that each storage host should process its own local data, so the network issue completely disappears. All distributed systems suffer from a management overhead. This management issue exists in single host solutions but is exacerbated when management data has to be passed in messages across relatively slow network connections rather than using local memory. This limits scalability but in our initial test we find that the effectiveness of clusters of about 50 hosts on a local Gigabit network does not degrade significantly. As of Spring 2014, the FCluster prototype is almost complete and we are starting full assessment. We intend this to be available when complete viawww.fcluster.org.uk.