Conclusion¶

This report discusses the design of a distributed system for running bioinformatics software. In this report we focus on running MAFFT in a distributed manner. The system has three different nodes: manager nodes, data nodes and computational worker nodes, which means our system has both a simple distributed file system and a computational grid.

Our system is fault tolerant: the system is guaranteed to function correctly when at most one node of each kind is down. Running an experiment with a real world dataset on Google Compute Engine showed a significant decrease in runtime with more nodes, but more experiments are needed to properly see the overhead effects of the data transfer between worker, data and manager nodes.

The system also easily scales, and it is almost no effort to add another worker or data node. One concern could be the managers: they need to replicate and share the metadata stored in a PostgreSQL databases and the queues in the RabbitMQ server. We think this will not become an issue quickly because the write operations are small and quick, and more than three managers is not likely to happen. The slowdown due to replication will therefore probably be small.

All our must-have requirements have therefore been met.

Conclusion¶

Related Topics

This Page