Requirements¶

In this section we discuss the following two use cases we have in mind for our distributed system: first a use case where the “life science” group in your organisation has performed a sequencing experiment, and you want to reliably store this data, the second use case being the bioinformatics group wanting to perform some analysis on this data.

Use Case 1: store the data of a sequencing experiment¶

The biological lab work is often done by a different group than the group performing the actual analysis on the data. Although the costs of sequencing have dropped dramatically recently, it’s still relatively expensive ($1000 dollar per experiment). The data coming from such experiment should be stored for later analysis.

This brings the following challenges:

The huge amount of data: a human genome with 60x read coverage depth can occupy easily 200 GB in its compressed FASTQ file format.
The data needs to be stored persistently and reliably.
The data needs to be accessible by other teams
Analysis and other actions need to be performed on this dataset, and the results should be stored too.

Use Case 2: perform analyses on the data¶

When the data is safely stored in the database, an organisation probably wants to analyse this data. For example, map individual reads to a reference genome or locally align them, assemble a new genome from the individual reads, or check if this newly sequenced genome has any variant genes compared to the reference.

Most of these operations can take a lot of time, but due to the nature of the sequencing experiment (you get a lot of independent reads), it is possible to perform a lot of steps at the same time, but on different chunks of the data. Building a scalable distributed system for these kinds of pipelines could reduce the computational time significantly.

Requirements Prioritisation¶

Must Have¶

Built a distributed system which implements a subset of the steps of a NGS pipeline: Burrows-Wheeler alignment and local alignment on independent reads, keeping the known best practices in mind [auwera2013fastq].
The data must be stored safely and reliably.
Fault tolerant, when one of the nodes crashes it should not hinder the final results.

Should have¶

Different scheduling policies for different workloads
Multi-tenancy: let multiple teams perform different actions simultaneously.

Could Have¶

Data-ownership: who can see which datasets

[auwera2013fastq]

Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., del Angel, G., Levy‐Moonshine, A., ... & Banks, E. (2013). From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline. Current Protocols in Bioinformatics, 11-10.