Distributed Storage for Stateful Serverless Computing

Postdoc offer

 

Background  The recently funded H2020 CloudButton project [1] aims to democratize big data by overly simplifying its programming model with the help of serverless technologies.The core idea is to tap into stateless functions to enable radically-simpler, more user-friendly data processing systems. Average users of the cloud do not want to spend hours understanding complex analytics stacks (e.g., Spark |2], Yarn [3], or Ignite [4]), and to struggle with the choice of instance types, cluster sizes, etc. What they want is just a simple interface to execute their optimized, single-machine code in parallel. CloudButton is the technological response to this emerging need. To demonstrate impact, the project targets two strategic settings with large data volumes and diverse analytics requirements: bioinformatics (genomics, metabolomics) and geospatial data (LiDAR, satellital).

 

Objectives The main objective of this position is to specify and implement the storage layer of the CloudButton stack. Serverless computing infrastructures deliver massively-parallel short-lived functions where computation can quickly scale up and down. To cope with this transient nature of computation, the storage layer needs to be auto-scalable and it must support ephemeral data, lasting only for the duration of the serverless function calls  [6]. One possible approach is to co-locate data with computation and thus operate the storage layer during a short amount of time. Another key challenge is that storage should help transitioning from single-machine code to the serverless infrastructure. This requires to offer not a simple binary storage, but instead a full library of complex objects, as commonly found in modern programming languages. To improve code modularity, objects need to be composable. For performance, the storage layer may also split them transparently to the serverless functions. A third challenge is that data should be shared among the serverless function to support stateful computation. Storage should thus include appropriate concurrency control mechanisms to manage concurrent accesses and guarantee data consistency.

 

Work Plan For starters, the storage layer will be built atop the Infinispan data grid [7] developed at RedHat, a CloudButton partner, using the contributions made to the Crucial framework [8]. To demonstrate applicability, the postdoc will port an existing machine learning library to the CloudButton stack and evaluate it in practice using standard data analytics workload.

 

Start date As soon as possible, for a duration of 12 to 24 months. Accepting applications now, will remain open until filled.

 

To Apply Required skills and background:

  • PhD in Computer Sciences
  • Excellent academic record.
  • Strong background in distributed systems / database / cloud computing
  • Good developer and experimenter

Please provide:

  • a full curriculum vitæ
  • a cover letter stating your motivation and fit for this position

 

Contact Pierre Sutra

 

References

  • [1] http://cloudbutton.eu, retrieved June. 2019
  • [2] Matei Zaharia et al. , Apache Spark: a unified engine for big data processing, CACM, Nov. 2016
  • [3] Vinod Kumar et al, Apache Hadoop YARN: yet another resource negotiator, Oct. 2013
  • [4] Shamim Bhuiyan et al. High Performance in-memory computing with Apache Ignite, 2017
  • [5] Raul Castro Fernandez et al., Java2SDG: Stateful big data processing for the masses, ICDE, May 2016
  • [6] Ana Klimovic et al. Pocket: Elastic Ephemeral Storage for Serverless Analytics, OSDI Oct. 2018
  • [7] http://infinispan.org, retrieved Feb. 2019
  • [8] Daniel Barcelona-Pons et al. On the FaaS Track: Building Stateful Distributed Applications with Serverless Architecture, Middleware, Dec. 2019