Background Big data stores form the backbone of modern computing infrastructures. They support large data sets and enable processing frameworks that mine information from this data. They are designed for quick response and parallel computing at unprecedented scale. Although processing occurs in main memory, big data stores maintain the authoritative version of data on secondary storage (typically, SSD and disk).
The existence of two versions of the data, one on-disk and another in-memory, poses several problems. First, the slow access time of secondary storage hinders performance and data are often persisted asynchronously at the expend of durability. Moreover, starting and warming up a big data store, for instance after a failure, takes a very long time. A striking example is the September 2010 Facebook outage, in which the whole system was unavailable for 2.5 hours due to recovery. Last, the two representations of data have to be mutually consistent. This requires complex mechanisms that translate into a high overhead. For instance, recent investigations show that Apache Spark is up to 6× faster without data persistence.
A key technology that has the potential to remove the dual representation and greatly improve performance is the non-volatile byte-addressable memory (NVRAM). NVRAM combines the best features of traditional RAM and persistent storage. It is persistent upon power loss, provides fast and fine-granular access to data, and near-RAM latency and bandwidth performance.
Objectives and challenges The main purpose of this position is to unleash the full capabilities of NVRAM for big data stores. This requires to deal with several challenges taking roots across multiple areas of computer science, including programming languages, concurrent computing and storage systems.
The central challenge is that algorithms managing the in-memory representation of data are not ready for persistent memory. A base example is an atomic map that serves key-value store operations, backed by several log-structured merge-trees for indexing, as found in systems such as HBase and Cassandra. In a nutshell, these algorithms are not designed to recover a consistent state after a failure. Upon recovery, threads may have to deal with a mix of data that was stored in persistent memory, while other parts of their state were not (e.g., processor cache, memory controller). Therefore, the state of the system may not be identical before and after the failure.
A second key challenge is that accesses to persistent storage are often done in a sequential manner to maximize the medium performance (typically, with disks), or to account for the memory hierarchy (as in Apache Lucene). However, offering persistence should not not come at the price of performance. It is necessary to maintain the parallelism where possible, as big data stores typically support concurrent operations when accessing volatile memory.
Work plan For starters, the notion of persistent data type (PDT) is studied in depth. A PDT is a common data type, in the sense that it satisfies a sequential specification. Additionally, every instance of a PDT is thread-safe and persistent. Internally, its implementation is designed to provide data consistency, recoverability and high-performance, by fulling taking advantage of NVRAM in a multi-core setting.
In a second step, a construction to transform a sequential data type into a PDT is proposed. As pointed out above, a major hurdle is to ensure integrity of data structures in the face of concurrency and failures. Techniques from non-blocking concurrency control, transactional programming, and commutative replicated data types provide a solid basis for leveraging NVRAM and they will represent the initial course of action.
In a third step, the postdoc demonstrates the applicability of the proposed solution as a whole. To avoid difficulties related to managed languages, this implementation targets data stores in C/C++ (e.g., MongoDB and RocksDB). The solution is compared to RAM disks (à la HDFS), RAM file systems (such as tmpfs), and the recent efforts on porting storage systems to NVRAM. The comparison relies on the workloads available internally at Scality SA, and in particular the ones related to the distributed file system.
Start date March 2020, for a duration of 18 months.
Required skills and background:
- Ph.D in Computer Sciences.
- Excellent academic record.
- Strong background in distributed systems / database / algorithms.
- Good developer and experimenter.
- A full curriculum vitæ.
- A cover letter stating your motivation and fit for this position.
Contact Pierre Sutra
Posted 3 December 2019