Samsara: Declarative Machine Learning on Distributed Dataflow Systems


We present Samsara, a domain-specific language for declarative machine learning in cluster environments. Samsara allows its users to specify programs using a set of common matrix abstractions and linear algebraic operations, similar to R or MATLAB. Samsara then compiles, optimizes and executes these programs on distributed dataflow systems. The aim of Samsara is to allow mathematicians and data scientists to leverage the scalability of distributed dataflow systems via common declarative abstractions, while drastically reducing the need for detailed knowledge of the programming model and execution scheme of the underlying systems. Samsara is part of the Apache Mahout library and supports backends like Apache Spark and Apache Flink. In this paper, we introduce the concepts of Samsara, showcase its compilation and optimization techniques using a simple example, and experimentally evaluate some of the benefits of these optimizations.

Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NIPS)