The rising need for custom machine learning algorithms that scale to large data sizes on distributed runtime platforms pose significant productivity challenges to data scientists. Towards this end, I will present the philosophy and the architecture behind Apache SystemML. It enables declarative machine learning by letting data scientists specify their 00analytics using higher-level linear algebraic and mathematical constructs. These analytics scripts are transparently compiled into efficient execution plans running on modern distributed data-parallel frameworks, such as Apache Spark, Apache Hadoop, and Apache Flink.
- Domain-specific custom analytics: through a ML language vs. a library of pre-canned implementations
- High-productivity for data scientists: through specification of what needs to be done instead of how it needs to be done
- Write-once- run-everywhere: with support for single-node, in-memory, and distributed computation platforms
- Automatic optimization and parallelization: through cost-based execution plans based on data, computation, and system characteristics