Apache Flink: Announcing Flink 0.9.0-milestone1 preview release

13 Apr 2015

The Apache Flink community is pleased to announce the availability of the 0.9.0-milestone-1 release. The release is a preview of the upcoming 0.9.0 release. It contains many new features which will be available in the upcoming 0.9 release. Interested users are encouraged to try it out and give feedback. As the version number indicates, this release is a preview release that contains known issues.

You can download the release here and check out the latest documentation here. Feedback through the Flink mailing lists is, as always, very welcome!

New Features

Table API

Flink’s new Table API offers a higher-level abstraction for interacting with structured data sources. The Table API allows users to execute logical, SQL-like queries on distributed data sets while allowing them to freely mix declarative queries with regular Flink operators. Here is an example that groups and joins two tables:

val clickCounts = clicks
  .groupBy('user).select('userId, 'url.count as 'count)

val activeUsers = users.join(clickCounts)
  .where('id === 'userId && 'count > 10).select('username, 'count, ...)

Tables consist of logical attributes that can be selected by name rather than physical Java and Scala data types. This alleviates a lot of boilerplate code for common ETL tasks and raises the abstraction for Flink programs. Tables are available for both static and streaming data sources (DataSet and DataStream APIs).

Check out the Table guide for Java and Scala here.

Gelly Graph Processing API

Gelly is a Java Graph API for Flink. It contains a set of utilities for graph analysis, support for iterative graph processing and a library of graph algorithms. Gelly exposes a Graph data structure that wraps DataSets for vertices and edges, as well as methods for creating graphs from DataSets, graph transformations and utilities (e.g., in- and out- degrees of vertices), neighborhood aggregations, iterative vertex-centric graph processing, as well as a library of common graph algorithms, including PageRank, SSSP, label propagation, and community detection.

Gelly internally builds on top of Flink’s delta iterations. Iterative graph algorithms are executed leveraging mutable state, achieving similar performance with specialized graph processing systems.

Gelly will eventually subsume Spargel, Flink’s Pregel-like API. Check out the Gelly guide here.

Flink Machine Learning Library

This release includes the first version of Flink’s Machine Learning library. The library’s pipeline approach, which has been strongly inspired by scikit-learn’s abstraction of transformers and estimators, makes it easy to quickly set up a data processing pipeline and to get your job done.

Flink distinguishes between transformers and learners. Transformers are components which transform your input data into a new format allowing you to extract features, cleanse your data or to sample from it. Learners on the other hand constitute the components which take your input data and train a model on it. The model you obtain from the learner can then be evaluated and used to make predictions on unseen data.

Currently, the machine learning library contains transformers and learners to do multiple tasks. The library supports multiple linear regression using a stochastic gradient implementation to scale to large data sizes. Furthermore, it includes an alternating least squares (ALS) implementation to factorizes large matrices. The matrix factorization can be used to do collaborative filtering. An implementation of the communication efficient distributed dual coordinate ascent (CoCoA) algorithm is the latest addition to the library. The CoCoA algorithm can be used to train distributed soft-margin SVMs.

Flink on YARN leveraging Apache Tez

We are introducing a new execution mode for Flink to be able to run restricted Flink programs on top of Apache Tez. This mode retains Flink’s APIs, optimizer, as well as Flink’s runtime operators, but instead of wrapping those in Flink tasks that are executed by Flink TaskManagers, it wraps them in Tez runtime tasks and builds a Tez DAG that represents the program.

By using Flink on Tez, users have an additional choice for an execution platform for Flink programs. While Flink’s distributed runtime favors low latency, streaming shuffles, and iterative algorithms, Tez focuses on scalability and elastic resource usage in shared YARN clusters.

Get started with Flink on Tez here.

Reworked Distributed Runtime on Akka

Flink’s RPC system has been replaced by the widely adopted Akka framework. Akka’s concurrency model offers the right abstraction to develop a fast as well as robust distributed system. By using Akka’s own failure detection mechanism the stability of Flink’s runtime is significantly improved, because the system can now react in proper form to node outages. Furthermore, Akka improves Flink’s scalability by introducing asynchronous messages to the system. These asynchronous messages allow Flink to be run on many more nodes than before.

Exactly-once processing on Kafka Streaming Sources

This release introduces stream processing with exacly-once delivery guarantees for Flink streaming programs that analyze streaming sources that are persisted by Apache Kafka. The system is internally tracking the Kafka offsets to ensure that Flink can pick up data from Kafka where it left off in case of an failure.

Read here on how to use the persistent Kafka source.

Improved YARN support

Flink’s YARN client contains several improvements, such as a detached mode for starting a YARN session in the background, the ability to submit a single Flink job to a YARN cluster without starting a session, including a “fire and forget” mode. Flink is now also able to reallocate failed YARN containers to maintain the size of the requested cluster. This feature allows to implement fault-tolerant setups on top of YARN. There is also an internal Java API to deploy and control a running YARN cluster. This is being used by system integrators to easily control Flink on YARN within their Hadoop 2 cluster.

See the YARN docs here.

More Improvements and Fixes

FLINK-1605: Flink is not exposing its Guava and ASM dependencies to Maven projects depending on Flink. We use the maven-shade-plugin to relocate these dependencies into our own namespace. This allows users to use any Guava or ASM version.
FLINK-1417: Automatic recognition and registration of Java Types at Kryo and the internal serializers: Flink has its own type handling and serialization framework falling back to Kryo for types that it cannot handle. To get the best performance Flink is automatically registering all types a user is using in their program with Kryo.Flink also registers serializers for Protocol Buffers, Thrift, Avro and YodaTime automatically. Users can also manually register serializers to Kryo (https://issues.apache.org/jira/browse/FLINK-1399)
FLINK-1296: Add support for sorting very large records
FLINK-1679: “degreeOfParallelism” methods renamed to “parallelism”
FLINK-1501: Add metrics library for monitoring TaskManagers
FLINK-1760: Add support for building Flink with Scala 2.11
FLINK-1648: Add a mode where the system automatically sets the parallelism to the available task slots
FLINK-1622: Add groupCombine operator
FLINK-1589: Add option to pass Configuration to LocalExecutor
FLINK-1504: Add support for accessing secured HDFS clusters in standalone mode
FLINK-1478: Add strictly local input split assignment
FLINK-1512: Add CsvReader for reading into POJOs.
FLINK-1461: Add sortPartition operator
FLINK-1450: Add Fold operator to the Streaming api
FLINK-1389: Allow setting custom file extensions for files created by the FileOutputFormat
FLINK-1236: Add support for localization of Hadoop Input Splits
FLINK-1179: Add button to JobManager web interface to request stack trace of a TaskManager
FLINK-1105: Add support for locally sorted output
FLINK-1688: Add socket sink
FLINK-1436: Improve usability of command line interface