-
Notifications
You must be signed in to change notification settings - Fork 29
Benchmarks: Dsl.scala vs Monix vs Cats Effect vs Scalaz Concurrent vs Scala Async vs Scala Continuation
We created some benchmarks to evaluate the computational performance of code generated by our compiler plug-in, especially, we are interesting how our !-notation and other direct style DSL affect the performance in an effect system that support both asynchronous and synchronous effects.
In spite of keywords of adapters to monads or other effect systems (see domain.cats and domain.scalaz), the preferred effect system for Dsl.scala is Task, the type alias of vanilla continuation-passing style function, defined as:
type Continuation[Domain, Value] = (Value => Domain) => Domain
type !![Domain, Value] = Continuation[Domain, Value]
type TaskDomain = TailRec[Unit] !! Throwable
// Use this `TaskDomain` as the answer type of `Task`,
// for the ability of tail call optimization and exception handling
type Task[Value] = TaskDomain !! Value
Our benchmarks measured the performance of Task
in Dsl.scala, along with other combination of effect system with direct style DSL, listed in the following table:
Effect System | direct style DSL |
---|---|
Dsl.scala Task, an alias of vanilla continuation-passing style functions | !-notation provided by Dsl.scala |
Scala Future | Scala Async |
Scala Continuation library | Scala Continuation compiler plug-in |
Monix tasks |
for comprehension |
Cats effects |
for comprehension |
Scalaz Concurrent |
for comprehension |
The purpose of the first benchmark is to determine the performance of recursive functions in various effect system, especially when a direct style DSL is used.
In order to measure the performance impact due to direct style DSLs, we have to measure the performance baseline of different effect systems at first. We created some benchmarks for the most efficient implementation of a sum function in each effect system. These benchmarks perform the following computation:
- Creating a
List[X[Int]]
of 1000 tasks, whereX
is the data type of task in the effect system. - Performing recursive right-associated "binds" on each element to add
the
Int
to an accumulator, and finally produce aX[Int]
as a task of the sum result. - Running the task and blocking awaiting the result.
Note that the tasks in the list is executed in the current thread or in a thread pool. We keep each task returning a simple pure value, because we want to measure the overhead of effect systems, not the task itself.
The “bind” operation means the primitive operation of each effect
system. For Monix tasks, Cats effects, Scalaz Concurrent and Scala
Continuations library, the “bind” operation is flatMap
; for
Dsl.scala, the “bind” operation is cpsApply
, which may or may not be
equivalent to flatMap
according to the type of the current domain.
We use the !-notation to perform the cpsApply
in Dsl.scala. The
!-notation results the exact same Java bytecode to manually passing a
callback function to cpsApply
:
// The most efficient implementation of sum in Dsl.scala
def loop(tasks: List[Task[Int]], accumulator: Int = 0)(callback: Int => TaskDomain): TaskDomain = {
tasks match {
case head :: tail =>
// The same performance as: `head.cpsApply(i => loop(tail, i + accumulator)(callback))`
loop(tail, !head + accumulator)(callback)
case Nil =>
callback(accumulator)
}
}
However, direct style DSLs for other effect systems are not used in
favor of raw flatMap
calls, in case of decay of the performance.
The following code block shows the benchmark code for
Scala Futures. The code for all the other effect systems are similar to
it.
// The most efficient implementation of sum in Scala Futures
def loop(tasks: List[Future[Int]], accumulator: Int = 0): Future[Int] = {
tasks match {
case head :: tail =>
head.flatMap { i =>
loop(tail, i + accumulator)
}
case Nil =>
Future.successful(accumulator)
}
}
The benchmark result is shown below (larger score is better):
executedIn |
size |
Score, ops/s |
|
---|---|---|---|
RawSum.cats |
thread-pool |
1000 |
799.072 ±3.094 |
RawSum.cats |
current-thread |
1000 |
26932.907 ±845.715 |
RawSum.dsl |
thread-pool |
1000 |
729.947 ±4.359 |
RawSum.dsl |
current-thread |
1000 |
31161.171 ±589.935 |
RawSum.future |
thread-pool |
1000 |
575.403 ±3.567 |
RawSum.future |
current-thread |
1000 |
876.377 ±8.525 |
RawSum.monix |
thread-pool |
1000 |
743.340 ±11.314 |
RawSum.monix |
current-thread |
1000 |
55421.452 ±251.530 |
RawSum.scalaContinuation |
thread-pool |
1000 |
808.671 ±3.917 |
RawSum.scalaContinuation |
current-thread |
1000 |
17391.684 ±385.138 |
RawSum.scalaz |
thread-pool |
1000 |
722.743 ±11.234 |
RawSum.scalaz |
current-thread |
1000 |
15895.606 ±235.992 |
The benchmark result of sum for performance baseline
The Task
alias of continuation-passing style function used with
Dsl.scala is quite fast. Dsl.scala, Monix and Cats Effects score on
top 3 positions for either tasks running in the current thread or in a
thread pool.
In this section, we will present the performance impact when different
syntax notations are introduced. For vanilla CPS functions, we added one
more !-notation to avoid manually passing the callback
in the
previous benchmark. For other effect
systems, we refactored the previous sum benchmarks to use Scala Async,
Scala Continuation’s @cps
annotations, and for
comprehension,
respectively:
// Left associated sum in Dsl.scala
def loop(tasks: List[Task[Int]]): Task[Int] = _ {
tasks match {
case head :: tail =>
!head + !loop(tail)
case Nil =>
0
}
}
// Right associated sum in Dsl.scala
def loop(tasks: List[Task[Int]], accumulator: Int = 0): Task[Int] = _ {
tasks match {
case head :: tail =>
!loop(tail, !head + accumulator)
case Nil =>
accumulator
}
}
// Left associated sum in Scala Future
def loop(tasks: List[Future[Int]]): Future[Int] = async {
tasks match {
case head :: tail =>
await(head) + await(loop(tail))
case Nil =>
0
}
}
// Right associated sum in Scala Future
def loop(tasks: List[Future[Int]], accumulator: Int = 0): Future[Int] = async {
tasks match {
case head :: tail =>
await(loop(tail, await(head) + accumulator))
case Nil =>
accumulator
}
}
// Left associated sum in Scala Continuation
def loop(tasks: List[() => Int @suspendable]): Int @suspendable = {
tasks match {
case head :: tail =>
head() + loop(tail)
case Nil =>
0
}
}
// Right associated sum in Scala Continuation
def loop(tasks: List[() => Int @suspendable], accumulator: Int = 0): Int @suspendable = {
tasks match {
case head :: tail =>
loop(tail, head() + accumulator)
case Nil =>
accumulator
}
}
// Left associated sum in Scalaz Concurrent
def loop(tasks: List[Task[Int]]): Task[Int] = {
tasks match {
case head :: tail =>
for {
i <- head
accumulator <- loop(tail)
} yield i + accumulator
case Nil =>
Task(0)
}
}
// Right associated sum in Scalaz Concurrent
def loop(tasks: List[Task[Int]], accumulator: Int = 0): Task[Int] = {
tasks match {
case head :: tail =>
for {
i <- head
r <- loop(tail, i + accumulator)
} yield r
case Nil =>
Task.now(accumulator)
}
}
Note that reduced sum can be implemented in either left-associated recursion or right-associated recursion. The above code contains benchmark for both cases. The benchmark result is shown below:
executedIn |
size |
Score, ops/s |
|
---|---|---|---|
LeftAssociatedSum.cats |
thread-pool |
1000 |
707.940 ±10.497 |
LeftAssociatedSum.cats |
current-thread |
1000 |
16165.442 ±298.072 |
LeftAssociatedSum.dsl |
thread-pool |
1000 |
729.122 ±7.492 |
LeftAssociatedSum.dsl |
current-thread |
1000 |
19856.493 ±386.225 |
LeftAssociatedSum.future |
thread-pool |
1000 |
339.415 ±1.486 |
LeftAssociatedSum.future |
current-thread |
1000 |
410.785 ±1.535 |
LeftAssociatedSum.monix |
thread-pool |
1000 |
742.836 ±9.904 |
LeftAssociatedSum.monix |
current-thread |
1000 |
19976.847 ±84.222 |
LeftAssociatedSum.scalaContinuation |
thread-pool |
1000 |
657.721 ±9.453 |
LeftAssociatedSum.scalaContinuation |
current-thread |
1000 |
15103.883 ±255.780 |
LeftAssociatedSum.scalaz |
thread-pool |
1000 |
670.725 ±8.957 |
LeftAssociatedSum.scalaz |
current-thread |
1000 |
5113.980 ±110.272 |
The benchmark result of left-associated sum in direct style DSLs
executedIn |
size |
Score, ops/s |
|
---|---|---|---|
RightAssociatedSum.cats |
thread-pool |
1000 |
708.441 ±9.201 |
RightAssociatedSum.cats |
current-thread |
1000 |
15971.331 ±315.063 |
RightAssociatedSum.dsl |
thread-pool |
1000 |
758.152 ±4.600 |
RightAssociatedSum.dsl |
current-thread |
1000 |
22393.280 ±677.752 |
RightAssociatedSum.future |
thread-pool |
1000 |
338.471 ±2.188 |
RightAssociatedSum.future |
current-thread |
1000 |
405.866 ±2.843 |
RightAssociatedSum.monix |
thread-pool |
1000 |
736.533 ±10.856 |
RightAssociatedSum.monix |
current-thread |
1000 |
21687.351 ±107.249 |
RightAssociatedSum.scalaContinuation |
thread-pool |
1000 |
654.749 ±7.983 |
RightAssociatedSum.scalaContinuation |
current-thread |
1000 |
12080.619 ±274.878 |
RightAssociatedSum.scalaz |
thread-pool |
1000 |
676.180 ±7.705 |
RightAssociatedSum.scalaz |
current-thread |
1000 |
7911.779 ±79.296 |
The benchmark result of right-associated sum in direct style DSLs
The result demonstrates that the !-notation provided by Dsl.scala is
faster than all other direct style DSLs in the right-associated sum
benchmark. The Dsl.scala version sum consumes a constant number of
memory during the loop, because we implemented a tail-call detection in
our CPS-transform compiler plug-in, and the Dsl
interpreter for Task
use a trampoline technique . On the other hand, the benchmark result of
Monix Tasks, Cats Effects and Scalaz Concurrent posed a significant
performance decay, because they costs O(n) memory due to the map
call
generated by for
comprehension, although those effect systems also
built in trampolines. In general, the performance of recursive monadic
binds in a for
comprehension is always underoptimized due to the
inefficient map
.
The previous sum benchmarks measured the performance of manually written loops, but usually we may want to use higher-ordered functions to manipulate collections. We want to know how those higher-ordered functions can be expressed in direct style DSLs, and how would the performance be affected by direct style DSLs.
In this section, we will present the benchmark result for computing the Cartesian product of lists.
As we did in sum benchmarks, we created some benchmarks to maximize the
performance for Cartesian product. Our benchmarks create the Cartesian
product from traverseM
for Scala Future, Cats Effect, Scalaz
Concurrent and Monix Tasks. The following code block shows the
benchmark code for Scala Future.
// Cartesian product in Scala Future, separated to two functions
import scala.concurrent.Future
import scalaz.std.list._
import scalaz.std.scalaFuture._
import scalaz.syntax.all._
def cellTask(taskX: Future[Int], taskY: Future[Int]): Future[List[Int]] = async {
List(await(taskX), await(taskY))
}
def listTask(rows: List[Future[Int]], columns: List[Future[Int]]): Future[List[Int]] = {
rows.traverseM { taskX =>
columns.traverseM { taskY =>
cellTask(taskX, taskY)
}
}
}
Scala Async or for
comprehension is used in element-wise task
cellTask
, but the collection manipulation listTask
is kept as
manually written higher order function calls, because neither Scala
Async nor for
comprehension supports traverseM
.
The benchmark for Dsl.scala is entirely written in !-notation:
// Cartesian product in Dsl.scala, separated to two functions
def cellTask(taskX: Task[Int], taskY: Task[Int]): Task[List[Int]] = _ {
List(!taskX, !taskY)
}
def listTask(rows: List[Task[Int]], columns: List[Task[Int]]): Task[List[Int]] = {
cellTask(!Each(rows), !Each(columns))
}
The Each
keyword is available here because it is adaptive. Each
keyword can
be used in not only List[_]
domain, but also (_ !! Coll[_])
domain
as long as Coll
is a Scala collection type that supports
CanBuildFrom
type class.
We didn’t benchmark Scala Continuation here because all higher ordered
functions for List
do not work with Scala Continuation.
The benchmark result is shown below:
executedIn |
size |
Score, ops/s |
|
---|---|---|---|
RawCartesianProduct.cats |
thread-pool |
50 |
136.415 ±1.939 |
RawCartesianProduct.cats |
current-thread |
50 |
1346.874 ±7.475 |
RawCartesianProduct.dsl |
thread-pool |
50 |
140.098 ±2.062 |
RawCartesianProduct.dsl |
current-thread |
50 |
1580.876 ±27.513 |
RawCartesianProduct.future |
thread-pool |
50 |
100.340 ±1.894 |
RawCartesianProduct.future |
current-thread |
50 |
93.678 ±1.829 |
RawCartesianProduct.monix |
thread-pool |
50 |
142.071 ±1.299 |
RawCartesianProduct.monix |
current-thread |
50 |
1750.869 ±18.365 |
RawCartesianProduct.scalaz |
thread-pool |
50 |
78.588 ±0.623 |
RawCartesianProduct.scalaz |
current-thread |
50 |
357.357 ±2.102 |
The benchmark result of Cartesian product for performance baseline
Monix tasks, Cats Effects and vanilla CPS functions created from Dsl.scala are still the top 3 scored effect systems.
We then refactored the benchmarks to direct style DSLs. The following code blocks are DSLs for
Scala Future, written in ListT
monad transformer provided by Scalaz.
The benchmarks for Monix tasks, Scalaz Concurrent are also rewritten in
the similar style.
import _root_.scalaz.syntax.all._
import _root_.scalaz.ListT
import _root_.scalaz.std.scalaFuture._
// Cartesian product for based on `ListT` transformer, in one function
def listTask(rows: List[Future[Int]], columns: List[Future[Int]]): Future[List[Int]] = {
for {
taskX <- ListT(Future.successful(rows))
taskY <- ListT(Future.successful(columns))
x <- taskX.liftM[ListT]
y <- taskY.liftM[ListT]
r <- ListT(Future.successful(List(x, y)))
} yield r
}.run
With the help of ListT
monad transformer, we are able to merge
cellTask
and listTask
into one function in a direct style for
comprehension, avoiding any manual written callback functions.
We also merged cellTask
and listTask
in the Dsl.scala version of
benchmark:
// Cartesian product in Dsl.scala, in one function
def listTask: Task[List[Int]] = reset {
List(!(!Each(inputDslTasks)), !(!Each(inputDslTasks)))
}
This time, Cats Effects are not benchmarked due to lack of ListT
in
Cats. The benchmark result are shown in Table.
executedIn |
size |
Score, ops/s |
|
---|---|---|---|
CartesianProduct.dsl |
thread-pool |
50 |
283.450 ±3.042 |
CartesianProduct.dsl |
current-thread |
50 |
1884.514 ±47.792 |
CartesianProduct.future |
thread-pool |
50 |
91.233 ±1.333 |
CartesianProduct.future |
current-thread |
50 |
150.234 ±20.396 |
CartesianProduct.monix |
thread-pool |
50 |
28.597 ±0.265 |
CartesianProduct.monix |
current-thread |
50 |
120.068 ±17.676 |
CartesianProduct.scalaz |
thread-pool |
50 |
31.110 ±0.662 |
CartesianProduct.scalaz |
current-thread |
50 |
87.404 ±1.734 |
The benchmark result of Cartesian product in direct style DSLs
Despite the trivial manual lift calls in for
comprehension, the monad
transformer approach causes terrible computational performance in
comparison to manually called traverseM
. In contrast, the performance
of Dsl.scala even got improved when cellTask
is inlined into
listTask
.
Task
in Dsl.scala is a simple vanilla continuation-passing style function, whose performance is comparable to Monix and Cats Effect.
When using a direct style DSL, our !-bang notation is the fastest implementation among for
-comprehension, Scala Async, and Scala Continuation. Especially, when performing a complex task to manipulate collections, our !-notation can be 12.5 times faster than for
comprehension when running in current thread, and more than 3.1 times faster than for
comprehension when running in a thread pool, despite conciser syntax that we provided.
In brief, Dsl.scala provides a concise, efficient and extensible approach to perform different types of effects at once, as a replacement of monad transformers, Eff monad and AutoLifts.
(The source code of these benchmarks can be found at https://github.com/ThoughtWorksInc/Dsl.scala/blob/8e7ab03/benchmarks/src/main/scala/com/thoughtworks/dsl/benchmarks.scala)
Recently, I reran those benchmarks on GraalVM. And this is the result (larger score is better):
Benchmark (executedIn) (size) Mode Cnt Score Error Units
CartesianProduct.dsl thread-pool 50 thrpt 25 381.073 ? 2.799 ops/s
CartesianProduct.dsl current-thread 50 thrpt 25 3804.440 ? 85.645 ops/s
CartesianProduct.future thread-pool 50 thrpt 25 429.282 ? 3.807 ops/s
CartesianProduct.future current-thread 50 thrpt 25 485.513 ? 10.014 ops/s
CartesianProduct.monix thread-pool 50 thrpt 25 88.741 ? 2.505 ops/s
CartesianProduct.monix current-thread 50 thrpt 25 485.425 ? 10.327 ops/s
CartesianProduct.scalaz thread-pool 50 thrpt 25 82.574 ? 2.669 ops/s
CartesianProduct.scalaz current-thread 50 thrpt 25 405.065 ? 14.329 ops/s
LeftAssociatedSum.cats thread-pool 1000 thrpt 25 1202.644 ? 23.662 ops/s
LeftAssociatedSum.cats current-thread 1000 thrpt 25 27250.554 ? 948.168 ops/s
LeftAssociatedSum.dsl thread-pool 1000 thrpt 25 1547.312 ? 15.828 ops/s
LeftAssociatedSum.dsl current-thread 1000 thrpt 25 42525.837 ? 1739.524 ops/s
LeftAssociatedSum.future thread-pool 1000 thrpt 25 642.385 ? 11.138 ops/s
LeftAssociatedSum.future current-thread 1000 thrpt 25 766.983 ? 11.002 ops/s
LeftAssociatedSum.monix thread-pool 1000 thrpt 25 883.240 ? 15.565 ops/s
LeftAssociatedSum.monix current-thread 1000 thrpt 25 21703.792 ? 451.721 ops/s
LeftAssociatedSum.scalaContinuation thread-pool 1000 thrpt 25 1338.400 ? 7.799 ops/s
LeftAssociatedSum.scalaContinuation current-thread 1000 thrpt 25 53880.878 ? 1116.539 ops/s
LeftAssociatedSum.scalaz thread-pool 1000 thrpt 25 847.057 ? 10.291 ops/s
LeftAssociatedSum.scalaz current-thread 1000 thrpt 25 6222.248 ? 625.228 ops/s
RawCartesianProduct.cats thread-pool 50 thrpt 25 210.906 ? 2.668 ops/s
RawCartesianProduct.cats current-thread 50 thrpt 25 2326.546 ? 32.516 ops/s
RawCartesianProduct.dsl thread-pool 50 thrpt 25 249.796 ? 6.692 ops/s
RawCartesianProduct.dsl current-thread 50 thrpt 25 4505.705 ? 71.748 ops/s
RawCartesianProduct.future thread-pool 50 thrpt 25 73.815 ? 2.765 ops/s
RawCartesianProduct.future current-thread 50 thrpt 25 84.541 ? 2.742 ops/s
RawCartesianProduct.monix thread-pool 50 thrpt 25 159.687 ? 3.617 ops/s
RawCartesianProduct.monix current-thread 50 thrpt 25 2834.124 ? 64.134 ops/s
RawCartesianProduct.scalaz thread-pool 50 thrpt 25 111.042 ? 3.088 ops/s
RawCartesianProduct.scalaz current-thread 50 thrpt 25 686.909 ? 18.852 ops/s
RawSum.cats thread-pool 1000 thrpt 25 1425.520 ? 24.974 ops/s
RawSum.cats current-thread 1000 thrpt 25 62939.245 ? 896.369 ops/s
RawSum.dsl thread-pool 1000 thrpt 25 1728.368 ? 15.419 ops/s
RawSum.dsl current-thread 1000 thrpt 25 90190.665 ? 1347.352 ops/s
RawSum.future thread-pool 1000 thrpt 25 1105.385 ? 10.313 ops/s
RawSum.future current-thread 1000 thrpt 25 1665.301 ? 20.220 ops/s
RawSum.monix thread-pool 1000 thrpt 25 1054.701 ? 12.500 ops/s
RawSum.monix current-thread 1000 thrpt 25 64029.761 ? 1010.878 ops/s
RawSum.scalaContinuation thread-pool 1000 thrpt 25 1603.568 ? 25.196 ops/s
RawSum.scalaContinuation current-thread 1000 thrpt 25 118392.309 ? 2401.650 ops/s
RawSum.scalaz thread-pool 1000 thrpt 25 1162.216 ? 24.588 ops/s
RawSum.scalaz current-thread 1000 thrpt 25 36355.420 ? 967.373 ops/s
RightAssociatedSum.cats thread-pool 1000 thrpt 25 1222.784 ? 7.624 ops/s
RightAssociatedSum.cats current-thread 1000 thrpt 25 28149.466 ? 421.281 ops/s
RightAssociatedSum.dsl thread-pool 1000 thrpt 25 1581.496 ? 10.579 ops/s
RightAssociatedSum.dsl current-thread 1000 thrpt 25 52651.555 ? 2381.584 ops/s
RightAssociatedSum.future thread-pool 1000 thrpt 25 641.212 ? 4.255 ops/s
RightAssociatedSum.future current-thread 1000 thrpt 25 786.968 ? 18.144 ops/s
RightAssociatedSum.monix thread-pool 1000 thrpt 25 915.910 ? 13.983 ops/s
RightAssociatedSum.monix current-thread 1000 thrpt 25 25382.029 ? 455.675 ops/s
RightAssociatedSum.scalaContinuation thread-pool 1000 thrpt 25 1317.374 ? 10.588 ops/s
RightAssociatedSum.scalaContinuation current-thread 1000 thrpt 25 53417.944 ? 592.979 ops/s
RightAssociatedSum.scalaz thread-pool 1000 thrpt 25 871.240 ? 10.063 ops/s
RightAssociatedSum.scalaz current-thread 1000 thrpt 25 11928.049 ? 281.798 ops/s
All benchmarks got improved on GraalVM in comparison to OpenJDK. However, for those case class based effect systems, including Monix, Cats Effects and Scalaz Concurrent, the performance improvement is relatively small. While, for those CPS function based effect systems, including Scala Continuations and Dsl.scala, the performance improvement is huge.
The difference between Scala Continuations and Dsl.scala is tail call optimization. Dsl.scala internally uses the trampoline technology on !-notation; Scala Continuations performs stack-unsafe calls.
Those case classes in Monix/Cats Effects/Scalaz can improve performance for OpenJDK by avoiding megamorphic calls. However, this benchmark suggests that any case classes other than Async
are unnecessary in the future, since the pattern matching on case classes actually prevents GraalVM from further optimization.