Vladimir Kostyukov

Finch Performance Lessons

2017-03-01T09:00:00+00:00

With respect to a good tradition, I’m publishing a blog post based on my ScalaMatsuri talk covering a few performance lessons I learned working on Finch. Even though these lessons are coming from a foundation library, I tend to believe they can be projected into any Scala (or even Java) codebase no matter if it’s an HTTP library or a user-facing application.

10%

When it comes to throughput, Finch has been doing quite well compared to other Scala HTTP libraries. However, the absolute metric like QPS isn’t necessary the most interesting one. Historically, Finch has been measuring its performance (the throughput) in terms of overhead it introduces on top of Finagle, its IO layer.

We’ve been constantly measuring this kind of overhead locally, running a wrk load test against two different instances: Finch and Circe vs. Finagle and Jackson. This way we’re not only measuring Finch’s overhead, but also comparing two ends of the spectrum: type-level vs. runtime-level solutions.

Even though we’ve been doing this comparison for a while, no third party has ever tried to reproduce it until this very recent round of the TechEmpower benchmark.

According to the “Framework Overhead” (see above) comparison table, Finch performs on about 90% of Finagle’s throughput making it only/as much as 10% in terms overhead. Obviously, this overhead will hardly be noticeable for most of the services and yet could easily be a deal breaker for under-provisioned applications. This is why it’s important to understand from where these 10% are coming from and see if there is a way to reduce the gap.

Why only 10%?

In a typical Finch application lots of work is happening at compile time (think of Circe’s generic derivation and Shapeless’ machinery empowering endpoints). Turns out it might be a good deal to trade of complication time for (usually) safer and cheaper runtime.

As far as the JSON encoding/decoding goes, instead of inspecting each object at runtime to figure out how to properly encode/decode it, Circe’s encoders/decoders are already materialized once program compiled hence are ready for use at runtime. This not only makes codecs safer such that it’s known at compile time if they weren’t derived properly but also cheaper to use.

Besides taking an advantage of compile-time derived codecs for JSON, Finch is always trying to get the most out of the setup it’s currently running on. This includes using the most efficient decoding/encoding strategy depending on what JSON library is wired in and/or what Netty version is Finagle using at the moment.

Why as much as 10%?

It’s not a surprise that composition comes with a cost in a form of allocations and it’s somewhat fundamental. In the object-oriented setting, a new behavior is usually implemented as a class extending some old behavior. This way it’s possible to access a newly introduced logic by just instantiating this new class with a penalty of a single allocation.

In the functional (or even purely functional) setting, on the other hand, in order to introduce a new behavior to an existing entity, the later should be instantiated anyway before it gets composed with some other entity making the changes. This literally doubles the number of allocations used in a program promoting composition over inheritance.

Allocations Matter

High allocation rate itself isn’t a big problem on modern JVMs, and in fact, quite often is mitigated by JIT automatically moving some allocations on stack. However, this only possible for “local” and short-lived allocations that are never tenured. While it’s nearly impossible to tell what allocations will be eliminated (stack-allocated) by just looking at the source code, a good rule of thumb would be always consider the worst outcome and pretend that all allocations will go on heap and will live long enough to be compacted (copied over to prevent heap fragmentation). In other words, it’s generally a good idea to avoid all sorts of allocations as long as it doesn’t regress the throughput. The bottom line is lower allocation rate will often pay back with less frequent and shorter GC pauses.

Allocation profiles are even more important for the foundation libraries like Finch or Circe. Libraries that are always placed on application’s hottest path, transforming a request into an appropriate response. Just a hundred of bytes allocated on the request/response path will sum up into several hundred megabytes of memory allocated and deallocated every second once service’s QPS hits reasonably high numbers.

Allocating Less

Paying attention to the allocation profile is one of the essential skills needed for writing GC-friendly programs. Even though Scala code is quite hard to reason about performance-wise, it’s mostly easy to guestimate how many bytes a given code structure is going to allocate. It just requires some practice made up of experiments with JMH’s -prof gc mode.

While this byte counting business may seem too low-level and possibly worthless, the idea of allocating less scales really well from the scope of a single function to the entire program. We’ll see later in this post how both local and global optimizations of the allocation profile, recently made in Finch, help to improve the throughput.

Composing Less

Intuitively, one way of saving allocations is giving up on composition. This doesn’t necessary mean dropping down to the object-oriented concepts in the user-facing API but rather revisiting the internal structures along with making sure they are not engaging composition/allocations. Even though, adopting ideas like inheritance, mutable arrays, and while loops considered impure, it’s quite a popular trade-off to make in the foundation libraries including those promoting a purely-functional API.

When it comes to modeling an endpoint’s result, Finch has been using the type alias to an option indicating whether or not the endpoint was matched on a given input.

type EndpointResult[A] = Option[(Input, Future[Output[A]])]

This worked perfectly well in the past given how idiomatic and easy to reason about it is. However, this introduces an unnecessary overhead coming from an abstraction that’s way more powerful than needed. As any other abstraction from a standard library, Option comes with a variety of combinators that promote an idiomatic usage pattern based on either for-comprehension or map and flatMap variants. There is absolutely nothing wrong with those functions except for the cost they’re coming with, which could be a deal breaker for performance-critical abstractions. Most of the time, mapping an option means allocating a closure that depending on the number of arguments may be quite expensive.

In addition to that, on the successful path (when an endpoint is being matched and the result is being returned), it requires two allocations to get the result out of the door: one for the inner Tuple2, one for the outer Option.

An alternative to the Option solution would be to hand-roll our own abstraction that basically acts as a flattened version of Option[Tuple2[_, _]] such that it only requires a single allocation to instantiate a successful result.

sealed abstract class EndpointResult[+A]

case object Skipped extends EndpointResult[Nothing]
case object Matched[A](rem: Input, out: Future[Output[A]]) extends EndpoinResult[A]

An important difference between those two approaches is in the signal their APIs are sending to the users. An API that engages composition isn’t always appropriate as a meaning for an internal abstraction living on a hot execution path. A bare-bones ADT exposing nothing but patten-matching (which costs almost nothing in Scala) as its API could be way more suitable for this kind of business.

As far as the benchmarking goes (see table below), saving around 56 bytes on a single Endpoint.map call make it up to 5% improvement in terms of throughput (see #707).

 ---------------------------------------------------------------------------------------------
 TA: Type Alias | ADT: sealed abstract class | Running Time Mode
 ---------------------------------------------------------------------------------------------
 MapBenchmark.mapAsyncTA                                    avgt    429.113 ±   43.297  ns/op
 MapBenchmark.mapAsyncADT                                   avgt    407.126 ±   12.807  ns/op
 MapBenchmark.mapOutputAsyncTA                              avgt    821.786 ±   52.045  ns/op
 MapBenchmark.mapOutputAsyncADT                             avgt    777.654 ±   26.444  ns/op

 MapBenchmark.mapAsyncTA:·gc.alloc.rate.norm                avgt    776.000 ±    0.001   B/op
 MapBenchmark.mapAsyncADT:·gc.alloc.rate.norm               avgt    720.000 ±    0.001   B/op
 MapBenchmark.mapOutputAsyncTA:·gc.alloc.rate.norm          avgt   1376.001 ±    0.001   B/op
 MapBenchmark.mapOutputAsyncADT:·gc.alloc.rate.norm         avgt   1320.001 ±    0.001   B/op

Introducing a new abstraction/type into a domain is always a trade-off and yet an easy one to make in this particular case. With a cost of a little of the maintenance burden, we get a less powerful and more performant abstraction that’s really hard to misuse.

Encoding/Decoding Less

Avoiding all sorts of allocations isn’t necessarily a local optimization, but can be applied globally to the scope of the entire program. Consider a typical HTTP application that serves JSON. Certainly, most of the allocations in such application are coming from JSON decoding and encoding such that instead of using the payload right away (in whatever form it is) we need to convert it into a JSON object first.

Presumably, JSON encoding and decoding aren’t something that could be easily avoided in the HTTP application exposing JSON APIs. This is a rightful workload for this kind of applications. However, there are certain stages (involving allocations) within the data transforming pipelines that might look mandatory and yet could be completely eliminated.

As far as the JSON decoding goes, there are at least two data transformation stages involved. After getting the bytes of the wire, we typically convert them into a JSON string (a UTF-8 string) instead of shoving them right away into a JSON parser. Whereas going from bytes to string (i.e., new String(bytes)) may seem pretty cheap, it actually involves quite a lot of allocations along with CPU time needed for memory copy. Instead of wrapping a given byte array with a String, JVM copies it over into a newly allocated char array of the same size thereby doubling the allocations (char takes 2 bytes on the JVM).

All of this sounds pretty frustrating given that, in most of the cases, the only reason we actually need a string (and not the bytes) is to satisfy the API of the used JSON library. Good news is lots of modern JSON libraries allow to skip this unnecessary to-string conversion and parse JSON objects directly from bytes (see below).

As of Finch 0.11 (see #671), for all the supported JSON libraries, the decoding of inbound payloads doesn’t involve any interim to-string conversions and is done in terms of bytes. In our end-to-end benchmark, this optimization alone accounts for 13% improvements in the throughput.

When it comes to micro-benchmarking, decoding from bytes instead of a string cuts both allocations and running time in half (see below) making it a pretty great deal given how small and simple the change is.

 ---------------------------------------------------------------------------------------------
 S: parse string | BA: parse byte array | Running Time Mode
 ---------------------------------------------------------------------------------------------
 JsonBenchmark.decodeS                                     avgt   5950.402 ±  464.246   ns/op
 JsonBenchmark.decodeBA                                    avgt   3232.696 ±  171.160   ns/op 

 JsonBenchmark.decodeS:·gc.alloc.rate.norm                 avgt   7992.005 ±   12.749    B/op
 JsonBenchmark.decodeBA:·gc.alloc.rate.norm                avgt   4908.003 ±    6.374    B/op

A similar optimization is also possible on the outbound path. It might be worth trying to skip the unnecessary string representation and print directly into a byte array (see below). By analogy from converting a byte array into a string, converting a string into a byte array also involves a surprising amount of allocations. Because it’s not known beforehand how many bytes a given string is going to occupy, JVM is trying to guestimate that as 3 * string.length, where 3 is the maximum numbers of bytes needed for a single UTF-8 character.

Printing JSON directly into a byte array isn’t so common as parsing bytes and only a couple of JSON libraries support that. As of Circe 0.7 (see #537) and Finch 0.12 (see #717) it is a default JSON encoding mode for applications depending on finch-circe (including those using circe-jackson for printing, see #11).

The encoding benchmark we run for JSON reports about 30% drop in both allocations and running time when targeting byte arrays instead of strings (see below).

 ---------------------------------------------------------------------------------------------
 S: print string | BA: print byte array | Running Time Mode
 ---------------------------------------------------------------------------------------------
 JsonBenchmark.encodeS                                     avgt   16400.327 ±  621.935  ns/op
 JsonBenchmark.encodeBA                                    avgt   12645.070 ±  391.591  ns/op

 JsonBenchmark.encodeS:·gc.alloc.rate.norm                 avgt   46900.015 ±   19.123   B/op
 JsonBenchmark.encodeBA:·gc.alloc.rate.norm                avgt   30360.011 ±    0.001   B/op

The main point here is not that string to bytes and bytes to string conversions aren’t really cheap on JVM (they are as cheap as they could be) but rather aren’t always necessary. The tricky part is that it often requires us to look at the problem end-to-end to figure what data transformations (as well as interim results) don’t add much value to the domain and can be eliminated.

Takeaways

“This is slow” is one of the toughest problems to debug. Although, always paying attention to the allocation profile is quite a healthy habit allowing to reduce the number of performance-related problems to their minimum. Despite all the great tools available for chasing down allocations (think of JMH’s -prof gc), none of them are going to tell us if there are any shortcuts our application can take to get the final result faster.

Finagle 101

2016-05-12T09:00:00+00:00

This post is based on my talk “Finagle: Under the Hood” that was presented at Scala Days NYC. I thought I’d publish this for those who prefer reading instead of watching (video is not yet published anyway). The full slide deck is available online as well.

What Finagle is?

Finagle is an RPC system for JVM developed and used in production at Twitter. It’s written in Scala but has a Java-compatible API for most of its components.

When it comes to describing what Finagle can do, I really like Alexey’s tweet from the last FinagleCon.

The key problem with #Finagle adoption that it solves tons of problems that you know nothing about until it's too late #FinagleCon
— Alexey Kachayev (@kachayev) August 13, 2015

The most important part here is: “Finagle solves tons of problems”. And it absolutely does. There are many different things (a user doesn’t know about) happening underneath a Finagle client or a server to make sure sessions are reliable enough. Finagle is doing a great job at tolerating all kinds of session and transport failures so its users usually don’t even notice neither failures nor actions Finagle is taking to tolerate them.

Finagle’s been around for quite a long time - since 2010. With 4.5k stars, it’s the 7th most popular Scala project on Github. As for today more than 15 protocols are implemented, including our own multiplexing protocol Mux. Mux is a full-duplex, multiplexing protocol that might be roughly viewed as a subset of HTTP/2 so it has low-level control messages for pings, interrupts, and many more. We will see later how we utilize those signals within Finagle.

Finagle has quite a specific mission to make RPC sessions fast, resilient, and easy to setup. Operating on Layer 5 of the OSI model, Finagle knows almost nothing about the application and even protocol it’s used by. That’s why it doesn’t actually answer lots of application-specific questions like “How to do logging?”, “How to do JSON?”, “Where to define a REST controller?”. To fulfill that gap and utilize Finagle’s outstanding scalability and performance, people started building opinionated libraries and frameworks that are supposed to answer all those question. Just to mention a couple of such libraries specifically designed for HTTP: Finatra and Finch.

The Finagle team at Twitter is called CSL (Core System Libraries). There are ~10 of us maintaining libraries that power Twitter’s distributed infrastructure. We own Finagle, Util, Scrooge, TwitterServer and Finatra. We’ve got on-call rotation and internal finaglers we use to provide support for teams dealing with production issues related to Finagle.

Internally Finagle lives in a monorepo and services depend on its source code. Essentially, every time we do an OSS release, we actually roll out the code that has been already tested internally for months (by thousands of services serving millions or RPCs every second), which sounds like a pretty decent deal for external adopters.

The “Big Picture”

Finagle is designed with this simple idea in mind that your server is a function. This means you can talk to that server by calling this function and you can implement that server by defining this function.

trait Service[Req, Rep] extends (Req => Future[Rep])

The most exciting part here is those type params Req and Rep meaning that your service (server or client) is actually abstracted over the particular protocol (particular request/response types). This gives us a freedom to build most of the generic features like retries in the protocol-agnostic way, not mentioning that types add some safety into your code. For example, it’s impossible to send an HTTP request to a MySQL - compiler will catch that.

This is how Finagle is organized internally. There are three major components it’s built on: Finagle stack on a left, Netty pipeline on a right, and transport in between.

This is a really interesting combination: we’ve got two completely different worlds here. The service-oriented, type safe, powered by composition and functional abstractions Finagle world on the left. And event-based, untyped, low-level Netty world on the right. And transport glues them together.

We will cover the left part here in this post. The Finagle stack is our generic abstraction used to materialize Finagle clients and servers out of a composition of ordered modules, which may be anything that is known how to compose. And we know how to compose services (since they are just functions). Technically we can put those services/functions into a stack and materialize it into a client or server. That’s pretty much what we do in Finagle today. If you speak functional programming, we fold a collection of modules into a client or server.

Each module in the stack represents some standalone functionality like retries, load balancing, circuit breaking and so on. And in fact, this is exactly how Finagle server looks like - it’s simple and flat. Clients are quite tricky on the other hand.

There is a actually a tree of stacks in the client with two branching points. First, a load balancer distributes traffic across a number of nodes or endpoints. Second, a connection pool, maintains a pool of connection stacks, which terminate with transport and a Netty pipeline.

Configuration

Before we go deeper into details about client/server modules, let’s discuss what they have in common - a configuration API. To be fair, that’s one of my favorite topics because I think it’s a really tricky problem to build a configuration API that’s both easy-to-use (common things should be easy to do) and powerful-enough (uncommon things should be possible to do). The current version of configuration API in Finagle is the third iteration of an idea that configuration is a code.

Configuration is always code (not CLI flags, not config files) in Finagle so it type-checks by your compiler and auto-completes by your IDE. There is a convention on how to find an entry point API depending on the protocol you want to work with. Usually, you start with typing something like Http.client.with and see what’s possible to configure/override on a given client. We separate commonly-configured params from the ones we think it might be dangerous to tweak today. We call those expert-level and use slightly different API to override them. The expert-level API is usually not so friendly and discoverable as with-API, which works perfectly as a red flag: if you’re not having a good time writing configuration, you’re probably doing something wrong or dangerous.

Servers

Servers are quite simple in Finagle. They are optimized for high-throughput by doing as little as possible on top of just handling requests. At the minimum, a Finagle server does tracing and metrics, maintains request concurrency level and enforces a very simple request handletime timeout.

Here is an example of how you can configure the concurrency limit on your server. You can say how many concurrent requests your server can handle at once and how many waiters are allowed. Everything on top of that will be rejected by a server and hopefully retried by a client talking to our server.

import com.twitter.finagle.Http

val server = Http.server
  .withAdmissionControl.concurrencyLimit(
    maxConcurrentRequests = 10,
    maxWaiters = 0
  )

Concurrency limit is one of the forms of admission control we have for servers. Admission control is a technique that employs some kind of feedback controller from the underlying system to determine whether it’s reasonable to handle (for servers) or send (for clients) a given request or it’s better to reject it. As a canonical example of server-side admission control, we might think of something that prevents a server from overwhelming by rejecting some amount of requests. Essentially, instead of slowing down 100% of requests we reject, for example, 25% but keep operating normally (and maybe even stay within the SLOs).

Request timeout is symmetric and might be configured on both servers and clients. It literally means the same thing: timeout requests for which responses weren’t sent (when configured on a server) or received (when configured on a client) in a given amount of time.

import com.twitter.conversions.time._
import com.twitter.finagle.Http

val server = Http.server
  .withRequestTimeout(42.seconds)

There is no default value for request timeout and it’s disabled for the same reason concurrency limit is disabled. Finagle is trying really hard to not speculate on any application-specific (or even protocol-specific) params. You have to be explicit about those.

Clients

Clients are where things get interesting. Unlike servers, which are optimized for high throughput, clients maximize success rate and minimize latency by doing as much as possible to make sure a request will succeed in the least possible time. This makes them much more complicated than servers. The list of features clients do is quite dramatic for being fully covered here in this post, but we can surely walk through them and see what kind of problems they solve.

First of all, we need to be able to retry failed requests thereby maximizing success rate. Retrying implies a number of quite difficult questions we need to find answers to. How can we say if the request is failed? Is that safe to retry this request? If we already tried once and it didn’t help, should we keep retrying or should we give up? Finagle is taking a good care about all of these and we’ll see later what kind of abstractions it uses to achieve that.

Next, we need a way to help services locate each other so there is a built-in service discovery support in every Finagle client. By default, it might either use DNS or Zookeeper, but it’s also possible to plug in any other library by implementing a couple of simple interfaces. For example, there is an OSS package that enables Consul support in Finagle.

We also need a tooling around timeouts so we could put reasonable bounds on components in our distributed system. In addition to request timeout that we’ve already discussed, there is the hell of a ground of timeouts you can override in Finagle, starting with low-level TCP connect timeouts and finishing with session timeouts. None of the timeouts are bound by default since those values considered specific to a given application. Use the following example to override timeouts.

import com.twitter.conversions.time._
import com.twitter.finagle.Http

val client = Http.client
  .withTransport.connectTimeout(1.second) // TCP connect
  .withSession.acquisitionTimeout(42.seconds)
  .withSession.maxLifeTime(20.seconds) // connection max life time
  .withSession.maxIdleTime(10.seconds) // connection max idle time

Of course, we need to distribute traffic across a number of instances where our software is deployed. Finagle comes with a very rich set of load balancers and I honestly can’t name another system around that provides such advanced load balancing strategies as Finagle does today. We’ll cover load balancing in details later in this post since it plays a major role in resiliency of Finagle clients.

Besides picking a right replica to send request to, a Finagle client also takes care about managing connections pool as well as maintaining a stack of circuit breakers used to exclude unreliable replicas/sessions from a request path.

Clients are also come with interrupts support that is primarily used for request cancellation and prevents both servers and clients from doing useless work. You might think - “Why would I cancel the request? I sent it and I meant it!”. The tricky part here is that interrupts happen implicitly. Consider the following example. Your client sets a timeout and sends a request to a server. After a given amount of time a timeout expires and a client interrupts the future associated with a given request. What happens next is really depending on client’s protocol, but long story short, Finagle will do its best in order to propagate that cancellation across service boundaries. Worst case scenario - (i.e., HTTP 1.1) it cuts the connection but there is so much more we can do in Mux (and perhaps HTTP/2) by sending a control message to a server saying something close to “Hey, I’m no longer interested in the result of this request so you should feel free to drop it”.

Quite similarly to propagating interrupts, we might want to also do that for some request context that might contain quite useful (for debugging and monitoring/tracing) information like request id, request deadline, upstream/downstream service name and so on. Clients support that today and will take care about serializing/deserializing contexts depending on the used protocol (eg: request headers are used in HTTP) and propagating them across service boundaries.

Both interrupts and contexts are used heavily at Twitter and that’s one of the reasons why we still like our futures better. Unlike Scala futures, Twitter futures will propagate contexts and interrupts through the chain of transformations, no matter if its parts are executed by different threads.

Response Classification

As we discussed before, being on Layer 5 means knowing everything about transport and sessions, but nothing (or almost nothing) about protocol/application. This implies some unexpected behaviour when HTTP 500 (Service Unavailable) actually looks like a successful response to a Finagle client. The following poll proves that this is confusing at the minimum.

Quick poll by @kevino: Does Finagle treat HTTP 500 response as a failure or success?
— Finagle (@finagle) February 9, 2016

Why does that happen? Nothing magical. There is definitely nothing wrong with a correctly structured protocol message (eg: HTTP 500) from client’s point of view so it’s counted as a success. And that’s quite a big deal. First, we’ve got our metrics messed up: success rate is 100%, but we’re serving failures. Second, our load balancers go crazy. They think if a given server responds fast and the response is a success, it seems like a great deal to send it more traffic while it fact it was failing fast.

Since 6.33, we’ve got response classifiers that you can plug into any client and teach it how to treat responses. For example, here is how you can tell HTTP client see HTTP 503 as a non-retriable failure so the circuit breaker will kick in on this response.

import com.twitter.finagle.{Http, http}
import com.twitter.finagle.service._
import com.twitter.util._

val classifier: ResponseClassifier = {
  case ReqRep(_, Return(r: http.Response)) if r.statusCode == 503 =>
    ResponseClass.NonRetryableFailure
}

val client = Http.client
  .withResponseClassifier(classifier)

Retries

The retries module is placed at the very top of the stack so it can retry failures from the underlying modules (eg: circuit breakers, timeouts, load balancers). Finagle will only retry when it’s absolutely safe to do that. For example, when it’s known that a request wasn’t written to a wire yet, or when a request was rejected by a server. And this makes a lot of sense: if a load balancer picked a replica that rejected a request (eg: due to admission control) - it’s totally fine to retry that on a different replica.

Retries are built on top of retry budgets, which behave as leaky token buckets and tie a number of retries to a total number of requests. Technically, RetryBudget is responsible for limiting a number of retries and helps mitigating retry storms (i.e. retrying too much).

Once we’re given a permission for a retry by a retry budget, we’d need to figure out how long to wait (if wait at all) between retries. This technique is called backoff in Finagle (and almost any other library) and represented as Stream[Duration], which means you can easily plug in your own thing.

Finagle provides an API for building popular backoffs, including jittered, which are super useful in clusters built around optimistic locking whose individual nodes might perform poorly under the high contention. Our goal is to make sure that clients started at the same time are not competing with each other on retries to a single server. To do that we add some randomized factor (or jitter) into every duration from a backoff policy thereby reducing chances that several clients will be retrying simultaneously.

For example, that’s how we override retry budget and retry backoff on HTTP client. Budget allows 10% of total request to be requeued on top of 5 requests per second to accommodate clients with low RPS. Backoff uses jittered version of randomized function that grows exponentially (eg: random(2s) :: random(4s) :: ... :: random(32s)).

import com.twitter.finagle.Http
import com.twitter.conversions.time._
import com.twitter.finagle.service.{RetryBudget, RetryBackoff}

val budget = RetryBudget(
  ttl = 10.seconds,
  minRetriesPerSec = 5,
  percentCanRetry = 0.1
)

val client = Http.client
  .withRetryBudget(budget)
  .withRetryBackoff(Backoff.exponentialJittered(2.seconds, 32.seconds))

Load Balancers

There are pretty deep-seated assumptions inside of Finagle that service clusters are homogeneous, that they are equivalent from the point of view of the application. Those are often referenced as a replica set.

In Finagle, load balancers consist of two independent components: load distributor and load metric. That said, the load balancing algorithm might be described in terms of those two: we distribute load across some subset of nodes/replicas and pick the one for which a load metric is minimal.

In order to better understand the variety of load balancing options available in Finagle today, let’s have a look at their evolution so we can see why they were introduced in the first place and what problems do they solve.

In the very beginning, we had this heap balancer built on top of min-heap that maintains the number of outstanding requests per each node. That worked really well and it was the default choice for a long time.

But at some point, we figured out a number of drawbacks this option had. First of all, a load balancer state (i.e., heap) is a highly contended resource (updated on each request) and it should support extremely fast updates. Needless to say, that heap is an amazing data structure and has constant time access to its min element, but every other operation takes the logarithmic time to perform. That’s why it’s tricky to implement a different and perhaps more sophisticated load metric on top of the heap without sacrificing its performance.

The next step was the P2C (power of two choices) load balancer that solved most of the heap balancer problems. By employing quite a brilliant idea, the algorithm takes two random nodes from the server set and picks the least loaded one. If we use that strategy repeatedly, we will get a manageable upper bound of the maximum load on each node.

Given that we can now update the load balancer state in a constant time (by just updating the array), we can employ more sophisticated load metrics. The EWMA load metrics was Finagle’s next attempt in that direction.

Per each node in the server set, EWMA (stands for Exponentially Weighted Moving Average) keeps track of round-trip latency weighted by a number of outstanding requests. And this is really smart, because being on Layer 5, we can take an advantage of both RPC latency and RPC queue depth. This makes EWMA quite sensitive to latency spikes so it reacts much faster on GC pauses and JVM warmups. For example, if a load balancer happened to pick a replica that just went to a long GC pause, its EWMA metric will reflect that immediately and its load will be adjusted accordingly.

There is an outstanding post from @stevej on the comparison of three different load balancing options in Finagle that quite explicitly shows how EMWA outperforms all other options. EWMA shows the best result there in mitigating latency spikes caused by GC pauses or JVM warmups.

While EMWA looks very promising already, there is even more advanced load balancer in Finagle today. The aperture load balancer is designed to solve the problem of large server sets. Depending on a scale, each Finagle client might be talking to several thousands of servers, which will likely result in several thousands of opened connections and quite low concurrency per each node. Why does a number of connections matter? It’s waste of resources and it comes with a cost of long tail latency because of the high number of connection establishments. Why does concurrency matter? To take advantage of any load metrics we need some numbers to work with. When concurrency is low our least loaded metrics is zero for every server.

The aperture load balancer solves that by viewing the huge server set through a small window where it can apply any existing load balancer.

The advantages of aperture are quite promising. Fewer connections for clients and servers means better tail latency. Also, by employing a simple feedback controller, the load balancer adjusts the aperture size to maintain the requested concurrency level thereby keeping replicas in a warm state.

We will likely make aperture a default balancing option quite soon, but right now you’d need to enable it manually. Here we build the aperture load balancer with initial size 10 and load bounds between 1 and 2. Basically, this means that we want to make sure all replicas in aperture will be constantly getting between 1 and 2 concurrent requests and the aperture will be resized dynamically to satisfy that requirement.

import com.twitter.conversions.time._
import com.twitter.finagle.Http
import com.twitter.finagle.loadbalancer.Balancers

val balancer = Balancers.aperture(
  lowLoad = 1.0, highLoad = 2.0, // the load band adjusting an aperture
  minAperture = 10 // min aperture size
)

val client = Http.client.withLoadBalancer(balancer)

Circuit Breakers

Now we know how load balancers distribute load, the question is how they avoid nodes that are likely to fail or already failed? This is done by a layer of circuit breakers placed under load balancers so that when they mark a replica unavailable, it will be avoided by a load balancer.

As of today, there are three circuit breakers in Finagle.

Fail Fast - prematurely disables the session that failed TCP connect.
Failure Accrual - performs liveness detection on a request basis.
Threshold Failure Detector - a ping-based failure detector that periodically measures RTT latency for ping-pong exchange between nodes and if the latency doesn’t look good, it excludes that session from a request path. This is a pretty powerful tool, but it requires the underlying protocol to support liveness detection control signals. For now, we have that implemented for Mux and will likely do that for HTTP/2 as well.

Failure Accrual is our the most advanced circuit breaker in the way that it supports a pluggable policy used to determine when to mark a session unavailable. Today it’s possible to either configure it to maintain a required success rate (if that goes below a requested value, a session is marked dead) or you can say after how many consecutive failures a session is considered unavailable.

By default, it’ll mark a session dead after 5 failures in row and go to a jittered backoff before re-enabling that session/or replica again. Although, it’s quite easy to override that to be success rate-based and disable the session once its success rate is below 95% on the most recent hundred of requests.

import com.twitter.conversions.time._
import com.twitter.finagle.Http
import com.twitter.finagle.service.{Backoff, FailureAccrualFactory.Param}
import com.twitter.finagle.service.exp.FailureAccrualPolicy

val twitter = Http.client
  .configured(Param(() => FailureAccrualPolicy.successRate(
    requiredSuccessRate = 0.95,
    window = 100,
    markDeadFor = Backoff.const(10.seconds)
  )))

What’s next?

There are quite exciting times ahead. Netty 4 migration is on track and should happen really soon. I know we’ve been promising this Netty 4 utopia for about two years now, but we’re finally getting there.

What Netty 4 means for Finagle? First, we expect better performance/fewer allocations. Second, support for new protocols (like HTTP/2). Third, simplification of Finagle internals due to simpler and safer threading model in Netty 4. Finally, it’s better to stay aligned with the state of the art IO library for JVM to get the maximum of it.

We’ll also continue working on resiliency and admission control in Finagle to make sure we’re doing as much as possible to make your RPC sessions even more reliable and easy to configure.

How Fast is Finch?

2016-02-29T09:00:00+00:00

Turns out I’ve never mentioned anything about Finch, a library I’m working on most of my free time, in my personal blog. So I decided to finally fix that and write a small note on what I think about Finch’s performance as an HTTP library/server. To be more precise, I want to comment the most recent results of the TechEmpower benchmark and perhaps, give some insights on why Finch is ranked so high there.

A couple of days ago, results from the most recent run of the TechEmpower benchmark were published. While I was expecting Finch to perform well there, I didn’t expect it to be the second fastest HTTP library written in Scala.

Impressive results by #Finch (now #Scala 2nd fastest HTTP library) running @techempower benchmark (430k QPS peak): pic.twitter.com/YeBMnJeQ5W
— Vladimir Kostyukov (@vkostyukov) February 27, 2016

With that said, I’ll go ahead and answer my own question in the title of this post, “How Fast is Finch?”. Looking at the chart, it’s obvious that Finch is fast enough to perform really well on 99.99% of your business problems. At least, in comparison with other Scala libraries.

The most interesting part of this discussion is trying to understand why it performs so well? Why that insane level of indirections, Finch involves on top of Finagle services, doesn’t add much overhead? The quick answer would be: Finch owns most of its high performance to the fast and battle-proven libraries it depends on (Finagle, Circe, Cats and Shapeless). The secret recipe is quite simple - take fast components (no matter functional or imperative) and glue them together using the rock-solid (pure) functional abstractions that are easy to test and reason about.

Thus, Finch is fast because …

Finagle is fast

Finch was designed with one goal in mind: provide an easy to use API (i.e., combinators API) on top one that’s easy to implement (Finagle services). Obviously, it should involve some overhead on top of bare metal Finagle. And it does: by our latest measurements it adds 10% of allocations and 5% of running time on top of Finagle, which is not so dramatic and, I’d say, pretty good for a pre 1.0 library.

When it comes to Finagle, there is no doubt in its performance. The Finagle team at Twitter (which I’m luckily a part of) puts a lot of effort to make sure that Finagle’s performance is constantly improving. There is a number of micro-benchmarks we run on each commit to critical components. There are integration tests we write using the internal framework called Integ to load test different Finagle topologies. Finally, there is https://twitter.com that stress tests Finagle 24/7 doing millions of queries per second (DC-wise).

Finagle itself is built by the same principles Finch is. It reuses the industry’s best practices and runs its IO layer on Netty, which is well-know as the best thing that happened to JVM in years. Netty is everywhere: I’d be surprised if Netty code doesn’t handle at least 10% of your everyday traffic. You send at tweet - Finagle and Netty take care of it. You upload your photos to iCloud, talk to Siri - Netty handles it. This list is almost limitless, I’m not sure I know any JVM shop around that doesn’t use Netty in some way.

Circe is fast

While Finch is designed to be agnostic to a concrete serialization library, there is one that plays really nicely with Finch. Circe is relatively new JSON library, started as fork of Argonaut, but ended up as a completely standalone and mature project. It promotes type-full programming and provides compile-time mechanisms for deriving JSON codecs for sealed traits and case classes.

Even though Circe is young, it’s already one of the fastest Scala JSON libraries around, which is quite mind-blowing given how nice, thoughtful and boilerplate-less its API is. Part of the Circe’s great performance comes from the library it uses to parse JSON strings into JSON ASTs. This library is called Jawn and it’s one of the fastest (if not the fastest) ways to parse JSON on JVM.

Shapeless is fast

Shapeless is a generic programming library used by many Scala projects (including Circe, Spray, scodec, etc) to implement generic API (e.g., abstract over tuple arity) in a boilerplate-less manner.

While it might seem like Shapeless does a lot of work and does add a lot of overhead, it’s not really the case. Most of the Shapeless-related work happens at compile time and does not affect program running time. It shouldn’t be a surprise that Shapeless-powered code does increase compilation time, but it almost never increase running time. Finch benchmarks of the derived vs. custom written endpoints only conform that - the performance is literally the same.

Finch is fast

The bottom line is - Finch is in a good company. There is a team or a person, behind every single library Finch uses, dedicated to its future and performance. I’m confident in those people and I’m confident in libraries they maintain. Finch takes a lot from the OSS community and tries to pay it back with a good performance.

The fact that Finch performs so well gives me a hope that the abstractions we’ve chosen in the beginning are not completely broken performance-wise. And this makes me confident in Finch’s future performance. We haven’t stopped yet. In fact, we haven’t started yet and the actual performance work is only planned as a post 1.0 activity.

Designing a Purely Functional Data Structure

2015-04-04T10:00:00+00:00

Functional programming nicely leverages constraints on how programs are written thereby promoting a clean and easy to reason about coding style. Purely functional data structures are (surprisingly) built out of those constraints. They are persistent (FP implies that both old and new versions of an updated object are available) and backed by immutable objects (FP doesn’t support destructive updates). Needless to say, it’s a challenge to design a purely functional data structure that meets performance requirements of its imperative sibling. Fortunately, it’s quite possible in most of the cases, even for those data structures whose reference implementations are backed by mutable arrays. This post precisely describes a process of designing a purely functional implementation technique for Standard Binary Heaps, with the same asymptotic bounds as in an imperative setting.

Immutability and Persistence

Immutability and persistent are quite similar terms, which often substitute each other. We say immutable vector (in Scala) but mean persistent vector (in Clojure): both implementations are based on the same abstract data structure Bit-Mapped Vector Trie but named differently. Although, there is a slight difference between immutability and persistence as they apply to data structures.

Persistent data structures support multiple versions
Immutable data structures aren’t changeable

The difference between immutable and persistent data structures in how they handle updates. A persistent data structure handles updates in a smart and memory-efficient way in order to keep its previous version unchanged, while an immutable data structure simply doesn’t care about updates at all (for example, Guava’s ImmutableList doesn’t even support updates), since its previous version could be destroyed.

The following example demonstrates the difference between Guava’s ImmutableList and Scala’s persistent List in terms of memory footprint (smart updates vs. dumb updates).

// xs takes O(n) memory 
val xs = ImmutableList.of(1, 2, 3) 

// yx takes O(n) memory
val ys = 1 :: 2 :: 3 :: Nil

// dumb update: xxs takes O(n) memory (full copying)
val xxs = ImmutableList.builder.add(0).addAll(xs).build()

// smart update: yys takes O(1) memory (structural sharing)
val yyx = 0 :: yx

Purely Functional Data Structures

Purely functional data structures are always persistent, which means they handle updates in a memory-efficient way. This achieved by an implementation technique called structural sharing. A persistent data structure shares its internal structure between its versions, which is completely safe to do, since none of the versions can ever be changed or destroyed.

val xs = 1 :: 2 :: 3 :: Nil

val xxs = 0 :: xs // shares (not copies) the tail with xs

Another heavily used implementation technique is path copying. It often requires to make some deep changes in a persistent data structure (i.e., insert, delete or update an element). To do so, we simply copy its nested structures (persistent data structures are often backed by ADTs) along the path to an element being modified. Both path copying and structural sharing aim to minimize the cost of modifying a persistent data structure: everything that can’t be shared (via structural sharing) is copied (via path copying).

def concat[A](xs: List[A], ys: List[A]): List[A] = 
  if (xs.isEmpty) ys
  else xs.head :: concat(xs.tail, ys) // copies the path to ys

Path copying is a quite lightweight operation that usually takes less than O(n) time to perform. Although, there are plenty of specialized data structures highly optimized for a concrete operation to make it in an amortized constant time (with no path copying). For example, Fast Mergeable Integer Maps and Persistent Catenable Lists support constant time merge and concat operations correspondingly.

Purely Functional Heaps

Tree-based data structures (i.e., trees, heaps and tries) are considered as low-hanging fruits for a functional setting, since they map directly to Algebraic Data Types. At the first approximation, a typical functional implementation of a persistent tree looks as follows.

sealed trait Tree[+A] { def value: A }
case class Branch[+A](value: A, left: Tree[A], right: Tree[A]) extends Tree[A]
case object Leaf extends Tree[Nothing]

There are several purely functional implementations of heaps such as Leftist Heap, Skew Heap and Pairing Heap with good asymptotic bounds. Although, there are other heaps without proper functional implementations. The simplest of them are Standard Binary Heaps, which do not fit well into a functional environment since their reference implementation is backed by mutable arrays. Luckily, it’s quite possible bring them into a purely functional world.

Standard Binary Heap

A binary heap (Williams, 1964) is a data structure that implements a priority queue interface and guarantees logarithmic running time for insert, delete operations and constant time access to minimum/maximum element. Binary heaps are commonly viewed as binary trees which satisfy two invariants:

The shape invariant: the tree is a complete binary tree.
The min-heap invariant: each node is less than or equal to each of its children.

In Scala a binary min heap might be represented as abstract Heap class with two variants: Branch and Leaf.

sealed trait Heap[+A] {
  def min: A
 
  def left: Heap[A]
  def right: Heap[A]
  def isEmpty: Boolean
 
  // Both 'size' and 'height' are stored in each node.
  val size: Int
  val height: Int
}

case object Leaf extends Heap[Nothing] {
  val size: Int = 0
  val height: Int = 0
  def isEmpty: Boolean = true
}
 
case class Branch[+A](min: A, left: Heap[A] = Leaf, right: Heap[A] = Leaf) extends Heap[A] {
  def isEmpty: Boolean = false
  val size: Int = left.size + right.size + 1
  val height: Int = math.max(left.height, right.height) + 1
}

Note that the height of a heap is defined as max height of its children plus one, while tge size of a heap is defined as the sum of its children sizes plus one; and both are calculated only once in a heap constructor. Also, to simplify calculations, suppose that singleton heap’s height is 1.

Except for height and size operations, this signature looks like a classic functional implementation of a Binary Search Tree. The two new operations are actually accessors to new fields in a heap - its height and size. These additional data should be accessible in constant time to define an efficient and simple search criterion for insert and remove operations.

Insertion in O(log n)

Insertion into a functional binary heap must not violate either of its invariants - neither the shape invariant nor the min-heap invariant. For this purpose two problems should be solved. First, to maintain the shape invariant a new node should be inserted in the first empty spot at the last level of the heap. Second, to maintain the min-heap invariant the inserted node should be bubbled up to the heap root until it becomes greater than its parent.

Figure 1: Eliminating min-heap invariant violations.

Bubbling up is quite a simple transformation that can be done at each level in constant time. There are two cases depending on whether the violation is at the left or right child (see “Figure 1” above). In either case the violation should be fixed by swapping two nodes - the root node and the child that violates the min-heap invariant. There is also a third case, when it doesn’t violate anything. In this case, a heap should be simply rebuilt with given parameters. In other words, all affected nodes should be copied in order to maintain data structure persistence. More precisely, bubbleUp and insert operations might be defined as follows.

def bubbleUp[B : Ordering](x: B, l: Heap[B], r: Heap[B]): Heap[B] = {
  val ordering = Ordering[B]; import ordering._

  (l, r) match {
    case (Branch(y, lt, rt), _) if (x > y) => 
      Branch(y, Branch(x, lt, rt), r)
    case (_, Branch(z, lt, rt)) if (x > z) => 
      Branch(z, l, Branch(x, lt, rt))
    case (_, _) => 
      Branch(x, l, r)
  }
}

def insert[B >: A : Ordering](x: B): Heap[B] =
  if (isEmpty) Branch(x)
  else if (???) bubbleUp(min, left, right.insert(x))
  else bubbleUp(min, left.insert(x), right)

The last thing to discuss is how to find a proper spot for a new node. The algorithm is based on a simple idea that binary heap will always be a complete tree if it tends to be a perfect tree each time it’s modified. There are two definitions of perfect trees: mathematical and recursive. Mathematical definition: a perfect binary tree contains 2^(h+1) − 1 nodes, where h is the height of the tree. Recursive definition: a tree is perfect if its children are perfect trees of the same height. Combining these facts together, one can define search criteria which allow filling a heap level by level from left to right, thereby maintaining the shape invariant. In other words, new nodes should be inserted in such a way as to make the heap be a perfect tree. This can be simply achieved by the following requirements of the recursive definition, using the math definition as an efficient test on tree perfectness. Thus, the search criteria for insertion consist of four cases depending on whether the children are perfect trees and whether their heights are equal.

Figure 2: Searching for the first empty spot in a heap.

The straightforward implementation of this idea (see “Figure 2” above) with four cases looks as follows.

def insert[B >: A : Ordering](x: B): Heap[B] =
  if (isEmpty) Heap(x)
  else if (left.size < math.pow(2, left.height) - 1)
    bubbleUp(min, left.insert(x), right)
  else if (right.size < math.pow(2, right.height) - 1)
    bubbleUp(min, left, right.insert(x))
  else if (right.height < left.height)
    bubbleUp(min, left, right.insert(x))
  else bubbleUp(min, left.insert(x), right)

The insert operation performs two traversals along the search path of a heap. First, in a top-down manner it searches for the first empty spot in a heap thereby maintaining the shape invariant. Second, it rebuilds the affected nodes of a heap in a bottom-up manner thereby maintaining the min-heap invariant. Both traversal take less than O(log n), since the longest possible path for perfect trees is log n. Thus, the time complexity of insertion is O(log n).

Conclusion

The most exciting thing about purely functional data structures is that there is always room for new ideas and techniques. Even today, this direction still attracts researches and enthusiasts of functional programming. It’s been 15 years, since Okasaki and the field is still developing: modern languages like Scala require modern and efficient data structures with optimal purely functional implementations.

The heap implementation in this post is based on a paper A Functional Approach for Standard Binary Heaps, 2013. The full source code (including operations remove and heapify) is available on Github.

Combinatorial Algorithms in Scala

2014-04-01T10:00:00+00:00

Combinatorics is a branch of mathematics that mostly focuses on problems of counting the structures of given size and kind. The most famous and well-known examples of such problems might be often asked as job interview questions. This blog post presents four generational problems (combinations, subsets, permutations and variations) along with their purely functional implementations in Scala.

Implicit Classes

Scala’s implicit classes provide simple and composable way of extending the API of third-party classes. For example, the following implicit class extends default Int class within a new method times(fn: Unit => Unit): Unit that executes given function fn n-times.

object IntOps {
  implicit class ExtendedInt(n: Int) {
    def times(fn: Unit => Unit): Unit =
      (0 until n).foreach(fn)
  }
}

This gives us a very neat usage way. All one need to do is import an implicit class into the current namespace and let the magic happened.

import IntOps._
5.times {
  println("Hello, World!")
}

We’ll use this approach in order to extend Scala’s List with four new methods that implement our combinatorial algorithms. The only one restriction we have to satisfy here: new functions’ names shouldn’t conflict with an existent API. Thus, we’ll use a prefix x (from eXtended) for new functions. The following listing represents a skeleton class we’re going to implement.

object CombinatorialOps {
  implicit class CombinatorialList[A](l: List[A]) {

    def xcombinations(n: Int): List[List[A]] = ???
    def xsubsets: List[List[A]] = ???
    def xvariations(n: Int): List[List[A]] = ???
    def xpermutations: List[List[A]] = ???

  }
}

This tiny class might be used as follows (in the exact way as IntOps was used below).

import CombinatorialOps._
val c = List(1, 2, 3).xcombinations(2)

Optimistic Programming

Optimistic Programming is an implementation technique of recursive programs when it’s believed that a recursive function works as expected on a smaller input (on a sub-problem) in order to use its result for solving the full-size problem. In other words, the body of recursive function may be implemented in terms of following ideas: (a) when called recursively it gives the right answer for any sub-problem, but (b) some additional work should be done in order to merge these sub-problem solutions into the single solution of the entire problem. Doesn’t that sound optimistic? The recursive function is pretended to be correctly implemented before its body is actually being written.

An optimistic programming lie between Divide and Conquer and Dynamic Programming techniques. Rather then focussing on how sub-problems are being splited (whether or not the sub-problems overlap) an optimistic programming focuses on the nature of recursive programs and provides a simple tool making the programming of complex problems much easier.

We’ll use an optimistic programming for solving combinatorial problems in a functional setting.

Combinations

Imagine you’re given a standard deck of fifty-two cards and asked to select any two of them. Those pair of cards you’ll select is called a combination (i.e., a 2-combination). And there are 1326 such 2-card combinations that may be possibly selected from a standard card deck. More formally, a binomial coefficient defines the number of k-combinations from a set of n distinct elements.

The combination elements’ order doesn’t matter. So, [a, b] and [b, a] are the same combinations.

scala> List("a", "b", "c").xcombinations(2)
res1: List[List[String]] = List(List(a, b),
                                List(a, c),
                                List(b, c))

It’s time to use the power of optimistic programming for solving the problem of generating k-combinations. An optimistic programming guarantees that a recursive function being called on a sub-problem produces a correct answer. A sub-problem of generating k-combinations is generating (k-1)-combinations. The only one question’s left: how to solve an entire problem then? This is when the things become interesting. Obviously, there is should be an extra element in the set, which being added to a (k-1)-combination upgrades it to a full-size k-combination. A set’s extra element is nothing different from a regular set’s element. And set S itself is a recursive object, which without one element is still a set S' and may be processed recursively. Thus, the final solution contains both S'’s k-combinations and S'’s (k-1)-combinations with an extra element appended.

There are also two corner cases that we have to handle separately. There is nothing to do when k > n (combination’s size is greater then an entire set’s size). And there is no further grouping required if it’s a generation of 1-combinations.

/**
 * Generates the combinations of this list with given length 'n'. The order
 * doesn't matter.
 *
 * The total number of k-combinations on n-length set might be calculated
 * as follows:
 *
 *                  C_k,n = n!/k!(n - k)!
 *
 * Time - O(C_k,n)
 * Space - O(C_k,n)
 */
def xcombinations(n: Int): List[List[A]] =
  if (n > xsize) Nil
  else l match {
    case _ :: _ if n == 1 =>
      l.map(List(_))
    case hd :: tl =>
      tl.xcombinations(n - 1).map(hd :: _) ::: tl.xcombinations(n)
    case _ => Nil
  }

Subsets

A set’s k-combination may also be referenced as a subset. The other combinatorial problem is generating all the subsets (all k-combinations, where k = 1..n) of a given set.

scala> List("a", "b", "c").xsubsets
res1: List[List[String]] = List(List(a, b, c),
                                List(a, b),
                                List(a, c),
                                List(b, c),
                                List(a),
                                List(b),
                                List(c))

The implementation is straightforward - combinations of all the possible sizes should be merged together. That may be done by List’s foldLeft operation.

/**
 * Generates all the subsets of this list. The order doesn't matter.
 *
 * The total number of subsets might be obtained from variations formula:
 *
 *                  S_n = sum(i=1..n) {C_i,n} = 2 ** n
 *
 * Time - O(S_n)
 * Space - O(S_n)
 */
def xsubsets: List[List[A]] =
  (2 to xsize).foldLeft(l.xcombinations(1))((a, i) => l.xcombinations(i) ::: a)

There are 2^n subset of an n-size set. It’s choice of two: every set’s element is either taken or not into the particular subset.

Variations

Unlike combinations, the order of elements inside a variation does matter. Thus, tuples [a, b] and [b, a] are different variations (i.e., 2-variations). In general, variations are denoted as partial permutations or k-permutations, where 0 < k <= n.

scala> List("a", "b", "c").xvariations(2)
res1: List[List[String]] = List(List(b, a),
                                List(a, b),
                                List(c, a),
                                List(a, c),
                                List(c, b),
                                List(b, c))

The number of k-permutations of n is the following product: n * (n-1) * ... * (n-k+1). That’s a bit different from a binomial coefficient (the number of k-combinations of n): there is no k! in a denominator, since it counts all the possible k-permutations rather then treating them equal.

The same ideas of an optimistic programming may be used in generating the variations (k-permutations) of a given set. The corner cases are the same: there’s nothing to do with k > n or k = 1. Just like in combinations, these two cases should be handled separately. More interesting is the regular case: upgrading a recursively generated (k-1)-permutation to a full-size one. It’s no longer a problem of getting an extra element from a set, as well as the upgrading itself is no longer a merging.

Since the order does matter, an extra element should be inserted into the every possible place of a permutation rather then just being merged with it. So, instead of 1-by-1 mapping between unfinished and finished combination it comes to 1-by-k mapping for permutations: there are k places in (k-1)-permutation where an extra element may be inserted.

Ultimately, by analogy with k-combinations, k-permutations of S contain all the k-permutations of S', where S' is a without-out-element version of S.

/**
 * Generates the variations of this list with given length 'n'. The order
 * does matter.
 *
 * The total number of variations might be calculated as follows:
 *
 *                   V_k,n = n!/(n - k)!
 *
 * Time - O(V_k,n)
 * Space - O(V_k,n)
 */
def xvariations(n: Int): List[List[A]] = {
  def mixmany(x: A, ll: List[List[A]]): List[List[A]] = ll match {
    case hd :: tl => foldone(x, hd) ::: mixmany(x, tl)
    case _ => Nil
  }

  def foldone(x: A, ll: List[A]): List[List[A]] =
    (1 to ll.length).foldLeft(List(x :: ll))((a, i) => (mixone(i, x, ll)) :: a)

  def mixone(i: Int, x: A, ll: List[A]): List[A] =
    ll.slice(0, i) ::: (x :: ll.slice(i, ll.length))

  if (n > xsize) Nil
  else l match {
    case _ :: _ if n == 1 => l.map(List(_))
    case hd :: tl => mixmany(hd, tl.xvariations(n - 1)) ::: tl.xvariations(n)
    case _ => Nil
  }
}

Permutations

Permutations are just set’s size variations or k-permutations with k = n. A permutation may also be viewed as a result of a set’s shuffle operation. In other words, every iteration of a shuffling the deck of cards process gives a new permutation. Permutations are counted by a product: n * (n-1) * ... 1, which is n!.

scala> List("a", "b", "c").xpermutations
res1: List[List[String]] = List(List(c, b, a),
                                List(c, a, b),
                                List(a, c, b),
                                List(b, c, a),
                                List(b, a, c),
                                List(a, b, c))

The implementation of a purely-functional algorithm of generating the permutations is quite simple in terms of variations.

/**
 * Generates all permutations of this list. The order does matter.
 *
 * The total number of permutations might be calculated as follows:
 *
 *                 P_n = V_n,n = n!
 *
 * Time - O(n!)
 * Space - O(n!)
 */
def xpermutations: List[List[A]] = xvariations(xsize)

Further Improvements

The full version of CombinatorialOps class might be found at GitHub. In order to reduce the memory footprint a bit of lazinesses may be involved by (a) replacing the output data type List[List[A]] with Iterable[List[A]] and (b) generating each piece of data on-demand.

Dual-Pivot Binary Search

2014-02-06T10:00:00+00:00

In 2009, Vladimir Yaroslavski introduced a Dual-Pivot QuickSort algorithm, which is currently the default sorting algorithm for primitive types in Java 8. The idea behind this algorithm is both simple and awesome. Instead of using single pivot element, it uses two pivots that divide an input array into three intervals (against two intervals in original QuickSort). This allowed to decrease the height of recursion tree as well as reduce the number of comparisons. The post describes a similar dual-pivot approach but for a BinarySearch algorithm. Thus, our modified binary search algorithm has prefix Dual-Pivot.

First of all, consider a standard variation of a binary search algorithm.

int binarysearch(int a[], int k, int lo, int hi) {
  if (lo == hi) return -1;

  int p = lo + (hi - lo) / 2;

  if (k < a[p]) {
    return binarysearch(a, k, lo, p);
  } else if (k > a[p]) {
    return binarysearch(a, k, p + 1, hi);
  }

  return p;
}

We’ll use a Master Method in order to understand its time complexity in terms of Big-Oh notation. The idea behind a master method is to express algorithm’s running time in terms of the following recurrent relation.

T(n) = a * T(n/b) + O(n^c)

The exact meaning of this relation is following: the running time T(n) of the algorithm on input n is equal to sum of the running times of each recursive call T(n/b) plus some extra job O(n^c) at each level of recursion. Note that a master method only works for Divide and Conquer algorithms.

The binary search algorithm does

split the input onto two equals intervals: b = 2
perform only one recursive call, depending on whether the key is less or greater then the pivot element: a = 1
compare the key with the pivot element at each level of recursion, which takes a constant time: c = 0

Thus, the following recurrent relation describes a standard binary search algorithm, where a = 1 (number of recursive calls), b = 2 (how many pieces we split the data at each level of recursion) and c = 0 (we also do some constant work at each recursion call).

T(n) = T(n/2) + O(1)

The relation a == b ^ c or 1 == 2 ^ 0 gives us the first case in a master method, which results in the running time O(n^c * log_b n) or O(log_2 n) in particular.

It’s time to use a dual-pivot element instead of a single-pivot one. This gives us three intervals and a couple of additional comparisons.

int dualPivotBinarysearch(int a[], int k, int lo, int hi) {
  if (lo == hi) return -1;

  int p = lo + (hi - lo) / 3;
  int q = lo + 2 * (hi - lo) / 3;

  if (k < a[p]) {
    return dualPivotBinarysearch(a, k, lo, p);
  } else if (k > a[p] && k < a[q]) {
    return dualPivotBinarysearch(a, k, p + 1, q);
  } else if (k > a[q]) {
    return dualPivotBinarysearch(a, k, q + 1, hi);
  }

  return (k == a[p]) ? p : q;
}

It should be clear now that a recurrent relation for a dual-pivot binary search looks as follows.

T(n) = T(n/3) + O(1)

The only difference is a data split factor, which is 3 against 2 in the original relation. Thus, three intervals give us a new time complexity: O(log_3 n). The careful reader may notice, that a logarithm’s base is a constant factor, which is redundant and might be eliminated according to the Big-Oh definition rules. So, both algorithm have the same time bounds - O(log n). And it’s doesn’t really matter what base of log is.

That is a partially true. We usually don’t care about constant factors in asymptotic bounds, since it doesn’t affect the algorithm’s scalability. The only thing we care is whether the algorithm is able to process the bigger (much bigger) input in a reasonable time or not. But, when it comes to a deeper analysis of the particular algorithm implementation, it may be useful keeping in mind the constant factors as well.

The new time complexity gives us a shorter version of a recursive tree: log_3 n against log_2 n. In other words, it gives us a shorter stack trace as well as a smaller memory footprint (due to reduced number of the allocated stack frames). For example, for n = 2 147 483 647 (a maximum length of the array that might be allocated in JVM) we’ll have a 40% shorter recursive tree. Isn’t it awesome? Not really. To be honest, 40% is from difference between 31 and 19 (base 2 against base 3 in a logarithmic function). A logarithm is an awesome function! It takes a number and makes it smaller. I wish all the algorithms have had a logarithm in their asymptotic bounds.

Well, what does it cost to make 31 recursive calls on a modern JVM (and a modern CPU as well)? I bet - nothing. And this might be a strong reason why we didn’t study a dual-pivot binary search algorithm in a university course. Another reason is optimizing compilers (i.e., a HotSpot JIT Compiler) that can easily eliminate a tail-recursion by replacing it with a simple iterative loop. Therefore, all the fictional benefits of using a dual-pivot binary search might be completely lost.

Anyway, there is still an interesting part of a dual-pivot approach that wasn’t discussed yet - a number of comparisons. Using a dual-pivot element introduces a different number of comparisons per recursive call: 4 (against 2 in a classic scheme), which doesn’t sound optimistic, but still should be investigated. And the easiest way to check whether it’s worth to use a dual-pivot scheme or not is to look at graphical representation of both functions: 2 log_2 n and 4 log_3 n.

The chart above shows that a dual-pivot scheme uses a bit more comparisons then a single-pivot one on the same input. More precisely, it uses 20% more comparisons then an original algorithm. On the one hand, a dual-pivot approach introduces a shorter recursive tree (less number of recursive calls), but on the other hand - a higher number of comparisons.

Let me do some math in order to find a reasonable answer whether (and when) it’s worth to use a dual-pivot binary search algorithm or not. A couple of new variables should be introduces: p - latency of a recursive (or simple) call, q - latency of a compare operation (i.e., an integer compare). Now we can define the total running time of a binary search algorithm as follows.

t(binary search) = p * log_2 n + 2q * log_2 n

It just straightforward: we waste p * log_2 n time doing recursive calls plus 2q * log_2 n doing comparisons. The similar formula might be defined for a dual-pivot binary search algorithm.

t(dual-pivot binary search) = p * log_3 n + 4q * log_3 n

And we want to find a relation between q and p for which the following will be true (we’re looking for constraints with which a dual-pivot scheme takes less time).

t(binary search) > t(dual-pivot binary search)

.. or ..

(p * log_2 n) + (2q * log_2 n) > (p * log_3 n) + (4q * log_3 n)

This gives us a strict answer: p > 1.5 q. In other words, it does make sense to use a dual-pivot approach on a platform for which making a function call costs at least as 1.5x of compare operation.

That’s nice to know, but can we find a more concrete answer? Well, it’s not that easy. It really depends on a hardware platform (ISA, micro-architecture) as well as on a software platform (compiler, runtime). Consider we use a compiler without tail-call optimization on a modern Intel’s CPU (like Haswell). An Agner’s optimization manual says that it takes 1 or 2 clock-ticks for both CMP and FUCOMI/FUCOMIP instructions. Needless to say that JMP, SUB and MOV cost almost nothing: ~3-4 clock-ticks in total. Why do we need these three instructions? Well, a usual calling convention does

perform a jump to a function - JMP
save the current stack pointer - MOV
reserve a stack for locals - SUB

Well, it’s roughly takes 3-4 clock-ticks in order to make a function call on a modern x86 chip. And this almost what we was looking for. We can say that it might be a good idea of using a dual-pivot binary search instead of a classic one. But the benefits we get in this case are such imperceptible that we won’t even see the difference. The only micro-benchmarking will help us find a truth.

The full source code of a JMH-based benchmark is available at GitHub. The results look as follows.

Benchmark                Mode   Samples         Mean   Mean error    Units
d.DPBS.benchmarkBS       avgt         5       81.665        8.000    ns/op
d.DPBS.benchmarkDPBS     avgt         5       69.563        8.410    ns/op

These performance results (70 nanoseconds vs. 80 nanoseconds on a hugest array that I managed to allocate on my MacBook Pro) sums up in a very robust conclusion: a classic binary search algorithm’s fast as hell. Seriously, it’s one of the fastest algorithms around. Just think about we wasted 80 nanoseconds (read it again - nanoseconds) in order to search in 2Gb array. That’s crazy fast and the difference in 10ns (read it again - nanoseconds) is just sort of quantum side-effects. So, if you have a bunch of ordered numbers and you want to perform a search on them - relax and use Arrays.binarySearch() or even write your own implementation for a particular case.

The point is we don’t need a dual-pivot approach, since it gives you almost nothing on a modern platforms. The aim of this post is to find a reasonable answer on a question why there’s still no dual-pivot binary search around. I didn’t want to get a faster version of a original binary search, which surely can be done by rewriting a tail-recursion with iteration (but it’s not even necessary - just think of 31 recursive calls in a wost case). I just wanted to show how use complexity analysis along with math and knowledge about your platform in order to dig into the interesting question and have fun.

Finagle Your Fibonacci Calculation

2014-02-01T10:00:00+00:00

Finagle is an RPC library for JVM that allows you to develop service-based applications in a protocol-agnostic way. Formally, the Finagle library provides both asynchronous runtime via futures and protocol-independence via codecs. In this post I will try to build a Finagle-powered distributed Fibonacci Numbers calculator that scales up to thousands of nodes.

Topology Design

Lets start with the requirements. At the first glance, we might want to see our system both fault-tolerant and scalable. These are typical requirements for any kind of distributed system. And the good news here, Finagle provides a corresponding set of building blocks and mechanisms (such as load balancing, retrying, monitoring, etc.) that allows the developer easily write a reusable scalable and fault-tolerant code without a particulate knowledge about concrete protocols.

Anyway, such things like scalability should be done at some different level then framework or library level. The systems should be scalable by design not by any fancy tool. Thus, we must remember it at any stage of application life-cycle.

In order to design the scalable system we have to understand the problem we’re trying to solve. The classic Fibonacci calculation algorithm builds a recursive tree with height O(n) and branching factor 2. Thus, the most natural and suitable service topology we can use here is a hierarchical one. The hierarchical or tree-based topology satisfy both scalability and fault-tolerance requirements. So, the distributed Fibonacci calculator might be viewed as following

In other words, we simply maps every node from the recursive tree (the algorithm’s abstraction) to physical/distributed nodes. The proposed topology tree has two kind of nodes - leaf nodes with label W (workers) and branch nodes with label F (fanouts). The worker node is our workhorse that does all the magic, while the fanout node doesn’t really perform calculation but implements map-reduce approach by delegating the sub-problems to child nodes. The number of nodes in such tree is unlimited, but doesn’t really make sense having workers more than a number of logical cores in your CPU. For example, a suitable configuration for a typical Haswell laptop with four logical cores looks exactly like the picture above.

Finagle Power

The Finagle’s API provides three robust building blocks: futures, filters and services. All the building blocks are designed to be composable in a very neat way. Thus, keeping in mind that futures are single-element immutable containers while services and filters are just functions, it’s really simple to reason about Finagle-powered code.

Finagle is a service-oriented platform. So, all the interactions between servers and clients are built around services. Servers implements their behavior via services, while clients interacts with servers via services. Finally, service is just a function that takes type A and returns a future of type B.

trait Service[A, B] {
  def apply(a: A): Future[B]
}

The Future type represents a placeholder for a response being sent from server. Programming with futures is an asynchronous programming discipline that relies on transforming values rather than reasoning about sequence of events and callbacks.

The last but not least thing to discuss - Finagle’s filters, which are actually decorators for services. Filters allow to change the behavior of services at running time as well as to change their types and get some benefits from Scala’s type checker at compile time.

Abstractions

Lets start with a cornerstone abstraction - a Fibonacci calculator that takes a BigInt number of Fibonacci member and returns a future of its value. It also a good idea to predefine a useful BigInt values in the same trait.

trait FibonacciCalculator {
  val Zero = BigInt(0)
  val One = BigInt(1)
  val Two = BigInt(2)

  def calculate(n: BigInt): Future[BigInt]
}

Now we can define a worker node implementation that uses a for comprehension for future pipelining (sequential composition). The straightforward implementation looks exactly like the classic recursive algorithm.

object LocalFibonacciCalculator extends FibonacciCalculator {
  def calculate(n: BigInt): Future[BigInt] =
    if (n.equals(Zero) || n.equals(One)) Future.value(n)
    else for { a <- calculate(n - One)
               b <- calculate(n - Two) } yield (a + b)
}

Thus, the fanout node implementation might be defined as following

class FanoutFibonacciCalculator(
  left: FibonacciCalculator,
  right: FibonacciCalculator) extends FibonacciCalculator {
  
  def calculate(n: BigInt): Future[BigInt] =
    if (n.equals(Zero) || n.equals(One)) Future.value(n)
    else {
      val seq = Seq(left.calculate(n - One), right.calculate(n - Two))
      Future.collect(seq) map { _.sum }
    }
}

The fanout calculator uses a concurrent compositor Future.collect() (which takes the sequence of futures and returns the future of sequences) in order to process left and right sub-trees in parallel. The last future transformation that is performed by fanout calculator is summing up the sequence.

In our system, we will use the String-based transport layer provided by Finagle’s example of Echo Server, which means we need to provide a suitable adapter implementation that adapts the String-based service Service[String, String] to FibonacciCalculator interface. This will allow us to use remote workers as fanout node’s children.

class RemoteFibonacciCalculator(remote: Service[String, String]) 
    extends FibonacciCalculator {

  def calculate(n: BigInt): Future[BigInt] = 
    remote(n.toString) map { BigInt(_) }
}

Good news here is that BigInt can be converted to the String (and vice versa) out-of-the box, so we can easily perform the conversion in one line.

Now we’re ready to setup our service that takes a Fibonacci calculator and delegates the clients’ requests to it. Also, a bit of type conversions should be done here. The FibonacciService can be treated as an adapter of FibonacciCalculator to Service interface.

class FibonacciService(calculator: FibonacciCalculator) 
    extends Service[String, String] {

  def apply(req: String): Future[String] =
    calculator.calculate(BigInt(req)) map { _.toString }
}

Server and Client Configurations

Finally, we can define a server that handles our Fibonacci service. The launcher should allow the user to run either the worker node or fanout node by specifying the corresponding command line options. The complete implementation looks like following.

object FibonacciServerLauncher {
  def main(args: Array[String]): Unit = main(args.toSeq)

  def main(args: Seq[String]): Unit = args match {
    case Seq("leaf", port) =>
      val service = new FibonacciService(LocalFibonacciCalculator)
      Await.ready(FibonacciServer.serve(":" + port, service))
    case Seq("node", port, left, right) =>
      // remote services 
      val ls = FibonacciClient.newService("localhost:" + left)
      val rs = FibonacciClient.newService("localhost:" + right)

      // remote calculators
      val lc = new RemoteFibonacciCalculator(ls)
      val rc = new RemoteFibonacciCalculator(rs)

      // a fanout
      val service = new FibonacciService(new FanoutFibonacciCalculator(lc, rc))
      Await.ready(FibonacciServer.serve(":" + port, service))
  }
}

The client launcher looks much simpler though.

object FibonacciClientLauncher {
  def main(args: Array[String]): Unit = main(args.toSeq)

  def main(args: Seq[String]): Unit = args match {
    case Seq(port, req) =>
      val client = FibonacciClient.newService("localhost:" + port)
      val rep = Await.result(client(req))
      printf("Fibonacci(%s) is %s\n", req, rep)
    case _ => println("Bad arguments!")
  }
}

The complete source code of both client and server is available at GitHub.

Now, it’s time to build the topology from the first picture (the binary three with seven nodes). The following script builds the tree in a bottom-up manner by launching a seven instances on the same machine.

$ sbt "run-main FibonacciServerLauncher leaf 2001" && \
  sbt "run-main FibonacciServerLauncher leaf 2002" && \
  sbt "run-main FibonacciServerLauncher node 2003 2002 2001" && \
  sbt "run-main FibonacciServerLauncher leaf 2004" && \
  sbt "run-main FibonacciServerLauncher leaf 2005" && \
  sbt "run-main FibonacciServerLauncher node 2006 2005 2004" && \
  sbt "run-main FibonacciServerLauncher node 2007 2006 2003"

From the client-side, system usage looks pretty simple. The client should interact with a root node of the topology tree. In our case, with an instance on port 2007.

Filters as Services’ Decorators

Filters provide a natural and clean way of changing the services’ behavior by chaining their requests through the number of nested filters. Thus, the same protocol-independent filters can be used at both server and client sides.

Lets consider the example of the filter that simply logs services’ requests to the console.

object LogStringFilter extends Filter[String, String, String, String] {
  def apply(req: String, srv: Service[String, String]): Future[String] = {
    println("Got a request: " + req)
    srv(req)
  }
}

The filter can be applied to the service by andThen operator. In order to make workers dump their requests we can change the launcher configuration as following.

val service = new FibonacciService(LocalFibonacciCalculator)
Await.ready(FibonacciServer.serve(":" + port, LogStringFilter andThen service))

Is it Scalable and Fault-Tolerant?

The suggested tree-based topology might be scaled in a bottom-up manner by adding new levels of fanout nodes. But, it’s not that easy to configure the system with only shell commands described before. Any kind of specialized tools (like ZooKeeper, which is supported by Finagle) should be used instead.

In order to make the system fault-tolerant, we can use Finagle’s built-in load balancers as well as customized filters that implement retries and timeouts.

For example, the following client service will be balancing its requests between two nodes localhost:2001 and localhost:2002:

val client = FibonacciClient.newService("localhost:2001,localhost:2002")

Further Improvements

It might be a good idea to replace the String-based transport layer with BigInt-based one. The suitable example of corresponding pipeline configurations with BigInt decoders and encoders can be found at Netty’s example directory.

In The Beginning

2014-01-19T10:00:00+00:00

This is my first post on this blog. Needless to say, I’m really excited about this. This is my first attempt in writing blog posts in English. I was previously posting in Russian at LJ and Tumbler, but it wasn’t about technical things. Here, I will try to post only about CS. I have huge plans on posting about Scala and it’s application for research of purely functional data structures. I should probably post about algorithms and data structures that I’ve already implemented in Scalacaster. There are loads of awesome pieces of Scala code that I want to tell about. One of my favorites - QuickSelect in a purely functional setting.

Anyway, I’m really looking forward to writing here. And I’m currently in the middle of writing my first useful post here. I’m going to describe a new and purely functional implementation technique of Union-Find data structure. I’ve almost committed the implementation, but it’s still requires a bit of improvement.

In order to not miss the updates of this blog, I would recommend you to follow my Twitter @vkostyukov. Announcements will be posted there.