Introduction to Apache Mesos

This article will discuss the concepts of distributed systems, resources sharing and scheduling. It will review what an Apache Mesos framework is and what are its components. This post will also review why the concepts of resources sharing and scheduling are important for the new generation of application architecture called microservices and what types of applications are good candidates for running on Apache Mesos.

We will cover the following topics:

  • The new type of distributed systems
  • What is Apache Mesos
  • Mesos architecture overview
  • Resources scheduling
  • Resources sharing and isolation
  • What is a Mesos framework

The New Type of Distributed Systems

There are applications that cannot get by with a single computer, for either capacity reasons or resilience. In order to run these applications, sometimes keeping busy tens, sometimes tens of thousands of computers, we introduce distributed systems, where computers may not be aware of the state of the whole cluster, yet they work together in order to achieve a common goal.

In the traditional approach, a distributed application occupies an entire cluster or its fixed partition and is tightly coupled with the underlying infrastructure – the network of virtual or physical servers. This approach often leads to inefficient use of resources, because it is rarely the case that an application can fully utilize the provided resources and not require more over time. In reality the amount of resources required by an application changes continuously, sometimes in predictable patterns (regular batch processing), but mostly it is a function of unpredictable demand for applications and data. The resources requirements of an application may also change with new application versions, where performance issues can be fixed or introduced, or when new features are added. Basically, applications are like living organisms, far from being static and often changing their expectations from the environment.

Apache Mesos and its applications (called frameworks) represent an improvement on the matter of efficient resources utilization. Regardless of whether a cluster is spread over multiple geographical locations or sits within a single rack in a classroom, Apache Mesos allows to share the resources (processors, memory, storage) of the entire cluster by many applications at the same time and rapidly react to changing requirements of these application. Apache Mesos enables that by using an innovative approach called the two-level resource scheduling.

There are several advantages that come with the concept of two-level resource scheduling. The number one advantage is the much better utilization of available resources. In the traditional approach described above, it’s quite likely that most of the time some of the computers in a cluster are sitting and doing nothing, while others are too busy and hungry for resources.

Apache Mesos balances the utilization of the underlying infrastructure by allocating resources to applications dynamically based on the actual requirements and letting the applications run their tasks on computers, that are more suitable for those tasks at the given moment. This behavior has massive impact on the number of computers actually needed to be attached to the cluster (as shown in Figure 1), which in turn might have tremendous impact on the system maintainability, the environment and business.

Mesos resource sharing increases throughput and utilization Figure 1

Resiliency, an increased tolerance to failures and easier applications scalability are other frequently mentioned features of clusters, whose resources are managed by Apache Mesos.

What is Apache Mesos

Apache Mesos, here on referred to as Mesos, is a cluster resource manager, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks [1].

Mesos is an open source software, that has been developed under Apache License since 2009. The project was started by a group of students as an academic research project at the University of California in Berkeley. Some of the contributors later continued the development while working at Twitter, AirBnb or other companies.

In 2016, the community actively works on adding new features, with new versions released roughly every two months, rapidly approaching release of version 1.0. The current stable version of Mesos at the time of writing is 0.27. Mesos is written in the C++ language.

Mesos provides fine grained and efficient resource isolation and sharing for distributed applications and is designed to be a small and flexible core, a layer between the operating system and distributed applications. Mesos is sometimes also called the datacenter kernel. That’s because it is built using the same principles as the Linux kernel, only at a different level of abstraction. Similarly to Linux kernel Mesos negotiates fair access to system resources for multiple competing applications.

Mesos can introduce previously non-scalable applications, originally bound to running on a single computer, to cluster environments. Virtually any application that can run on a Linux can run on Mesos.

Like majority of distributed systems, Mesos clusters rely on message passing as the primary means of interprocess communication, with HTTP, RPC-like connectors and message queues being the typical representatives of technologies used for this purpose.

What Mesos is very good at in terms of architectural patterns support, is microservices. Although microservices is a relatively new pattern, it’s becoming popular quickly, partly because of the onset of containerization in the last few years.

Mesos can scale up to thousands of host computers, as has been successfully proven by massive deployments at Twitter.

Mesos Architecture Overview

Mesos cluster consists of two types of nodes. Their goal is to run tasks, typically binaries or containers, things that run on Linux.

The first type of Mesos node is called Mesos Master. Nodes of this type are responsible for management of the cluster itself, for resource scheduling and exposing Mesos API.

Mesos Agents, the second type of Mesos node, that used to be called Mesos Slaves in the past (and still are in majority of the documentation) are there to execute application tasks with a provided amount of resources, to keep tasks isolated and communicate with Mesos Master the state of the nodes.

Mesos Architecture Diagram Figure 2

There can be virtually any number of Mesos Agents in a Mesos cluster, each of them typically running on a single physical or virtual computer. Adding and removing Agents to and from a cluster is a natural part of its lifecycle. Mesos Agents are generally considered to be replaceable units, because the Mesos failure tolerance model is ready to deal with Agents being lost by recreating that Agent’s tasks elsewhere in the cluster. This, together with rolling updates functionality, creates environments very suitable for highly available and resilient systems.

In each Mesos cluster, there is always only one Master node managing the whole cluster, regardless of the number of Agent nodes. But computers fail and there’s no doubt this single Master node, being the system’s single point of failure, would inevitably die one day as well, letting the entire cluster down. Mesos introduces shadow masters to address this potential issue. The shadow or sometimes also called standby Master nodes are replicated Mesos Master nodes, running on separate virtual or physical machines, literally waiting for the leading Master to fail. Once it does fail, one of the shadow Masters jumps in and takes over the leading role. It does so by connecting to all the Mesos agents and continuing with Resource scheduling and other Mesos Master activities. We can say that Mesos has the fault tolerance at Master node level covered.

The recommended number of Master nodes (including standbys) in a production cluster is 3 or 5, assuming the broken nodes are replaced soon after a failure.

There is one service that is almost always running nearby a Mesos cluster. This service is aptly named the ZooKeeper.

Zookeeper is a separate open source project, also maintained by Apache Software Foundation and licensed under the Apache License. Apache Zookeeper is an effort to develop and maintain an open source server which enables highly reliable distributed coordination [2].

In a Mesos cluster, Zookeeper’s role is to elect the leading Master, either during Mesos start up phase, or later when a leading Master is for some reason gone, and to provide the cluster components with information about the address of the the leading Master.

A Zookeeper installation is itself represented as a quorum of 3 or 5 nodes for high reliability. The size of the quorum is an odd number, because it allows for democratic election of the leading Master node.

Resources Scheduling

Mesos delegates both the scheduling functionality and tasks execution to frameworks by collecting information about available resources from Mesos Agents and offering them to these frameworks. The reason for not implementing a common scheduler, that would work for all possible frameworks, current and future, is that Mesos is focused on heterogenous environments, and the scheduling logic and tasks execution logic of various applications reflect the applications’ nature and features more than any common scheme. A scheduler of batch job processing framework Spark has certainly very different logic to the one of say Elasticsearch Mesos framework. However, they typically share some bits of functionality. The framework schedulers and executors implement a common programming interface provided by Mesos in several programming languages.

Mesos introduces a two-level scheduling mechanism and a new abstraction called Resource Offers, which represents an encapsulation of a bundle of resources, that frameworks can allocate on a node to run their tasks. Understanding of these two concepts is crucial for successful work with Mesos API and programming Mesos frameworks.

Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which tasks to run using those resources. Mesos’s decision of how many resources to offer to each framework is determined by the active allocation module, which is one of the components of Mesos Master. By default, Mesos includes a strict priority resource allocation module and a fair sharing resource allocation module.

The strict priority resource allocation module simply always tries to give a framework enough resources to run its tasks. That is suitable for mission critical applications, but not very good for democratic and dynamic resource sharing amongst frameworks competing for resources. Once a framework asks for too many resources, other frameworks starve and Mesos is fine with that, when using the strict priority resource allocation module.

Most environments thus prefer the fair sharing resource allocation module, which balances the offered resources amongst heterogenous load in the arguably most democratic way. The implementation of fair sharing algorithm used by Mesos is called Dominant Resource Fairness algorithm.

Custom allocation modules can be written to satisfy specific requirements.

Mesos two-level scheduling mechanism Figure 3

Figure 3 shows an example of how the two-level scheduling mechanism works.

In step 1, Mesos Agent 1 reports to Mesos Master it has 4 CPU cores and 4 gigabytes of memory available. With this information Mesos Master calls its internal allocation module, which decides Framework 1 should be offered all these free resources.

In step 2, Mesos Master send a Resource Offer to the Scheduler of framework 1, saying it has been assigned 4 CPU cores and 4 gigabytes of memory for its tasks.

In step 3, the Scheduler accepts the offer and decides to use some of these resources and run two tasks, task 1 using 2 CPU cores and 1 gigabyte of RAM, task 2 using 1 CPU core and 2 gigabytes of memory.

In step 4, Mesos Master sends the two tasks to Slave 1, that allocates appropriate resources and executes the tasks.
Because there are still some free resources on Slave 1 (1 CPU core, 1 gigabyte RAM), these keep being offered back to Mesos Master. Once any of the two tasks scheduled in the previous example terminates, the freed resources are immediately offered to frameworks again.

This workflow allows Mesos to support wide range of different types of applications and still keep its own implementation minimalistic and performant. Accordingly to the Mesos whitepaper [1] Mesos can make tens of thousands resource offers per seconds.

You may be asking how is it possible that a task is assigned to run on a given number of CPU cores or a given amount of memory. Isn’t this something operating systems deal with? Yes it is. Moreover, Linux operating system has the concepts of cgroups and namespaces, that allow to create spaces of isolated CPU cores, memory and even network and filesystem mount points, where processes can run undisturbedly with the provided portion of system resources.

Resources Sharing and Isolation

In order to enable different tasks running on a same host safely, it is necessary to isolate task runtime environments. Mesos uses what it calls resource isolation modules for this purpose, and since all currently supported isolators use containers, the resource isolation modules are normally called containerizers. The two containerizers available out of the box are the more lightweight Mesos containerizer, that uses cgroups and namespaces directly to isolate the executor processes on the host, and the Docker containerizer, that allows running Docker containers as tasks or even executors.

These two containerizers both work similarly regarding the low-level isolation mechanisms, as Docker containerizer also leverages cgroups and Linux namespaces in its core. They do offer different features and ease of use, mostly favoring the Docker containerizer.

How a containerizer works in practice? It creates a Linux container for a given task and isolates the task process from the rest of the host system, only giving it explicit access to selected resources. After the task finishes its execution or dies unexpectedly, the container is removed and the freed resources are further offered to Mesos frameworks by Mesos Master, based on the active allocation policy.

Some resources can be naturally shared by tasks running on the same host. This is particularly useful in case of filesystem, because applications may want to access the same data. For example one application might be pulling data from a remote source, while another might want to perform statistical computations on top of the very same data in real time. In a traditional cluster, this would be a hard problem, because we would ended up synchronizing chunks of data between cluster partitions, as each of these partitions are the territory of one application only. Sometimes this kind of data replication would be impossible due to the size of the data.

We have mentioned that Mesos offers resources to frameworks and they choose which resources to use for their tasks. Mesos frameworks also have the right to reject resource offers. Often the reason for rejection resources by a framework is the requirement of running tasks on a specific host (or hosts), in order to work with data data present of these hosts only.

In theory, a virtual machine could also be an implementation of a containerizer, but for obvious reasons containers is the technology of choice nowadays.

What is a Mesos framework

Mesos framework is a software system that manages and executes one or more jobs on a cluster [1]. While the key feature of Mesos is efficient use of hardware resources, for a framework it is to provide the application enhanced resilience against failure.

A framework is essentially a computer program composed of two parts, a scheduler and an executor. A scheduler, as has been already shown above in the Resource Scheduling section, works together with Mesos Master to negotiate resources for running tasks. Once resources are assigned to the framework for running tasks, it gets executors (typically on different machines) to run these tasks.

Executor are deployed on Mesos Agents. There could be a lot of them (up to thousands) and they are responsible for running properly isolated tasks, watching their lifecycle and reporting back to Mesos Master with any relevant information. This information, evaluated by the scheduler, can for example trigger automatic scaling and expand or shrink the application appropriately. Other parts of application logic, like backups for example, may negotiate with the scheduler when it’s an appropriate time for them to run (certainly not during high load with scaling being in progress). And so on.

Mesos frameworks can be split into two groups. One group represents the general-purpose frameworks (Marathon and Chronos to name the most often used). Those frameworks can be used to deploy virtually any containers onto a Mesos cluster, monitor their state and provide lifecycle management and resiliency by automatic redeployment of containers lost due to infrastructure failures.

Marathon is an aptly named general-purpose Mesos framework for long-running applications, sometimes compared to the init process in Linux distributions. It’s written in Scala and developed by Mesosphere company. Coming with a nice UI and exposing well documented REST API, Marathon can be used for starting and stopping containers, scaling by running multiple instances of a container, performing basic health checks and retrieving system metrics. Marathon is so called meta-framework, which means it can run other Mesos frameworks as its tasks.

Chronos, another example of a general-purpose Mesos framework, is meant to be the replacement for regular Linux cron in distributed environments. Chronos allows transfering files to remote machines, executing them via shell (Mesos containerizer) or as Docker containers, and notifying Chronos of job completion or failures. An interesting feature of Chronos is its ability to not only run repeating jobs based on the ISO8601 notation, but also run jobs after other jobs are completed, creating dependency chains.

There are many use cases for general-purpose Mesos frameworks. Especially applications having less complex requirements regarding scalability, data locality and lifecycle management (web servers being a typical example) may flourish on general-purpose Mesos frameworks. Because these frameworks are battle tested in a wide range of various deployments, they are relatively reliable and easy to set up and use.

Once it comes to applications with a custom scaling logic (sometimes related to a preexisting scaling features of such an application), or generally with any special requirements related to running the application on a multi-host environments, not supported by general-purpose frameworks, an application-bound Mesos framework must be developed.

These specialized, application-bound frameworks then address specific scalability and resiliency issues and extend Mesos features by custom scheduler and executor implementations, leveraging Mesos API and the information it provides. A custom Mesos framework can extend the application too and essentially become a Mesos native application (hence using the terms Mesos framework and Mesos application interchangeably).

Summary

In this post we looked at the new type of distributed system, that introduces efficient resources utilization, improved resiliency and an increased tolerance to failures, and we compared it with the traditional approach to distributed computing.

We have described origins of the Apache Mesos project, what types of applications it’s suitable to be used with, and what are the main Mesos cluster components. The key concepts of resources scheduling, sharing and isolation were introduced.

Resources