Elastic Build Service

Filed under: Cloud computing, cloudbursting, build system, software build, OBS, Open Build Service, computing cluster, distributed system, virtualization

Authors: Ville Seppänen

Category: industry article

Keywords: Cloud computing, cloudbursting, build system, software build, OBS, Open Build Service, computing cluster, distributed system, virtualization

Abstract: Linux-based operating systems such as MeeGo consist of thousands of modular software packages. Compiling and packaging source code is an automated but computationally heavy task. As the load on a build farm can vary greatly, a local infrastructure is difficult to provision efficiently. In this paper we present the elastic acquisition of cloud resources as a means to ensure sufficient computing capacity for a software build system. This system is Open Build Service, a centrally managed distributed build system capable of building packages for several distributions and architectures. Main concerns were the technical feasibility, security and cost-efficiency of the proposed solution. A script was implemented to autonomously manage the elastic cloudbursting, monitoring resource usage and demand and making decisions whether new machines should be requested or idle ones terminated. The latencies incurred by the physical distance to the cloud were not insurmountable and the system scaled up in a matter of minutes. The main advantage achieved with cloud usage in this work was the advent of seemingly infinite number of resources on-demand, ideal for building taking care of sudden bursts of packages that can be built in parallel.

Permanent link to this page: http://urn.fi/URN:NBN:fi-fe201109275589

Initial submission
Elastic Build Service: Small updates to diagrams, clarified text, fixed typos.
Elastic Build Service: Revised based on peer-review.

Filed under: Cloud computing, cloudbursting, build system, software build, OBS, Open Build Service, computing cluster, distributed system, virtualization

associateEditors589 says:

Nov 28, 2011 04:55 PM

This paper presents a fine example on what you can do with cloud when you can spill over load to cloud resources and the application can be nicely automated. The presented test system is able to push compilation load from private cloud to Amazon EC2 instances.

Somehow this sounds familiar to me. It seems that this resembles Electric Commander tool of a company called Electric Cloud (http://www.electric-cloud.com/products/electriccommander.php) which provides services to distribute build load to the cloud. However, this paper rather nicely provides the details on how the parts work, although does not provide all the functionality yet.
As a research paper the main question to me is what is novel in this paper compared to these existing solutions? Some statements about this are needed.

Also some reflection against the literature describing the state of the research is necessary. From http://scholar.google.com/scholar?q=cloudbursting I was able to find some articles using cloudbursting in other domains. However, in a quick search I was not able to find scientific articles on build/make automation using on-demand resources from the public cloud (closest being Armbrust et al, Above the clouds, http://x-integrate.de/x-in-cms.nsf/id/DE_Von_Regenmachern_und_Wolkenbruechen_-_Impact_2009_Nachlese/$file/abovetheclouds.pdf)

In addition to those points above, I have some minor comments regarding the clarity of the presentation:
- Does OBS Server in Figure 3 refer to Back-end in Figure 2?
- In Fig 3 "Workers connect to server to build packages" leaves it unclear if the OBS Server or the Workers make the build (as in Fig 2). I assume the Workers do that (?).
- Where are the Local Build Hosts in Fig 3?
- The role of boxes in Fig 3 is unclear, please describe (assuming functional modules in a single HW?)
- Please, elaborate "boto" (in text above Fig 4).
- "When bad and unnecessary build hosts have been terminated, the service manager determines whether it should request new build hosts." is there no means to assign new tasks for the workers of a free (i.e., unnecessary) CBH?
- Altogether, the management of workers vs. Cloud Build Hosts is a bit unclear (as there often exists only one worker per CBH, although they have separate roles).
- In Chapter 5, please give some data of the LBHs (number of cores etc.) and use of CBHs.
- Chapter 5 entitled "Evaluation and future work" could preferably be titled simply "Evaluation".
- The figures given and the simple cost analysis in Ch 5 were useful.
- Ch 6: "It uses virtualization to inhibit attacks from inside of the build environments to the build host and the whole build system." What does this mean? Please, rephrase or elaborate.

Ville Seppänen says:

Dec 13, 2011 08:18 PM

Novelty: I'm not sure whether this should be counted more as an industry paper, as this is based on my Master's thesis work on a industry-based research problem. Regardless, the novelty here is the elastic use of public clouds to serve as a backup if MeeGo development would grow suddenly.

State of research: I have added some comments on literature, mainly "Above the Clouds: A Berkeley View of Cloud Computing" and "Elastic Management of Cluster-based Services in the Cloud".

Other points:
- My OBS Server refers to the combination of Back-end, Front-end and Storage (i.e. everything except workers). These software processes can be run on a single server in smaller deployments. Workers should run on separate servers.
- Workers build packages from sources, the OBS Server stores these sources and packages and dispatches jobs to the workers.
- LBHs are not visible in Fig 3, which represents the proposed architecture. In this architecture, all build hosts are treated the same way no matter where they exist. They are managed using a Virtual Infrastructure Manager or VIM (e.g. OpenStack, Eucalyptus). I have added a figure (Fig 4) to show the more simple, implemented architecture. LBHs are manually managed local build hosts that are started and then left running. This was done because of time constraints, to avoid the need of a VIM.
- The boxes are indeed functional modules of software running on the management server.
- Boto is a set of Python modules, for interfacing Amazon Web Services from Python code.
- Unnecessary is not the same as a free CBH. With unnecessary I meant build hosts that have all of their workers idle because there is no more work to be done, AND in addition they are nearing the time when they would be billed again. You could say expiring/deletable/retiring instead.
- A build host is a machine (virtual or physical) that is running one or more worker processes on it. Each worker can build one package at a time. The reason for this separation is that the build jobs are tied to specific workers, while EC2 (or a VIM) is only aware of virtual machines (the build hosts). To destroy unnecessary resources, one has to shut down the build host of a specific worker, not just the worker.
- There is a single LBH, a virtual machine running on a quad-core commodity desktop, with 3 dedicated cores and thus 3 workers running on it. Mainly two types of EC2 machines were used (Small Instance "m1.small" and High-CPU Medium Instance "c1.medium" http://aws.amazon.com/ec2/#instance )
- Title is now simply Evaluation
- Attacks: Build hosts have sandboxed build environments for each worker. For this sandboxing, OBS traditionally uses Xen or KVM. These are possibly troublesome (nested virtualization) to use on a build host that is already a Xen virtual machine in EC2. If no sandboxing is used, the build environment is a simple chroot jail, which can be easily escaped from within. This means that if someone would build malicious (open-)source code on the build system, he could get root-level access on the whole build host, and alter future builds or the build system itself. The solution is to use lightweight, OS-level virtualization such as Linux Containers for the sandboxing.

These issues have been addressed and clarified in the revised paper.

reviewer589-4 says:

Nov 28, 2011 07:53 PM

The paper describes an innovative application of a cloud bursting - i.e. dynamic deployment of computing instances in a cloud to serve a spike in demand - for effectively compiling/building large-scale software systems, in particular, the software packages of contemporary open-source operating systems, such as MeeGo and openSUSE.

Overall, I find the paper quite interesting, insightful, concise, and coherent. I would still suggest explicitly listing (early in the paper) the additional contributions made by the authors (proof of concept implementation, elaboration of service manager, empirical evaluation, etc.)

Some further comments:
- The paper presents a distributed cloud-based build system dedicated specifically for building Linux distributions. What is the reason for focusing especially on Linux OS packages? (Other types of software may also require complex and time-consuming building process.)
- Why was the decision made to use the aggressive elasticity policy? Is it a requirement of the open-source community to have the building phase completed asap, or was it used for simplicity only?
- The evaluation was carried out by compiling a set of packages in batch mode, which does not reflect the bursty nature of demand. Would it be possible to make the evaluation using both the average and peak load - this would enable the benefits of the system to be manifested: in particular, little or no over-provisioning, and likely lower overall costs (since the number of local hosts could be reduced). How i.e. about comparing in terms of build time, server utilization, costs:
1) the baseline (only local hosts), with no payments to Amazon, but notable overprovisioning (i.e. acquired and largely underutilized servers)
2) the proposed system, with extra payments to Amazon, but little or no overprovisioning and fast build time
- The luck of trusted environment indeed undermines the security (confidentiality) when dealing with proprietary software. But is it equally serious problem in the context considered in the paper - namely, the case of open source packages?

A couple of questions regarding terminology:
- What is Local/Cloud Build Host? - In particular, is it the same as "Worker"?
- What is OBS server? - In particular, is it the same as "Back-end" server?

Ville Seppänen says:

Dec 13, 2011 08:42 PM

Contributions are now mentioned more clearly in the beginning.

Further points:
- This research addressed a specific case: How to use cloud to ensure build capacity if MeeGo development grows rapidly. Back from Intel's Moblin, the MeeGo project has been OBS to build packages. All points in the work are also valid for building other Linux packages on OBS. Building OS packages is a somewhat special case compared to other distributed computation in that:
1) Handling a job (i.e. building a package) may take little time or a very long time (from minutes to hours or days).
2) Jobs may or may not depend on each other and thus may have to wait other jobs. Packages can form long dependency trees, and usually many packages depend on large packages (making the bottleneck even worse).
3) A worker that is building a package is completely reserved. If all 100 workers are building 100 packages, then the rest of the jobs have to wait in a queue until there is a free worker. A web server cluster for example has much more granular workload.
Other than this, the reason of this focus is to narrow down the subject.
- The aggressive elasticity policy is for simplicity only. With more advanced policies it is possible to cut down costs without affecting total build times. Jobs in the queue could be left waiting if the system knows that some jobs are about to finish. OBS does not however predict build times.
- The batch is somewhat small and I would see it as a single burst of jobs. With a traditional OBS, the workers are easily saturated to the point that there are many jobs in the queue. With cloud usage, the main benefit comes from having a seemingly unlimited amount of workers. Unfortunately I am not able to make further measurements.
- Indeed the compromisation of the build system is far greater security issue than the confidentiality of the (open) source code. However, as MeeGo is not an end-user OS but rather a platform for device vendors to create their products on, the vendors may have their proprietary packages they wish to build on the system.

These issues have been addressed and clarified in the revised paper.

Ville Seppänen says:

Dec 13, 2011 08:47 PM

Also, as stated in the reply to the other reviewer:
- Worker is a program running on a computer (build host). A build host is running one or more workers. Jobs are dispatched to specific workers, while build hosts can be started and shut down for elasticity.
- OBS Server is running all the components that can also be distributed into Front-end, Back-end and Storage servers. The OBS Server stores sources and packages and makes decisions on which workers should build what.

Pasi Tyrväinen says:

Nov 29, 2011 11:21 AM

Editor Decision

Your manuscript has been reviewed and reviewers have suggested revising it prior to publication.

There is still a possibility to revise this in due time to get it accepted for publication in the first peer-reviewed issue of the Communications of Cloud Software journal. To achieve this you need to read carefully the comments of the reviewers and update your manuscript accordingly in two weeks (by December 13th). As commented by the reviewers, the paper needs to be connected better with prior research described in related scientific publications. Please, check also the information for authors section providing useful guidelines for revising the paper.

In case you are not able to revise the manuscript by that date, you a later revision will be reviewed for the second issue.

Looking forward for the updated version by 13.12.

Pasi Tyrväinen says:

Dec 22, 2011 10:43 AM

Editor Decision

Thank you for revising the paper, which has improved much from the previous version. The reviewer comments imply that related work has probably been conducted while in the current version the exact relation of the presented results and other related research has not been elaborated to the full extent. However, so far previous similar publications have not been identified by the reviewers or the audience implying that this paper still contains relatively novel ideas.

The editor approach is to encourage fast dissemination of ideas with high potential if the research community considers them to be novel. Based on this the decision is to publish this paper in the first issue of Communications of Cloud Software. Congratulations!

We also encourage readers potentially aware with similar or close to similar ideas published somewhere to comment on the paper to share their knowledge and to facilitate discussion of the novel ideas in the forefront of the research.