Utilising the OpenStack Placement service to schedule GPU and NVMe workloads alongside general purpose instances
- Technology
We are going through a period of growth and transforming the way that we build and deploy our platforms at G-Research. A big part of this involves the creation of a heterogeneous OpenStack cloud, which focuses on security, high-performance compute (HPC) and providing users with the ability to self-serve infrastructure on demand.
The Challenge
Whilst it can be quite straightforward to deploy an OpenStack environment to support traditional virtual machines (instances), it becomes a lot more complicated when trying the schedule instances that have different hardware requirements.
We have lots of different teams with lots of diverse use cases here at G-Research. Whilst some teams may require a “generic” instance, others may require a fleet of bare metal machines to run their more specialist workloads. Recently, we have also seen an increase in the demand for instances which provide access to NVMe disk drives, and GPUs. We were challenged to run these different workloads from within a single OpenStack deployment.
GPU vs vGPU
PCI-E passthrough is a technology that allows PCI-Express devices to be handed to an instance with no extra virtualisation layer. This allows the guest to have complete control of the device and its capabilities. vGPU (virtual GPU), as the name suggests, adds a layer of virtualisation around the GPU, allowing this to be shared between multiple guests. Both GPU [1] and vGPU [2] configuration is generally well documented and is fairly straightforward to deploy within an OpenStack cloud.
We tried both GPU passthrough and vGPU to compare the technologies; evaluating their capabilities, advantages and disadvantages. This ensured that we were able to move forward with the confidence that we made the right decision, and to compare whether vGPU would give us any additional benefit in comparison to passing through a whole device.
We initially thought that having the ability to slice up a vGPU would offer additional flexibility, however as we looked into this more closely, we came across some limitations of using vGPU for some of our use cases. For example, at the time of our investigation, Nova was only able to schedule one vGPU per instance. We have many use cases at G-Research where this would not fit our needs. We also expect our users’ workloads to be able to utilise an entire (or multiple) GPUs, therefore we would not make use of the additional flexibility offered by vGPU. We also took on advice from our vendors, and general feedback from the community, mailing lists, forums, etc. and came to the conclusion that vGPU appeared to be more suitable for VDI type workloads. We decided that the extra complexity did not offer any real gains for our particular use cases at this time.
Traits
After making the decision to go ahead with passing through an entire GPU device, we set out to find a way to schedule different kinds of workloads side by side under one OpenStack deployment. This would allow users to be abstracted away from having to know the intricacies of our data centres when deploying applications. This could be relatively difficult to achieve as it would be unlikely that we would have a use-case where general purpose instances are scheduled on the same physical hypervisors that host our GPU backed instances, but there’s nothing to say that this might not be something we want to do in the future to make the most of our available resources. We needed a pragmatic and adaptable solution.
Scheduling different types of workloads in Nova would have traditionally been solved using a combination of host aggregates and additional flavour metadata. Since the Placement service was split from the Nova project and introduced as its own independent service, we now have the ability to schedule workloads with more granularity, and to identify what type of workload can be scheduled on a particular hypervisor by making use of ‘resource providers’ and ‘traits’. For example, an OpenStack hypervisor that is intended for a general purpose instance can be given the trait “CUSTOM_GENERAL_COMPUTE”. We can then add this trait to a flavour’s metadata to make the link between what an end-user asked for, and where it can be appropriately placed. Initially we had the requirement for three different types of hypervisor – ‘General Compute’, ‘GPU’, and ‘NVMe’.
To achieve this at scale, reduce operational toil and avoid human error, it made sense to wrap this up into some kind of automation or script – e.g. an Ansible playbook. The Kayobe and Kolla OpenStack projects allow for the deployment of a containerised OpenStack onto bare metal nodes. At G-Research, we use Kayobe and Kolla to deploy and maintain our OpenStack infrastructure. An interesting feature of Kayobe is the ability to run a custom Ansible Playbook [3]. This provided a scalable way to apply and maintain the traits of our hypervisors, without having the need for an OpenStack administrator to apply config manually. An example of the code needed to create this playbook can be found on GitHub [4]. It is worth noting that all custom traits must begin with “CUSTOM_”. When setting traits, you must specify a list of all the traits for a resource provider. This is done by passing in the flag “--
trait” multiple times as the cli does not currently offer a way to append or remove traits in a list. An example of this can be seen below:
$ openstack resource provider trait set <resource-provider-id> --trait CUSTOM_FOO --trait CUSTOM_BAR
Nova also sets some of its own traits but you do not have to worry about overwriting these as Nova will repopulate them for you. To see a list of all the traits for a particular resource provider use the following command:
$ openstack resource provider trait list <resource-provider-id>
To maintain the resources that make up our OpenStack configuration (such as flavours, networks, etc) we use Terraform [5]. As mentioned earlier, in order to match flavours to hypervisor traits, we added metadata to ensure flavours are correctly matched with traits. This can be seen in the example below.
resource "openstack_compute_flavor_v2" "flavor_c1_m1" {
name = "flavor_c1_m1"
ram = "1024"
vcpus = "1"
disk = "20"
is_public = true
extra_specs = {
"trait:CUSTOM_GENERAL_COMPUTE"="required"
}
}
Next steps
By using the process explained throughout this blog, we are now able to ensure that instances are only scheduled on hypervisors that meet certain hardware requirements. Another advantage gained here is the ability to reject certain workloads. For example, preventing a general compute workload from being scheduled on a GPU hypervisor. By using Ansible group variables within Kayobe, we can manage the list of traits associated with each type of hypervisor so we have a dynamic way to modify how the Placement service schedules instances.
The next stage for us at G-Research is to extend this approach and introduce Ironic for bare metal compute. Now that we are more familiar with how the Placement service works we are able to make use of features such as custom resource classes when deploying Ironic. We also look forward to quotas [6] being introduced to the Ironic service to improve the overall experience for our end users and bring this in line with controls that we have in place for other services such as Nova.
Authors:
- Scott Solkhon – Engineer
- Ross Martyn – Engineer
Thanks to John Garbutt from StackHPC for his help on this project.
Reference
[1] https://docs.openstack.org/nova/latest/admin/pci-passthrough.html
[2] https://docs.openstack.org/nova/latest/admin/virtual-gpu.html
[3] https://docs.openstack.org/kayobe/latest/custom-ansible-playbooks.html
[4] https://gist.github.com/ssolkhon/0f39ee11130ed4f322bedb3bd3cb0462
[5] https://www.terraform.io/docs/providers/openstack/index.html
[6] https://specs.openstack.org/openstack/nova-specs/specs/ussuri/approved/unified-limits-nova.html