Loading...

Follow OpenStack in Production on Feedspot

Continue with Google
Continue with Facebook
Or

Valid



During the Ironic sessions at the recent OpenStack Dublin PTG in Spring 2018, there were some discussions on adding a further burn in step to the OpenStack Bare Metal project (Ironic) state machine. The notes summarising the sessions were reported to the openstack-dev list. This blog covers the CERN burn in process for the systems delivered to the data centers as one example of how OpenStack Ironic users could benefit from a set of open source tools to burn in newly delivered servers as a stage within the Ironic workflow.

CERN hardware procurement follows a formal process compliant with public procurements. Following a market survey to identify potential companies in CERN's member states, a tender specification is sent to the companies asking for offers based on technical requirements.
Server burn in goals
Following the public procurement processes at CERN, large hardware deliveries occur once or twice a year and smaller deliveries multiple times per year. The overall resource management at CERN was covered in a previous blog. Part of the steps before production involves burn in of new servers. The goals are
  • Ensure that the hardware delivered complies with CERN Technical Specifications
  • Find systematic issues with all machines in a delivery such as bad firmware
  • Identify failed components in single machines
  • Provoke early failure in failing components due to high load during stress testing
Depending on the hardware configuration, the burn-in tests take on average around two weeks but do vary significantly (e.g. for systems with large memory amounts, the memory tests alone can take up to two weeks). This has been found to be a reasonable balance between achieving the goals above compared to delaying the production use of the machines with further testing which may not find more errors.

Successful execution of the CERN burn in processes is required in the tender documents prior to completion of the invoicing.
Workflow
The CERN hardware follows a lifecycle from procurement to retirement as outlined below. The parts marked in red are the ones currently being implemented as part of the CERN Bare Metal deployment.






As part of the evaluation, test systems are requested from the vendor and these are used to validate compliance with the specifications. The results are also retained to ensure that the bulk equipment deliveries correspond to the initial test system configurations and performance.
Preliminary Checks
CERN requires that the Purchase Order ID  and an unique System Serial Number are set in the NVRAM of the Baseboard Management Controller (BMC), in the Field Replaceable Unit (FRU) fields Product Asset Tag (PAT) and Product Serial (PS) respectively:

# ipmitool fru print 0 | tail -2
 Product Serial        : 245410-1
 Product Asset Tag     : CD5792984

The Product Asset Tag is set to the CERN delivery number and the Product Serial is set to the unique serial number for the system unit.

Likewise, certain BIOS fields have to be set correctly such as booting from network before disk to ensure the systems can be easily commissioned.

Once these basic checks have been done, the burn in process can start. A configuration file, containing the burn-in tests to be run, is created according on the information stored in the PAT and PS FRU fields. Based on the content of the configuration file, the enabled tests will automatically start.
Burn in
The burn in process itself is highlighted in red in the workflow above, consisting of the following steps
  • Memory
  • CPU
  • Storage
  • Benchmarking
  • Network
Memory
The memtest stress tester is used for validation of the RAM in the system. Details of the tool are available at http://www.memtest.org/
CPU
Testing the CPU is performed using a set of burn tools, burnK7 or burnP6, and burn MMX. These tools not only test the CPU itself but are also useful to find cooling issues such as broken fans since the power load is significant with the processors running these tests.
Disk
Disk burn ins are intended to create the conditions for early drive failure. The bathtub curve aims to cause the early failure drives to fail prior to production.



With this aim, we use the badblocks code to repeatedly read/write the disks. SMART counters are then checked to see if there are significant numbers of relocated bad blocks and the CERN tenders require disk replacement if the error rate is high.

We still use this process although the primary disk storage for the operating system has now changed to SSD. There may be a case for minimising the writing on an SSD to maximise the life cycle of the units.
Benchmarking
Many of the CERN hardware procurements are based on price for total compute capacity needed. With the nature of most of the physics processing, the total throughput of the compute farm is more important than the individual processor performance. Thus, it may be that the most total performance can be achieved by choosing processors which are slightly slower but less expensive.

CERN currently measures the CPU performance using a set of benchmarks based on a subset of the SPEC 2006 suite. The subset, called HEPSpec06, is run in parallel on each of the cores in the server to determine the total throughput from the system. Details are available at the HEPiX Benchmarking Working Group web site.

Since the offers include the expected benchmark performance, the results of the benchmarking process are used to validate the technical questionnaire submitted by the vendors. All machines in the same delivery would be expected to produce similar results so variations between different machines in the same batch are investigated.

CPU benchmarking can also be used to find problems where there is significant difference across a batch, such as incorrect BIOS settings on a particular system.

Disk performance is checked using a reference fio access suite. A minimum performance level in I/O is also required in the tender documents.

Networking
Networking interfaces are difficult to burn in compared to disks or CPU. To do a reasonable validation,  at lest two machines are needed. With batches of 100s of servers, a simple test against a single end point will produce unpredictable results.

Using a network broadcast, the test finds other machines running the stress test, they pair up and run a number of tests.

  • iperf3 is used for bandwidth, reversed bandwidth, udp and reversed udp
  • iperf for full duplex testing (currently missing from iperf3)
  • ping is used for congestion testing
Looking forward
CERN is currently deploying Ironic into production for bare metal management of machines. Integrating the burn in and retirement stages into the bare metal management states would bring easy visibility of the current state as the deliveries are processed.

The retirement stage is also of interest to ensure that there is no CERN configuration in the servers (such as Ironic BMC credentials or IP addresses).  CERN has often donated retired servers to other high energy physics sites such as SESAME in Jordan and Morocco which requires a full server factory reset before dismounting. This retirement step would be a more extreme cleaning followed by complete removal from the cloud.

Discussing with other scientific laboratories such as SKA through the OpenStack Scientific special interest group has shown interest in extending Ironic to automate the server on-boarding and retirement processes as described in the session at the OpenStack Sydney summit. We'll be following up on these discussions at Vancouver.
Acknowledgements
  • CERN IT department - http://cern.ch/it
  • CERN Ironic and Rework Contributors 
    • Alexandru Grigore
    • Daniel Abad
    • Mateusz Kowalski
References

Read Full Article
Visit website
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
The CERN cloud resources are used for a variety of purposes from running compute intensive workloads to long running services. The cloud also provides personal projects for each user who is registered to the service. This allows a small quota (5 VMs, 10 cores) where the user can have resources dedicated for their use such as boxes for testing. A typical case would be for the CERN IT Tools training where personal projects are used as sandboxes for trying out tools such as Puppet.

Personal projects have a number of differences compares to other projects in the cloud
  • No non-standard flavors
  • No additional quota can be requested
  • Should not be used for production services
  • VMs are deleted automatically when the person stops being a CERN user
With the number of cloud users increasing to over 3,000, there is a corresponding growth in the number of cores used by personal projects, growing by 1,200 cores in the past year. For cases like training users, there is often the case that the VMs are created and the user then does not remember to delete the resources so they consume cores which could be used for compute capacity to analyse the data from the LHC.



One possible approach would be to reduce the quota further. However, tests such as setting up a Kubernetes cluster with OpenStack Magnum often need several VMs to perform the different roles so this would limit the usefulness of personal projects. The usage of the full quota is also rare.



VM Expiration
Based on a previous service which offered resources on demand (called CVI based on Microsoft SCVMM), the approach was taken to expire personal virtual machines.
  • Users can create virtual machines up to the limit of their quota
  • Personal VMs are marked with an expiry date
  • Prior to their expiry, the user is sent several mails to inform them their VM will expire soon and how to extend it if it is still useful.
  • On expiry, the virtual machine is locked and shutdown. This helps to catch cases where people have forgotten to prolong their VMs.
  • One week later, the virtual machine is deleted, freeing up the resources.
Implementation
We use Mistral to automate several OpenStack tasks in the cloud (such as regular snapshots and project creation/deletion). This has the benefit of a clean audit log to show what steps worked/failed along with clear input/output states supporting retries and an authenticated cloud cron for scheduling.

Our OpenStack projects have some properties set when they are created. This is used to indicate additional information like the accounting codes to be charged for the usage. There are properties for indicating if the type of project such as personal and if the expiration workflow should apply. Mistral YAQL code can then select resources where expiration applies.

task(retrieve_all_projects).result.select(dict(id => $.id, name => $.name, enabled => $.enabled, type => $.get('type','none'),expire => $.get('expire','off'))).where($.type='personal').where($.enabled).where($.expire='on')

The expire_at parameter is stored as a VM property. This makes it visible for automation such as CLIs through the openstack client show server CLI.

There are several parts to the process
  • A cron trigger'd workflow which
    • Machines in error state or currently building are ignored
    • A newly created machine which does not have an expiry date set has the expiration date set according to the grace period
    • Sees if any machines are entering close to their expiry time and sends a mail to the owner
    • Checks for invalid settings of the expire_at property (such as people setting it a long way in the future or deleting the property) and restores a reasonable value if this is detected
    • If a machine has reached it's expiry date, it's locked and shutdown
    • If a machine has past it's date by the grace period, it's deleted
    • A workflow, launched by Horizon or from the CLI
      • Retrieves the expire_at value and extends it by the prolongation period
    The user notification is done using a set of mail templates and a dedicated workflow (https://gitlab.cern.ch/cloud-infrastructure/mistral-workflows/blob/master/workflows/send_mail_template.yaml). This allows templates such as instance reminders to have details about the resources included, such as the example from the mail template.

    The Virtual Machine {instance} from the project {project_name} in the Cloud Infrastructure Service will expire on {expire_date}.

    A couple of changes to Mistral will be submitted upstream
    • Support for HTML mail bodies which allows us to have a nicer looking e-mail for notification with links included
    • Support for BCC/CC on the mail so that the OpenStack cloud administrator e-mail can also be kept on copy when there are notifications
    A few minor changes to Horizon were also done (currently local patches)
    • Display expire_at value on the instance details page
    • Add a 'prolong' action so that instances can be prolonged via the web by using the properties editor to set the date of the expiry (defaulting to the current date with the expiry time). This launches the workflow for prolonging the instance.
    Author
    Jose Castro Leon from the CERN cloud team
    References
    Read Full Article
    Visit website
    • Show original
    • .
    • Share
    • .
    • Favorite
    • .
    • Email
    • .
    • Add Tags 
    Motivation
    The CERN cloud consists of around 8,500 hypervisors providing over 36,000
    virtual machines. These provide the compute resources for both the laboratory's
    physics program but also for the organisation's administrative operations such
    as paying bills and reserving rooms at the hostel.

    The resources themselves are generally ordered once to twice a year with servers being kept for around 5 years. Within the CERN budget, the resource planning teams looks at:
    • The resources required to run the computing services requirements for the CERN laboratory. These are projected using capacity planning trend data and upcoming projects such as video conferencing.
    With the installation and commissioning of thousands of servers concurrently
    (along with their associated decommissioning 5 years later), there are scenarios
    to exploit underutilised servers. Programs such as LHC@Home are used but we have also been interested to expand the cloud to provide virtual machine instances which can be rapidly terminated in the event of
    • Resources being required for IT services as they scale out for events such as a large scale web cast on a popular topic or to provision instances for a new version of an application.
    • Partially full hypervisors where the last remaining cores are not being requested (the Tetris problem).
    • Compute servers at the end of their lifetime which are used to the full before being removed from the computer centre to make room for new deliveries which are more efficient and in warranty.
    The characteristics of this workload is that it should be possible to stop an
    instance within a short time (a few minutes) compared to a traditional physics job.
    Resource Management In Openstack Operators use project quotas for ensuring the fair sharing of their infrastructure. The problem with this, is that quotas pose as hard limits.This
    leads to actually dedicating resources for workloads even if they are not used
    all the time or to situations where resources are not available even though
    there is quota still to use.

    At the same time, the demand for cloud resources is increasing rapidly. Since
    there is no cloud with infinite capabilities, operators need a way to optimize
    the resource utilization before proceeding to the expansion of their infrastructure.

    Resources in idle state can occur, showing lower cloud utilization than the full
    potential of the acquired equipment while the users’ requirements are growing.

    The concept of Preemptible Instances can be the solution to this problem. These
    type of servers can be spawned on top of the project's quota, making use of the
    underutilised  capabilities. When the resources are requested by tasks with
    higher priority (such as approved quota), the preemptible instances are
    terminated to make space for the new VM.
    Preemptible Instances with OpenstackSupporting preemptible instances, would mirror the AWS Spot Market and the
    Google Preemptible Instances. There are multiple things to be addressed here as
    part of an implementation with OpenStack, but the most important can be reduced to these:
    1. Tagging Servers as Preemptible
    In order to be able to distinguish between preemptible and non-preemptible
    servers, there is the need to tag the instances at creation time. This property
    should be immutable for the lifetime of the servers.
    1. Who gets to use preemptible instances
    There is also the need to limit which user/project is allowed to use preemptible
    instances. An operator should be able to choose which users are allowed to spawn this type of VMs.
    1. Selecting servers to be terminated
    Considering that the preemptible instances can be scattered across the different cells/availability zones/aggregates, there has to be “someone” able to find the existing instances, decide the way to free up the requested resources according to the operator’s needs and, finally, terminate the appropriate VMs.
    1. Quota on top of project’s quota
    In order to avoid possible misuse, there could to be a way to control the amount of preemptible resources that each user/project can use. This means that apart from the quota for the standard resource classes, there could be a way to enforce quotas on the preemptible resources too.
    OPIE : IFCA and Indigo Dataclouds
    In 2014, there were the first investigations into approaches by Alvaro Lopez
    from IFCA (https://blueprints.launchpad.net/nova/+spec/preemptible-instances).
    As part of the EU Indigo Datacloud project, this led to the development of the
    OpenStack Pre-Emptible Instances package (https://github.com/indigo-dc/opie).
    This was written up in a paper to Journal of Physics: Conference Series
    (http://iopscience.iop.org/article/10.1088/1742-6596/898/9/092010/pdf) and
    presented at the OpenStack summit (https://www.youtube.com/watch?v=eo5tQ1s9ZxM)
    Prototype Reaper Service
    At the OpenStack Forum during a recent OpenStack summit, a detailed discussion took place on how spot instances could be implemented without significant changes to Nova. The ideas were then followed up with the OpenStack Scientific Special Interest Group.

    Trying to address the different aspects of the problem, we are currently
    prototyping a “Reaper” service. This service acts as an orchestrator for
    preemptible instances. It’s sole purpose is to decide the way to free up the
    preemptible resources when they are requested for another task.

    The reason for implementing this prototype, is mainly to help us identify
    possible changes that are needed in Nova codebase to support Preemptible
    Instances.

    More on this WIP can be found here: 
    SummaryThe concept of Preemptible Instances gives operators the ability to provide a
    more "elastic" capacity. At the same time, it enables the handling of increased
    demand for resources, with the same infrastructure, by maximizing the cloud
    utilization.

    This type of servers is perfect for tasks/apps that can be terminated at any
    time, enabling the users to take advantage of extra cpu power on demand without the fixed limits that quotas enforce.

    Finally, here in CERN, there is an ongoing effort to provide a prototype
    orchestrator for Preemptible Servers with Openstack, in order to pinpoint the
    changes needed in Nova to support this feature optimally. This could also be
    available in future for other OpenStack clouds in use by CERN such as the
    T-Systems Open Telekom Cloud through the Helix Nebula Open Science Cloud
    project.
    Contributors
    • Theodoros Tsioutsias (CERN openlab fellow working on Huawei collaboration)
    • Spyridon Trigazis (CERN)
    • Belmiro Moreira (CERN)
    References
    Read Full Article
    Visit website
    • Show original
    • .
    • Share
    • .
    • Favorite
    • .
    • Email
    • .
    • Add Tags 
    At CERN, we have around 8,500 hypervisors running 36,000 guest virtual machines. These provide the compute resources for both the laboratory's physics program but also for the organisation's administrative operations such as paying bills and reserving rooms at the hostel. These resources are spread over many different server configurations, some of them over 5 years old.

    With the accelerator stopping over the CERN annual closure until mid March, this is a good period to be planning reconfiguration of compute resources such as the migration of our central batch system which schedules the jobs across the central compute resources to a new system based on HTCondor. The compute resources are heavily used but there is more flexibility to drain some parts in the quieter periods of the year when there is not 10PB/month coming from the detectors. However, this year we have had an unexpected additional task to deploy the fixes for the Meltdown and Spectre exploits across the centre.

    The CERN environment is based on Scientific Linux CERN 6 and CentOS 7. The hypervisors are now entirely CentOS 7 based with guests of a variety of operating systems including Windows flavors and CERNVM. The campaign to upgrade involved a number of steps
    • Assess the security risk
    • Evaluate the performance impact
    • Test the upgrade procedure and stability
    • Plan the upgrade campaign
    • Communicate with the users
    • Execute the campaign
    Security Risk
    The CERN environment consists of a mixture of different services, with thousands of projects on the cloud, distributed across two data centres in Geneva and Budapest. 

    Two major risks were identified
    • Services which provided the ability for end users to run their own programs along with others sharing the same kernel. Examples of this are the public login services and batch farms. Public login services provide an interactive Linux environment for physicists to log into from around the world, prepare papers, develop and debug applications and submit jobs to the central batch farms. The batch farms themselves provide 1000s of worker nodes processing the data from CERN experiments by farming event after event to free compute resources. Both of these environments are multi-user and allow end users to compile their own programs and thus were rated as high risk for the Meltdown exploit.
    • The hypervisors provide support for a variety of different types of virtual machines. Different areas of the cloud provide access to different network domains or to compute optimised configurations. Many of these hypervisors will have VMs owned by different end users and therefore can be exposed to the Spectre exploits, even if the performance is such that exploiting the problem would take significant computing time.
    The remaining VMs are for dedicated services without access for end user applications or dedicated bare metal servers for I/O intensive applications such as databases and disk or tape servers.

    There are a variety of different hypervisor configurations which we split down by processor type (in view of the Spectre microcode patches). Each of these needs independent performance and stability checks.


    Microcode
    Assessment
    #HVs
    Processor name(s)
    06-3f-02
    covered
    3332
    E5-2630 v3 @ 2.40GHz,E5-2640 v3 @ 2.60GHz
    06-4f-01
    covered
    2460
    E5-2630 v4 @ 2.20GHz, E5-2650 v4 @ 2.20GHz
    06-3e-04
    hopefully
    1706
    E5-2650 v2 @ 2.60GHz
    ??
    unclear
    427
    CPU family: 21 Model: 1 Model name: AMD Opteron(TM) Processor 6276 Stepping: 2
    06-2d-07
    unclear
    333
    E5-2630L 0 @ 2.00GHz, E5-2650 0 @ 2.00GHz
    06-2c-02
    unlikely
    168
    E5645 @ 2.40GHz, L5640 @ 2.27GHz, X5660 @ 2.80GHz

    These risks were explained by the CERN security team to the end users in their regular blogs.
    Evaluating the performance impact
    The High Energy Physics community uses a suite called HEPSPEC06 to benchmark compute resources. These are synthetic programs based on the C++ components of SPEC CPU2006 which match the instruction mix of the typical physics programs. With this benchmark, we have started to re-benchmark (the majority of) the CPU models we have in the data centres, both on the physical hosts and on the guests. The measured performance loss across all architectures tested so far is about 2.5% in HEPSPEC06 (a number also confirmed by by one of the LHC experiments using their real workloads) with a few cases approaching 7%. So for our physics codes, the effect of patching seems measurable, but much smaller than many expected. 
    Test the upgrade procedure and stability
    With our environment based on CentOS and Scientific Linux, the deployment of the updates for Meltdown and Spectre were dependent on the upstream availability of the patches. These could be broken down into several parts
    • Firmware for the processors - the microcode_ctl packages provide additional patches to protect against some parts of Spectre. This package proved very dynamic as new processor firmware was being added on a regular basis and it was not always clear when this needed to be applied, the package version would increase but it was not always that this included an update for the particular hardware type. Following through the Intel release notes,  there were combinations such as "HSX C0(06-3f-02:6f) 3a->3b" which explains that the processor description 06-3f-02:6f is upgraded from release 0x3a to 0x3b. The fields are the CPU family, model and stepping from /proc/cpuinfo and the firmware level can be found at /sys/devices/system/cpu/cpu0/microcode/version. A simple script (spectre-cpu-microcode-checker.sh) was made available to the end users so they could check their systems and this was also used by the administrators to validate the central IT services.
    • For the operating system, we used a second script (spectre-meltdown-checker.sh) which was derived from the upstream github code at https://github.com/speed47/spectre-meltdown-checker.  The team maintaining this package were very responsive incorporating our patches so that other sites could benefit from the combined analysis.
    Communication with the users
    For the cloud, there are several resource consumers.
    • IT service administrators who provide higher level functions on top of the CERN cloud. Examples include file transfer services, information systems, web frameworks and experiment workload management systems. While some are in the IT department, others are representatives of their experiments or supporters for online control systems such as those used to manage the accelerator infrastructure.
    • End users consume cloud resources by asking for virtual machines and using them as personal working environments. Typical cases would be a MacOS user who needs a Windows desktop where they would create a Windows VM and use protocols such as RDP to access it when required.
    The communication approach was as follows:
    • A meeting was held to discuss the risks of exploits, the status of the operating systems and the plan for deployment across the production facilities. With a Q&A session, the major concerns raised were around potential impact on performance and tuning options. 
    • An e-mail was sent to all owners of virtual machine resources informing them of the upcoming interventions.
    • CERN management was informed of the risks and the plan for deployment.
    CERN uses ServiceNow to provide a service desk for tickets and a status board of interventions and incidents. A single entry was used to communicate the current plans and status so that all cloud consumers could go to a single place for the latest information.
    Execute the campaign
    With the accelerator starting up again in March and the risk of the exploits, the approach taken was to complete the upgrades to the infrastructure in January, leaving February to find any residual problems and resolve them. As the handling of the compute/batch part of the infrastructure was relatively straight forward (with only one service on top), we will focus in the following on the more delicate part of hypervisors running services supporting several thousand users in their daily work.

    The layout of our infrastructure with its availability zones (AVZs) determined the overall structure and timeline of the upgrade. With effectively four AVZs in our data centre in Geneva and two AVZs for our remote resources in Budapest, we scheduled the upgrade for the services part of the resources over four days.


    The main zones in Geneva were done one per day, with a break after the first one (GVA-A) in case there were unexpected difficulties to handle on the infrastructure or on the application side. The remaining zones were scheduled on consecutive days (GVA-B and GVA-C), the smaller ones (critical, WIG-A, WIG-B) in sequential order on the last day. This way we upgraded around 400 hosts with 4,000 guests per day.

    Within each zone, hypervisors were divided into 'reboot groups' which were restarted and checked before the next group was handled. These groups were determined by the OpenStack cells underlying the corresponding AVZs. Since some services required to limit the window of service downtime, their hosting servers were moved to the special Group 1, the only one for which we could give a precise start time.

    For each group several steps were performed:
    • install all relevant packages
    • check the next kernel is the desired one
    • reset the BMC (needed for some specific hardware to prevent boot problems)
    • log the nova and ping state of all guests
    • stop all alarming 
    • stop nova
    • shut down all instances via virsh
    • reboot the hosts
    • ... wait ... then fix hosts which did not come back
    • check running kernel and vulnerability status on the rebooted hosts
    • check and fix potential issues with the guests
    Shutting down virtual machines via 'virsh', rather than the OpenStack APIs, was chosen to speed up the overall process -- even if this required to switch off nova-compute on the hosts as well (to keep nova in a consistent state). An alternative to issuing 'virsh' commands directly would be to configure 'libvirt-guests', especially in the context of the question whether guests should be shut down and rebooted (which we did during this campaign) or paused/resumed. This is an option we'll have a look at to prepare for similar campaigns in the future.

    As some of the hypervisors in the cloud had very long uptimes and this was the first time we systematically rebooted the whole infrastructure since the service went to full production about five years ago, we were not quite sure what kind issues to expect -- and in particular at which scale. To our relief, the problems encountered on the hosts hit less than 1% of the servers and included (in descending order of appearance)
    • hosts stuck in shutdown (solved by IPMI reset)
    • libvirtd stuck after reboot (solved by another reboot)
    • hosts without network connectivity (solved by another reboot)
    • hosts stuck in grub during boot (solved by reinstalling grub) 
    On the guest side, virtual machines were mostly ok when the underlying hypervisor was ok as well.
    A few additional cases included
    • incomplete kernel upgrades, so the root partition could not be found (solved by booting back into an older kernel and reinstall the desired kernel)
    • file system issues (solved by running file system repairs)
    So, despite initial worries, we hit no major issues when rebooting the whole CERN cloud infrastructure!
    Conclusions
    While these kind of security issues do not arrive very often, the key parts of the campaign follow standard steps, namely assessing the risk, planning the update, communicating with the user community, execution and handling incomplete updates.

    Using cloud availability zones to schedule the deployment allowed users to easily understand when there would be an impact on their virtual machines and encourages good practise to load balance resources.
    References
    Authors
    • Arne Wiebalck
    • Jan Van Eldik
    • Tim Bell

    Read Full Article
    Visit website
    • Show original
    • .
    • Share
    • .
    • Favorite
    • .
    • Email
    • .
    • Add Tags 
    While most of the machines on the CERN cloud are configured using Puppet with state stored in external databases or file stores, there are a few machines where this has been difficult, especially for legacy applications.

    Doing a regular snapshot of these machines would be a way of protecting against failure scenarios such as hypervisor failure or disk corruptions.

    This could always be scripted by the project administrator using the standard functions in the openstack client but this would also involve setting up the schedules and the credentials externally to the cloud along with appropriate skills for the project administrators. Since it is a common request, the CERN cloud investigated how this could be done as part of the standard cloud offering.

    The approach that we have taken uses the Mistral project to execute the appropriate workflows at a scheduled time. The CERN cloud is running a mixture of OpenStack Newton and Ocata but we used the Mistral Pike release in order to have the latest set of fixes such as in the cron triggers. With the RDO packages coming out in the same week as the upstream release, this avoided doing an upgrade later.

    Mistral has a set of terms which explain the different parts of a workflow (https://docs.openstack.org/mistral/latest/terminology).

    The approach needed several steps
    • Mistral tasks to define the steps
    • Mistral workflows to provide the order to perform the steps in
    • Mistral cron triggers to execute the steps on schedule

    Mistral Workflows

    The Mistral workflows consist of a set of tasks and a process which decides which task to execute next based on different branch criteria such as success of a previous task or the value of some cloud properties.

    Workflows can be private to the project, shared or public. By making these scheduled snapshot workflows public, the cloud administrators can improve the tasks incrementally and the cloud projects will receive the latest version of the workflow next time they execute them. With the CERN gitlab based continuous integration environment, the workflows are centrally maintained and then pushed to the cloud when the test suites have completed successfully.

    The following Mistral workflows were defined

    instance_snapshot

    Virtual machines can be snapshotted so that a copy of the virtual machine is saved and can be used for recovery or cloning in future. The instance_snapshot workflow performs this operation for both virtual machines which have been booted from volume or locally.

    Parameter
    Description
    Default
    instance
    The name of the instance to be snapshot
    Mandatory
    pattern
    The name of the snapshot to store. The text ={0}= is replaced by the instance name and the text ={1}= is replaced by the date in the format YYYYMMDDHHMM.
    {0}_snapshot_{1}
    max_snapshots
    The number of snapshots to keep. Older snapshots are cleaned from the store when new ones are created.
    0 (i.e. keep all)
    wait
    Only complete the workflow when the steps have been completed and the snapshot is stored in the image storage
    false
    instance_stop
    Shut the instance down before snapshotting and boot it up afterwards.
    false (i.e. do not stop the instanc)
    to_addr_success
    e-mail address to send the report if the workflow is successful
    null (i.e. no mail sent)
    to_addr_error
    e-mail address to send the report if the workflow failed
    null (i.e. no mail sent)

    The steps for this workflow are described in the detail in the YAML/YAQL files at https://gitlab.cern.ch/cloud-infrastructure/mistral-workflows.

    The operation is very fast with Ceph based boot-from-volumes since the snapshot is done within Ceph. It can however take up to a minute for locally booted VMs while the hypervisor is ensuring the complete disk contents are available. The VM is resumed and the locally booted snapshot is then sent to Glance in the background.

    The high level steps are

    ·      Identify server
    ·      Stop instance if requested by instance_stop
    ·      If the VM is locally booted
    o   Snapshot the instance
    o   Clean up the oldest image snapshot if over max_snapshots
    ·      If the VM is booted from volume
    o   Snapshot the volume
    o   Cleanup oldest volume snapshot if over max_snapshots
    ·      Start instance if requested by instance_stop
    ·      If there is an error and to_addr_error is set
    o   Send an e-mail to to_addr_error
    ·      If there is no error and to_addr_success is set
    o   Send an e-mail to to_addr_success

    restore_clone_snapshot
    For applications which are not highly available, a common configuration is using a LanDB alias to a particular VM. In the event of a failure, the VM can be cloned from a snapshot and the LanDB alias updated to reflect the new endpoint location for the service. This workflow will create a volume if the source instance is booted from volume. The workflow is called restore_clone_snapshot.

    The source instance needs to be still defined since information such as the properties, flavor and availability zone are not included in the snapshot and these are propagated by default.

    Parameter
    Description
    Default
    instance
    The name of the instance from which the snapshot will be cloned
    Mandatory
    Date
    The date of the snapshot to clone (either YYYYMMDD or YYYYMMDDHHMM)
    Mandatory
    pattern
    The name of the snapshot to clone. The text ={0}= is replaced by the instance name and the text ={1}= is replaced by the date.
    {0}_snapshot_{1}
    clone_name
    The name of the new instance to be created
    Mandatory
    avz_name
    The availability zone to create the clone in.
    Same as the source instance
    flavor
    The flavour for the cloned instance
    Same as the source instance
    meta
    The properties to copy to the new instance
    All properties are copied from the source[1]
    wait
    Only complete the workflow when the steps have been completed and the cloned VM is running
    false
    to_addr_success
    e-mail address to send the report if the workflow is successful
    null (i.e. no mail sent)
    to_addr_error
    e-mail address to send the report if the workflow failed
    null (i.e. no mail sent)

    Thus, cloning the machine timbfvlinux143 to timbfvclone143 requires running the workflow with the parameters

    {“instance”: “timbfvlinux143”, “clone_name”: “timbfvclone143”, “date”: “20170830” }

    This results in

    ·      A new volume created from the snapshot timbfvlinux143_snapshot_20170830
    ·      A new VM is created called timbfvclone143 booted from the new volume

    An instance clone can be run for VMs which are booted from volume even when the hypervisor is not running. A machine can then be recovered from it's current state using the procedure

    ·      Instance snapshot of original machine
    ·      Instance clone from that snapshot (using today's date)
    ·      If DNS aliases are used, the alias can then be updated to point to the new instance name

    For Linux guests, the rename of the hostname to the clone name occurs as the machine is booted. In the CERN environment, this took a few minutes to create the new virtual machine and then up to 10 minutes to wait for the DNS refresh.

    For Windows guests, it may be necessary to refresh the Active Directory information given the change of hostname.
    restore_inplace_snapshot

    In the event of an issue such as a bad upgrade, the administrator may wish to roll back to the last snapshot. This can be done using the restore_inplace_snapshot workflow.

    This operation works for locally booted machines, maintains the IP and MAC address but cannot be used if the hypervisor is down. It does not currently work for boot from volume until the revert to snapshot (available in Pike from https://specs.openstack.org/openstack/cinder-specs/specs/pike/cinder-volume-revert-by-snapshot.html) is in production.

    Parameter
    Description
    Default
    instance
    The name of the instance from which the snapshot will be replaced
    Mandatory
    date
    The date of the snapshot to replace from (either YYYYMMDD or YYYYMMDDHHMM)
    Mandatory
    pattern
    The name of the snapshot to replace from. The text ={0}= is replaced by the instance name and the text ={1}= is replaced by the date.
    {0}_snapshot_{1}
    wait
    Only complete the workflow when the steps have been completed and the replaced VM is running
    false
    to_addr_success
    e-mail address to send the report if the workflow is successful
    null (i.e. no mail sent)
    to_addr_error
    e-mail address to send the report if the workflow failed
    null (i.e. no mail sent)





    Mistral Cron Triggers
    Mistral has another nice feature where it is able to run a workflow at regular intervals. Compared to standard Unix cron, the Mistral cron triggers use Keystone trusts to save the user token when the trigger is enabled. Thus, the execution is able to run without needing the credentials such as a password or valid Kerberos token.
    The steps are as follows to create a cron trigger via Horizon or the CLI.
    Parameter
    Description
    Example
    Name
    The name of the cron trigger
    Nightly Snapshot
    Workflow ID
    The name or UUID of the workflow
    instance_snapshot
    Params
    A JSON dictionary of the parameters
    {“instance”: “timbfvlinux143”, “max_snapshots”: 5, “to_addr_error”: “theadmin@cern.ch”}
    Pattern
    A cron schedule pattern according to http://en.wikipedia.org/wiki/Cron
    * 5 * * * (i.e. run daily at 5a.m.)

    This will then execute the instance snapshot at 5a.m. sending a mail to theadmin@cern.ch in the event of a failure of the snapshot. 5 past copies will be kept.

    Mistral Executions
    When Mistral runs a workflow, it provides details of the steps executed, the timestamps for start and end along with the results. Each step can be inspected individually as part of debugging and root cause analysis in the event of failures.
    The Horizon interface gives an easy interface for selecting the failing tasks. There may be tasks reported as ‘error’ but these steps can then have subsequent actions which succeed so an error step may be a normal part of a successful task execution such as using a default if no value can be found.


    References

    Credits
    • Jose Castro Leon from the CERN IT cloud team did the implementation of the Mistral project and the workflows described.



    [1] Except for a CERN specific one called landb-alias for a DNS alias
    Read Full Article
    Visit website
    • Show original
    • .
    • Share
    • .
    • Favorite
    • .
    • Email
    • .
    • Add Tags 
    At the Boston Forum, there were many interesting discussions on models which could be used for nested quota management (https://etherpad.openstack.org/p/BOS-forum-quotas).

    Some of the background for the use has been explained previously in the blog (http://openstack-in-production.blogspot.fr/2016/04/resource-management-at-cern.html), but the subsequent discussions have also led to further review.

    With the agreement to store the quota limits in Keystone
    (https://specs.openstack.org/openstack/keystone-specs/specs/keystone/ongoing/unified-limits.html), the investigations are now focussing on the exact semantics of nested project quotas. This becomes especially true as the nesting levels go beyond 2.

    There are a variety of different perspectives on the complex problem such that there is not yet a consensus on the right model. The policy itself should be replaceable so that different use cases can implement alternative algorithms according to their needs.

    The question we are faced with is the default policy engine to implement. This is a description of some scenarios considered at CERN.

    The following use cases in the context of the CERN cloud
    1. An LHC experiment, such as ATLAS, is given a pledge of resources of X vCPUs. They should decide the priority of allocation of resources across their working groups (e.g. ATLAS Higgs studies) and their applications. These allocations between different ATLAS teams should not be arbitrated by the CERN cloud provider but more within the envelope of the allocation for the experiment and the ratio of capacity decided by the ATLAS experiment. This would produce typically around 50 child projects for each parent and a nesting level of 2 to 3.
    2. A new user to the CERN cloud is allocated a limit for small resource (up to 10 cores typically) for testing and prototyping. This means a new user can experiment with how to create VMs, use containers through Magnum and Ceph shares and block storage. This can lead to under-utilised resources (such as tests where the resources have not been deleted afterwards) or inappropriate usage such as 'crowd-sourcing' resources for production. With thousands of users of the CERN cloud, this can become a significant share of the cloud. Using nested projects, these users could be associated with an experiment and a total cap placed on their usage. The experiment would then arbitrate between different users. This would give up to 400 child projects per parent and a nesting level of 2.
    We would not expect nesting levels to exceed 3 based on the current scenarios.

    Currently, we have around 3,400 projects in the CERN cloud.

    A similar use case is for the channel partner models for some SMEs using the cloud where the SME owns the parent project with the cloud provider and then allocates out to customers of the SME's services (such as a useful dataset) and charges the customers an uplift on the cloud providers standard pricing to cover their development and operations costs.

    Looking through the review of the different options (https://review.openstack.org/#/c/441203/), from the CERN perspective,
    • CERN child projects would be initially created with a limit of 0. Having more than 0 would mean potentially that projects could not be created in the event of the parent project quota being exceeded.
    • We would like to have resources allocated at any level of the tree. If I have a project which I want to split into two sub-projects, I would need to create the two sub-projects and then arrange to move/re-create the VMs. Requiring the resources to only be in the leaves would make this operation difficult to ensure application availability (thus I agree with Chet in the review).
    • Overbooking should not be permitted in the default policy engine. Thus, the limits on the child projects should sum up to less than the limit on the parent. This was a topic of much debate in the CERN team but it was felt that permitting overbooking would require a full traversal of the tree for each new resource creation which would be very expensive in cases like the Personal tenants. It also makes the limits on a project visible to the user of that project rather than seeing an out of quota error because a project higher up in the tree has a restriction.
    • The limits for a project should be set, at minimum, by the parent project administrator. It is not clear for CERN that there would be a use case that, in a 3 or more level tree, the administrators higher up the tree than the parent project would need to be able to change a lower project limits. A policy.json for setting the limit would allow a mixture of implementations if needed.
    • It should be possible for an administrator to lower the limit on a child project below the current usage. This allows a resource co-ordinator to make the decision regarding resource allocation and inform the child project administrators to proceed with implementation. Any project with usage over its limit would not be able to create new resources. This would also be the natural semantics in the unified limits proposal where the limits moved to Keystone and avoid having callbacks to the relevant project when changes are made to the limits.
    • Each project would have one parent in a tree like structure
    • There is no need for user_id limits so the only unit to consider is the project. The one use case where we had been considering on using this is now replaced by the 2nd example given above.
    • Given the implementation constraints, these can be parts of a new API but
      • Limits would be stored in Keystone and thus any call back to Nova, Cinder, Swift, … would be discouraged
      • A chatty protocol on resource creation which required multiple iterations with the service is non-ideal.

    Within the options described in the review, this comes near to the Strict Hierarchy Default Closed model.

    For consistency, I'll define the following terms:
    • Limit is the maximum amount of a particular resource which can be consumed by a project. This is assumed to be stored in Keystone as an attribute of the project.
    • Used is the resources actually used by the project itself.
    • Hierarchical Usage is the usage of all of the child projects below the project in the tree.

    To give an example, the ATLAS project has a limit (L) of 100 cores but no resources used (U) inside that project. However, it has several child projects, Physics and Operations which have resources summing into the hierarchical usage (HU). These each have two child projects with resources allocated and limits.




    The following rules would be applied

    ·      The quota administrator would not be able to increase the limit of a child project such that the sum of the limits of the child project exceeds the parent.
    o   The ATLAS administrator could not increase the limit for either Physics or Operations
    o   The Physics administrator could increase the limit for either the Higgs or Simulation project by 10
    o   The Operations administrator could not increase the limit for either Workflow or Web
    ·      The quota administrator cannot set the limit on a project such that the limits on the children exceeds the parent limit
    o   The ATLAS administrator could not set the Operations limit to be 50 as Limit(Workflow)+Limit(Web)>Limit(Operations)
    ·      The quota administrator can set the limit below the usage
    o   The Physics administrator could lower the Simulation limit to 5 even though the used resources at 10
    ·      Creating a new resource requires that the usage in the project is less than or equal to the limit at all levels of the tree
    o   No additional resources could be created in Simulation since Used(Simulation)>=Limit(Simulation)
    o   No additional resources could be created in Higgs as HierarchicalUsage(Physics)>=Limit(Physics). The error message for this case would need to indicate the quota limit is in Physics.
    o   Up to 25 new resources could be created in Web since Usage(Web)+25<=Limit(Web), HierarchicalUsage(Operations)+25<=Limit(Operations) and HierarchicalUsage(ATLAS)+25<=Limit(ATLAS). After this operation,
    §  Used(Web)=30
    §  HierarchicalUsage(Operations)=60
    §  HierarchicalUsage(ATLAS)=80


    Based on past experience with Nova quotas, the aim would be to calculate all the usages (both Used and HierarchicalUsage dynamically at resource creation time). The calculation of the hierarchical usage could be expensive however since it requires the navigation of the whole tree for each resource creation. Some additional Keystone calls to get the entire project tree would be needed. This may limit the use in large scale environments but the performance would only be affected where the functionality was used.
    Read Full Article
    Visit website
    • Show original
    • .
    • Share
    • .
    • Favorite
    • .
    • Email
    • .
    • Add Tags 
    The CERN OpenStack cloud service is providing block storage via Cinder since Havana days in early 2014.  Users can choose from seven different volume types, which offer different physical locations, different power feeds, and different performance characteristics. All volumes are backed by Ceph, deployed in three separate clusters across two data centres.

    Due to its flexibility, the volume concept has become very popular with users and the service has hence grown during the past years to over 1PB of allocated quota, hosted in more than 4'000 volumes. In this post, we'd like to share some of the features we use and point out some of the pitfalls we've run into when running (a very stable and easy to maintain) Cinder service in production.

    Avoiding sadness: Understanding the 'host' parameter

    With the intent to increase the resiliency, we configured the service from the start to run on multiple hosts. The three controller nodes were set up in an identical way, so all of them ran the API ('c-api'), scheduler ('c-sched') and volume ('c-vol') services.

    With the first upgrades, however, we realised that there was a coupling between a volume and the 'c-vol' service that had created it: each volume is associated with its creation host which, by default, is identified by hostname of the controller. So, when the first controller needed to be replaced, the 'c-sched' wasn't able to find the original 'c-vol' service which would be able to execute volume operations. At the time, we fixed this by changing the corresponding volume entries in the Cinder database to point to the new host that was added.

    As the Cinder configuration allows the 'host' to be set directly in 'cinder.conf',  we set this parameter to be the same on all controllers with the idea to remove the coupling between the volume and the specific 'c-vol' which was used to create it. We ran like this for quite a while and although we never saw direct issues related to this setting, in hindsight it may explain some of the issues we had with volumes getting stuck in transitional states. The main problem here is the clean-up being done as the daemons start up: as they assume exclusive access to 'their' volumes, volumes in transient states will be "cleaned up", e.g. their state reset, when a daemon starts, so in a setup with identical 'host's, this may cause undesired interferences.

    Taking this into account, our setup has been changed to keep the 'c-api' on all three controller, but run the 'c-vol' and 'c-sched' services on one host only. Closely following the recent work of the Cinder team to improve the locking and allow for Active/Active HA we're looking forward to have Active/Active HA 'c-vol' services fully available again.

    Using multiple volume types: QoS and quota classes
    The scarce resource on our Ceph backend is not space, but IOPS, and after we handed out the first volumes to users, we quickly realised that some resource management was needed. We achieved this by creating a QoS spec and associating it with the one volume type we had at the time:

    # cinder qos-create std-iops write_iops_sec=100 read_iops_sec=100
    # cinder qos-associate <std_iops_qos_id> <std_volume_type_id>

    This setting does not only allow you to limit the amount of IOPS used on this volume type, but also to define different service levels. For instance, for more demanding use cases we added a high IOPS volume type to which access is granted on a per request basis:

    # cinder type-create high-iops
    # cinder qos-create high-iops write_iops_sec=500 read_iops_sec=500
    # cinder qos-associate <high_iops_qos_id> <high_iops_volume_type_id>

    Note that both types are provided by the same backend and physical hardware (which also allows for a conversion without data movement between these types using 'cinder retype')! Note also that for attached volumes a detach/re-attach cycle is needed to have QoS changes taking effect.

    In order to manage the initial default quotas for these two (and the other five volume types the service offers), we use Cinder's support for quota classes. As apart from the std-iops volume type all other volume types are only available on request, the initial quota is usually set to '0'. So, in order to create the default quotas for a new type, we would hence update the default quota class by running a command like:

    # cinder type-create new-type
    # cinder quota-class-update --volume-type new-type --volumes 0 --snapshots 0 --gigabytes 0 default

    Of course, this method can also be used to define different initial quotas for new volume types, but it is in any case a way to avoid setting the initial quotas explicitly after project creation.
    Fixing a long-standing issue: Request timeouts and DB deadlocks
    For quite some time, our Cinder deployment had suffered from request timeouts leading to volumes left in error states when doing parallel deletions. Though easily reproducible, this was infrequent (and subsequently received the corresponding attention ...). Recently, however, this became a much more severe issue with the increased use of Magnum and Kubernetes clusters (which use volumes and hence launch parallel volumes deletions at larger scale when being removed). This affected the overall service availability (and, subsequently, received the corresponding attention here as well ...).

    In this situations, the 'c-vol' logs showed lines like

    "Deadlock detected when running 'reservation_commit': Retrying ..."

    and hence indicated locking problem. We weren't able to pinpoint in the code how a deadlock would occur, though. A first change that mitigated the situation was to reduce the 'innodb_lock_wait_timeout' from its default value of 50 seconds to 1 second: the client was less patient and exercised the retry logic decorates the database interactions much earlier. Clearly, this did not address the underlying problem, but at least allowed the service to handle these parallel deletions in a much better way.

    The real fix, suggested by a community member, implied to try and change a setting we had carried forward since the initial setup of the service: the connection string in 'cinder.conf' was not specifying a driver and hence using the mysql Python wrapper (rather than the recommended 'pymysql' Python implementation). After changing our connection from

    connection = mysql://cinder:<pw>@<host>:<port>/cinder

    to

    connection = mysql+pymysql://cinder:<pw>@<host>:<port>/cinder

    the problem basically disappeared!

    So the underlying reason was the management of the green thread parallelism in the wrapper vs. the native Python implementation: while the former enforces serialisation (and hence eventually deadlocks in SQLAlchemy), the latter allows for proper parallel execution of the requests to the database. The OpenStack oslo team is now looking into issuing a warning when it detects this obsolete setting.

    As using the 'pymysql' driver is generally recommended and, for instance, default in devstack deployments, volunteers to help with this issue had a really hard time to reproduce the issues we experienced ... another lesson learnt when keeping services running for a longer period :)
    Read Full Article
    Visit website
    • Show original
    • .
    • Share
    • .
    • Favorite
    • .
    • Email
    • .
    • Add Tags 

    At the recent summit in Boston, Doug Hellmann and I were discussing research around OpenStack, both the software itself but also how it is used by applications. There are many papers being published in proceedings of conferences and PhD theses but finding out about these can be difficult. While these papers may not necessarily lead to open source code contribution, the results of this research is a valuable resource for the community.

    Increasingly, publications are made with Open Access conditions which are free of all restrictions on access. For example, all projects receiving European Union Horizon 2020 funding are required to make sure that any peer-reviewed journal article they publish is openly accessible, free of charge. Reviewing with the OpenStack scientific working group, Open access was also felt to be consistent with OpenStack's Open principles of Open Source, Open Design, Open Development and Open Community.

    There are a number of different repositories available where publications such as this can be made available. The OpenStack scientific working group are evaluating potential approaches and Zenodo looks like a good candidate as it is already widely used in the research community, open source on github and the application also runs in the CERN Data Centre on OpenStack. Preservation of data is one of CERN's key missions and this is included in the service delivery for Zenodo.

    The name Zenodo is derived from Zenodotus, the first librarian of the Ancient Library of Alexandria and father of the first recorded use of metadata, a landmark in library history.
    Accessing the Repository
    The list of papers can be seen at https://zenodo.org/communities/openstack-papers. Along with keywords, there is a dedicated search facility is available within the community so that relevant papers can be found quickly.
    Submitting New Papers
    Zenodo allows new papers to be submitted for inclusion into the OpenStack Papers repository. There are a number of steps to be performed. Please ensure that these papers are available under open access conditions before submitting them to the repository if published elsewhere. Alternatively, they can be published in Zenodo for the first time and receive the DOI directly.
    1. Log in to Zenodo. This can be done using your github account if you have one or by registering a new account via the 'Sign Up' button.
    2. Once logged in, you can go to the openstack repository at https://zenodo.org/communities/openstack-papers and upload a new paper.
    3. The submission will then be verified before publishing.
    To submit for this repository, you need to provide
    • Title of the paper
    • Author list
    • Description (the Abstract is often a good content)
    • Date of publication
    If you know the information, please provide the following also
    • DOI (Digital Object Identifier) used to uniquely identify the object. In general, these will already be allocated to the paper since the original publication will have allocated one. If none is specified, Zenodo will create one, which is good for new publications but bad practice to generate duplicate DOIs for published works. So please try to find the original, which also it helps with future cross referencing.
    • There are optional fields at upload time for adding more metadata (to make it machine readable), such as “Journal” and “Conference”. Adding journal information improves the searching and collating of documents for the future so if this information is known, it is good to enter it.
    Zenodo provides the synchronisation facilities for repositories to exchange information (OAI 2.0). Planet OpenStack feeds using this would be an interesting enhancement to consider or adding RSS support to Zenodo would be welcome contributions.




    Read Full Article
    Visit website
    • Show original
    • .
    • Share
    • .
    • Favorite
    • .
    • Email
    • .
    • Add Tags 
    For several OpenStack releases, the Identity service in OpenStack offers an additional token format than UUID, called Fernet. This token format has a series of advantages over the UUID, the most prominent for us is that it doesn't need to be persisted. We were also interested in a speedup of the token creation and validation.

    At CERN, we have been using the UUID format for the tokens since the beginning of the cloud deployment in 2013. Normally in the keystone database we have around 300,000 tokens with an expiration of 1 day. In order to keep the database size controlled, we run the token_flush procedure every hour.

    In the Mitaka release, all remaining bugs were sorted out and since the Ocata release of OpenStack, Fernet is now the default. Right now, we are running keystone in Mitaka and we decided to migrate to the Fernet token format before the upgrade of the Ocata release.

    Before reviewing the upgrade from UUID to Fernet, let's have a brief look on the architecture of the identity service. The service resolves into a set of load balancers and then they redirect to a set of backends running keystone under apache. This allows us to replace/increase the backends transparently.

    The very first question is how many keys we would need. If we take the formula from [1]:

    fernet_keys = validity / rotation_time + 2

    If we have a validity of 24 hours and a rotation every 6 hours, we would need 24/6 + 2 = 6 keys

    As we have several backends, the first task is to distribute the keys among the backends, for that we are using puppet that provides the secrets in the /etc/keystone/fernet-keys folder. With that we ensure that a new introduced backend will always have the last set of keys available.

    The second task is how to rotate them. In our case we are using a cronjob in our rundeck installation that rotates the secrets and introduces a new one. This job is doing exactly the same as the keystone-manage token-flush command. One important aspect is that on each rotation, you need to reload or restart the Apache daemon to load the keys from the disk.

    So we prepare all this changes in the production setup quite some time ago, and we were testing that the keys were correctly updated and distributed. On April 5th, we decided to go on. This is the picture of the API messages during the intervention.


    There are two distinct steps in the API messages usage. The first one is a peak of invalid tokens. This is triggered by the end users trying to validate UUID tokens after the change. The second peak is related to OpenStack services that are using trusts, like Magnum and Heat. From our past experience, these services can be affected by a massive invalidation of tokens. The trust credentials are cached and you need to restart both services so both services will get their Fernet tokens.

    Below is a picture of the size of the token table in the keystone database, as we are now in Fernet is going slowly down to zero, due to the hourly token_flush command I mentioned earlier.


    The last picture is the response time of the identity service during the change. As you can see the response time is better than on UUID as stated in [2]


    In the Ocata release, more improvements are on the way to improve the response time, and we are working to update the identity service in the near future.

    References:
    1. Fernet token FAQ at https://docs.openstack.org/admin-guide/identity-fernet-token-faq.html
    2. Fernet token performance at http://blog.dolphm.com/openstack-keystone-fernet-tokens/
    3. Payloads for Fernet token format at http://blog.dolphm.com/inside-openstack-keystone-fernet-token-payloads/
    4. Deep dive into Fernet at https://developer.ibm.com/opentech/2015/11/11/deep-dive-keystone-fernet-tokens/
    Read Full Article
    Visit website
    • Show original
    • .
    • Share
    • .
    • Favorite
    • .
    • Email
    • .
    • Add Tags 

    Separate tags by commas
    To access this feature, please upgrade your account.
    Start your free year
    Free Preview