Opportunity

All cloud providers maintain significant spare capacity to handle peak traffic and unexpected spikes. When these instances are not in use, they are sold at heavily discounted prices known as “Spot Instances” (Azure and Google use alternate names).

The problem with Spot Instances is that they can be “interrupted” at any time, when the provider needs them back. This means that Spot Instances are difficult to use, especially for stateful / high availability services. This problem is exacerbated by the fact that similar instances are often interrupted at the same time.

Solution

To overcome this problem, PIQO combines client workloads, and deploys them on a diverse mix of Spot and Reserved / On-Demand instances. Client workloads are distributed to minimise cost while maintaining SLA-backed levels of high availability. When PIQO anticipates an interruption, it migrates workloads to alternative Spot, Reserved, or On-Demand instances as required.

To maximise performance, PIQO maintains a proprietary database of historic Spot availability at every AWS, Azure, Google, and Alibaba datacenter worldwide. This allows us to optimise instance selection, and to anticipate interruptions in advance.

Stateful Services

Maintaining high availability on stateful services (e.g. a database cluster) can be challenging. In particular, precautions must be taken to ensure that migrations occur slowly, and that a sufficient number of cluster members remain active at all times.

PIQO meets these requirements in four ways:

  • Permanent nodes: PIQO deploys a proportion (30% by default) of stateful containers on On-Demand / Reserved instances, which rarely fail.
  • Permanent volumes: PIQO uses permanent volumes (e.g. EBS on AWS), which allow stateful services to migrate quickly.
  • Anti-affinity: PIQO ensures that stateful containers are deployed across multiple instance types and availability zones. This reduces the risk that a large proportion will be interrupted at the same time.
  • Serial migrations: PIQO uses a proprietary model to anticipate interruptions up to 20 minutes in advance. This allows nodes running stateful containers to be drained slowly.