Platform Infrastructure Design for Security, Availability, and Scalability

This page covers the design of the interworks.cloud Platform infrastructure, with regards to security, availability and scalability.

 

Table of Contents

 

Deployment


The platform is deployed according to the distributed deployment model. This model spreads platform roles across separate physical/virtual resources, and multiple resources are deployed to support each role, to avoid potential points of failure among the deployed components. Each platform component is then installed on multiple, separate Azure resources, and components of the same role will share a dedicated availability set. All web-based components are configured in highly available web farms, served by Azure load balancers, with the option to activate on-demand scaling to distribute load and support increasing traffic/request handling. The platform relational databases reside in highly available geo-replicated Azure SQL managed instances. All component resources can be scaled up to support increased traffic demand and storage instances, and/or additional resources may be provisioned on-demand to serve and handle the additional load.

The following is a summary of the high availability features of the interworks.cloud Platform:

  • Use of Azure availability sets and zones for VMs and AKS clusters

  • Separate server farms for front-end (Storefront) and back-office (BSS) web applications

  • Multi-node cluster for message broker (RabbitMQ) implementation

  • Azure SQL Managed Instance built-in high availability solution deeply integrated with the Azure platform

  • Use of Azure load balancer and application gateway

  • Use of horizontal/vertical Pod Autoscaler for AKS clusters

Deployment Environment

The interworks.cloud Platform is mainly offered as a SaaS service, however, it can also be hosted on premises or in the cloud — either by the client, or by interworks.cloud on the client’s behalf. The deployment environment must be selected considering several factors related to scalability, flexibility and security and general maintenance of the infrastructure. Cloud-based platforms provide several resources (Virtual Machines, Storage, Networking, Load Balancers) and deployment options (managed or not) that will cover most business requirements.
Microsoft Azure is one of the fastest growing Cloud Infrastructure Platforms. It provides unparalleled flexibility, on-demand scalability, and secure, stable and reliable high performance services and resources. Additionally, Azure is generally available in 36 regions around the world and supported by a growing network of Microsoft-managed datacenters. On-premise platforms may offer more control and customization, but they may also require more maintenance and resources. If you already have a mature IT network with datacenters in appropriate locations, this may be a suitable option.

Security


Access to cloud resources and management portal are granted only to a specific group of authorized personnel. Strong Two-factor authentication (2FA) is configured to further strengthen security and protect from unauthorized access. Remote access to platform VMs is also restricted to specific users and conditional access policies are in place to allow access from specific locations. Additional Two-factor authentication is employed at operating system level for all RDP/SSH connections to the platform VMs. All critical data are stored in secure locations, providing encryption-at-rest and safeguarding of the data in order to meet organizational security and compliance commitments. Data exchanged locally between the database and the platform components may not be encrypted – with the exception of specific sensitive data (like user passwords). All other communications (between client and server or server and 3rd party end-points) are executed through secure channels enforcing SSL encryption.

Please note the following features pertaining to database security:

  • Latest Enterprise Edition SQL Server

  • Automatic patching and version updates, automated backups, built-in high availability

  • Quick service / resource scaling

  • Isolated environment

  • Advanced threat protection

  • Transparent data encryption (TDE)

For more information about platform security, please see the following page: https://interworkscloud.atlassian.net/wiki/spaces/ICPD/pages/955121667/Platform+Security+Privacy#2.-Cloud-Security

For certificates and policy documents pertaining to platform and information security, please see the following page: https://interworkscloud.atlassian.net/wiki/spaces/ICPD/pages/955121667/Platform+Security+Privacy#5.5-Security-Audits

System Monitoring


All architecture elements of an interworks.cloud Platform deployment could present possible failures or performance degradation. It is crucial to establish monitoring processes for all applications and services to maximize the availability, performance, reliability and consumption. There’s a wide range of monitoring features that are employed and are actively used for the interworks.cloud Platform online deployment:

  • Azure Monitor

    • Application Insights - Monitors the availability, performance, and usage of the platform web applications

    • VM insights - Monitors and analyzes the performance and health of platform servers

    • Container Insights - monitors the health and performance of managed Kubernetes clusters hosted on AKS

  • Additional open-source monitoring solution deployed on on-premises datacenter - monitors the availability and performance of platform infrastructure servers

  • External monitoring service - Monitors uptime and availability of all platform public endpoints (including custom URLs used by resellers/end-customers)

  • Monitoring automations - Alert about unhandled application exceptions, expiring SSL certificates, automatically initialize web applications to optimize startup response times

  • Publicly accessible status page for all platform services and regions displays current service status, information on potential incidents, maintenance operations in progress or scheduled, as well as historical uptime and SLA metrics

  • Integrations with 3rd party service for real time alerts and escalations to On-Call teams 24x7

Maintenance and Upgrades


Maintenance actions are required to keep all resources healthy and keep long-term stability and functionality of the software system . More specifically:

  • Servers are kept up-to-date with latest OS and security updates

  • Servers may be replaced in case of new major OS versions or new generation VMs

  • Appropriate retention policies are in place for database, file and event viewer logs to ensure adequate keeping of historical data

  • Systematic scheduled maintenance of databases is performed to guarantee constant operation at peak performance

  • A new version of the interworks.cloud Platform is released weekly, with new features, improvements and bug-fixes that are in most cases critical for your business evolution.

Disaster Recovery


To meet typical needs and requirements regarding disaster recovery capabilities, any deployment should consider geo-replication technologies. All critical data, including platform SQL databases as well as other plain storage data should be backed up or replicated between separate regions with hot disaster recovery support. In this way, in the unlikely event of a disaster the platform will be able to be re-activated on another region and resume normal operations. Geo-replication of all Virtual Machines of the infrastructure (BSS, Storefront, Administration, RabbitMQ, PostgreSQL) is also necessary to ensure Business Continuity as fast as possible. Fully automated creation of a new Kubernetes cluster on any region should be ready to be initiated any time failover is required keeping this way the cost of related resources at the lowest possible level.

Please note the following Disaster Recovery features of the interworks.cloud Platform online deployment:

  • Azure Site Recovery implemented on all platform servers

  • Azure SQL Managed instance configured for Geo-replication and Failover groups

  • All platform related data residing in storage accounts are replicated to secondary storage accounts in another (paired) Azure region

  • Recovery plan implemented with appropriate automations to facilitate a complete failover to another (paired) Azure region in case of a disaster within minutes