Posted: 15 Jan. 2021 7 min. read

Self-healing platforms

Imperative of the digital age

Digital platforms are increasingly becoming the core strategy of enterprises to better integrate customers, partners and suppliers in a marketplace to scale demand and supply. In addition, internal IT ecosystems are digitizing the core with platform and micro services-based approaches for agility.

Platforms, therefore, are becoming a significant representation of corporate brands, and user experience will be a key success factor. Platforms have an increasing need to have consistent high performance and stability.

Downtimes, poor responsiveness, inconsistent data have and will increasingly deliver losses to the tune of hundreds of millions in the new ‘platform-itized’ world. Inability to make real-time decisions and respond effectively to events will be disruptive to business.

Given this criticality, AI enabled operations (AIOps) is a trend that needs to be adopted by enterprises with considerable urgency. Self-monitoring, managing and healing platforms will prove to be a strategic differentiator, and the power of AIOps today has the capability to deliver this effectively. This needs to become a core part of every platform architecture and design.           


Self-healing demystified

'Self-healing' as the term indicates, is the intelligence of the platform to detect anomalies, before or on-time, that will take place and be able to self-initiate corrective actions. Self-healing solutions need to instrument monitoring mechanisms to detect anomalies to defined KPI ranges, degrading performance and infrastructure, platform and application failures. Thereafter, data from these instrumentations are monitored for trends and signals leading to anomalies. Self-healing frameworks leverage this to determine next best actions. Next best actions can be determined based on rules derived from prior experiences and/or machine learned from incident and platform logs information. Ability to correctly determine this next best action is the core intelligence in self-healing platforms. Based on this, appropriate response is initiated which includes but not limited to notifications, re-try's, capacity expansion, invoking exception responses, application of patches etc.  Below diagram depicts this solution framework.


Typical construct of self-healing solutions


Typical responses of self-healing solutions

Self-healing solutions can respond to anomalies in various ways depending on the circumstances and the confidence on the anomaly based on prior events. Some indicative ones are as follows:

  • Omni-channel proactive alerts to relevant stakeholders.
  • Sense failures based on deteriorating performance, data quality, etc. and initiating corrective actions based on pre-determined rules.
  • Pattern recognition of correlated events, chronic issues and taking preemptive actions.
  • Next best actions based on predictive intelligence, time to failure, probability of failure.
  • Initiate routine maintenance to avoid future degradations like file compressions, archiving, memory re-allocations, etc.


Self-healing opportunities for data platforms

Big data platforms, data lakes, massive data warehouses play a crucial role in the digital journey of enterprises. Availability of data is a precursor to modern ways of working and below are some of the unique opportunities for data platforms to be responsive to anomalies and self-heal themselves:

  • Manage source data quality:  Be able to respond to challenges of incoming source data quality and avoid impact on the integrity of the platform.
  • Adapt to schema evolution:  This is common failure scenario in data platforms and modern platforms need to be able to auto adjust to incoming schema changes.
  • Manage long running jobs:  Big data jobs can quickly go out of range in terms of execution time. Platforms need to able to detect the KPI overshoot and respond with necessary corrective actions based on learnings from prior job run data.
  • Early detection of platform health issues:  Ability to detect early trends of health issues on infrastructure, partner platforms and data apps, and initiate proactive response to either auto-correct and/or notify relevant stakeholders.
  • Automated platform maintenance:  Detect need and initiate standard maintenance routines to manage services, archive, compact and compress files, based on deterministic rules.
  • Chronic failures:  Recognize pattern of frequent failures with network connection to processing nodes at certain times, job kickoff issues etc. and establish framework to address these automatically.


In summary

Self-healing strategies will be an imperative for next generation maturity of platforms. AI will play an increasing role in such platform engineering, taking away the need for expert engineers to spend time on routine effort to keep lights on, crucial dollar savings that will be leveraged to deepen digital transformations.

Key Contacts

Arunabha Mookerjea

Arunabha Mookerjea

Specialist Leader

Arunabha Mookerjea is a specialist leader and distinguished cloud architect in the Strategy and Analytics practice at Deloitte Consulting India Private Limited. He specializes in technology advisory, solution architectures and directing large scale delivery in next generation areas of cloud and big data platforms, IoT, micro-services and digital core solutions. Arunabha is a member of the Next Gen Architecture Program.