Cloud systems are perhaps the most complicated computing systems developed so far, yet our daily lives depend heavily on their continuous and reliable operations. To achieve systematic cloud reliability, we propose an intelligent data-driven paradigm based on AIOps (artificial intelligence for IT operations). We collect heterogeneous data such as traces, logs, key performance indicators (KPIs), topologies, and incidences from multiple sources in cloud systems, and perform various data-driven operations in the paradigm, including anomaly detection, failure diagnosis, and fault localization, for cloud resilience. We conduct experimentations on industrial settings, demonstrating applicability of the proposed paradigm, as well as its effectiveness towards achieving reliable AIOps tasks in cloud computing environments.
Learn more about the 2021 Microsoft Research Summit: https://Aka.ms/researchsummit