Materials about Artificial Intelligence for IT Operations (AIOps).
- Researchers
- Industrial Materials
- Academic Materials
- Papers
- Competitions
- Datasets
- CUHK 香港中文大学
- Microsoft 微软
- Tsinghua 清华 CCF 国际AIOps大赛
- Google 谷歌
- Backblaze 云备份和存储供应商
- Alibaba-智能运维大赛
- AIOps2022通信网络智能运维大赛数据集
- GAIA-DataSet
- 华为网络运维数据集
- Tools, Models and Systems
- Community
- Internet articles
- Others
- Reference
China (& HK SAR) | |||
---|---|---|---|
Michael R. Lyu, CUHK | Dongmei Zhang, Microsoft | Pengfei Chen, SYSU | Dan Pei, Tsinghua |
Pengfei Chen, SYSU | Xin Peng, Fudan | Qingwei Lin, Microsoft | |
USA | |||
Ryan Huang, JHU | Yingnong Dang, Microsoft | Christina Delimitrou, MIT EECS | |
Europe | |||
Odej Kao, TU Berlin | |||
Australia | |||
Hongyu Zhang, UON |
- [VMware] Proactive Incident and Problem Management
- [GREATOPS 高效运维社区] 《企业级 AIOps 实施建议》白皮书
- [Awesome Open Source] Aiops Handbook
- [Tsinghua University] 清华裴丹:AIOps落地的15条原则
- [Tsinghua University] 清华裴丹:AIOps效果落地最后一公里
- [Alibaba Cloud] 基于大数据的智能网络分析-齐天
- [Tsinghua University] 清华裴丹:AIOps效果落地最后一公里
- [Moogsoft] What is AIOps?
- [Microsoft] Advancing Azure service quality with artificial intelligence: AIOps
- Datadog: A monitoring and security platform for cloud applications
- 必示 bizseer
- 听云 TINGYUN: 端到端的全平台应用性能管理系统
- Loom Systems
- ICSE21 Workshop on Cloud Intelligence
- AAAI-20 Workshop on Cloud Intelligence
- AIOPS 2020 (International Workshop on Artificial Intelligence for IT Operations)
- [arXiv '21] Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
- [CSUR '21] A Survey on Automated Log Analysis for Reliability Engineering
- [ESEC/FSE '20] Towards intelligent incident management: why we need it and how we make it
- [arXiv '20] A Systematic Mapping Study in AIOps
- [ICSE '19] AIOps: Real-World Challenges and Research Innovations
- [ISSRE '16] Experience Report: System Log Analysis for Anomaly Detection
- [ASE '13] Software analytics for incident management of online services: An experience report
- [arXiv '22] Constructing Large-Scale Real-World Benchmark Datasets for AIOps
- [ASPLOS '19] An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems
- [ICSE-SEIP '22] Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps
- [ICSE-SEIP '21] Neural knowledge extraction from cloud service incidents
- [arXiv '21] SoftNER: Mining Knowledge Graphs From Cloud Incidents
- [APPLSCI '20] A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications
- [ASPLOS '21] Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices
- [ICDCS '21] Defuse: A Dependency-Guided Function Scheduler to Mitigate Cold Starts on FaaS Platforms
- [FSE '20] Graph-based trace analysis for microservice architecture understanding and problem diagnosis
- [OSDI '20] FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices
- [ESEC/FSE '19] Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs
- [TSE '18] Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study
- [ASE '21] AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems [code]
- [NSDI '07] X-Trace: A Pervasive Network Tracing Framework
- [HotNets '06] Discovering Dependencies for Network Management
- [ICSE '22] Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching [code]
- [KDD '19] Time-Series Anomaly Detection Service at Microsoft
- [OSDI '18] Capturing and Enhancing In Situ System Observability for Failure Detection
- [ESEC/FSE '18] Identifying Impactful Service System Problems via Log Analysis
- [CCS '17] DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
- [DSN '22] Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems
- [USENIX ATC '21] Fighting the Fog of War: Automated Incident Detection for Cloud Systems
- [ASE '21] Graph-based Incident Aggregation for Large-Scale Online Service Systems
- [ASE '21] Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings
- [SIGCOMM '20] Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing
- [ASE '20] How Incidental are the Incidents?: Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems
- [ESEC/FSE '20] Identifying linked incidents in large-scale online service systems
- [ESEC/FSE '20] Efficient incident identification from multi-dimensional issue reports via meta-heuristic search
- [ESEC/FSE '20] Real-time incident prediction for online service systems
- [ESEC/FSE '20] How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems
- [ICSE '20] Understanding and Handling Alert Storm for Online Service Systems
- [HotOS '19] What bugs cause production cloud incidents?
- [ASE '19] Continuous Incident Triage for Large-Scale Online Service Systems
- [ICSE '19] An empirical investigation of incident triage for online service systems
- [WWW '19] Outage Prediction and Diagnosis for Cloud Service Systems
- [KDD '14] Correlating Events with Time Series for Incident Diagnosis
- [DSN '21] General Feature Selection for Failure Prediction in Large-scale SSD Deployment
- [TOSEM '20] Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution
- [ICDCS '20] Toward Adaptive Disk Failure Prediction via Stream Mining
- [VLDB '20] Diagnosing root causes of intermittent slow queries in cloud databases
- [USENIX ATC '19] IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services
- [NSDI '18] Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure
- [ESEC/FSE '18] Predicting Node Failure in Cloud Service Systems
- [USENIX ATC '18] Improving Service Availability of Cloud Systems by Predicting Disk Error
- [NSDI '22] CloudCluster: Unearthing the Functional Structure of a Cloud Service
- [OSDI '20] Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
- [SOSP '21] Understanding and Detecting Software Upgrade Failures in Distributed Systems
- [NSDI '20] Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure
- [AIOps Challenge] A series of AIOps competitions hosted by Tsinghua University
- [CUHK] Loghub
- [Microsoft Azure] Azure Public Dataset
- [Tsinghua] 2018 AIOps Challenge Dataset
- [Tsinghua] 2019 AIOps Challenge Dataset
- [Tsinghua] 2020 AIOps Challenge Dataset
- [Google] Cluster Traces
- [Backblaze] Hard Drive Dataset
- [Alibaba] SMART Dataset of PAKDD CUP 2020
- [Alibaba] SSD SMART logs and failure data
- [Ceph Drive] kaggle
- [Log Analytics] LogPAI
- [AI for Cloud Operation] OpsPAI
- [Outlier Detection] PyOD
- [Anomaly Detection] ADTK
- [Anomaly Detection] PySAD
- [Online Machine Learning] River
- [Online Machine Learning] scikit-multiflow
- [Fault Injection] Chaos Mesh
- [Fault Injection] ChaosBlade
- [Container Monitoring] cAdvisor
- [Performance Monitoring] Netdata
- [Anomaly Detection Labeling Tool] Microsoft TagAnomaly
- [Serverless App Dev. Framework] AWS Serverless Application Model (AWS SAM)
- [Fudan] Train Ticket (A Benchmark Microservice System) 该项目是一个基于微服务架构的火车票预订系统,包含41个微服务。
- [Weaveworks] Sock Shop (A Microservices Demo Application) 袜子店模拟了一个销售袜子的电子商务网站中面向用户的部分。它的目的是帮助演示和测试微服务和云原生技术。
- [云智慧] CloudWise-OpenSource
- [Coursera] Cloud-Based Network Design & Management Techniques
- [Tsinghua] AIOps Course of Tsinghua
- [OpsPAI] https://github.com/WeibinMeng/log-anomaly-detection
- [WeibinMeng] https://github.com/WeibinMeng/log-anomaly-detection
- [awesome-AIOps] https://github.com/linjinjin123/awesome-AIOps