Skip to content

Commit

Permalink
Add AWS Backup detectors (#457)
Browse files Browse the repository at this point in the history
* feat(aws_backup): Init detectors module

Supersed #338

* feat(aws_backup): Add recovery points detectors

* refactor(aws_backup): improvements according to comments

* style(aws_backup): lowercase name

Co-authored-by: Patrick Decat <patrick.decat@fr.clara.net>

---------

Co-authored-by: Nicolas VION <nicolas.vion@icloud.com>
Co-authored-by: Patrick Decat <patrick.decat@fr.clara.net>
  • Loading branch information
3 people authored Mar 24, 2023
1 parent 1ad986a commit 3f85d43
Show file tree
Hide file tree
Showing 18 changed files with 816 additions and 0 deletions.
13 changes: 13 additions & 0 deletions docs/severity.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
- [fame_azure-vpn](#fame_azure-vpn)
- [integration_aws-alb](#integration_aws-alb)
- [integration_aws-apigateway](#integration_aws-apigateway)
- [integration_aws-backup](#integration_aws-backup)
- [integration_aws-beanstalk](#integration_aws-beanstalk)
- [integration_aws-ecs-cluster](#integration_aws-ecs-cluster)
- [integration_aws-ecs-service](#integration_aws-ecs-service)
Expand Down Expand Up @@ -158,6 +159,18 @@
|AWS APIGateway http 4xx error rate|X|X|X|-|-|


## integration_aws-backup

|Detector|Critical|Major|Minor|Warning|Info|
|---|---|---|---|---|---|
|AWS Backup failed|X|-|-|-|-|
|AWS Backup job expired|X|-|-|-|-|
|AWS Backup copy jobs failed|X|-|-|-|-|
|AWS Backup check jobs completed successfully|X|-|-|-|-|
|AWS Backup recovery point partial|-|-|X|-|-|
|AWS Backup recovery point expired|-|X|-|-|-|


## integration_aws-beanstalk

|Detector|Critical|Major|Minor|Warning|Info|
Expand Down
117 changes: 117 additions & 0 deletions modules/integration_aws-backup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# AWS-BACKUP SignalFx detectors

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
:link: **Contents**

- [How to use this module?](#how-to-use-this-module)
- [What are the available detectors in this module?](#what-are-the-available-detectors-in-this-module)
- [How to collect required metrics?](#how-to-collect-required-metrics)
- [Metrics](#metrics)
- [Related documentation](#related-documentation)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## How to use this module?

This directory defines a [Terraform](https://www.terraform.io/)
[module](https://www.terraform.io/language/modules/syntax) you can use in your
existing [stack](https://github.com/claranet/terraform-signalfx-detectors/wiki/Getting-started#stack) by adding a
`module` configuration and setting its `source` parameter to URL of this folder:

```hcl
module "signalfx-detectors-integration-aws-backup" {
source = "github.com/claranet/terraform-signalfx-detectors.git//modules/integration_aws-backup?ref={revision}"
environment = var.environment
notifications = local.notifications
}
```

Note the following parameters:

* `source`: Use this parameter to specify the URL of the module. The double slash (`//`) is intentional and required.
Terraform uses it to specify subfolders within a Git repo (see [module
sources](https://www.terraform.io/language/modules/sources)). The `ref` parameter specifies a specific Git tag in
this repository. It is recommended to use the latest "pinned" version in place of `{revision}`. Avoid using a branch
like `master` except for testing purpose. Note that every modules in this repository are available on the Terraform
[registry](https://registry.terraform.io/modules/claranet/detectors/signalfx) and we recommend using it as source
instead of `git` which is more flexible but less future-proof.

* `environment`: Use this parameter to specify the
[environment](https://github.com/claranet/terraform-signalfx-detectors/wiki/Getting-started#environment) used by this
instance of the module.
Its value will be added to the `prefixes` list at the start of the [detector
name](https://github.com/claranet/terraform-signalfx-detectors/wiki/Templating#example).
In general, it will also be used in the `filtering` internal sub-module to [apply
filters](https://github.com/claranet/terraform-signalfx-detectors/wiki/Guidance#filtering) based on our default
[tagging convention](https://github.com/claranet/terraform-signalfx-detectors/wiki/Tagging-convention) by default.

* `notifications`: Use this parameter to define where alerts should be sent depending on their severity. It consists
of a Terraform [object](https://www.terraform.io/language/expressions/type-constraints#object) where each key represents an available
[detector rule severity](https://docs.splunk.com/observability/alerts-detectors-notifications/create-detectors-for-alerts.html#severity)
and its value is a list of recipients. Every recipients must respect the [detector notification
format](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs/resources/detector#notification-format).
Check the [notification binding](https://github.com/claranet/terraform-signalfx-detectors/wiki/Notifications-binding)
documentation to understand the recommended role of each severity.

These 3 parameters alongs with all variables defined in [common-variables.tf](common-variables.tf) are common to all
[modules](../) in this repository. Other variables, specific to this module, are available in
[variables-gen.tf](variables-gen.tf).
In general, the default configuration "works" but all of these Terraform
[variables](https://www.terraform.io/language/values/variables) make it possible to
customize the detectors behavior to better fit your needs.

Most of them represent usual tips and rules detailled in the
[guidance](https://github.com/claranet/terraform-signalfx-detectors/wiki/Guidance) documentation and listed in the
common [variables](https://github.com/claranet/terraform-signalfx-detectors/wiki/Variables) dedicated documentation.

Feel free to explore the [wiki](https://github.com/claranet/terraform-signalfx-detectors/wiki) for more information about
general usage of this repository.

## What are the available detectors in this module?

This module creates the following SignalFx detectors which could contain one or multiple alerting rules:

|Detector|Critical|Major|Minor|Warning|Info|
|---|---|---|---|---|---|
|AWS Backup failed|X|-|-|-|-|
|AWS Backup job expired|X|-|-|-|-|
|AWS Backup copy jobs failed|X|-|-|-|-|
|AWS Backup check jobs completed successfully|X|-|-|-|-|
|AWS Backup recovery point partial|-|-|X|-|-|
|AWS Backup recovery point expired|-|X|-|-|-|

## How to collect required metrics?

This module deploys detectors using metrics reported by the
[AWS integration](https://docs.splunk.com/Observability/gdi/get-data-in/connect/aws/aws.html) configurable
with [this Terraform module](https://github.com/claranet/terraform-signalfx-integrations/tree/master/cloud/aws).


Check the [Related documentation](#related-documentation) section for more detailed and specific information about this module dependencies.



### Metrics


Here is the list of required metrics for detectors in this module.

* `NumberOfBackupJobsCompleted`
* `NumberOfBackupJobsCreated`
* `NumberOfBackupJobsExpired`
* `NumberOfBackupJobsFailed`
* `NumberOfCopyJobsFailed`
* `NumberOfRecoveryPointsExpired`
* `NumberOfRecoveryPointsPartial`




## Related documentation

* [Terraform SignalFx provider](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs)
* [Terraform SignalFx detector](https://registry.terraform.io/providers/splunk-terraform/signalfx/latest/docs/resources/detector)
* [Splunk Observability integrations](https://docs.splunk.com/Observability/gdi/get-data-in/integrations.html)
* [CloudWatch metrics](https://docs.aws.amazon.com/aws-backup/latest/devguide/cloudwatch.html)
3 changes: 3 additions & 0 deletions modules/integration_aws-backup/common-filters.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
locals {
filters = "filter('ResourceType', '*')"
}
1 change: 1 addition & 0 deletions modules/integration_aws-backup/common-locals.tf
1 change: 1 addition & 0 deletions modules/integration_aws-backup/common-modules.tf
1 change: 1 addition & 0 deletions modules/integration_aws-backup/common-variables.tf
1 change: 1 addition & 0 deletions modules/integration_aws-backup/common-versions.tf
16 changes: 16 additions & 0 deletions modules/integration_aws-backup/conf/00-aws-backup-failed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
module: AWS Backup
name: failed
id: backup_failed

transformation: ".max(over='1d').fill(0)"
aggregation: true
filtering: "filter('namespace', 'AWS/Backup') and filter('stat', 'sum')"

signals:
signal:
metric: NumberOfBackupJobsFailed
rules:
critical:
threshold: 0
comparator: ">"
lasting_duration: '1h'
17 changes: 17 additions & 0 deletions modules/integration_aws-backup/conf/01-aws-backup-job-expired.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
module: AWS Backup
name: job expired
id: backup_job_expired

transformation: ".max(over='1d').fill(0)"
aggregation: true
filtering: "filter('namespace', 'AWS/Backup') and filter('stat', 'sum')"

signals:
signal:
metric: NumberOfBackupJobsExpired
extrapolation: zero
rules:
critical:
threshold: 0
comparator: ">"
lasting_duration: '1h'
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
module: AWS Backup
name: copy jobs failed
id: backup_copy_jobs_failed

transformation: ".max(over='1d').fill(0)"
aggregation: true
filtering: "filter('namespace', 'AWS/Backup') and filter('stat', 'sum')"

signals:
signal:
metric: NumberOfCopyJobsFailed
rules:
critical:
threshold: 0
comparator: ">"
lasting_duration: '1h'
23 changes: 23 additions & 0 deletions modules/integration_aws-backup/conf/03-aws-backup-check.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
module: AWS Backup
name: check jobs completed successfully
id: backup_successful

transformation: ".min(over='23h')"
aggregation: true
filtering: "filter('namespace', 'AWS/Backup') and filter('stat', 'sum')"

signals:
created:
metric: NumberOfBackupJobsCreated
extrapolation: zero
completed:
metric: NumberOfBackupJobsCompleted
extrapolation: zero
signal:
formula: (created-completed)
rules:
critical:
threshold: 0
comparator: ">"
lasting_duration: 1d
lasting_at_least: 0.9
16 changes: 16 additions & 0 deletions modules/integration_aws-backup/conf/04-aws-backup-rp-partial.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
module: AWS Backup
name: recovery point partial
id: backup_rp_partial

transformation: ".max(over='1d').fill(0)"
aggregation: true
filtering: "filter('namespace', 'AWS/Backup') and filter('stat', 'sum')"

signals:
signal:
metric: NumberOfRecoveryPointsPartial
rules:
minor:
threshold: 0
comparator: ">"
lasting_duration: '1h'
16 changes: 16 additions & 0 deletions modules/integration_aws-backup/conf/05-aws-backup-rp-expired.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
module: AWS Backup
name: recovery point expired
id: backup_rp_expired

transformation: ".max(over='1d').fill(0)"
aggregation: true
filtering: "filter('namespace', 'AWS/Backup') and filter('stat', 'sum')"

signals:
signal:
metric: NumberOfRecoveryPointsExpired
rules:
major:
threshold: 0
comparator: ">"
lasting_duration: '1h'
5 changes: 5 additions & 0 deletions modules/integration_aws-backup/conf/readme.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
documentations:
- name: CloudWatch metrics
url: 'https://docs.aws.amazon.com/aws-backup/latest/devguide/cloudwatch.html'

source_doc:
Loading

0 comments on commit 3f85d43

Please sign in to comment.