Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Loss: Changing Schema Propagation Settings Causes Silent Dropping of CDC Tables #50874

Open
rtol5 opened this issue Jan 3, 2025 · 3 comments

Comments

@rtol5
Copy link

rtol5 commented Jan 3, 2025

Topic

Schema propagation

Relevant information

Description

We experienced complete data loss in our CDC changelog tables when changing the Schema Propagation setting from "Propagate all field and stream changes" to "Propagate field changes only". Upon this configuration change, Airbyte silently dropped and recreated all changelog tables on the next sync, effectively erasing our historical changelog data without any warning or confirmation.

Environment

  • Airbyte Version: 1.3.1
  • Source: source-mysql 3.9.4
  • Destination: destination-snowflake 3.15.2
  • Deployment: OSS
  • Connection Method: CDC/Binlog
  • Initial Schema Propagation Setting: "Propagate all field and stream changes"
  • Changed Schema Propagation To: "Propagate field changes only"

The Problem

  1. Our CDC pipeline kept failing when new tables were added to MySQL
  2. With a 72-hour binlog retention limit, this gave us very little time to respond to failures
  3. To mitigate this, we decided to change from auto-detecting new tables to manual table enabling
  4. Upon changing this setting, Airbyte:
    • Detected this as a "schema change"
    • Silently dropped all existing changelog tables
    • Recreated them as empty tables
    • Did not provide any warning about potential data loss
    • Did not require confirmation for this destructive action

Impact

  • Complete loss of historical changelog data
  • Required extensive manual work to attempt to restore from backups
  • Loss of data continuity in our changelog history
  • Unexpected behavior for what should be a non-destructive configuration change

Evidence

The Snowflake query history shows the sequence of DROP TABLE commands executed by Airbyte:
Image

Expected Behavior

When changing schema propagation settings:

  1. Airbyte should not treat this as a schema change requiring table recreation
  2. If table recreation is necessary:
    • Show a clear warning about potential data loss
    • Require explicit confirmation before proceeding with destructive operations
    • Provide an option to backup existing data
  3. Maintain existing data and table structures unless explicitly requested otherwise

Additional Context

This issue is particularly severe because:

  1. The change was made specifically to prevent data loss scenarios
  2. The resulting behavior caused the very thing we were trying to prevent
  3. There was no warning or indication that this setting change would be destructive
  4. CDC data is historical by nature and often irreplaceable once lost

Suggested Solutions

  1. Add clear warnings when configuration changes might result in data loss
  2. Implement a confirmation step for destructive operations
  3. Provide an option to preserve existing tables when changing propagation settings
  4. Consider adding a "dry run" option to show what changes would be made
  5. Add documentation clearly stating which configuration changes might trigger table recreations

Screenshot of timeline

Note how the DROP statements above started after the "Schema updated" event, but before the next sync started.
Image

@marcosmarxm
Copy link
Member

Slack discussion

@davinchia
Copy link
Contributor

Upon this configuration change, Airbyte silently dropped and recreated all changelog tables on the next sync

Thanks for reporting this @rtol5 ! Trying to get facts straight - was the next sync triggered immediately after the settings were changed, or did this behaviour happen on the next scheduled sync?

@rtol5
Copy link
Author

rtol5 commented Jan 7, 2025

Hi @davinchia – I didn't notice the schema change until a day after it happened, but based on the timeline in the screenshot (schema propagation setting changed on 12/28 at 10am; schema change applied on 12/29 at 11am, and a new scheduled sync right after the schema change), I'm fairly certain it's the latter – the schema change that dropped our tables happened as Step 0 of the next scheduled sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants