Migrating with Sync: Keeping MySQL and PostgreSQL Consistent During TransitionMigrating a production system from MySQL to PostgreSQL (or running both together during a phased migration) is a common but delicate task. The primary challenge is keeping data consistent between the source (MySQL) and the target (PostgreSQL) while minimizing downtime and ensuring application correctness. This article walks through planning, architectures, tools, conflict handling, verification, and operational tips to achieve a safe, efficient migration with continuous synchronization.
Why sync during migration?
- Minimize downtime: Synchronous or near-real-time replication reduces the window where writes to MySQL are lost or unavailable on PostgreSQL.
- Reduce risk: Allow read traffic and testing on PostgreSQL before full cutover.
- Enable phased migration: Migrate services incrementally, switching components to PostgreSQL while others still use MySQL.
- Support hybrid workloads: Some teams permanently keep both databases for legacy apps, analytics, or multi-DB strategies.
Pre-migration planning
Inventory and assessment
- Catalog schemas: tables, types, constraints, indexes, foreign keys, triggers, stored procedures.
- Identify incompatible features: MySQL-specific SQL, ENUM differences, unsigned integers, zero-dates, full-text differences, JSON handling differences, stored procedures and functions.
- Data volume and change rate: estimate initial dataset size and write throughput to plan sync strategy and hardware/network capacity.
- Transaction semantics: MySQL engines (InnoDB) behavior vs PostgreSQL MVCC specifics (serializable vs read-committed semantics).
Decide migration approach
Common approaches:
- Big-bang export/import (fast for small datasets; high downtime).
- Logical replication + sync (best for minimal downtime, schema adaptation).
- Dual-write at application level (risky: potential for divergence).
- CDC (Change Data Capture) based replication (recommended for production migrations).
Architecture patterns for sync
One-way replication (MySQL → PostgreSQL)
- Use when MySQL remains primary and PostgreSQL is target for reads or eventual cutover.
- Keeps single source of truth; simplifies conflict handling.
Bi-directional replication (MySQL ⇄ PostgreSQL)
- Useful for running both as active-active during transition, but complexity and conflicts increase significantly.
- Requires deduplication, conflict resolution rules, and strict idempotency.
Staging + Cutover
- Continuous replication into a staging PostgreSQL cluster.
- Run application read-only or shadow writes to validate before final switch.
- Final cutover involves brief downtime to ensure final WAL applied and redirect application.
Tools and techniques
Logical dump and load
- mysqldump — compatible with simple schemas.
- pt-table-sync / pt-table-checksum (Percona Toolkit) — useful for verifying and fixing differences when both servers are active.
- Limitations: slow for large datasets, needs careful handling of schema differences.
Change Data Capture (CDC)
- Debezium: reads MySQL binlog and streams change events to Kafka or directly to consumers. Often paired with Kafka Connect and sink connectors for PostgreSQL (e.g., JDBC sink).
- Maxwell’s Daemon / Canal: lighter-weight binlog readers for streaming changes.
- BottledWater, pg_chameleon: other CDC/replication projects.
Advantages of CDC:
- Low-latency propagation of changes.
- Fine-grained events (INSERT/UPDATE/DELETE).
- Can transform events to match PostgreSQL schema.
Tradeoffs:
- Operational complexity: Kafka, connectors, or other middleware.
- Schema evolution handling needs automation.
Direct replication tools for MySQL → PostgreSQL
- pg_chameleon: Python-based, uses MySQL binlog to replicate to PostgreSQL; handles initial snapshot and ongoing binlog replication.
- mysql2pgsql (and variants): converts dumps to PostgreSQL-compatible SQL.
- SymmetricDS: supports cross-DB replication with conflict resolution and transformation rules (heavy but feature-rich).
Using logical replication and foreign data wrappers
- Foreign data wrappers (FDW): postgres_fdw allows PostgreSQL to query MySQL data indirectly (via postgres_fdw → foreign server wrappers like mysql_fdw); can help during testing and phased migrations.
- FDWs are not a replacement for full replication—performance and transactional guarantees differ.
Schema migration and compatibility
Data type mapping
- Integers: map unsigned MySQL ints to larger signed PostgreSQL types where necessary (e.g., unsigned INT → BIGINT).
- AUTO_INCREMENT → SERIAL or IDENTITY: prefer IDENTITY in modern PostgreSQL.
- ENUM → TEXT or PostgreSQL ENUM: PostgreSQL enums are stricter and require migration steps for changes.
- DATETIME/TIMESTAMP: handle MySQL zero-dates and timezone differences; normalize to UTC where possible.
- JSON: MySQL JSON vs PostgreSQL jsonb — PostgreSQL jsonb is richer; convert carefully.
Constraints, indexes, and keys
- Recreate primary and unique keys; watch for differences in index behavior (functional indexes, collations).
- Foreign keys: ensure referential integrity order during initial load (load parent tables first or defer constraints).
- Collation and character sets: convert MySQL character sets (e.g., latin1) to UTF-8 in PostgreSQL. Collation differences can change sort/order results—decide expected behavior and test.
Stored procedures, triggers, and views
- Reimplement business logic in PostgreSQL (PL/pgSQL) or move logic to application layer.
- Triggers and virtual columns need manual conversion.
Initial load strategies
Snapshot + apply binlog
- Take consistent snapshot of MySQL (flush tables with read lock or use Percona XtraBackup for InnoDB hot backup).
- Dump snapshot into PostgreSQL-compatible format and load into PostgreSQL.
- Capture binlog position at snapshot time.
- Start CDC tool from that binlog position to apply subsequent changes.
This approach minimizes downtime and avoids missing changes between snapshot and replication start.
Parallelized bulk load
- Use parallel workers to load multiple tables concurrently.
- For huge tables, consider chunked export/import (e.g., by primary key range) to speed up load.
- Disable indexes during bulk load then recreate indexes after load to speed up writes.
Handling ongoing changes and conflicts
Idempotency and ordering
- Ensure events applied to PostgreSQL are idempotent (use UPSERTs with deterministic ordering).
- Preserve ordering per primary key or transactional ordering where necessary to avoid out-of-order updates causing data loss.
Conflict detection & resolution
- One-way replication: conflicts rarely occur if MySQL is single writer.
- Bi-directional replication: define rules (last-writer-wins, source-priority, timestamps). Use unique change identifiers to deduplicate.
- Consider tombstone markers for deletes and soft-delete strategies to avoid accidental resurrects.
Backpressure and retries
- CDC consumers must handle transient failures, replay, and backpressure (e.g., Kafka retention settings).
- Monitor lag between binlog position and applied WAL; set alerts.
Verification and testing
Checksums and consistency checks
- Use pt-table-checksum to detect row-level divergence when MySQL is still writable.
- Use checksums in batches to avoid heavy locking.
- For PostgreSQL, build equivalent checksumming scripts or export sample rows to compare.
Row counts and distribution checks
- Compare row counts per table, per partition.
- Compare aggregates (SUM of important numeric fields, MAX timestamps) to spot differences.
Schema and query tests
- Run application test suites against PostgreSQL with production-like data.
- Run comparative queries (sample queries executed on both DBs) and diff results.
Canary and shadow testing
- Route a percentage of read traffic to PostgreSQL.
- Use shadow writes (write to both DBs) for non-critical workloads to test parity.
Cutover steps (example minimal-downtime plan)
- Prepare target PostgreSQL with all schema, indexes, and preloaded data.
- Start CDC replication from captured binlog position and let it catch up until lag is minimal.
- Put application into read-only or maintenance mode briefly, or temporarily quiesce writes.
- Ensure CDC has applied all changes up to the latest binlog position.
- Switch writes to PostgreSQL (update application config/load balancers).
- Monitor application and replication metrics closely for errors or divergence.
- Keep MySQL in read-only mode for a verification period, then decommission or repurpose.
For absolutely zero-downtime, more complex synchronization and dual-write validation mechanisms are required.
Observability and monitoring
Key metrics to monitor:
- Replication lag (binlog position vs applied WAL).
- Error rates in CDC pipeline and sink connectors.
- Throughput: rows/sec and transactions/sec.
- Disk and IO metrics on both DBs (to avoid stall).
- Application-level latency and error rates after cutover.
Set alerts on replication lag thresholds, connector failures, and data skew detection.
Operational best practices
- Test the full migration process in a staging environment with a production-sized sample dataset.
- Automate schema conversion and migration steps with scripts/CI pipelines to avoid human error.
- Keep strict backups and point-in-time recovery plans for both MySQL and PostgreSQL.
- Version-control schema migration scripts (use tools like Flyway or Liquibase adapted for PostgreSQL).
- Maintain idempotent migration scripts to allow re-runs safely.
- Have a rollback plan and criteria for aborting cutover.
Common pitfalls and how to avoid them
- Ignoring type mismatches (causes subtle bugs): map types explicitly and test queries that depend on them.
- Overlooking character set/collation issues: normalize encodings and re-test ORDER BY results.
- Underestimating write throughput during replication: benchmark and size infrastructure accordingly.
- Relying on dual-write without strong guarantees: prefer single-writer plus CDC or transactional middleware.
- Not validating after cutover: schedule and run automated consistency checks immediately after transition.
Example toolchain for a production migration
- Initial snapshot: Percona XtraBackup or mysqldump (with careful options).
- CDC: Debezium reading MySQL binlog → Kafka.
- Transform & sink: Kafka Connect with a JDBC sink or custom consumer that converts events to idempotent UPSERTs into PostgreSQL.
- Verification: pt-table-checksum (MySQL side) + custom checksumming scripts against PostgreSQL.
- Or use pg_chameleon for a simpler, integrated approach when appropriate.
Final notes
Migrating with sync between MySQL and PostgreSQL is achievable with careful planning, the right tooling, and thorough testing. CDC-based approaches (Debezium, pg_chameleon, SymmetricDS) give the best mix of low downtime and reliability for production migrations. Focus on correct schema mapping, ordered/idempotent event application, robust observability, and a well-rehearsed cutover plan to avoid surprises.
Leave a Reply