Incremental Load: avoiding data loss

Written by mannem on . Posted in Data Pipelines, Redshift

While copying data from RDS to Redshift..

To avoid data loss, start the ‘Incremental copy template’ before the ‘Full copy’

A sample implementation can be,

————————————————-
Incremental copy scheduled start time – 1:50 PM

Full copy start time – 2:00 PM
A DB Insert – 2:10 PM
Full copy End Time – 4:00 PM

A DB Insert – 4:05 PM

Incremental copy First run – 4:10 PM
————————————————-

> In the above example, the contents of first DB Insert at 2:10 may or may not be included in FULL copy.
> Contents of the second insert will not be included in Full copy.

How to ensure that these new inserts will show up in Redshift database ?

> As the ‘Incremental copy template’ uses TIME SERIES scheduling, the actual ‘Incremental copy activity’ run wont start at scheduled start time(1:50), rather it will start and the end of scheduled start time(4:10). All the DB changes between ‘scheduled start date/time’ and ‘first run of the actual copy activity’ will be copied to redshift.
> So, the first incremental copy run will copy all new DB inserts between 1:50 PM and 4:10 PM to redshift. This includes the contents of two DB inserts which are happening during/after FULL copy activity.

Trackback from your site.

Comments (1)

  • mannem

    |

    timeseries huh ?

    Reply

Leave a comment

  • cloudformation

    cloudformation

    pipeline

    Data-pipelines

    directoryservice

    directoryservicez

    cloudtrail

    cloudtrail

    config

    config

    trustedadvisor

    Trustedadvisor

  • snap

    Snapshot

    glacier

    Glacie

    storagegw

    Storage Gatewa

    s3

    S3

    cloudFront

    Cloud Front

  • r53

    Route 53

    lambda

    lambd

    directConnect

    DirectConnect

    vpc

    VPC

    kinesis

    Kinesis

    emr

    Emr

  • sns

    SNS

    transcoder

    Transcoder

    sqs

    SQS

    cloudsearch

    Cloud Search

    appstream

    App Stream

    ses

    SES

  • opsworks

    opsworks

    cloudwatch

    Cloud Watch

    beanstalk

    Elastic Beanstalk

    codedeploy

    Code Deploy

    IAM

    IAM

  • dynamodb

    dynamodb

    rds

    RDS

    elasticache

    ElastiCache

    redshift

    Redshift

    simpledb

    simpledb