Incremental Load: avoiding data loss

Written by mannem on . Posted in Data Pipelines, Redshift

While copying data from RDS to Redshift..

To avoid data loss, start the ‘Incremental copy template’ before the ‘Full copy’

A sample implementation can be,

————————————————-
Incremental copy scheduled start time – 1:50 PM

Full copy start time – 2:00 PM
A DB Insert – 2:10 PM
Full copy End Time – 4:00 PM

A DB Insert – 4:05 PM

Incremental copy First run – 4:10 PM
————————————————-

> In the above example, the contents of first DB Insert at 2:10 may or may not be included in FULL copy.
> Contents of the second insert will not be included in Full copy.

How to ensure that these new inserts will show up in Redshift database ?

> As the ‘Incremental copy template’ uses TIME SERIES scheduling, the actual ‘Incremental copy activity’ run wont start at scheduled start time(1:50), rather it will start and the end of scheduled start time(4:10). All the DB changes between ‘scheduled start date/time’ and ‘first run of the actual copy activity’ will be copied to redshift.
> So, the first incremental copy run will copy all new DB inserts between 1:50 PM and 4:10 PM to redshift. This includes the contents of two DB inserts which are happening during/after FULL copy activity.

Trackback from your site.

Comments (1)

  • mannem

    |

    timeseries huh ?

    Reply

Leave a comment