Posts Tagged ‘targetSize’

EMR s3DistCp “–groupBy” Regex examples

Written by mannem on . Posted in EMR || Elastic Map Reduce

Sometimes the regex could be confusing on S3DistCp groupBy. I usually use some online regex tools like https://regex101.com/ to better work with string matching and grouping.

Here are some examples that I explored so far :

Example 1 :

Example 2 :

s3-dist-cp –src s3://support.elasticmapreduce/training/datasets/gdelt/ –dest hdfs:///gdeltWrongOutput1/ –groupBy ‘.*(\d{6}).*’

This command would not merge any files but copy all files with 6 numbers like 20130401.export.CSV.gz to destination.

Example 3 :

http://stackoverflow.com/questions/38374107/how-to-emr-s3distcp-groupby-properly

Example 4 :

If you want to concatenate matching files in the root directory and and all matching files inside a ‘sample_directory’ into a single file and compress that in gzip format. on http://regexr.com/3ftn9 will concatenate all matched file contents and creates one .gz file

  • cloudformation

    cloudformation

    pipeline

    Data-pipelines

    directoryservice

    directoryservicez

    cloudtrail

    cloudtrail

    config

    config

    trustedadvisor

    Trustedadvisor

  • snap

    Snapshot

    glacier

    Glacie

    storagegw

    Storage Gatewa

    s3

    S3

    cloudFront

    Cloud Front

  • r53

    Route 53

    lambda

    lambd

    directConnect

    DirectConnect

    vpc

    VPC

    kinesis

    Kinesis

    emr

    Emr

  • sns

    SNS

    transcoder

    Transcoder

    sqs

    SQS

    cloudsearch

    Cloud Search

    appstream

    App Stream

    ses

    SES

  • opsworks

    opsworks

    cloudwatch

    Cloud Watch

    beanstalk

    Elastic Beanstalk

    codedeploy

    Code Deploy

    IAM

    IAM

  • dynamodb

    dynamodb

    rds

    RDS

    elasticache

    ElastiCache

    redshift

    Redshift

    simpledb

    simpledb