EMR s3DistCp “–groupBy” Regex examples
Sometimes the regex could be confusing on S3DistCp groupBy. I usually use some online regex tools like https://regex101.com/ to better work with string matching and grouping.
Here are some examples that I explored so far :
Example 1 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
aws s3 ls s3://support.elasticmapreduce/training/datasets/gdelt/ .. .. 2015-01-16 17:12:37 1729157 20130401.export.CSV.gz 2015-01-16 17:12:11 2194092 20130402.export.CSV.gz 2015-01-16 17:13:42 2520540 20130403.export.CSV.gz 2015-01-16 17:13:16 2438567 20130404.export.CSV.gz 2015-01-16 17:12:17 2288289 20130405.export.CSV.gz 2015-01-16 17:13:21 1324743 20130406.export.CSV.gz 2015-01-16 17:12:11 1532151 20130407.export.CSV.gz 2015-01-16 17:12:52 2377993 20130408.export.CSV.gz 2015-01-16 17:12:59 2599953 20130409.export.CSV.gz .. .. // Suppose if you want to merge(GroupBy) all files beginning with 2013 - With next 4 characters as numbers between 0-9 - With the rest of the file as any extension - and create 1GB files out of those , then you can use a command like s3-dist-cp --src=s3://support.elasticmapreduce/training/datasets/gdelt/ --dest=/gdelt/ --groupBy='.*(2013).*[0-9]+[0-9]+[0-9]+[0-9]+.*' --targetSize=1024 // In the regex note that I put brackets on 2013 so that the merging will take place with all files containing that string and the rest of the regex is used for matching. //The output would be hdfs dfs -ls -h /gdelt/ -rw-r--r-- 1 hadoop hadoop 1024 M 2016-09-15 18:51 /gdelt/2013.gz -rw-r--r-- 1 hadoop hadoop 1024 M 2016-09-15 18:51 /gdelt/20131.gz -rw-r--r-- 1 hadoop hadoop 1024 M 2016-09-15 18:51 /gdelt/20132.gz .. .. |
Example 2 :
s3-dist-cp –src s3://support.elasticmapreduce/training/datasets/gdelt/ –dest hdfs:///gdeltWrongOutput1/ –groupBy ‘.*(\d{6}).*’
This command would not merge any files but copy all files with 6 numbers like 20130401.export.CSV.gz to destination.
Example 3 :
http://stackoverflow.com/questions/38374107/how-to-emr-s3distcp-groupby-properly
Example 4 :
If you want to concatenate matching files in the root directory and and all matching files inside a ‘sample_directory’ into a single file and compress that in gzip format. on http://regexr.com/3ftn9 will concatenate all matched file contents and creates one .gz file
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
[hadoop@ip-172-31-31-253 ~]$ hadoop fs -ls -R s3://MyBucket/DailyUpdatedStocksCSV/ -rw-rw-rw- 1 hadoop hadoop 11144 2017-05-08 17:33 s3://MyBucket/DailyUpdatedStocksCSV/abc_output.txt -rw-rw-rw- 1 hadoop hadoop 72 2017-05-08 17:35 s3://MyBucket/DailyUpdatedStocksCSV/file1.csv -rw-rw-rw- 1 hadoop hadoop 72 2017-05-08 17:35 s3://MyBucket/DailyUpdatedStocksCSV/file2.csv -rw-rw-rw- 1 hadoop hadoop 102 2017-05-08 17:35 s3://MyBucket/DailyUpdatedStocksCSV/file3.csv drwxrwxrwx - hadoop hadoop 0 1970-01-01 00:00 s3://MyBucket/DailyUpdatedStocksCSV/noname -rw-rw-rw- 1 hadoop hadoop 72 2017-05-08 17:35 s3://MyBucket/DailyUpdatedStocksCSV/noname/file1.csv -rw-rw-rw- 1 hadoop hadoop 72 2017-05-08 17:35 s3://MyBucket/DailyUpdatedStocksCSV/noname/file2.csv -rw-rw-rw- 1 hadoop hadoop 102 2017-05-08 17:35 s3://MyBucket/DailyUpdatedStocksCSV/noname/file3.csv drwxrwxrwx - hadoop hadoop 0 1970-01-01 00:00 s3://MyBucket/DailyUpdatedStocksCSV/sample_folder -rw-rw-rw- 1 hadoop hadoop 11144 2017-05-08 17:32 s3://MyBucket/DailyUpdatedStocksCSV/sample_folder/2015-05-19 -rw-rw-rw- 1 hadoop hadoop 11362 2017-05-08 17:32 s3://MyBucket/DailyUpdatedStocksCSV/sample_folder/2015-05-20 -rw-rw-rw- 1 hadoop hadoop 11362 2017-05-08 17:33 s3://MyBucket/DailyUpdatedStocksCSV/sample_output.txt s3-dist-cp --src s3://MyBucket/DailyUpdatedStocksCSV/ --dest /test3/ --groupBy='.*(file|noname).*[0-9].*' --outputCodec=gz [hadoop@ip-172-31-31-253 ~]$ hadoop fs -ls -R /test3/ drwxr-xr-x - hadoop hadoop 0 2017-05-08 18:28 /test3/noname -rw-r--r-- 1 hadoop hadoop 90 2017-05-08 18:28 /test3/noname/file.gz |