Querying DynamoDB export data using Athena (with LIST attribute)
Presto supports multiple Array and JSON functions using which you can write queries to get the required results. There is no single way to define a CREATE TABLE and later your QUERIES. You can have CREATE TABLE predefine the schema structure and later your queries can refer the elements you need in the schema. Or you can define your schema as a string and later use functions on your queries to parse this string to get the required results.
https://prestodb.io/docs/current/functions/array.html
https://prestodb.io/docs/current/functions/json.html
In addition, Athena has some examples to use most of the presto functions :
http://docs.aws.amazon.com/athena/latest/ug/querying-arrays.html
http://docs.aws.amazon.com/athena/latest/ug/rows-and-structs.html
http://docs.aws.amazon.com/athena/latest/ug/querying-JSON.html
Now, I was able to use the following CREATE TABLE syntax on the DynamoDB items having List of strings. I was able to flatten out the list of strings using some of the functions like CAST. Please note that this is not the only way to define table and query. My query might be over complexing what your are trying to get. There might be a simpler way as well. So, its really important that you understand the data types that you define and what each query returns and use the correct functions as each function takes a datatype and returns another datatype.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
------------------------------------------------- Data on S3(Exported from DynamoDB table) : {"color":{"s":"yellow"},"damageShapes":{"l":[{"s":"991"},{"s":"vvv"},{"s":"qqq"},{"s":"mmmfgh"}]},"time":{"n":"77"},"age":{"s":"22"}} {"color":{"s":"green"},"damageShapes":{"l":[{"s":"8091"},{"s":"2728"},{"s":"ddf"},{"s":"dfdf"}]},"time":{"n":"34"},"age":{"s":"22"}} Same data with JSON Pretty print on one Item : { "age": { "S": "22" }, "color": { "S": "yellow" }, "damageShapes": { "L": [ { "S": "991" }, { "S": "vvv" }, { "S": "qqq" }, { "S": "mmmfgh" } ] }, "time": { "N": "77" } } ------------------------------------------------- # Create table with required data types as per the structure of data. # Supported data types of Athena : http://docs.aws.amazon.com/athena/latest/ug/ddl/create-table.html # damageShapes can take different structures in schema ex : damageShapes struct<L:array<struct<S:string>>> or damageShapes map<string,array<map<string,string>>> or struct<L:array<map<string,string>>> . # Or if may be a simple string so that you need to rely on queries to parse this string. CREATE EXTERNAL TABLE IF NOT EXISTS ddb ( color struct<s:string>, time struct<n:string>, age struct<s:string>, damageShapes struct<L:array<struct<S:string>>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true') LOCATION 's3://bucket.ne/json-ddb/' ------------------------------------------------- select * from ddb; color time age damageshapes {s=yellow} {n=77} {s=22} {l=[{s=991}, {s=vvv}, {s=qqq}, {s=mmmfgh}]} {s=green} {n=34} {s=22} {l=[{s=8091}, {s=2728}, {s=ddf}, {s=dfdf}]} # Get the list of strings of damageshapes[list] DynamoDB attribute. # From presto perspective damageshapes.l is array<row<K, V>>) data type. i.e array(row(s varchar)) select damageshapes.l from ddb; [{s=991}, {s=vvv}, {s=qqq}, {s=mmmfgh}] [{s=8091}, {s=2728}, {s=ddf}, {s=dfdf}] # Get first row of a damageshapes.l array select damageshapes.l[1] from ddb; {s=991} {s=8091} # Get value(with key s) of first row of a damageshapes.l array. select damageshapes.l[1].s from ddb; 991 8091 # Since damageshapes.l is array(row(s varchar)) , you can find any presto functions which can flatten this to your required format which is array<varchar>. If there is not direct function, you might need to do 2 conversions. # casting the damageshapes.l [array(row(s varchar))] to JSON automatically strips all keys(S) select cast(damageshapes.l as json) as values_of_damageshapes , color.s as color from ddb; values_of_damageshapes color [["991"],["vvv"],["qqq"],["mmmfgh"]] yellow [["8091"],["2728"],["ddf"],["dfdf"]] green # values_of_damageshapes is in JSON format and you can use JSON functions to get the necessary output. # https://prestodb.io/docs/current/functions/json.html ------------------------------------------------- # Here I am converting values_of_damageshapes(json) to ARRAY(ARRAY(varchar)))) so that I use an array function called flatten which Flattens an array(array(T)) to an array(T) by concatenating the contained arrays. # https://prestodb.io/docs/current/functions/array.html select flatten(cast(values_of_damageshapes as ARRAY(ARRAY(varchar)))) , Actualtime from (select cast(damageshapes.l as json) as values_of_damageshapes , color.s as Actualcolor , time.n as Actualtime from ddb); _col0 Actualtime [991, vvv, qqq, mmmfgh] 77 [8091, 2728, ddf, dfdf] 34 ------------------------------------------------- |