Posts Tagged ‘installation’

spark-redshift library from databricks – Installation on EMR and using InstanceProfile

Written by mannem on . Posted in EMR || Elastic Map Reduce

ear earearspark-redshift is a library to load data into Spark SQL DataFrames from Amazon Redshift, and write them back to Redshift tables. Amazon S3 is used to efficiently transfer data in and out of Redshift, and JDBC is used to automatically trigger the appropriate COPY and UNLOAD commands on Redshift.This library is more suited to ETL than interactive queries, since large amounts of data could be extracted to S3 for each query execution

spark-redshift installation instructions on EMR:
Steps 1-8 shows how to compile your own spark-redshift package(JAR). You can directly skip to step 9 if you wish to use pre-compiled JAR’s(V 0.6.1). Later we use spark-shell to invoke these JAR’s and run scala code to query Redshift table and put contents into a dataframe. This guide assumes you had followed github page https://github.com/databricks/spark-redshift and its tutorial.

Download pre-req’s to compile the your own JAR’s

————————————————-

If you wanna skip above steps , you can download the pre-compiled JAR’s using –

wget https://s3.amazonaws.com/emrsupport/spark/spark-redshift-databricks/minimal-json-0.9.5-SNAPSHOT.jar

wget https://s3.amazonaws.com/emrsupport/spark/spark-redshift-databricks/spark-redshift_2.10-0.6.1-SNAPSHOT.jar

With these JAR’s you can skip all above options (1-8)

————————————————-

10. In SCALA shell, you can run the following commands to init SQLContext. Note that the below code automatically uses IAM role’s (Instance profile ) cred’s to authenticate against S3

Sample Scala code uses Instance profile

————————————————-


Sample spark-sql invocation

spark-sql --jars RedshiftJDBC41-1.1.13.1013.jar,spark-redshift_2.10-0.6.1-SNAPSHOT.jar,minimal-json-0.9.5-SNAPSHOT.jar


Note that while running the above commands , the spark-redshift executes a COPY/Unload with a command like below:

These temporary credentials are from IAM role’s (EMR_EC2_DefaultRole) and this role should have policy that allows S3 access to atleast temp bucket mentioned in the command.


Search DynamoDB tables using Elasticsearch/Kibana via Logstash plugin

Written by mannem on . Posted in Dynamo DB, Elasticsearch


The Logstash plugin for Amazon DynamoDB gives you a nearly real-time view of the data in your DynamoDB table. The Logstash plugin for DynamoDB uses DynamoDB Streams to parse and output data as it is added to a DynamoDB table. After you install and activate the Logstash plugin for DynamoDB, it scans the data in the specified table, and then it starts consuming your updates using Streams and then outputs them to Elasticsearch, or a Logstash output of your choice.

Logstash is a data pipeline service that processes data, parses data, and then outputs it to a selected location in a selected format. Elasticsearch is a distributed, full-text search server. For more information about Logstash and Elasticsearch, go to https://www.elastic.co/products/elasticsearch.

Amazon Elasticsearch Service is a managed service that makes it easy to deploy, operate, and scale Elasticsearch in the AWS Cloud. aws.amazon.com/elasticsearch-service/


This article includes an installation guide that is tested on EC2 instance where all the per-requsites are installed and Logstash is configured so that it connects to Amazon ElasticSearch using the input/Output plugins to start indexing records from DynamoDB. Click here to get all the instructions :
https://github.com/mannem/logstash-input-dynamodb


Logstash configuration:

After running a similar command on the shell, Logstash should successfully start and begin indexing the records from your DynamoDB table.


Throughput considerations:


Kibana:


References:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.DynamoDBLogstash.html

https://aws.amazon.com/blogs/aws/new-logstash-plugin-search-dynamodb-content-using-elasticsearch/

https://github.com/awslabs/logstash-input-dynamodb


Similar plugins:

https://github.com/kzwang/elasticsearch-river-dynamodb