Aws emr vs s3 copy log files to redshift

8/5/2023

Remote host through secured shell (SSH) connectivity.EMR cluster (Amazon Elastic map-reduce) is a big data platform which uses open-source frameworks like Spark, Hive and Presto for processing data.DynamoDB table in Amazon, part of a NoSQL database consists of key values and document data structures.Amazon S3 data lake that stores cold legacy data in a structured and unstructured format.It allows to run analytics on data in delay of seconds, and in the same time prepare the data to copy into Redshift in a more optimized way. If you are able to split your Redis queues to multiple servers or at least multiple queues with different log files, you can probably get very good records per second ingestion speed.Īnother pattern that you may want to consider to allow near real time analytics is the usage of Amazon Kinesis, the streaming service. You can copy the data directly from your servers using the SSH COPY option. Lastly, you don't really need to upload the data into S3, which is the major part of your ingestion timing. If over time you have more than 500GB of compressed data, you can consider running 2 different clusters, one for "hot" data on SSD with the data of the last week or month, and one for "cold" data on magnetic disks with all your historical data. Since you are looking for very fresh data, you probably prefer to run with the SSD ones, which are usually lower cost for less than 500GB of compressed data. The SSD ones are faster in terms of ingestion as well. This should lead you to the balance between the frequency of the COPY, for example every 5 or 15 minutes and not every 30 seconds, and the size and number of the events files.Īnother point to consider is the 2 types of Redshift nodes you have, the SSD ones (dw2.xl and dw2.8xl) and the magnetic ones (dx1.xl and dw1.8xl). There is a balance between the number of files and the number of records in each file, as each file has some small overhead. For example, if you have a 5 small node (dw2.xl) cluster, you can copy data 10 times faster if you have your data is multiple number of files (20, for example). The design of the COPY command is to work with parallel loading of multiple files into the multiple nodes of the cluster. It is also optimized to allow you to ingest these records very quickly into Redshift using the COPY command. Redshift is an Analytical DB, and it is optimized to allow you to query millions and billions of records. CONS: doesn't look like a standard way of importing data.I can call an insert query from my application code CONS: More work (I have to manage buckets and manifests and a cron that triggers the COPY commands.).PROS: looks like the good practice from the docs.It looks like the preferred way is COPYing from s3 with unique object keys (each. In the documentation about consistency, there is no mention about loading the data via multi-row inserts. Or do I have to create a background worker myself to trigger the COPY commands ? csv files are added into s3 ? Doc here and here. What are the advantages and drawbacks of each approach ? What is the best practise ? Did I miss anything ?Īnd side question: is it possible for redshift to COPY the data automatically from s3 via a manifest ? I mean COPYing the data as soon as new. The upload + copy time is equal to the insert time. +-+-+-+Īs you can see, in terms of performance, it looks like I gain nothing by first copying the data in s3. | insert query | upload to s3 | COPY query | I've done some tests (this was done on a clicks table with already 2 million rows): | multi-row insert stragegy | S3 Copy strategy |

Then I run a COPY to load this into the clicks table.

S3 Copy strategy: I copy the rows in s3 as clicks_1408736038.csv.
Multi-row insert strategy: I use a regular insert query for inserting multiple rows.
I have two ways of inserting a batch of clicks into redshift: My workflow is to store the clicks in redis, and every minute, I insert the ~600 clicks from redis to redshift as a batch. If possible, I would like my data to be available as soon as possible in redshift.įrom what I understand, because of the columnar storage, insert performance is bad, so you have to insert by batches. Let's say I have ~10 new 'clicks' that I want to store each second. I have incoming data that I would like to add to a clicks table. I have a redshift cluster that I use for some analytics application.

0 Comments

Aws emr vs s3 copy log files to redshift

Leave a Reply.

Author

Archives

Categories