Amazon Kinesis & S3 & Redshift Backend

This deployment type uses the big data solutions of Amazon that is easy to scale and manage. We deploy our services that are using Beanstalk, S3, Kinesis, Data pipeline, Redshift Amazon products and coordinate between these products using Rakam. As a result, we get an fault-tolerant, scalable and cost-effective analytics services.

Storing Events

At first, we deploy Rakam Collection API services to Amazon EC2 instances using Beanstalk that serialize data and send it to Amazon Kinesis. Similar to the Apache Kafka in Apache Kafka & PrestoDB deployment type, Kinesis is a real-time stream processing engine thats acts like commit-log in our architecture. Collection APIs the events are converted to a compact data structure using Apache Avro by Rakam and then they are sent to the to EventStore implementation that basically serialize that data structure and send the byte array to Kinesis partitions based on their event collections.

The other part of the application that is using Beanstalk consumes the data in Kinesis parallely and store the data in S3 in batches using buffering technique. The data stored in S3 is in uncompressed CSV format so not efficient for querying. We also use an Amazon Data Pipeline job that copies data from S3 to Amazon Redshift, which is a distributed columnar database that store data in an efficient format for complex aggregation queries. It periodically (usually a few hours) feeds Redshift with new event dataset and allow us to execute SQL queries on that dataset efficiently.

Analyzing Events

Amazon Redshift is used internally by Analysis API to execute SQL queries over event dataset. However, we re-write queries before executing on Redshift query engine in order to prevent users to execute queries on projects in which they're not authorized to access. We use PrestoDB parser written in ANTLR, and rewrite queries using AST generated by ANTLR. Similar to the other deployment types, continuous query tables use continuous prefix and materialized query tables use materialized prefix. When the query is re-written, Rakam send the query directly to Redshift.

Materialized Query Tables

Unfortunately, unlike Postgresql, Redshift does not support materialized views so we use CREATE TABLE AS in order to materialize the query results. The materialized query tables uses a special prefix and stored in same schema with event collections. The query parser re-writes the table names that corresponds to the internal name of the materialized query tables. When the materialized query table is refreshed, we DELETE the whole table and perform the same CREATE TABLE AS query that materialize the last snapshot.

Continuous Query Tables

Currently, this deployment type does not support continuous query tables. We're in the process of evaluating possible solutions. We found out that processing events in Kinesis partitions and feeding Amazon DynamoDB streams using pre-aggregation SQL reports is the only way that AWS offers right now but it's not actually that flexible. Since we have to implement SQL language in order to make this feature interactive, we need to implement SQL language on top of Amazon DynamoDB but it requires too much effort. The other alternative that we spesificially like is Samza since they plan to support an high level language similar to SQL but it's still experimental. The related issue is #33, you're welcomed to share your thoughts or contribute to the project.

Modules

Customer Analytics Module _api-doc

Currently, you need to store user data in Redshift in order to be able JOIN event dataset and user data. You can use Redshift backend for user database since it's actually based on Postgresql 8.0.2 but currently we don't know how to join event dataset in redshift efficiently with external user database. We may either extract user identifier keys from event dataset or user database and do the work in application code or use Hadoop on Amazon EMR to join data in Amazon S3 and external user storage but in fact, neither of them seems like a good solution. The related issue is #34, you're welcomed to share your thoughts or contribute to the project.

CRM module (Customer Mailbox) _api-doc

This deployment type does not offer any solution for CRM module. We suggest Postgresql for this purpose since you can use PrestoDB as event database and Postgresql for CRM module. You just need to configure the settings in config.properties file.

Real-time Analytics Module _api-doc

There's nothing fancy here, it uses continuous query tables under the hood. (Since continuous query tables are not supported yet in this deployment type, this module is also not supported.)

Event Stream Module _api-doc

When a user subscribe an event stream, we automatically subscribe Kinesis stream for the subscribed event collections.

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amazon Kinesis & S3 & Redshift Backend

Storing Events

Analyzing Events

Materialized Query Tables

Continuous Query Tables

Modules

Customer Analytics Module _api-doc

CRM module (Customer Mailbox) _api-doc

Real-time Analytics Module _api-doc

Event Stream Module _api-doc

Rakam Documentation

Design Docs

Contributor

Clone this wiki locally

Amazon Kinesis & S3 & Redshift Backend

Storing Events

Analyzing Events

Materialized Query Tables

Continuous Query Tables

Modules

Customer Analytics Module api-doc

CRM module (Customer Mailbox) api-doc

Real-time Analytics Module api-doc

Event Stream Module api-doc

Rakam Documentation

Design Docs

Contributor

Clone this wiki locally

Customer Analytics Module _api-doc

CRM module (Customer Mailbox) _api-doc

Real-time Analytics Module _api-doc

Event Stream Module _api-doc