-
Notifications
You must be signed in to change notification settings - Fork 105
Amazon Kinesis & S3 & Redshift Backend
This deployment type uses the big data solutions of Amazon that is easy to scale and manage. We deploy our services that are using Beanstalk, S3, Kinesis, Data pipeline, Redshift Amazon products and coordinate between these products using Rakam. As a result, we get an fault-tolerant, scalable and cost-effective analytics services.
At first, we deploy Rakam Collection API services to Amazon EC2 instances using Beanstalk that serialize data and send it to Amazon Kinesis. Similar to the Apache Kafka in Apache Kafka & PrestoDB deployment type, Kinesis is a real-time stream processing engine thats acts like commit-log in our architecture. Collection APIs the events are converted to a compact data structure using Apache Avro by Rakam and then they are sent to the to EventStore implementation that basically serialize that data structure and send the byte array to Kinesis partitions based on their event collections.
The other part of the application that is using Beanstalk consumes the data in Kinesis parallely and store the data in S3 in batches using buffering technique. The data stored in S3 is in uncompressed CSV format so not efficient for querying. We also use an Amazon Data Pipeline job that copies data from S3 to Amazon Redshift, which is a distributed columnar database that store data in an efficient format for complex aggregation queries. It periodically (usually a few hours) feeds Redshift with new event dataset and allow us to execute SQL queries on that dataset efficiently.
Amazon Redshift is used internally by Analysis API to execute SQL queries over event dataset. However, we re-write queries before executing on Redshift query engine in order to prevent users to execute queries on projects in which they're not authorized to access. We use PrestoDB parser written in ANTLR, and rewrite queries using AST generated by ANTLR. Similar to the other deployment types, continuous query tables use continuous prefix and materialized query tables use materialized prefix. When the query is re-written, Rakam send the query directly to Redshift.
Unfortunately, unlike Postgresql, Redshift does not support materialized views so we use CREATE TABLE AS in order to materialize the query results. The materialized query tables uses a special prefix and stored in same schema with event collections. The query parser re-writes the table names that corresponds to the internal name of the materialized query tables. When the materialized query table is refreshed, we DELETE the whole table and perform the same CREATE TABLE AS query that materialize the last snapshot.
Currently, this deployment type does not support continuous query tables. We're in the process of evaluating possible solutions. We found out that processing events in Kinesis partitions and feeding Amazon DynamoDB streams using pre-aggregation SQL reports is the only way that AWS offers right now but it's not actually that flexible. Since we have to implement SQL language in order to make this feature interactive, we need to implement SQL language on top of Amazon DynamoDB but it requires too much effort. The other alternative that we spesificially like is Samza since they plan to support an high level language similar to SQL but it's still experimental. The related issue is #33, you're welcomed to share your thoughts or contribute to the project.
Customer Analytics Module api-doc
Currently, you need to store user data in Redshift in order to be able JOIN event dataset and user data. You can use Redshift backend for user database since it's actually based on Postgresql 8.0.2 but currently we don't know how to join event dataset in redshift efficiently with external user database. We may either extract user identifier keys from event dataset or user database and do the work in application code or use Hadoop on Amazon EMR to join data in Amazon S3 and external user storage but in fact, neither of them seems like a good solution. The related issue is #34, you're welcomed to share your thoughts or contribute to the project.
CRM module (Customer Mailbox) api-doc
This deployment type does not offer any solution for CRM module. We suggest Postgresql for this purpose since you can use PrestoDB as event database and Postgresql for CRM module. You just need to configure the settings in config.properties file.
Real-time Analytics Module api-doc
There's nothing fancy here, it uses continuous query tables under the hood. (Since continuous query tables are not supported yet in this deployment type, this module is also not supported.)
Event Stream Module api-doc
When a user subscribe an event stream, we automatically subscribe Kinesis stream for the subscribed event collections.