diff --git a/docs/api-reference/failed-events/index.md b/docs/api-reference/failed-events/index.md index 1ee4d25802..a8f959e0d7 100644 --- a/docs/api-reference/failed-events/index.md +++ b/docs/api-reference/failed-events/index.md @@ -3,7 +3,9 @@ title: "Failed event types" sidebar_position: 15 --- -## Where do Failed Events originate? +This page lists all the possible types of [failed events](/docs/fundamentals/failed-events/index.md). + +## Where do failed events originate? While an event is being processed by the pipeline it is checked to ensure it meets the specific formatting or configuration expectations; these include checks like: does it match the schema it is associated with, were Enrichments successfully applied and was the payload sent by the tracker acceptable. @@ -17,7 +19,7 @@ Once the Collector payload successfully reaches the validation and enrichment st ::: -## Schema Violation +## Schema violation This failure type is produced during the process of [validation and enrichment](/docs/pipeline/enrichments/what-is-enrichment/index.md). It concerns the [self-describing events](/docs/fundamentals/events/index.md#self-describing-events) and [entities](/docs/fundamentals/entities/index.md) which can be attached to your snowplow event. @@ -25,14 +27,14 @@ This failure type is produced during the process of [validation and enrichment]( In order for an event to be processed successfully: -1. There must be a schema in an [iglu repository](/docs/api-reference/iglu/iglu-repositories/index.md) corresponding to each self-describing event or entity. The enrichment app must be able to look up the schema in order to validate the event. +1. There must be a schema in an [Iglu repository](/docs/api-reference/iglu/iglu-repositories/index.md) corresponding to each self-describing event or entity. The enrichment app must be able to look up the schema in order to validate the event. 2. Each self-describing event or entity must conform to the structure described in the schema. For example, all required fields must be present, and all fields must be of the expected type. -If your pipeline is generating schema violations, it might mean there is a problem with your tracking, or a problem with your [iglu resolver](/docs/api-reference/iglu/iglu-resolver/index.md) which lists where schemas should be found. The error details in the schema violation JSON object should give you a hint about what the problem might be. +If your pipeline is generating schema violations, it might mean there is a problem with your tracking, or a problem with your [Iglu resolver](/docs/api-reference/iglu/iglu-resolver/index.md) which lists where schemas should be found. The error details in the schema violation JSON object should give you a hint about what the problem might be. -Snowplow BDP customers should check in the Snowplow BDP Console that all data structures are correct and have been [promoted to production](/docs/data-product-studio/data-structures/manage/ui/index.md). Snowplow Community Edition users should check that the Enrichment app is configured with an [iglu resolver file](/docs/api-reference/iglu/iglu-resolver/index.md) that points to a repository containing the schemas. +Snowplow BDP customers should check in the Snowplow BDP Console that all data structures are correct and have been [promoted to production](/docs/data-product-studio/data-structures/manage/ui/index.md). Snowplow Community Edition users should check that the Enrichment app is configured with an [Iglu resolver file](/docs/api-reference/iglu/iglu-resolver/index.md) that points to a repository containing the schemas. -Next, check the tracking code in your custom application, and make sure the entities you are sending conform the schema definition. +Next, check the tracking code in your custom application, and make sure the entities you are sending conform to the schema definition. Once you have fixed your tracking, you might want to also [recover the failed events](/docs/data-product-studio/data-quality/failed-events/recovering-failed-events/index.md), to avoid any data loss. @@ -54,7 +56,7 @@ There are many reasons why an enrichment will fail, but here are some examples: - You are using the [IP lookup enrichment](/docs/pipeline/enrichments/available-enrichments/ip-lookup-enrichment/index.md) but have mis-configured the location of the MaxMind database. - You are using the [custom API request enrichment](/docs/pipeline/enrichments/available-enrichments/custom-api-request-enrichment/index.md) but the API server is not responding. - The raw event contained an unstructured event field or a context field which was not valid JSON. -- An iglu server responded with an unexpected error response, so the event schema could not be resolved. +- An Iglu server responded with an unexpected error response, so the event schema could not be resolved. If your pipeline is generating enrichment failures, it might mean there is a problem with your enrichment configuration. The error details in the enrichment failure JSON object should give you a hint about what the problem might be. @@ -66,7 +68,7 @@ Enrichment failure schema can be found [here](https://github.com/snowplow/iglu-c -## Collector Payload Format Violation +## Collector payload format violation This failure type is produced by the [enrichment](/docs/pipeline/enrichments/what-is-enrichment/index.md) application, when Collector payloads from the raw stream are deserialized from thrift format. @@ -87,7 +89,7 @@ Collector payload format violation schema can be found [here](https://github.com -## Adaptor Failure +## Adaptor failure This failure type is produced by the [enrichment](/docs/pipeline/enrichments/what-is-enrichment/index.md) application, when it tries to interpret a Collector payload from the raw stream as a http request from a [3rd party webhook](/docs/sources/webhooks/index.md). @@ -112,7 +114,7 @@ Adapter failure schema can be found [here](https://github.com/snowplow/iglu-cent -## Tracker Protocol Violation +## Tracker protocol violation This failure type is produced by the [enrichment](/docs/pipeline/enrichments/what-is-enrichment/index.md) application, when a http request does not conform to our [Snowplow Tracker Protocol](/docs/sources/trackers/snowplow-tracker-protocol/index.md). @@ -130,7 +132,7 @@ Tracker protocol violation schema can be found [here](https://github.com/snowplo -## Size Violation +## Size violation This failure type can be produced either by the [Collector](/docs/api-reference/stream-collector/index.md) or by the [enrichment](/docs/pipeline/enrichments/what-is-enrichment/index.md) application. It happens when the size of the raw event or enriched event is too big for the output message queue. In this case it will be truncated and wrapped in a size violation failed event instead. @@ -138,23 +140,25 @@ This failure type can be produced either by the [Collector](/docs/api-reference/ Failures of this type cannot be [recovered](/docs/data-product-studio/data-quality/failed-events/recovering-failed-events/index.md). The best you can do is to fix any application that is sending over-sized events. -Size violation schema can be found [here](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow.badrows/size_violation/jsonschema/1-0-0). +Because this failure is handled during collection or enrichment, events in the real time good stream are free of this violation type. + +The size violation schema can be found [here](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow.badrows/size_violation/jsonschema/1-0-0). -## Loader Parsing Error +## Loader parsing error -This failure type can be produced by [any loader](/docs/api-reference/loaders-storage-targets/index.md), if the enriched event in the real time good stream cannot be parsed as a canonical TSV event format. For example, if line has not enough columns (not 131) or event_id is not UUID. This error type is uncommon and unexpected, because it can only be caused by an invalid message in the stream of validated enriched events. +This failure type can be produced by [any loader](/docs/api-reference/loaders-storage-targets/index.md), if the enriched event in the real time good stream cannot be parsed as a canonical TSV event format. For example, if the row does not have enough columns (131 are expected) or the `event_id` is not a UUID. This error type is uncommon and unexpected, because it can only be caused by an invalid message in the stream of validated enriched events.
This failure type cannot be [recovered](/docs/data-product-studio/data-quality/failed-events/recovering-failed-events/index.md). -Loader parsing error schema can be found [here](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow.badrows/loader_parsing_error/jsonschema/2-0-0). +The loader parsing error schema can be found [here](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow.badrows/loader_parsing_error/jsonschema/2-0-0).
-## Loader Iglu Error +## Loader Iglu error This failure type can be produced by [any loader](/docs/api-reference/loaders-storage-targets/index.md) and describes an error using the [Iglu](/docs/api-reference/iglu/index.md) subsystem. @@ -162,17 +166,17 @@ This failure type can be produced by [any loader](/docs/api-reference/loaders-st For example: -- A schema is not available in any of the repositories listed in the [iglu resolver](/docs/api-reference/iglu/iglu-resolver/index.md). -- Some loaders (e.g. [RDB loader](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md) and [Postgres loader](/docs/api-reference/loaders-storage-targets/snowplow-postgres-loader/index.md)) make use of the "schema list" api endpoints, which are only implemented for an [iglu-server](/docs/api-reference/iglu/iglu-repositories/iglu-server/index.md) repository. A loader iglu error will be generated if the schema is in a [static repo](/docs/api-reference/iglu/iglu-repositories/static-repo/index.md) or [embedded repo](/docs/api-reference/iglu/iglu-repositories/jvm-embedded-repo/index.md). +- A schema is not available in any of the repositories listed in the [Iglu resolver](/docs/api-reference/iglu/iglu-resolver/index.md). +- Some loaders (e.g. [RDB loader](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md) and [Postgres loader](/docs/api-reference/loaders-storage-targets/snowplow-postgres-loader/index.md)) make use of the "schema list" api endpoints, which are only implemented for an [Iglu server](/docs/api-reference/iglu/iglu-repositories/iglu-server/index.md) repository. A loader Iglu error will be generated if the schema is in a [static repo](/docs/api-reference/iglu/iglu-repositories/static-repo/index.md) or [embedded repo](/docs/api-reference/iglu/iglu-repositories/jvm-embedded-repo/index.md). - The loader cannot auto-migrate a database table. If a schema version is incremented from `1-0-0` to `1-0-1` then it is expected to be [a non-breaking change](/docs/api-reference/iglu/common-architecture/schemaver/index.md), and many loaders (e.g. RDB loader) attempt to execute a `ALTER TABLE` statement to facilitate the new schema in the warehouse. But if the schema change is breaking (e.g. string field changed to integer field) then the database migration is not possible. This failure type cannot be [recovered](/docs/data-product-studio/data-quality/failed-events/recovering-failed-events/index.md). -Loader iglu error schema can be found [here](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow.badrows/loader_iglu_error/jsonschema/2-0-0). +Loader Iglu error schema can be found [here](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow.badrows/loader_iglu_error/jsonschema/2-0-0). -## Loader Recovery Error +## Loader recovery error Currently only the [BigQuery repeater](/docs/api-reference/loaders-storage-targets/bigquery-loader/index.md#block-8db848d4-0265-4ffa-97db-0211f4e2293d) generates this error. We call it "loader recovery error" because the purpose of the repeater is to recover from previously failed inserts. It represents the case when the software could not re-insert the row into the database due to a runtime failure or invalid data in a source. @@ -184,7 +188,7 @@ Loader recovery error schema can be found [here](https://github.com/snowplow/igl -## Loader Runtime Error +## Loader runtime error This failure type can be produced by any loader and describes generally any runtime error that we did not catch. For example, a DynamoDB outage, or a null pointer exception. This error type is uncommon and unexpected, and it probably indicates a mistake in the configuration or a bug in the software. @@ -196,7 +200,7 @@ Loader runtime error schema can be found [here](https://github.com/snowplow/iglu -## Relay Failure +## Relay failure This failure type is only produced by relay jobs, which transfer Snowplow data into a 3rd party platform. This error type is uncommon and unexpected, and it probably indicates a mistake in the configuration or a bug in the software. @@ -208,7 +212,7 @@ Relay failure schema can be found [here](https://github.com/snowplow/iglu-centra -## Generic Error +## Generic error This is a failure type for anything that does not fit into the other categories, and is unlikely enough that we have not created a special category. The failure error messages should give you a hint about what has happened. diff --git a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/images/athena-count.png b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/images/athena-count.png similarity index 100% rename from docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/images/athena-count.png rename to docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/images/athena-count.png diff --git a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/images/athena-create-table.png b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/images/athena-create-table.png similarity index 100% rename from docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/images/athena-create-table.png rename to docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/images/athena-create-table.png diff --git a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/images/bigquery-count.png b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/images/bigquery-count.png similarity index 100% rename from docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/images/bigquery-count.png rename to docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/images/bigquery-count.png diff --git a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/index.md b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/index.md index 18517d090c..2ae68943ee 100644 --- a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/index.md +++ b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/index.md @@ -1,10 +1,15 @@ --- title: "Accessing failed events in file storage" -sidebar_label: "Using S3 or GCS" -sidebar_position: 1 +sidebar_label: "In file storage" +sidebar_position: 2 --- -When failed events are generated on your pipeline the raw event payload along with details about the failure are saved into file storage (S3 on AWS, GCS on Google Cloud). +```mdx-code-block +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; +``` + +On AWS and GCP, when failed events are generated on your pipeline, the raw event payload along with details about the failure are saved into file storage (S3 on AWS, GCS on Google Cloud). :::info Community Edition quick start guide on GCP @@ -12,9 +17,12 @@ If you followed the [Community Edition quick start guide](/docs/get-started/snow ::: +## Retrieving raw data + You can directly access and download examples of events that are failing from file storage, this is useful for further investigation and also required to design a recovery operation. -## Retrieving raw data from S3 on AWS + + - Login to your AWS Console account and navigate to the sub-account that contains your Snowplow pipeline - Navigate to your S3 storage buckets @@ -38,7 +46,8 @@ Step 3 - select the relevant folder for your error type Step 4 - use the date and timestamps to find a batch of failed events that will contain an example of the event you wish to find -## Retrieving raw data from GCS on GCP + + - Login to your Google Cloud Platform account and navigate to the project that contains your Snowplow pipeline - Navigate to your Google Cloud Storage buckets @@ -65,3 +74,274 @@ Step 4 - drill down into the folder structure by year, month, day and time ![](images/failed-evs-gcs-7.jpg) Step 5 - once you find the raw files you can download them and view them in a text editor + + + + +## Using Athena or BigQuery + +[Athena](https://aws.amazon.com/athena/) on AWS and [BigQuery](https://cloud.google.com/bigquery) on GCP are tools that let you query your failed events, using the cloud storage files as a back-end data source. + +```sql +SELECT data.failure.messages FROM adapter_failures +WHERE from_iso8601_timestamp(data.failure.timestamp) > timestamp '2020-04-01' +``` + +This approach is great for debugging your pipeline without the need to load your failed events into a separate database. + +Before you can query this data, you need to create corresponding tables in Athena or BigQuery as we explain below. Each different failed event type (e.g. schema violations, adapter failures) has a different schema, so you will need one table per event type. + +## Creating the tables + + + + +Go to [the Athena dashboard](https://eu-central-1.console.aws.amazon.com/athena/home) and use the query editor. Start by creating a database (replace `{{ DATABASE }}` with the name of your pipeline, e.g. `prod1` or `qa1`): + +```sql +CREATE DATABASE IF NOT EXISTS {{ DATABASE }} +``` + +Then run each sql statement provided in the [badrows-tables repository](https://github.com/snowplow-incubator/snowplow-badrows-tables/tree/master/athena) by copying them into the Athena query editor. We recommend creating all tables, although you can skip the ones you are not interested in. + +:::info Placeholders + +Note that the sql statements contain a few placeholders which you will need to edit before you can create the tables: + +* `{{ DATABASE }}` — as above, change this to the name of your pipeline, e.g. `prod1` or `qa1`. +* `s3://{{ BUCKET }}/{{ PIPELINE }}` — this should point to the directory in S3 where your bad rows files are stored. + +::: + +![Creating a table in Athena](images/athena-create-table.png) + + + + +:::info Community Edition quick start guide on GCP + +If you followed the [Community Edition quick start guide](/docs/get-started/snowplow-community-edition/quick-start/index.md), you will need to manually deploy the [GCS Loader](/docs/api-reference/loaders-storage-targets/google-cloud-storage-loader/index.md) to save failed events into GCS, as it’s currently not included in the Terraform scripts. + +::: + +:::note + +These instructions make use of the [bq command-line tool](https://cloud.google.com/bigquery/docs/bq-command-line-tool) which is packaged with the [google cloud sdk](https://cloud.google.com/sdk/docs). Follow the sdk instructions for how to [initialize and authenticate the sdk](https://cloud.google.com/sdk/docs/initializing). Also take a look at the [BigQuery dashboard](https://console.cloud.google.com/bigquery) as you run these commands, so you can see your tables as you create them. + +::: + +Create a dataset to contain your failed event tables: + +```bash +bq mk --data_location=EU bad_rows_prod1 +# Dataset 'my-snowplow-project:bad_rows_prod1' successfully created. +``` + +The `--data-location` should match the location of your bad rows bucket. Also replace `prod1` with the name of your pipeline. + +Next, download the table definitions provided in the [badrows-tables repository](https://github.com/snowplow-incubator/snowplow-badrows-tables/tree/master/bigquery) in JSON format. + +:::info Placeholders + +Each table definition contains a `{{ BUCKET }}` placeholder which needs to be changed to the GCS bucket where your bad rows files are stored (e.g. `sp-storage-loader-bad-prod1-com_acme`). + +::: + +Now run `bq mk` for each table definition in turn. Use the `--external_table_definition` parameter so that BigQuery uses the bucket as the back-end data source. Here is how to run the command for the first three tables (note that you should change the dataset name `bad_rows_prod1` to match the dataset you just created): + +```bash +bq mk \ + --display_name="Adapter failures" \ + --external_table_definition=./adapter_failures.json \ + bad_rows_prod1.adapter_failures + +# Table 'my-snowplow-project:bad_rows_prod1.adapter_failures' successfully created. + +bq mk \ + --display_name "Schema violations" \ + --external_table_definition=./schema_violations.json \ + bad_rows_prod1.schema_violations + +# Table 'my-snowplow-project:bad_rows_prod1.schema_violations' successfully created. + +bq mk \ + --display_name "Tracker protocol violations" \ + --external_table_definition=./tracker_protocol_violations.json \ + bad_rows_prod1.tracker_protocol_violations + +# Table 'my-snowplow-project:bad_rows_prod1.tracker_protocol_violations' successfully created. +``` + +Run the corresponding commands for the remaining table definitions. We recommend creating all tables, although you can skip the ones you are not interested in. + +:::tip Why not just auto-detect the schemas? + +BigQuery has an “Auto-detect” feature to automatically generate the table definition for you by inspecting the file contents. So you might wonder why it is necessary to provide explicit schema definitions for your tables. + +There are two potential pitfalls when using the autogenerated schema with the Snowplow bad rows files: + +- _Optional fields_. BigQuery might not “notice” that a field exists, depending on the sample of data used to detect the schema. +- _Polymorphic fields_, e.g. `error` that can be either a string or an object. BigQuery will throw an exception if it sees an unexpected value for a field. Our table definitions use the `JSON` data type for these fields. + +::: + + + + +## Querying the data + + + + +As example of using your Athena tables, you might start by getting counts of each failed event type from the last week. Repeat this query for each table you have created: + +```sql +SELECT COUNT(*) FROM schema_violations +WHERE from_iso8601_timestamp(data.failure.timestamp) > DATE_ADD('day', -7, now()) +``` + +![Athena query](images/athena-count.png) + +If you have schema violations, you might want to find which tracker sent the event: + +```sql +SELECT data.payload.enriched.app_id, COUNT(*) FROM schema_violations +WHERE from_iso8601_timestamp(data.failure.timestamp) > DATE_ADD('day', -7, now()) +GROUP BY data.payload.enriched.app_id +``` + +You can do a deeper dive into the error messages to get a explanation of the last 10 failures: + +```sql +SELECT message.field AS field, + message.value AS value, + message.error AS error, + message.json AS json, + message.schemaKey AS schemaKey, + message.schemaCriterion AS schemaCriterion +FROM schema_violations +CROSS JOIN UNNEST(data.failure.messages) AS t(message) +ORDER BY data.failure.timestamp DESC +LIMIT 10 +``` + + + + +You can query your tables from the query editor in the [BigQuery console](https://console.cloud.google.com/bigquery). You might want to start by getting counts of each failed event type from the last week. This query will work, but it is relatively expensive because it will scan all files in the `schema_violations` directory: + +```sql +SELECT COUNT(*) FROM bad_rows_prod1.schema_violations +WHERE data.failure.timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY); +``` + +You can construct a more economical query by using the `_FILE_NAME` pseudo column to restrict the scan to files from the last week: + +```sql +SELECT COUNT(*) FROM bad_rows_prod1.schema_violations +WHERE DATE(PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%S', LTRIM(REGEXP_EXTRACT(_FILE_NAME, 'output-[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+'), 'output-'))) >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY); +``` + +You can repeat that query for each table you created in your bad rows dataset. + +![BigQuery query](images/bigquery-count.png) + +If you have schema violations, you might want to find which tracker sent the event: + +```sql +SELECT data.payload.enriched.app_id, COUNT(*) FROM bad_rows_prod1.schema_violations +WHERE DATE(PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%S', LTRIM(REGEXP_EXTRACT(_FILE_NAME, 'output-[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+'), 'output-'))) >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY) +GROUP BY data.payload.enriched.app_id; +``` + +If you have tracker protocol failures, you can do a deeper dive into the error messages to get a explanation of the last 10 failures: + +```sql +SELECT message.field AS field, + message.value AS value, + message.error AS error, + message.expectation AS expectation, + message.schemaKey AS schemaKey, + message.schemaCriterion AS schemaCriterion +FROM bad_rows_prod1.tracker_protocol_violations, +UNNEST(data.failure.messages) AS message +WHERE DATE(PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%S', LTRIM(REGEXP_EXTRACT(_FILE_NAME, 'output-[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+'), 'output-'))) >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY) +ORDER BY data.failure.timestamp DESC +LIMIT 10; +``` + +
+ Digging deeper + +You might notice that the `error` field in the result of the query above has the `JSON` type. +This is because depending on the variety of the failed event, the `error` might be a simple string or a complex object with additional detail. + +For example, the “invalid JSON” message might have this `error`: + +```json +"invalid json: expected false got 'foo' (line 1, column 1)" +``` + +In contrast, in case of a failure to resolve Iglu server, the value in the `error` field would look like this, with “sub-errors” inside: + +```json +{ + "error": "ResolutionError", + "lookupHistory": [ + { + "attempts": 1, + "errors": [ + { + "error": "RepoFailure", + "message": "Unexpected exception fetching: org.http4s.client.UnexpectedStatus: unexpected HTTP status: 404 Not Found" + } + ], + "lastAttempt": "2021-10-16T17:20:52.626Z", + "repository": "Iglu Central" + }, + ... + ] +} +``` + +You can figure out what to expect from such a field by looking at the JSON schema for the respective type of failed events, in this case the [tracker protocol violations schema](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow.badrows/tracker_protocol_violations/jsonschema/1-0-0). The mapping between the various failed event tables and the corresponding JSON schemas is [here](https://github.com/snowplow-incubator/snowplow-badrows-tables/tree/master/bigquery). + +BigQuery has a variety of JSON functions that allow you to extract data from within complex objects. For instance, if you are interested in Iglu repositories that failed to resolve, you can use something like this: + +```sql +SELECT DISTINCT(JSON_VALUE(message.error.lookupHistory[0].repository)) +FROM ... +WHERE ... +AND message.error.lookupHistory IS NOT NULL +``` + +It’s also possible, although unwieldy, to reduce all `error`s to a single string: + +```sql +-- Unnest individual messages for each failed event +WITH unnested_messages AS ( + SELECT message, CASE + -- resolution errors + WHEN message.error.lookupHistory IS NOT NULL THEN JSON_QUERY_ARRAY(message.error.lookupHistory[0].errors) + -- event validation errors + WHEN message.error.dataReports IS NOT NULL THEN JSON_QUERY_ARRAY(message.error.dataReports) + -- schema validation errors + WHEN message.error.schemaIssues IS NOT NULL THEN JSON_QUERY_ARRAY(message.error.schemaIssues) + -- other errors + ELSE [TO_JSON(STRUCT(message.error as message))] + END AS errors +FROM bad_rows_prod1.tracker_protocol_violations, +UNNEST(data.failure.messages) AS message +WHERE ...) + +SELECT JSON_VALUE(error.message) AS error +FROM unnested_messages, +UNNEST(errors) AS error +``` + +In the future, we plan to simplify the schemas of failed events so that they are more uniform and straightforward to query. + +
+ +
+
diff --git a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/index.md b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/index.md deleted file mode 100644 index 48c6aa04cd..0000000000 --- a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/index.md +++ /dev/null @@ -1,276 +0,0 @@ ---- -title: "Querying failed events in Athena or BigQuery" -sidebar_label: "Using Athena or BigQuery" -sidebar_position: 2 ---- - -```mdx-code-block -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; -``` - -[Athena](https://aws.amazon.com/athena/) on AWS and [BigQuery](https://cloud.google.com/bigquery) on GCP are tools that let you query your failed events, using the cloud storage files as a back-end data source. - -```sql -SELECT data.failure.messages FROM adapter_failures -WHERE from_iso8601_timestamp(data.failure.timestamp) > timestamp '2020-04-01' -``` - -This approach is great for debugging your pipeline without the need to load your failed events into a separate database. - -Before you can query this data, you need to create corresponding tables in Athena or BigQuery as we explain below. Each different failed event type (e.g. schema violations, adapter failures) has a different schema, so you will need one table per event type. - -## Creating the tables - - - - -Go to [the Athena dashboard](https://eu-central-1.console.aws.amazon.com/athena/home) and use the query editor. Start by creating a database (replace `{{ DATABASE }}` with the name of your pipeline, e.g. `prod1` or `qa1`): - -```sql -CREATE DATABASE IF NOT EXISTS {{ DATABASE }} -``` - -Then run each sql statement provided in the [badrows-tables repository](https://github.com/snowplow-incubator/snowplow-badrows-tables/tree/master/athena) by copying them into the Athena query editor. We recommend creating all tables, although you can skip the ones you are not interested in. - -:::info Placeholders - -Note that the sql statements contain a few placeholders which you will need to edit before you can create the tables: - -* `{{ DATABASE }}` — as above, change this to the name of your pipeline, e.g. `prod1` or `qa1`. -* `s3://{{ BUCKET }}/{{ PIPELINE }}` — this should point to the directory in S3 where your bad rows files are stored. - -::: - -![Creating a table in Athena](images/athena-create-table.png) - - - - -:::info Community Edition quick start guide on GCP - -If you followed the [Community Edition quick start guide](/docs/get-started/snowplow-community-edition/quick-start/index.md), you will need to manually deploy the [GCS Loader](/docs/api-reference/loaders-storage-targets/google-cloud-storage-loader/index.md) to save failed events into GCS, as it’s currently not included in the Terraform scripts. - -::: - -:::note - -These instructions make use of the [bq command-line tool](https://cloud.google.com/bigquery/docs/bq-command-line-tool) which is packaged with the [google cloud sdk](https://cloud.google.com/sdk/docs). Follow the sdk instructions for how to [initialize and authenticate the sdk](https://cloud.google.com/sdk/docs/initializing). Also take a look at the [BigQuery dashboard](https://console.cloud.google.com/bigquery) as you run these commands, so you can see your tables as you create them. - -::: - -Create a dataset to contain your failed event tables: - -```bash -bq mk --data_location=EU bad_rows_prod1 -# Dataset 'my-snowplow-project:bad_rows_prod1' successfully created. -``` - -The `--data-location` should match the location of your bad rows bucket. Also replace `prod1` with the name of your pipeline. - -Next, download the table definitions provided in the [badrows-tables repository](https://github.com/snowplow-incubator/snowplow-badrows-tables/tree/master/bigquery) in JSON format. - -:::info Placeholders - -Each table definition contains a `{{ BUCKET }}` placeholder which needs to be changed to the GCS bucket where your bad rows files are stored (e.g. `sp-storage-loader-bad-prod1-com_acme`). - -::: - -Now run `bq mk` for each table definition in turn. Use the `--external_table_definition` parameter so that BigQuery uses the bucket as the back-end data source. Here is how to run the command for the first three tables (note that you should change the dataset name `bad_rows_prod1` to match the dataset you just created): - -```bash -bq mk \ - --display_name="Adapter failures" \ - --external_table_definition=./adapter_failures.json \ - bad_rows_prod1.adapter_failures - -# Table 'my-snowplow-project:bad_rows_prod1.adapter_failures' successfully created. - -bq mk \ - --display_name "Schema violations" \ - --external_table_definition=./schema_violations.json \ - bad_rows_prod1.schema_violations - -# Table 'my-snowplow-project:bad_rows_prod1.schema_violations' successfully created. - -bq mk \ - --display_name "Tracker protocol violations" \ - --external_table_definition=./tracker_protocol_violations.json \ - bad_rows_prod1.tracker_protocol_violations - -# Table 'my-snowplow-project:bad_rows_prod1.tracker_protocol_violations' successfully created. -``` - -Run the corresponding commands for the remaining table definitions. We recommend creating all tables, although you can skip the ones you are not interested in. - -:::tip Why not just auto-detect the schemas? - -BigQuery has an “Auto-detect” feature to automatically generate the table definition for you by inspecting the file contents. So you might wonder why it is necessary to provide explicit schema definitions for your tables. - -There are two potential pitfalls when using the autogenerated schema with the Snowplow bad rows files: - -- _Optional fields_. BigQuery might not “notice” that a field exists, depending on the sample of data used to detect the schema. -- _Polymorphic fields_, e.g. `error` that can be either a string or an object. BigQuery will throw an exception if it sees an unexpected value for a field. Our table definitions use the `JSON` data type for these fields. - -::: - - - - -## Querying the data - - - - -As example of using your Athena tables, you might start by getting counts of each failed event type from the last week. Repeat this query for each table you have created: - -```sql -SELECT COUNT(*) FROM schema_violations -WHERE from_iso8601_timestamp(data.failure.timestamp) > DATE_ADD('day', -7, now()) -``` - -![Athena query](images/athena-count.png) - -If you have schema violations, you might want to find which tracker sent the event: - -```sql -SELECT data.payload.enriched.app_id, COUNT(*) FROM schema_violations -WHERE from_iso8601_timestamp(data.failure.timestamp) > DATE_ADD('day', -7, now()) -GROUP BY data.payload.enriched.app_id -``` - -You can do a deeper dive into the error messages to get a explanation of the last 10 failures: - -```sql -SELECT message.field AS field, - message.value AS value, - message.error AS error, - message.json AS json, - message.schemaKey AS schemaKey, - message.schemaCriterion AS schemaCriterion -FROM schema_violations -CROSS JOIN UNNEST(data.failure.messages) AS t(message) -ORDER BY data.failure.timestamp DESC -LIMIT 10 -``` - - - - -You can query your tables from the query editor in the [BigQuery console](https://console.cloud.google.com/bigquery). You might want to start by getting counts of each failed event type from the last week. This query will work, but it is relatively expensive because it will scan all files in the `schema_violations` directory: - -```sql -SELECT COUNT(*) FROM bad_rows_prod1.schema_violations -WHERE data.failure.timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY); -``` - -You can construct a more economical query by using the `_FILE_NAME` pseudo column to restrict the scan to files from the last week: - -```sql -SELECT COUNT(*) FROM bad_rows_prod1.schema_violations -WHERE DATE(PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%S', LTRIM(REGEXP_EXTRACT(_FILE_NAME, 'output-[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+'), 'output-'))) >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY); -``` - -You can repeat that query for each table you created in your bad rows dataset. - -![BigQuery query](images/bigquery-count.png) - -If you have schema violations, you might want to find which tracker sent the event: - -```sql -SELECT data.payload.enriched.app_id, COUNT(*) FROM bad_rows_prod1.schema_violations -WHERE DATE(PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%S', LTRIM(REGEXP_EXTRACT(_FILE_NAME, 'output-[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+'), 'output-'))) >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY) -GROUP BY data.payload.enriched.app_id; -``` - -If you have tracker protocol failures, you can do a deeper dive into the error messages to get a explanation of the last 10 failures: - -```sql -SELECT message.field AS field, - message.value AS value, - message.error AS error, - message.expectation AS expectation, - message.schemaKey AS schemaKey, - message.schemaCriterion AS schemaCriterion -FROM bad_rows_prod1.tracker_protocol_violations, -UNNEST(data.failure.messages) AS message -WHERE DATE(PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%S', LTRIM(REGEXP_EXTRACT(_FILE_NAME, 'output-[0-9]+-[0-9]+-[0-9]+T[0-9]+:[0-9]+:[0-9]+'), 'output-'))) >= DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY) -ORDER BY data.failure.timestamp DESC -LIMIT 10; -``` - -
- Digging deeper - -You might notice that the `error` field in the result of the query above has the `JSON` type. -This is because depending on the variety of the failed event, the `error` might be a simple string or a complex object with additional detail. - -For example, the “invalid JSON” message might have this `error`: - -```json -"invalid json: expected false got 'foo' (line 1, column 1)" -``` - -In contrast, in case of a failure to resolve Iglu server, the value in the `error` field would look like this, with “sub-errors” inside: - -```json -{ - "error": "ResolutionError", - "lookupHistory": [ - { - "attempts": 1, - "errors": [ - { - "error": "RepoFailure", - "message": "Unexpected exception fetching: org.http4s.client.UnexpectedStatus: unexpected HTTP status: 404 Not Found" - } - ], - "lastAttempt": "2021-10-16T17:20:52.626Z", - "repository": "Iglu Central" - }, - ... - ] -} -``` - -You can figure out what to expect from such a field by looking at the JSON schema for the respective type of failed events, in this case the [tracker protocol violations schema](https://github.com/snowplow/iglu-central/blob/master/schemas/com.snowplowanalytics.snowplow.badrows/tracker_protocol_violations/jsonschema/1-0-0). The mapping between the various failed event tables and the corresponding JSON schemas is [here](https://github.com/snowplow-incubator/snowplow-badrows-tables/tree/master/bigquery). - -BigQuery has a variety of JSON functions that allow you to extract data from within complex objects. For instance, if you are interested in Iglu repositories that failed to resolve, you can use something like this: - -```sql -SELECT DISTINCT(JSON_VALUE(message.error.lookupHistory[0].repository)) -FROM ... -WHERE ... -AND message.error.lookupHistory IS NOT NULL -``` - -It’s also possible, although unwieldy, to reduce all `error`s to a single string: - -```sql --- Unnest individual messages for each failed event -WITH unnested_messages AS ( - SELECT message, CASE - -- resolution errors - WHEN message.error.lookupHistory IS NOT NULL THEN JSON_QUERY_ARRAY(message.error.lookupHistory[0].errors) - -- event validation errors - WHEN message.error.dataReports IS NOT NULL THEN JSON_QUERY_ARRAY(message.error.dataReports) - -- schema validation errors - WHEN message.error.schemaIssues IS NOT NULL THEN JSON_QUERY_ARRAY(message.error.schemaIssues) - -- other errors - ELSE [TO_JSON(STRUCT(message.error as message))] - END AS errors -FROM bad_rows_prod1.tracker_protocol_violations, -UNNEST(data.failure.messages) AS message -WHERE ...) - -SELECT JSON_VALUE(error.message) AS error -FROM unnested_messages, -UNNEST(errors) AS error -``` - -In the future, we plan to simplify the schemas of failed events so that they are more uniform and straightforward to query. - -
- -
-
diff --git a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/warehouse-lake/images/enable-stream.png b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/warehouse-lake/images/enable-stream.png new file mode 100644 index 0000000000..89cb111cf6 Binary files /dev/null and b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/warehouse-lake/images/enable-stream.png differ diff --git a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/warehouse-lake/images/loader-type.png b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/warehouse-lake/images/loader-type.png new file mode 100644 index 0000000000..72212e1906 Binary files /dev/null and b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/warehouse-lake/images/loader-type.png differ diff --git a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/warehouse-lake/index.md b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/warehouse-lake/index.md new file mode 100644 index 0000000000..7d2c872789 --- /dev/null +++ b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/warehouse-lake/index.md @@ -0,0 +1,131 @@ +--- +title: "Exploring failed events in the warehouse or data lake" +description: "Load common types of failed events to a separate table in your warehouse or lake to analyze them easily." +sidebar_label: "In warehouse or lake" +sidebar_position: 1 +--- + +:::note Compatibility + +This feature is available since Enrich 5.0.0 and works with Snowflake and Lake loaders. + +::: + +## Introduction + +[Failed events](/docs/fundamentals/failed-events/index.md) are events the pipeline had some problem processing (for example, events that did not pass validation). + +For the common failures (validation and enrichment), you can configure continuous loading of any offending events into _a separate table_ in your warehouse or lake. This way, you can easily inspect them and decide how they might be patched up (e.g. with SQL) and merged with the rest of your data. + +:::note + +This feature is not retroactive, i.e. only failed events that occur _after it’s enabled_ will be loaded into your desired destination. + +::: + +## Format + +The format of the failed events loaded into your warehouse or lake is [the same as for your atomic events](/docs/fundamentals/canonical-event/index.md). All the standard fields are present (unless themselves invalid — see below). This allows you to query and aggregate this data easily, e.g. if you want to see the number of failures per `app_id`. + +There are two differences compared to regular events. + +**Invalid data is removed from the event.** This principle applies to all columns: +* Any invalid standard column (e.g. `geo_country`) will be set to `null`. +* Likewise, any column containing the JSON for a self-describing event (`unstruct_...`) will be set to `null` if that JSON fails validation. +* Finally, for entity columns (`contexts_`), if one entity is invalid, it will be removed from the array of entities. If all entities are invalid, the whole column will be set to `null`. + +For more information about the different columns in Snowplow data, see [how Snowplow data is stored in the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md). + +**There is an extra column with failure details.** The column is named `contexts_com_snowplowanalytics_snowplow_failure_1`. In most cases, it will also contain the invalid data in some form. See the [next section](#example-failed-event) for an example. + +## Example failed event + +Here is an example of what the `contexts_com_snowplowanalytics_snowplow_failure_1` column might contain. Note that a single failed event might have more than one error. In this case, there is a required field missing, and also an unrecognized field present. + +```js +[ + { + // timestamp of the failure + "timestamp": "2025-01-15T14:12:50.498148Z", + + // failure type and failure schema version + "failureType": "ValidationError", + "_schema_version": "1-0-0", + + // the component where the failure happened + "componentName": "snowplow-enrich-kafka", + "componentVersion": "5.1.2", + + // the schema of the offending event + "schema": "iglu:com.snowplowanalytics.snowplow/link_click/jsonschema/1-0-1", + + // any properties which were invalid + "data": { + "invalidProperty": "This schema doesn't have this property" + }, + + // there can be multiple errors per event + "errors": [ + { + "keyword": "required", + "message": "$.targetUrl: is missing but it is required", + "path": "$", + "source": "unstruct", + "targets": [ + "targetUrl" + ] + }, + { + "keyword": "additionalProperties", + "message": "$.invalidProperty: is not defined in the schema and the schema does not allow additional properties", + "path": "$", + "source": "unstruct", + "targets": [ + "invalidProperty" + ] + } + ] + } +] +``` + +## Setup + +To use this feature, you will first need to enable the stream that contains failed events in the [Snowplow TSV format](/docs/fundamentals/canonical-event/understanding-the-enriched-tsv-format/index.md) suitable for loading into your warehouse or lake. + +The instructions below are for Snowplow BDP users. For Community Edition, you will need to configure this manually via Terraform. + +:::note Infrastructure costs + +An additional stream (Kinesis, Pub/Sub or Event Hubs on AWS, GCP and Azure respectively) will be reflected in your cloud infrastructure costs (unless you are using BDP Cloud). That said, failed events are usually a tiny fraction of all events, so this stream will be minimally sized. + +::: + +Open the _“Pipeline configuration”_ section for the desired pipeline and select _“Failed events stream”_. + +![enable failed events stream](images/enable-stream.png) + +Click _“Enable”_ and wait for the changes to take effect. + +Now you are ready to add a loader. Click _“Add failed events loader”_, which will take you to the destinations catalog. + +You can use the following loaders with the failed events stream: + +* Snowflake Streaming Loader +* Lake Loader + +Pick your desired destination and follow the steps in the UI, selecting _“failed events”_ as the type of events. + +![loader type selection](images/loader-type.png) + +Note that as with any other loader, you will first need to create a connection to your warehouse or lake, and then the loader itself. + +:::warning PII in failed events + +Some of the problems that cause failed events could lead them to contain personally identifiable information (PII). For example, a validation error could stem from PII placed in the wrong field, and that field might not be pseudonymized, leaving PII exposed in the error message. Or the [PII enrichment](/docs/pipeline/enrichments/available-enrichments/pii-pseudonymization-enrichment/index.md) itself might have failed. + +For this reason, we strongly recommend loading failed events into a separate schema (in case of a warehouse) or storage location (in case of a data lake) compared to your atomic events. This allows you to restrict access to failed events. + +Keep this in mind when creating the connection. + +::: diff --git a/docs/data-product-studio/data-quality/failed-events/recovering-failed-events/manual/getting-started/index.md b/docs/data-product-studio/data-quality/failed-events/recovering-failed-events/manual/getting-started/index.md index 77c1b010d3..50b5a155ab 100644 --- a/docs/data-product-studio/data-quality/failed-events/recovering-failed-events/manual/getting-started/index.md +++ b/docs/data-product-studio/data-quality/failed-events/recovering-failed-events/manual/getting-started/index.md @@ -6,7 +6,7 @@ sidebar_position: 0 Event recovery at its core, is the ability to fix events that have failed and replay them through your pipeline. -After inspecting failed events either in the [Snowplow BDP Console](/docs/data-product-studio/data-quality/failed-events/monitoring-failed-events/ui/index.md), or in the [partitioned failure buckets](/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/index.md), you can determine which events are possible to recover based on what the fix entails. +After inspecting failed events either in the [Snowplow BDP Console](/docs/data-product-studio/data-quality/failed-events/monitoring-failed-events/ui/index.md), or in the [partitioned failure buckets](/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/index.md), you can determine which events are possible to recover based on what the fix entails. With recovery it is possible to: @@ -16,9 +16,9 @@ With recovery it is possible to: If your failed events would not be fixed by applying the above, they currently would be considered unrecoverable. Due to the fact that there might be a mix of recoverable and unrecoverable data in your storage, event recovery uses configuration in order to process only a subset of the failed events. -### What you'll need to get started +### What you'll need to get started -The typical flow for recovery and some prerequisites to consider would be: +The typical flow for recovery and some prerequisites to consider would be: **Understanding the failure issue** - Familiarity with the [failed event types](/docs/fundamentals/failed-events/index.md) diff --git a/docs/data-product-studio/data-quality/snowplow-inspector/overview/index.md b/docs/data-product-studio/data-quality/snowplow-inspector/overview/index.md index f33cf938fe..2a8280d99f 100644 --- a/docs/data-product-studio/data-quality/snowplow-inspector/overview/index.md +++ b/docs/data-product-studio/data-quality/snowplow-inspector/overview/index.md @@ -32,6 +32,6 @@ This makes the tool a good first port of call when trying to answer questions su Additionally, you can configure the extension to show whether or not an event has passed validation according to any event validation rules codified in the corresponding [schema](/docs/fundamentals/schemas/index.md). -For events that failed validation in production historically that you are unable to replicate in your own browser, see our guides on [how to query failed events](/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/index.md) from their respective destinations. +For events that failed validation in production historically that you are unable to replicate in your own browser, see our guides on [how to query failed events](/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/index.md) from their respective destinations. These failed events have a [specific format](/docs/fundamentals/failed-events/index.md) that includes an array of helpful, detailed error messages that explain the exact reasons why the event failed validation. These events can also [be imported](/docs/data-product-studio/data-quality/snowplow-inspector/importing-events/index.md#importing-failed-events) into the extension to view as if your browser had generated them itself. diff --git a/docs/fundamentals/failed-events/index.md b/docs/fundamentals/failed-events/index.md index 43c5eda159..e71e5d06c7 100644 --- a/docs/fundamentals/failed-events/index.md +++ b/docs/fundamentals/failed-events/index.md @@ -55,23 +55,21 @@ Snowplow BDP provides a dashboard and alerts for failed events. See [Monitoring --- -For the common failures (validation and enrichment), it’s possible to load the offending events into _a separate table_ in your warehouse or lake. This way, you can easily inspect them and decide how they might be patched up (e.g. with SQL) and merged with the rest of your data. These events will include a special column with the details of the failure, and any invalid columns will be set to `null`. Otherwise, the format is [the same as for your atomic events](/docs/fundamentals/canonical-event/index.md). +For the common failures (validation and enrichment), you can configure continuous loading of any offending events into _a separate table_ in your warehouse or lake. This way, you can easily inspect them and decide how they might be patched up (e.g. with SQL) and merged with the rest of your data. -:::note Compatibility +:::note -This feature is available since Enrich 5.0.0 and works with Snowflake, BigQuery and Lake loaders. +This feature is not retroactive, i.e. only failed events that occur _after it’s enabled_ will be loaded into your desired destination. ::: ---- - -Finally, all failed events are backed up in object storage (S3 on AWS or GCS on GCP). Sometimes, but not in all cases (e.g. not if the original events exceeded size limits), it’s possible to recover them by replaying them through the pipeline. This is a complicated process mainly reserved for internal failures and outages. Refer to [Recovering failed events](/docs/data-product-studio/data-quality/failed-events/recovering-failed-events/index.md). +The events will include a special column with the details of the failure, and any invalid columns will be set to `null`. Otherwise, the format is [the same as for your atomic events](/docs/fundamentals/canonical-event/index.md). -:::note Azure +See [Exploring failed events](/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/warehouse-lake/index.md) for more details and setup instructions. -Writing failed events to object storage is supported on AWS and GCP, but not on Azure. +--- -::: +Finally, on AWS and GCP all failed events are backed up in object storage (S3 and GCS respectively). Sometimes, but not in all cases (e.g. not if the original events exceeded size limits), it’s possible to recover them by replaying them through the pipeline. This is a complicated process mainly reserved for internal failures and outages. Refer to [Recovering failed events](/docs/data-product-studio/data-quality/failed-events/recovering-failed-events/index.md). --- diff --git a/docs/get-started/snowplow-community-edition/what-is-deployed/index.md b/docs/get-started/snowplow-community-edition/what-is-deployed/index.md index 83acc72ce8..c16669982c 100644 --- a/docs/get-started/snowplow-community-edition/what-is-deployed/index.md +++ b/docs/get-started/snowplow-community-edition/what-is-deployed/index.md @@ -239,7 +239,7 @@ See the [S3 Loader](https://registry.terraform.io/modules/snowplow-devops/s3-loa The following loaders and folders are available: * Raw loader, `raw/`: events that come straight out of the Collector and have not yet been validated or enriched by the Enrich application. They are Thrift records and are therefore a little tricky to decode. There are not many reasons to use this data, but backing this data up gives you the flexibility to replay this data should something go wrong further downstream in the pipeline. * Enriched loader, `enriched/`: enriched events, in GZipped blobs of [enriched TSV](/docs/fundamentals/canonical-event/understanding-the-enriched-tsv-format/index.md). Historically, this has been used as the staging ground for loading into data warehouses via the [Batch transformer](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/transforming-enriched-data/spark-transformer/index.md) application. However, it’s no longer used in the quick start examples. -* Bad loader, `bad/`: [failed events](/docs/fundamentals/failed-events/index.md). You can [query them using Athena](/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/index.md). +* Bad loader, `bad/`: [failed events](/docs/fundamentals/failed-events/index.md). You can [query them using Athena](/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/index.md). Also, if you choose Postgres as your destination, the Postgres loader will load all failed events into Postgres. diff --git a/src/remark/abbreviations.js b/src/remark/abbreviations.js index ae682045bb..14d4b9b812 100644 --- a/src/remark/abbreviations.js +++ b/src/remark/abbreviations.js @@ -24,7 +24,7 @@ const plugin = () => { KCL: 'Kinesis Client Library', OSS: 'Open Source Software', QA: 'Quality Assurance', - PII: 'Personal Identifiable Information', + PII: 'Personally Identifiable Information', RDS: 'Amazon Relational Database Service', S3: 'Amazon Cloud Object Storage', SS: 'Server Side', diff --git a/static/_redirects b/static/_redirects index ef4ac4497e..d1e7537b76 100644 --- a/static/_redirects +++ b/static/_redirects @@ -328,3 +328,6 @@ docs/understanding-tracking-design/managing-data-structures-with-data-structures /docs/storing-querying/* /docs/destinations/warehouses-lakes/:splat 301 /docs/recipes/* /docs/resources/recipes-tutorials/:splat 301 /docs/using-the-snowplow-console/* /docs/account-management/:splat 301 + +# Failed events reshuffle +/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/querying/ /docs/data-product-studio/data-quality/failed-events/exploring-failed-events/file-storage/ 301