Skip to content

Latest commit

Β 

History

History
328 lines (261 loc) Β· 13.6 KB

api-data-sanitization.md

File metadata and controls

328 lines (261 loc) Β· 13.6 KB

API Data Sanitization

Psoxy supports specifying sanitization rule sets to use to sanitize data from an API. These can be configured by encoding a rule set in YAML and setting a parameter in your instance's configuration. See an example of rules for Zoom: zoom.yaml.

If such a parameter is not set, a proxy instances selects default rules based on source kind, from the corresponding supported source.

You can configure custom rule sets for a given instance via Terraform, by adding an entry to the custom_api_connector_rules map in your terraform.tfvars file.

eg,

custom_api_connector_rules = {
    gmail: "custom-gmail.yaml"
}

API Connector Rules Syntax

<ruleset> ::= "endpoints:" <endpoint-list> <endpoint-list> ::= <endpoint> | <endpoint> <endpoint-list>

A ruleset is a list of API endpoints that are permitted to be invoked through the proxy. Requests which do not match a endpoint in this list will be rejected with a 403 response.

Endpoint Specification

<endpoint> ::= <path-template> <allowed-methods> <path-parameter-schemas> <query-parameter-schemas> <response-schema> <transforms>

<path-template> ::= "- pathTemplate: " <string> Each endpoint is specified by a path template, based on OpenAPI Spec v3.0.0 Path Template syntax. Variable path segments are enclosed in curly braces ({}) and are matched by any value that does not contain an / character.

See: https://swagger.io/docs/specification/paths-and-operations/

Allowed Methods

<allowed-methods> ::= "- allowedMethods: " <method-list> <method-list> ::= <method> | <method> <method-list>

<method> ::= "GET" | "POST" | "PUT" | "PATCH" | "DELETE" | "HEAD"

If provided, only HTTP methods included in this list will be permitted for the endpoint. Given semantics of RESTful APIs, this allows an additional point to enforce "read-only" data access, in addition to OAuth scopes/etc.

NOTE: for AWS-hosted deployments using API Gateway, IAM policies and routes may also be used to restrict HTTP methods. See aws/guides/api-gateway.md for more details.

Path / Query Parameter Schemas

<path-parameter-schemas> ::= "- pathParameterSchemas: " <parameter-schema> <query-parameter-schemas> ::= "- queryParameterSchemas: " <parameter-schema>

alpha - a parameter schema to use to validate path/query parameter values; if validation fails, proxy will return 403 forbidden response. Given the use-case of validating URL / query parameters, only a small subset of JSON Schema is supported.

As of 0.4.38, this is considered an alpha feature which may change in backwards-incompatible ways.

Currently, the supports JSON Schema features are:

  • type :
    • string value must be a JSON string
    • integer value must be a JSON integer
    • number value must be a JSON number
  • format :
    • reversible-pseudonym : value MUST be a reversible pseudonym generated by the proxy
  • pattern : a regex pattern to match against the value
  • enum : a list of values to match against the value

null/empty is valid for all types; you can use a pattern to restrict this further.

Response Schema

<response-schema> ::= "responseSchema: " <json-schema-filter>

See: Response Schema Specification below.

<transforms> ::= "transforms:" <transform-list> <transform-list> ::= <transform> | <transform> <transform-list>

For each Endpoint, rules specify a list of transforms to apply to the response content.

Transform Specification

<transform> ::= "- " <transform-type> <json-paths> [<encoding>]

Each transform is specified by a transform type and a list of JSON paths. The transform is applied to all portions of the response content that match any of the JSON paths.

Supported Transform Types:

<transform-type> ::= "!<pseudonymizeEmailHeader>" | "!<pseudonymize>" | "!<redact>" | "!<redactRegexMatches>" | "!<tokenize>" | "<!filterTokenByRegex>" | "!<redactExceptSubstringsMatchingRegexes"

NOTE: these are implementations of com.avaulta.gateway.rules.transforms.Transform class in the psoxy codebase.

Pseudonymize

!<pseudonymize> - transforms matching values by normalizing them (triming whitespace; if appear to be emails, treating them as case-insensitive, etc) and computing a SHA-256 hash of the normalized value. Relies on SALT value configured in your proxy environment to ensure the SHA-256 is deterministic across time and between sources. In the case of emails, the domain portion is preserved, although the hash is still based on the entire normalized value (avoids hash of [email protected] matching hash of [email protected]).

Options:

  • includeReversible (default: false): If true, an encrypted form of the original value will be included in the result. This value, if passed back to the proxy in a URL, will be decrypted back to the original value before the request is forward to the data source. This is useful for identifying values that are needed as parameters for subsequent API requests. This relies on symmetric encryption using the ENCRYPTION_KEY secret stored in the proxy; if ENCRYPTION_KEY is rotated, any 'reversible' value previously generated will no longer be able to be decrypted by the proxy.
  • encoding (default: JSON): The encoding to use when serializing the pseudonym to a string.
    • JSON - a JSON object structure, with explicit fields
    • URL_SAFE_TOKEN - a string format that aims to be concise, URL-safe, and format-preserving for email case.

Pseudonymize Email Header

!<pseudonymizeEmailHeader> - transforms matching values by parsing the value as an email header, in accordance with RFC 2822 and some typical conventions, and generating a pseudonym based only on the normalized email address itself (ignoring name, etc that may appear) . In particular:

  • deals with CSV lists (multiple emails in a single header)
  • handles the name <email> format, in effect redacting the name and replacing with a pseudonym based only on normalized email

Redact

!<redact> - removes the matching values from the response.

Some extensions of redaction are also supported:

  • !<redactExceptSubstringsMatchingRegexes> - removes the matching values from the response except value matches one of the specified regex options. (Use case: preserving portions of event titles if match variants of 'Focus Time', 'No Meetings', etc)
  • !<redactRegexMatches> - redact content IF it matches one of the regexs included as an option.

By using a negation in the JSON Path for the transformation, !<redact> can be used to implement default-deny style rules, where all fields are redacted except those explicitly listed in the JSON Path expression. This can also redact object-valued fields, conditionally based on object properties as shown below.

Eg, the following redacts all headers that have a name value other than those explicitly listed below:

- !<redact>
  jsonPaths:
    - "$.messages.payload.headers[?(!(@.name =~ /^From|To|Cc|Bcc|X-Original-Sender|Delivered-To|Sender|Message-ID|Date|In-Reply-To|Original-Message-ID|References$/i))]"

Tokenize

!<tokenize> - replaces matching values it with a reversible token, which proxy can reverse to the original value using ENCRYPTION_KEY secret stored in the proxy in subsequent requests.

Use case are values that may be sensitive, but are opaque. For example, page tokens in Microsoft Graph API do not have a defined structure, but in practice contain PII.

Options:

  • regex a capturing regex to use to extract portion of value that needs to be tokenized.

Filter Tokens by Regex

!<filterTokenByRegex> - tokenizes matching string values by a delimiter, if provided; and matches result against a list of filters, removing any content that doesn't match at least one of the filters. (Use case: preserving Zoom URLs in meeting descriptions, while removing the rest of the description)

Options:

  • delimiter - used to split the value into tokens; if not provided, the entire value is treated as a single token.
  • filters - in effect, combined via OR; tokens matching ANY of the filters is preserved in the value.

Response Schema Specification

A "response schema" is a "JSON Schema Filter" structure, specifying how response (which must be JSON) should be filtered. Using this, you can implement a "default deny" approach to sanitizing API fields in a manner that may be more convenient than using JSON paths with conditional negations (a redact transform with a JSON path that matches all but an explicit list of named fields is the other approach to implementing 'default deny' style rules).

Our "JSON Schema Filter" implementation attempts to align to the JSON Schema specification, with some variation as it is intended for filtering rather than validation. But generally speaking, you should be able to copy the JSON Schema for an API endpoint from its OpenAPI specification as a starting point for the responseSchema value in your rule set. Similarly, there are tools that can generate JSON Schema from example JSON content, as well as from data models in various languages, that may be useful.

See: https://json-schema.org/implementations.html#schema-generators

If a responseSchema attribute is specified for an endpoint, the response content will be filtered (rather than validated) against that schema. Eg, fields NOT specified in the schema, or not of expected type, will be removed from the response.

  • type - one of :
    • object a JSON object
    • array a JSON array
    • string a JSON string
    • number a JSON number, either integer or decimal.
    • integer a JSON integer (not a decimal)
    • boolean a JSON boolean
  • properties - for type == object, a map of field names to schema to filter field's value against (eg, another JsonSchemaFilter for the field itself)
  • items - for type == array, a schema to filter each item in the array against (again, a JsonSchemaFilter)
  • format - for type == string, a format to expect for the string value. As of v0.4.38, this is not enforced by the proxy.
  • $ref - a reference to a schema specified in the definitions property of the root schema.
  • definitions - a map of schema names to schemas of type JsonSchemaFilter; only supported at root schema of endpoint.

Example:

The following is for a User from the GitHub API, via graphql. See: https://docs.github.com/en/graphql/guides/forming-calls-with-graphql#the-graphql-endpoint

- pathTemplate: "/graphql"
  responseSchema:
    type: "object"
    properties:
      data:
        type: "object"
        properties:
          organization:
            type: "object"
            properties:
              samlIdentityProvider:
                type: "object"
                properties:
                  externalIdentities:
                    type: "object"
                    properties:
                      pageInfo:
                        type: "object"
                        properties:
                          hasNextPage:
                            type: "boolean"
                          endCursor:
                            type: "string"
                      edges:
                        type: "array"
                        items:
                          type: "object"
                          properties:
                            node:
                              type: "object"
                              properties:
                                guid:
                                  type: "string"
                                samlIdentity:
                                  type: "object"
                                  properties:
                                    nameId:
                                      type: "string"
                                user:
                                  type: "object"
                                  properties:
                                    login:
                                      type: "string"
              membersWithRole:
                type: "object"
                properties:
                  pageInfo:
                    type: "object"
                    properties:
                      hasNextPage:
                        type: "boolean"
                      endCursor:
                        type: "string"
                  edges:
                    type: "array"
                    items:
                      type: "object"
                      properties:
                        node:
                          type: "object"
                          properties:
                            email:
                              type: "string"
                            login:
                              type: "string"
                            id:
                              type: "string"
                            isSiteAdmin:
                              type: "boolean"
                            organizationVerifiedDomainEmails:
                              type: "array"
                              items:
                                type: "string"
      errors:
        type: "array"
        items:
          type: "object"
          properties:
            type:
              type: "string"
            path:
              type: "array"
              items:
                type: "string"
            locations:
              type: "array"
              items:
                type: "object"
                properties:
                  line:
                    type: "integer"
                  column:
                    type: "integer"
            message:
              type: "string"