azure-apim-management-multi-tenancy-monitoring

Requirements

Support Application Insights for 500 different products. Each product will have varying number of APIs. An application team can manage multiple products as a portfolio.
Application teams regularly work within their products but need ability to perform cross-product queries.
Consolidate Log Analytics Workspaces and used as backing store for Application Insights.
Consolidate Application Insights were appropriate.
Cross-workspace and cross-Application Insights queries for reporting functions such as API usage metrics and guidance for app teams to improve API quality.
Reduce business risks for high business impact application such as HR or Finance.
Charge back to product teams.
RBAC assignments are made at resource groups.
Integration with API Management as the source of metrics for Application Insights.
No plans to use AMPLS.

How to Deploy Locally

This demo project is built in Terraform and contains three configurations. We assumed that you already have an APIM instance in your tenant.

AAD Entities: To deploy Azure Active Directory entities for tenants. Example: Service Principals, AAD Groups, etc
APIM-Monitoring: To deploy global alerts and action group for the entire APIM. Also, storage account for API reference data is created
Tenants: To deploy tenants (tenant-a and tenant-b)
APIS: To deploy APIs under tenants and products

Follow these links for instructions:

AAD Entities for teams
APIM Global Monitoring Resouces
APIM Teams
APIs

Tenancy Model

An APIM instance is being used accross an enterprise and its hosts many APIs from different teams.The implementation becomes straightforward if we treat API developers as tenants on the APIM platform. A tenant is a group of users that shares one or many APIs without any access boundary. A tenant consists of:

APIM Product
APIM Group
APIs
APIM Policies
Azure Resource Group
AAD RBAC
- AAD Group
- AAD Service Principal Accounts
Azure Monitor
- Application Insights
- Alerts
- Action Group
- Alerts Processing Rules

Two types of Alerts

In order to minimize the number of alerts, we will use multi-resources and multi-series alerts which can target many resources and dimensions at scale. These alerts are owned by the platform team and are handled by a central action group which plays a broadcast role.

Tenant alerts:

Tenant alert behaviour:

Targets one or several APIs belonging to the same tenant.
- Requires a tenant action group.
- Increases as the number of APIs increases.
- Can be tagged and billed to a tenant team easily

Platform alerts

Platform alert behaviour:

Targets all APIs logs.
- Uses the platform action group.
- Requires an action that can filter and route alerts to appropriate tenants.
- Stays the same as the number of APIs increases.
- Requires additional logic to tenant charge back.

Tenant RBAC Configuration

There are three types of accesses needed for each tenant:

AAD tenant group: To access Azure Monitor logs in Azure Portal. No individual user is given access to a tenant RG; they must be added to the group.
AAD Service Principal Accounts: To access Azure Monitor logs programmatically with external systems. An example is Grafana. To query tenant logs programmatically, a tenant service principal needs:
- Reader to the Log Analytics Workspace. Reader role does not allow log data access; its needed to list the LAW.
- Log Analytics Reader to the tenant App insights or the entire tenant RG
AAD Managed Identities: In case we need to read logs from within Azure. Managed Identities are easier to use than Service Principals.

Other Recommendations

#	Category	Recommendation	Benefits
1	Scale	1 Application Insights per product. Each product will have multiple APIs and each API can be tagged with a Cloud Role to differentiate their metrics.	Improve the ability to correlate API integrations for troubleshooting. Reduces the number of App Insights for each product team to manage. Product team manage their own alerts, alert processing rules and ServiceNow routing. Faster to build new APIs and integrate into existing monitoring and troubleshooting runbooks. Tiered discounts based on data ingestion.
2	Scale	Start with default metric alerts that is a baseline for all APIs per product. Teams to build their own alerts for per-API when required.	Lowered alerts and alert processing rules. Flexibility to consolidate alerts across products APIs. For example, set response time for all APIs per product to 20 seconds. It is not per-API alert that’s needed.
3	Security	RBAC for each Application Insights will be controlled through security groups. Users can be managed through AD and Azure AD based on a customer existing IdP approach.	Improved security, access management and audit. Members are enrolled/de-enrolled through existing Idp processes. Each product can be classified based on risk (e.g., Finance or HR) and access can be restricted based on scope.
4	Security	Use resource-context permissions for Log Analytics Workspace.	View logs for only resources in all tables that you have access to. Queries in this mode are scoped to only data associated with that resource.
5	Billing	Tags based on product team cost center.	Charge back to product teams based on their usage.
6	Billing	Set log analytics level data retention and/or table specific retention to control data retention, cost, and availability.	Reduce cost by setting appropriate data retention requirements. Data retention can be customized per table when required. Set interactive retention period up to 2 years. Total retention period up to 7 years.
7	Reporting	Cross-resource queries for reporting.	Cross-workspace query across up to 100 Log Analytics workspaces and Application Insights.

Appendix

Appendix A: Application Insights Design Considerations

One vs. Many Application Insights

For application components that are deployed together. These applications are usually developed by a single team and managed by the same set of DevOps/ITOps users.
If it makes sense to aggregate key performance indicators, such as response durations or failure rates in a dashboard, across all of them by default. You can choose to segment by role name in the metrics explorer.
If there's no need to manage Azure role-based access control differently between the application components.
If you don't need metrics alert criteria that are different between the components.
If you don't need to manage continuous exports differently between the components.
If you don't need to manage billing/quotas differently between the components.
If it's okay to have an API key have the same access to data from all components. And 10 API keys are sufficient for the needs across all of them.
If it's okay to have the same smart detection and work item integration settings across all roles.

Per-API metrics & logs in shared Application Insights

You might need to add custom code to ensure that meaningful values are set into the Cloud_RoleName attribute. Without meaningful values set for this attribute, none of the portal experiences will work.
- Individual components of the application are determined by their "roleName" or "name" property in recorded telemetry. These components are represented as circles on the map and are referred to as "nodes." HTTP calls between nodes are represented as arrows connecting these nodes, referred to as "connectors" or "edges." The node that makes the call is the "source" of the call, and the receiving node is the "target" of the call.

Appendix B: Log Analytics Workspace Design Considerations

One v. Many Workspaces

Many customers will create separate workspaces for their operational and security data for data ownership and the extra cost from Microsoft Sentinel. In some cases, you might be able to save costs by consolidating into a single workspace to qualify for a commitment tier.
Each workspace resides in a particular Azure region. You might have regulatory or compliance requirements to store data in specific locations.
Set different retention settings for each table in a workspace. You need a separate workspace if you require different retention settings for different resources that send data to the same tables.

Cost

When ingesting >= 500GB per day across all resources, use dedicated cluster & set commitment tier.
When ingesting >= 100GB per day across all resources, consider combing to one workspace & set commitment tier.

RBAC

Resource-context RBAC - if a user has read access to an Azure resource, they inherit permissions to any of that resource's monitoring data sent to the workspace. This level of access allows users to access information about resources they manage without being granted explicit access to the workspace.
Table level RBAC - grant or deny access to specific tables in the workspace. In this way, you can implement granular permissions required for specific situations in your environment.

Dedicated Cluster

Customer managed keys, double encryption
Cross-workspace queries run faster
Availability Zones (East US 2, West US 2)
Cost optimization through commitment tier (500, 1000, 2000 or 5000 GB/day)
Link up to 1000 workspaces per cluster.
Migrate existing Log Analytics Workspaces to a dedicated cluster: When a Log Analytics workspace is linked to a dedicated cluster, new data ingested to the workspace is routed to the new cluster while existing data remains on the existing cluster. If the dedicated cluster is encrypted using customer-managed keys (CMK), only new data is encrypted with the key. The system abstracts this difference, so you can query the workspace as usual while the system performs cross-cluster queries in the background.

Log Ingestion

Typical latency to ingest log data is between 20 seconds and 3 minutes.
- TimeGenerated (record created at data source)
- _TimeReceived (record received by Azure Monitor ingestion endpoint)
- ingestion_time() (record stored in workspace and available for queries)
Log data ingestion time in Azure Monitor

Data Retention / Archive

Archiving lets you keep older, less used data in your workspace at a reduced cost.
Archived data stays in the same table, alongside the data that's available for interactive queries. When you set a total retention period that's longer than the interactive retention period, Log Analytics automatically archives the relevant data immediately at the end of the retention period.
If you change the archive settings on a table with existing data, the relevant data in the table is also affected immediately.
Interactive Retention period – up to 2 years
Total retention period – up to 7 years
Archive period – (total retention period minus interactive retention period)

Restore Logs

The restore operation creates the restore table and allocates additional compute resources for querying the restored data using high-performance queries that support full KQL.
Restored logs does not have an explicit retention policy. Must be explicitly dismissed through REST API or CLI.
Limits
- Restore data for a minimum of two days.
- Restore up to 60 TB.
- Perform up to four restores per workspace per week.
- Run up to two restore processes in a workspace concurrently.
- Run only one active restore on a specific table at a given time. Executing a second restore on a table that already has an active restore will fail.

Appendix C: AMPLS Design Considerations

Limits

A virtual network can only connect to one AMPLS object. That means the AMPLS object must provide access to all the Azure Monitor resources the virtual network should have access to.
An AMPLS object can connect to 300 Log Analytics workspaces and 1000 Application Insights components at most.
An Azure Monitor resource (Workspace or Application Insights component or Data Collection Endpoint) can connect to 5 AMPLS’ at most.

Network Design

AMPLS requires at least 11 IPs. Smallest supported IPv4 subnet is /27 (27 allocatable IPs).
Regional endpoints will require additional IP addresses. For example, Application Insight uses regional endpoints (e.g., eastus-8.in.applicationinsights.azure.com and japanwest-0.in.ai.monitor.azure.com). Each endpoint would be mapped to a private IP address. Careful planning on regional expansion is required to ensure enough space in the subnet.

DNS

Use a single AMPLS for all networks that share the same DNS. Due to shared endpoints, it affects not only the network connected to the Private Endpoint but also all other networks sharing the same DNS. When multiple AMPLS are added to the virtual networks using the same DNS, the last update takes precedent.
Creating a Private Link affects traffic to all monitoring resources, not only the resources in your AMPLS. Effectively, it will cause all query requests as well as ingestion to Application Insights components to go through private IPs. However, it does not mean the Private Link validation applies to all these requests. Resources not added to the AMPLS can only be reached if the AMPLS access mode is 'Open' and the target resource accepts traffic from public networks.

Diagnostic Settings

Logs and metrics uploaded to a workspace via Diagnostic Settings go over a secure private Microsoft channel and are not controlled by AMPLS settings.

Network Isolation

Private Only - allows the virtual network to reach only Private Link resources (resources in the AMPLS). That's the most secure mode of work, preventing data exfiltration. To achieve that, traffic to Azure Monitor resources out of the AMPLS is blocked.
Open - allows the virtual network to reach both Private Link resources and resources not in the AMPLS (if they accept traffic from public networks). While the Open access mode doesn't prevent data exfiltration, it still offers the other benefits of Private Links - traffic to Private Link resources is sent through private endpoints, validated, and sent over the Microsoft backbone. The Open mode is useful for a mixed mode of work (accessing some resources publicly and others over a Private Link), or during a gradual onboarding process.

Design with Hub + Spoke

Azure Resource Manager

Configuration changes, including turning AMPLS access settings on or off, are managed by Azure Resource Manager. To control these settings, you should restrict access to resources using the appropriate roles, permissions, network controls, and auditing.
Queries sent through the Azure Resource Management (ARM) API can't use Azure Monitor Private Links. These queries can only go through if the target resource allows queries from public networks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

azure-apim-management-multi-tenancy-monitoring

Requirements

How to Deploy Locally

Tenancy Model

Two types of Alerts

Tenant alerts:

Platform alerts

Tenant RBAC Configuration

Other Recommendations

Appendix

Appendix A: Application Insights Design Considerations

One vs. Many Application Insights

Per-API metrics & logs in shared Application Insights

Appendix B: Log Analytics Workspace Design Considerations

One v. Many Workspaces

Cost

RBAC

Dedicated Cluster

Log Ingestion

Data Retention / Archive

Restore Logs

Appendix C: AMPLS Design Considerations

Limits

Network Design

DNS

Diagnostic Settings

Network Isolation

Design with Hub + Spoke

Azure Resource Manager

Files

README.md

Latest commit

History

README.md

File metadata and controls

azure-apim-management-multi-tenancy-monitoring

Requirements

How to Deploy Locally

Tenancy Model

Two types of Alerts

Tenant alerts:

Platform alerts

Tenant RBAC Configuration

Other Recommendations

Appendix

Appendix A: Application Insights Design Considerations

One vs. Many Application Insights

Per-API metrics & logs in shared Application Insights

Appendix B: Log Analytics Workspace Design Considerations

One v. Many Workspaces

Cost

RBAC

Dedicated Cluster

Log Ingestion

Data Retention / Archive

Restore Logs

Appendix C: AMPLS Design Considerations

Limits

Network Design

DNS

Diagnostic Settings

Network Isolation

Design with Hub + Spoke

Azure Resource Manager