Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid queue telemetry using hostname as telemetry service name #14075

Closed
wants to merge 1 commit into from

Conversation

txomon
Copy link

@txomon txomon commented Jun 7, 2023

Proposed Changes

  • Currently traces coming from the queue component will be labeled as a service with the hostname as name, which means that every time a pod is recycled, monitoring tools will show a new host. This PR makes sure to give an unique consistent name to all the queue components of a knative service.
  • Traces coming from queue will come as namespace-service-queue

Release Note

* Have a common opencensus tracing service name for all queue components of a service

@knative-prow
Copy link

knative-prow bot commented Jun 7, 2023

Welcome @txomon! It looks like this is your first PR to knative/serving 🎉

@knative-prow knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 7, 2023
@knative-prow
Copy link

knative-prow bot commented Jun 7, 2023

Hi @txomon. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow knative-prow bot requested a review from krsna-m June 7, 2023 08:41
@knative-prow
Copy link

knative-prow bot commented Jun 7, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: txomon
Once this PR has been reviewed and has the lgtm label, please assign kvmware for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot requested a review from ReToCode June 7, 2023 08:41
@skonto
Copy link
Contributor

skonto commented Jun 14, 2023

Hi @txomon thanks for contributing to the project! I think that this could break some tooling outhere so I am wondering if we could make it configurable. Another thought is if it makes sense to group per revision instead as we do with metrics.

@txomon
Copy link
Author

txomon commented Jun 14, 2023

Hello @skonto I agree with the assessment that it could break some toolkit, however I understand people are dynamically having to match against a specific pattern service-name-* and running into issues when there are services with the same name in different namespaces within the same cluster.

This is adding uniformity to the same way the activator service is behaving. We are currently running in production with this and works like a charm, as we were able to avoid going dark during deployments (the lag between when a service is being updated and the update to our monitoring systems).

Do you think it would make sense to maybe start thinking on having a breaking change further down the line (maybe for the next stable version) and maintain it with a configuration override until then?

@@ -189,7 +189,8 @@ func Main(opts ...Option) error {
d.Transport = buildTransport(env)

if env.TracingConfigBackend != tracingconfig.None {
oct := tracing.NewOpenCensusTracer(tracing.WithExporterFull(env.ServingPod, env.ServingPodIP, logger))
tracingName := fmt.Sprintf("%s-%s-queue", env.ServingNamespace, env.ServingService)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all revisions have a parent Knative Service.

Users can create configurations and then point routes to them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, what do you think should be used in this case? Should we default to the default string if any of these two fields are not set?

@dprotaso
Copy link
Member

dprotaso commented Jul 7, 2023

/hold

I think we should hold this for now due to the upcoming release coming.

I agree with @skonto about introducing this as a config option and then having users opt-in and then over a few releases switch the default.

Though this change might be more timely to do when we do a cut over to using OpenTelemetry Go SDK (almost GA - https://github.com/orgs/open-telemetry/projects/34) and other metrics related name changes (knative/pkg#2174)

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2023
@txomon
Copy link
Author

txomon commented Jul 7, 2023

Though this change might be more timely to do when we do a cut over to using OpenTelemetry Go SDK (almost GA - https://github.com/orgs/open-telemetry/projects/34) and other metrics related name changes (knative/pkg#2174)

Using otel would probably fix our problems, as we are able to override certain parameters in the otel collector. Just in case, the focus of this change would be to give a name to this part of the request path so that it can be accounted for. Overriding the whole service name in a per service release does seem appropriate to me, however I'm sure I'm missing out on some use cases.

How do you think we could deal with the problem?

@dprotaso
Copy link
Member

Honestly I don't know - I think it's probably best we capture your requirement when we migrate to OTel - I did that here

I feel like the PR would break existing integrations so I'm going to close this out

As a workaround you could make the changes in your fork and then change the sidecar image that is used via this knob -

queue-sidecar-image: ko://knative.dev/serving/cmd/queue

@dprotaso dprotaso closed this Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale area/networking do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants