Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for OpenTelemetry #32866

Open
TheFox0x7 opened this issue Dec 16, 2024 · 24 comments
Open

Support for OpenTelemetry #32866

TheFox0x7 opened this issue Dec 16, 2024 · 24 comments
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.

Comments

@TheFox0x7
Copy link
Contributor

Feature Description

OpenTelemetry is an open standard designed to unify signals coming from an application. It allows for tracing requests and operations, sending logs from the application and gathering and sending prometheus metrics.

Potentially it would allow for removing prometheus in favor of opentelemetry (go library does have setting for prometheus style endpoint though I've yet to test it) and having the option to find long running operations quicker with tracing.

I'd like to work on this (though it might take a while) and I have a "demo" of such integration here, with only tracing support.

Screenshots

No response

@TheFox0x7 TheFox0x7 added the type/proposal The new feature has not been accepted yet but needs to be discussed first. label Dec 16, 2024
@lunny lunny added the proposal/accepted We have reviewed the proposal and agree that it should be implemented like that/at all. label Dec 16, 2024
@lunny
Copy link
Member

lunny commented Dec 16, 2024

Please have more discussion here before proposing a pull request.

Traditionally, I would like to collect logs from the disk log file rather than sending logs to the remote directly. I haven't read all the documentation of OpenTelemetry but I want to know how it works for logs.

@wxiaoguang wxiaoguang removed the proposal/accepted We have reviewed the proposal and agree that it should be implemented like that/at all. label Dec 17, 2024
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 17, 2024

I think it needs more discussion.

If there is a chance, I'd like to totally decouple the 3rd dependency to avoid bloating Gitea binary (it is large enough now, and it is executed by every git operation .....)

There are various "tracing" or other tools and there will be more, if we added this, then should or shouldn't we add others?

So before starting the work, I'd like to clarify:

  1. Does it really help to trace the useful metrics and resolve real world problems? Maybe we should have a roadmap ahead.
    • meter: use it to replace the existing modules/metrics/collector.go (prometheus)? I do not see we should keep different and duplicate collectors together
    • tracer: I agree that Gitea needs better performance tracing, but the question is how to introduce it. And Gitea used to be an all-in-one app, in most cases we need end users to report their performance problems directly, but not ask the users to setup a OpenTelemetry to collect the performance profile ....
    • logger: well, maybe Gitea needs to refactor its logger system first? Or just output JSON logs?
  2. Is it to possible decouple the dependency, for example: make Gitea just output some JSON logs and use log collector to send them to other tools.

@TheFox0x7
Copy link
Contributor Author

For starters, I hoped for some discussion before continuing forward - I have few outstanding issues (mainly how to handle ssh connections and how to register the exporters with manager because there's probably a way better solution for that) which I want to discuss.


Firstly keep in mind that all 3 of the components are optional by design, you can have noop versions of all 3 with 0 code change.
As for your questions:

Traditionally, I would like to collect logs from the disk log file rather than sending logs to the remote directly. I haven't read all the documentation of OpenTelemetry but I want to know how it works for logs.

I haven't yet connected logs to it in any app besides python, but in general it doesn't stop you from gathering logs on disk, to stdout/err or to syslog. What it does is provide a vendor independent way to export logs (which is just gathers on the fly as well), to systems supporting the OpenTelemetry Protocol (OTLP), via grpc or http.
Logging in golang SDK is experimental stage.

There are various "tracing" or other tools and there will be more, if we added this, then should or shouldn't we add others?

The idea with opentelemetry is to provide a single standard for signals, so it shouldn't be required to add more.

Does it really help to trace the useful metrics and resolve real world problems? Maybe we should have a roadmap ahead.

Roadmap probably would be useful yeah. As for the usefulness, it might vary from case to case but I'll refer you to grafana's blog post about it which I think shows why it can be useful pretty well.

meter: use it to replace the existing modules/metrics/collector.go (prometheus)? I do not see we should keep different and duplicate collectors together

That's the idea and it shouldn't be a breaking change since golang SDK supports prometheus style export aside from OTLP.

tracer: I agree that Gitea needs better performance tracing, but the question is how to introduce it. And Gitea used to be an all-in-one app, in most cases we need end users to report their performance problems directly, but not ask the users to setup a OpenTelemetry to collect the performance profile ....

Tracing is mainly for operations (cross or in app) - for example if you select the branch view and it takes a while to load, you might be interested why - was it the template taking weirdly long, git being slow for some reason, nothing was cached or maybe database took a while to respond?
With traces you can setup tracing on all of those parts and have a time route with metadata similar to gnatt chart to see where the issue was.
I don't mean to introduce it as a mandatory step to report a performance issue, I think it would be easier if someone has such a setup to be able to see the trace duration and point at let's say DB query as the thing that took the longest (say 2 seconds) despite normally being quick (0.2 seconds).

logger: well, maybe Gitea needs to refactor its logger system first? Or just output JSON logs?

Logging as mentioned before is experimental in golang SDK and I'm still kind of lost how it actually works (I mainly never tried it). Migrating to log/slog would be nice but that's not something I can do without help and I'd have to refresh on how the logging in SDK is configured to say something for sure.

Is it to possible decouple the dependency, for example: make Gitea just output some JSON logs and use log collector to send them to other tools.

That is basically how the current log system can be gathered (except the JSON part) if one wants. So yes it is possible - with a very similar system as prometheus. JSON logs for that would be something of interest to me.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 17, 2024

Thank you for the details. I also have some experiences for "app performance tracing" (I used to build a web app serving millions of user per day and the QPS is around thousands), I also used some tracing tools and built some tracing frameworks, so I could fully agree with the ideas behind OpenTelemetry.

I also used GitLab for long time, GitLab packed a whole grafana into their deployment, and they have their own diagnosis tools.

So if let me design a roadmap for Gitea, I think the key problem is how to benefit most end users if we introduce a new mechanism.

  • Gitea indeed needs some mechanisms to help to diagnose performance problems
    • I can see there are still many performance related problems
    • At the moment there is no easy way to figure out the root problems
    • Performance tracing needs a lot of work, for example, add some necessary tracer calls correctly.
  • At the moment, Gitea is not like GitLab
    • GitLab always requires users to setup a lot of services to run, so it's not surprising that they could pack grafana into its package.
    • But Gitea is designed to be a lightweight system (almost zero dependency), so if we introduce the OpenTelemetry dependency but end users couldn't use it out-of-box, I think it departs from Gitea's goals.
    • I do not think it is feasible to tell end users that "please install other softwares to report the performance problem".

Then the question is: how to satisfy these requirements? Gitea needs tracing tools, and end users need to report performance problems conveniently. In my mind, the roadmap could be like this:

  • Refactor Gitea's code first, to make it have basic "signals" support, and we need to make end users could report without external dependency (unless we change our mind now and make end users always deploy Gitea with a complex docker container which contains everything)
  • Add shim interfaces to the "signals" support
  • If some advanced users would like to use OpenTelemetry or something similar, they could use the "interfaces" to build their own instance from source code.
  • If one day, Gitea is not lightweight anymore and must be deployed like GitLab, then we can integrate everything into one container.

@TheFox0x7
Copy link
Contributor Author

I agree it's not feasible to expect someone who wanted a simple git server to setup prometheus (metrics), loki/logstash (logs) and tempo/jaeger (traces) plus possibly pyroscope (pprof). This feature mostly targets people who would want to have observable application when using multiple components or are looking to diagnose such issues.
Collecting traces is fairly simple but I know that's not helping much.

But I sort of think you're misunderstanding the idea. The entire thing is optional by design - so you COULD use it but it's not mandatory. If you want to have a system which would work without any collectors and so on - it could be done by adding an exporter (either dedicated or the debug console one).
It's very similar to how metrics work - you can turn it on and scrape it but you can also not do it and there's no loss.

As for gitlab comparison - the minimal possible deployment won't change. The protocol is already a "shim" of sorts, as you can add custom exporters to it or just ingest data in your own way.
I don't want to introduce entire grafana stack into gitea or to try and go the gitlab way of trying to add dashboards and metric collection into it.
I'm not sure if I understand your concerns properly - you're worried about it being a smaller feature which won't be used by a lot of people and will be hard to take advantage of?


quick demo of how it works in case someone wants to see it.
It's just tracing, copy pasted from getting started page and trimmed down a bit. On first run you should get a number from the dice roll, then a trace json should be printed. Second thing which you can try is comment out every part in setup except the roll then see that it still works but the entire trace is a noop.

@wxiaoguang
Copy link
Contributor

But I sort of think you're misunderstanding the idea. The entire thing is optional by design - so you COULD use it but it's not mandatory. If you want to have a system which would work without any collectors and so on - it could be done by adding an exporter (either dedicated or the debug console one). It's very similar to how metrics work - you can turn it on and scrape it but you can also not do it and there's no loss.

To prevent from misunderstanding, let's take some examples.

  • Your proposal: Suppose we have OpenTelemetry in Gitea:

    • If an end user encounters performance problems, what we should tell them to provide the details?
  • My proposal: not introduce OpenTelemetry in Gitea at the moment, but only implement the necessary tracing features in Gitea and provide some shim interfaces

    • If an end user encounters performance problems, they could provide the details out-of-box (just one click on the admin panel)
    • If some advanced users need to add OpenTelemetry support, they could use the shim interfaces and build their instance from source code.

@TheFox0x7
Copy link
Contributor Author

Okay I get it now, thanks a lot. I haven't considered an embedded use case.

In my case it would be slightly more complicated:

  1. start a jaeger container (podman run --rm -p 16686:16686 -p 4317:4317 docker.io/jaegertracing/all-in-one:latest would cover the minimum version)
  2. configure gitea to push to it (about one config change, but it depends on how the config will be handled)
  3. Probably full restart would be required
  4. send relevant findings
    which is far from an ease of having it available from admin panel.

On the other hand I have no clue how would it look like. It's not like one (to my knowledge) could enable it from admin panel, then get one particular endpoint traced end to end without having propagation which the standard does provide. Also I think that making an interface for it would be harder to maintain and integrate with later. So I still think it would be better to have the native support and think of a way to make it simple to configure and gather as intended or make an custom exporter which will provide the use case which you're talking about.

Regardless it is an interesting use case, I'll admit.
I thought about having an OTel collector embedded in to use as reporting for actions - since CI monitoring is a really nice use case for traces, but I scrapped the idea as it feel more like bloat and there's the storage of it to figure out. Maybe it would work for this? Throwing it out there in case you'll find it interesting.


Full disclosure: I'm obviously biased to having it natively implemented instead of having to shim support onto base.

@wxiaoguang
Copy link
Contributor

After 1.23 gets a stable release (maybe in 1 or 2 weeks), I think we could start some initial work for the performance tracing. Then we could try to get OpenTelemetry support, either source-code-level optional (build from source code/tag), or config-level optional (enable by config option), or mandatory.

@TheFox0x7
Copy link
Contributor Author

I would suggest making it a configuration feature. It would be always "on" but just default to noop.
Making it mandatory is already out as it needs to be configured - the default localhost:4317 address is pretty useless on it's own.

I took a really quick glace at the std runtime/trace version. I think that would indeed cover your usecase better as it lets the user write the trace to a file and I think it works globally so you could just add a trigger in admin panel for it, maybe an on/off switch even.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 22, 2024

I started some initial work here: https://github.com/wxiaoguang/gitea/commits/support-trace/

Quite simple, it introduces our shim-layer and builtin tracer. The interface is designed to be OpenTelemetry-like.

Will try to do some improvements and make it really work before proposing a PR: the real challenge is how to correctly trace the performance problems and show useful result to end users (especially in the db and git package). Maybe it would take about one week.


Update: PR: Support performance trace #32973

@TheFox0x7
Copy link
Contributor Author

It looks like it would work with the standard. As long as semantic conventions will be followed for metadata keys/values and I would be able to replace tracer with otel one without breaking stuff I don't think I'll have objections.

Db would probably be best handed in xorm via otel as well (seeing that grafana uses xorm they might consider helping there), but semconv exist for client side, and it might be doable with either session or struct events but I haven't looked that deep.

For git I modified the run command, which worked surprisingly well as a quick "catch all" option, with a downside of spans lacking good context (that required diving into the span and seeing what command was called), which I started to solve by having the general command be traced too:

func GetReverseRawDiff(ctx context.Context, repoPath, commitID string, writer io.Writer) error {
        ctx, span := tracer.Start(ctx, "GetReverseRawDiff")
        defer span.End()
        stderr := new(bytes.Buffer)
        cmd := NewCommand(ctx, "show", "--pretty=format:revert %H%n", "-R").AddDynamicArguments(commitID)
        // rest of the code
}
func (c *Command) Run(opts *RunOpts) error {
        _, span := tracer.Start(c.parentContext, c.args[c.globalArgsLength], trace.WithAttributes(
                semconv.ProcessCommand(c.prog), semconv.ProcessCommandArgs(c.args...),
        ), trace.WithSpanKind(trace.SpanKindInternal))
        defer span.End()
       // rest of the function plus marking errored spans
}

Side note so I won't forget about it, calls to /internal endpoint should have span headers since they aren't a standalone operations but they are part of a larger one. The caller of that endpoint is the parent.

@TheFox0x7
Copy link
Contributor Author

FYI: I've published what I've made to date on my fork.

To turn it on:

[opentelemetry]
ENABLED=true

and docker/podman run --rm -p 16686:16686 -p 4318:4318 -d --name jaeger docker.io/jaegertracing/all-in-one:latest
should suffice. The container should listen on localhost:4318 as that's the default.
As for the actual demo - visit any page in gitea, view a file, etc. then go to localhost:16668 (jaeger) and pick gitea from service menu. At the very least you should have a git trace similar to this one:
image

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 22, 2024

Hmm ... it seems somewhat cumbersome to add a lot of ctx, span := tracer.Start(repo.Ctx, "CommitTree"). I think I could have some ideas about how to improve it. At least, by my recent " Refactor pprof labels and process desc #32909 ", git.Command.Run is able to know its caller.

ps: I am still working on some refactoring PRs, I think we could have enough preparation first, then the following work could be easier.


Update: in Refactor request context #32956 , legacy hacky code has been removed and now we have a clear RequestContext, this context is worth to trace.

@techknowlogick
Copy link
Member

Thank you for this issue. As for the related code you linked to, due to its licensing, I have a personal policy of not looking at that repo to ensure that I don't violate any of the licenses.
It's funny that you opened this issue, as a part of some performance work, and some chance encounters with some folks from honeycomb, I have been investigating this for some time, and have some (less than stellar) code that I put together as a proof of concept to show that it could be done. I've pushed my code, so that others can also look at it too, and perhaps get this in.
I share the same feelings as @wxiaoguang about the cumbersome boilerplate of wrapping many things with tracer.start/etc.., but sadly, other than some ebpf auto-profiling things that exist for otel, that seems to be the approach for adding otel to golang programs. Thankfully the two libraries I've chosen for middleware for db & HTTP, alleviate that. Sadly, for our git module we are likely to have to do the wrapping ourselves.

@wxiaoguang
Copy link
Contributor

I am still working on some preparations, so please review and approve my recent refactoring PRs, then we will have the builtin performance tracer soon. Then, at least, advanced users could introduce their tracers from source code level.

@TheFox0x7
Copy link
Contributor Author

As for the related code you linked to, due to its licensing, I have a personal policy of not looking at that repo to ensure that I don't violate any of the licenses.

@techknowlogick You can ignore it. I'm the author of the entire thing and I did pick EUPL (since I wanted to ensure the telemetry part is public), but seeing as I want to add it to gitea and work on this as well I dropped that requirement.
If you want I can swap the header so it's on paper without having to rely on my word.

It's also less than stellar - main things which work well are settings and basic tracing, the actual service start is something that works but it's probably misusing the graceful service system somewhat.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 24, 2024

I've chosen for middleware for db & HTTP,

I am sure we need to do it by ourselves: #32866 (comment)

Sadly, for our git module we are likely to have to do the wrapping ourselves.

If you read my recent PRs, then you could know that: no sad, no need to make it complicated. I have a complete design to make the system right and lightweight, but I need to do some refactorings first.

@wxiaoguang
Copy link
Contributor

I have pushed my work for "builtin tracer": WIP: Support performance trace #32973 .

It has initial http/db/git support, and a simple list in admin UI

@techknowlogick
Copy link
Member

@TheFox0x7 thanks! I'll checkout your code in the link above. I'm sure your statement here is sufficient, and enough people get emails from this tracker that there is some record of your statement should something catastrophic happen to GitHub (knock on wood it doesn't).

@wxiaoguang thanks! I hadn't checked out your code yet. The code in my PR was stashed in my local repo for a while and figured it'd be better to push it up since it can be added to this discussion.

@TheFox0x7
Copy link
Contributor Author

I've looked through the build-in tracer PR and I have few issues I think are worth discussing. Posting them here since I don't know the conventions too well - some don't like draft reviews. Plus they aren't that specific to it.

First thing, circling back a bit - why not implement SpanExporter interface as internal backend?
I see following upsides of that:

  • No custom trace wrapper, so the entire behavior is available and there is no need to re-implement spans.
  • It's fairly simple to swap exporters with the non internal one

I've tried to do it and the biggest issue I had so far is connecting spans into one as spans know only their parent so I can't easily build a string like it did.
I did however manage to rewrite parts of it so it works on my PR (without parent child relation):
image

If we manage to solve without re-implementing half of jaeger I think we'd have it covered quite easily between the 3 solutions.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 25, 2024

First thing, circling back a bit - why not implement SpanExporter interface as internal backend?

Sorry but I didn't get the point .... could you elaborate?

I think our shim-layer won't implement any "real" OpenTelemetry interfaces, because the OpenTelemetry's interfaces have quite heavy and complex dependencies. So by current design, users could implement the "traceStarter" and register it to operate on the our spans directly.

If you think "SpanExporter" is useful, could you provide some pseudocode to describe how to use it?

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Dec 25, 2024

In my mind the otel integration could be like this:

type traceOtelStarter struct{}

type traceOtelSpan struct {
	otelSpan otel.Span
}

func (t *traceOtelSpan) end() {
	t.otelSpan.End()
}

func (t *traceOtelStarter) start(ctx context.Context, traceSpan *TraceSpan, internalSpanIdx int) (context.Context, traceSpanInternal) {
	ctx, span := tracer.Start(ctx, traceSpan.name)
	return ctx, &traceOtelSpan{otelSpan: span}
}

func init() {
	globalTraceStarters = append(globalTraceStarters, &traceOtelStarter{})
}

@TheFox0x7
Copy link
Contributor Author

TheFox0x7 commented Dec 25, 2024

Putting it into words turns out way harder than I thought but:

The way I see it right now is that we could make taillog into an exporter, making the integration more natural IMO, with the yet to be solved part of assembling a trace from spans. I might be stuck a bit on this idea though as mentioned earlier.
Unfortunately I agree that the sdk isn't the lightest or simple and compared to main branch the end binary is ~3MB larger - which is quite a lot for a feature that might not even be used.
I think that the list of concerns I have is as follows:

  • Deviating from the standard on accident
  • Having to recreate trace.Span
  • Over complicating the system with wrappers on something which is already kind of a wrapper

As a side note - caller context in git command is really useful, so thanks for that :)
image


SpanExporter interface consists of two methods:

  • ExportSpans (context.Context, []trace.ReadOnlySpan) error, which is called by relevant OTel system (SpanProcessor here, usually a buffer so the application isn't exporting 1 span at a time). It's job is to export spans (ones that ended) to any place you want.
  • Shutdown(context.Context) error, which is the clean up method, after which the exporter should not process anything.

So I was thinking that we could have internal one which would export spans into memory if the feature is on, similarly to how you proposed in the draft but with regular library.
So putting it into a flow: span -> Processor -> Exporter -> Memory -> /perftrace page

I hope I made myself clearer... even if a small bit.


My hacky plug
func (m *memoryMsgRecorder) ExportSpans(_ context.Context, spans []trace.ReadOnlySpan) error {
	m.mu.Lock()
	defer m.mu.Unlock()
	for _, t := range spans {
		sb := strings.Builder{}

		SpanToString(t, &sb, 2)
		m.msgs = append(m.msgs, &MsgRecord{
			Time:    time.Now(),
			Content: sb.String(),
		})
		if len(m.msgs) > m.limit {
			m.msgs = m.msgs[len(m.msgs)-m.limit:]
		}

	}
	return nil
}
func SpanToString(t trace.ReadOnlySpan, out *strings.Builder, indent int) {

	out.WriteString(strings.Repeat(" ", indent))
	out.WriteString(t.Name())
	if t.EndTime().IsZero() {
		out.WriteString(" duration: (not ended)")
	} else {
		out.WriteString(fmt.Sprintf(" duration: %.4fs", t.EndTime().Sub(t.StartTime()).Seconds()))
	}
	out.WriteString("\n")
	for _, a := range t.Attributes() {
		out.WriteString(strings.Repeat(" ", indent+2))
		out.WriteString(string(a.Key))
		out.WriteString(": ")
		out.WriteString(a.Value.AsString())
		out.WriteString("\n")
	}
	// for _, c := range t.Parent()..children {
	// 	span := c.internalSpans[t.internalSpanIdx].(*traceBuiltinSpan)
	// 	span.toString(out, indent+2)
	// }
}
func (m *memoryMsgRecorder) Shutdown(_ context.Context) error {
	log.Warn("Shutdown has been called!")
	return nil
}

// Plus small hack in my branch
func setupTraceProvider(ctx context.Context, r *resource.Resource) (func(context.Context) error, error) {
	var shutdown func(context.Context) error
	switch setting.OpenTelemetry.Traces {
	case "otlp":
		traceProvider := sdktrace.NewTracerProvider(
			sdktrace.WithSampler(setting.OpenTelemetry.Sampler),
                       // Here
			sdktrace.WithBatcher(tailmsg.GetManager().GetTraceRecorder()), 

			sdktrace.WithResource(r),
		)
		otel.SetTracerProvider(traceProvider)
		shutdown = traceProvider.Shutdown
	default:
		shutdown = func(ctx context.Context) error { return nil }
	}
	return shutdown, nil
}

Full code on: https://github.com/TheFox0x7/gitea/tree/otel-mods

@TheFox0x7
Copy link
Contributor Author

I added prometheus exporter with otelchi generated metrics as a port test and as fun as otelchi is for quick start I don't think this:
gitea_response_size_bytes_count{http_method="GET",http_scheme="http",net_host_name="Gitea",net_host_port="3000",net_protocol_version="1.1",net_sock_peer_addr="127.0.0.1",net_sock_peer_port="57834",otel_scope_name="github.com/riandyrn/otelchi/metric",otel_scope_version="0.1.0",user_agent_original="Mozilla/5.0 (X11; Linux x86_64; rv:133.0) Gecko/20100101 Firefox/133.0"} 13
is something we're interested in (the user agent addition is... interesting).
But it is doable and fairly simple to port current metrics to it.

Unfortunately this doesn't remove prometheus package like I thought it would so there goes my "removing dependency" argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.
Projects
None yet
Development

No branches or pull requests

4 participants