Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate OT libraries #251

Closed
pauldraper opened this issue Aug 20, 2018 · 20 comments
Closed

Separate OT libraries #251

pauldraper opened this issue Aug 20, 2018 · 20 comments
Labels
community core question Further information is requested

Comments

@pauldraper
Copy link

pauldraper commented Aug 20, 2018

This is awesome work!

Before I stumbled on this, I was working on

  1. adding ScopeManager interface to opentracing-javascript
  2. implement ScopeManager with async_hook and async-hook-jl
  3. implement ScopeManager with Zone.js
  4. implementing tracing for http.Server
  5. implementing tracing for http client
  6. implementing tracing for pg
  7. implementing tracing for aws-sdk

It looks like (2), (5), and (6) have already been worked on by this project.

I'd like to continue with the great work being done for OT JS.

I think this library should be split up into:

  • async-hook-opentracing
    • scope
  • dd-trace-js
    • tracer
    • everything else
  • nodejs-opentracing
    • plugins/http
  • mysql-opentracing
    • plugins/mysql
    • plugins/mysql2
  • pg-opentracing
    • plugins/pg

Thoughts?

@rochdev
Copy link
Member

rochdev commented Aug 20, 2018

I had similar thoughts on how to split this.

For async-hook-opentracing I was actually thinking of making this 3 projects basically:

  • One project to hold the actual asynchronous hooks logic, which could either wrap something like async_hooks or be a 100% JavaScript implementation. I tend to prefer the latter since it would work on any platform (i.e. Node, browsers, Vert.x). In any case, the goal of this library would be to hide this behind a common interface. This project could also contain the monkey-patching for libraries like bluebird, or these could be externalized to their own sub-projects as well.
  • One project for context storage. This would basically be equivalent to continuation-local-storage, but without the need for callbacks.
  • One project for the scope manager. Still not sure if it would be beneficial to have this in its own project or if it should exist directly in opentracing-javascript.

The idea is that each component can be used on its own, and the first 2 are not related to OpenTracing at all. For example they could be used in OpenCensus as well or for any context propagation use case in general.

For the various integrations that you mentioned (i.e. mysql-opentracing) I think this is a great idea. However we would have to be very careful how this is implemented, as each APM vendor has their own sets of tags that need to be included. For example in our case we have a specific requirement for service.name and resource.name. There needs to be a way to hook into the lifecycle of the instrumentation to add metadata. Did you already start working on this? It would be interesting to try and converge our efforts.

@rochdev rochdev added question Further information is requested core community labels Aug 20, 2018
@pauldraper
Copy link
Author

pauldraper commented Aug 20, 2018

One project to hold the actual asynchronous hooks logic

This project could also contain the monkey-patching for libraries like bluebird, or these could be externalized to their own sub-projects as well.

One thing to check out is Zone.js which has a lot of those and supports a variety of environments. It does not, however, support native syntax async/await. (But AFAIK there is no way to support async/await in some runtimes, like browsers.)

There's even a TC39 proposal for it, but it's stalled. https://github.com/domenic/zones

From process.addEventListener, AsyncWrap, async_hooks, Zones, ... I don't know what the right answer is. This is something that ideally would be in the ES standard.

One project for context storage. This would basically be equivalent to continuation-local-storage, but without the need for callbacks.

With callbacks, CLS is super easy. I'll leave that to opentracing/specification#126 discussion.

For the various integrations that you mentioned (i.e. mysql-opentracing) I think this is a great idea.

This is in fact the intent of the OpenTracing design. So that each APM vendor doesn't need to instrument every library from scratch.

There are a lot of libraries out there and it doesn't make sense for users to hope their APM supports the one they want.

Did you already start working on this?

A first draft of http client and server instrumentation: https://gist.github.com/pauldraper/5f7055b2d24d042c58625bfebfd2331f

each APM vendor has their own sets of tags that need to be included.

Ideally, the standard OT tags cover a lot of ground. For example, does Datadog's service.name differ from the OT standard peer.service? It's possible there are some really basic tags that should be added to the spec.

And then tracer can add more data onto all traces (e.g. name or IP address of the current machine).

There needs to be a way to hook into the lifecycle of the instrumentation to add metadata.

Agreed. It doesn't have to be a standardized way. But each library instrumentation should give you some options. Especially for the ability to add tags.

If there isn't a way to write vendor-agnostic instrumentation for a library, OpenTracing is a bad idea and its design has fundamentally failed.

@rochdev
Copy link
Member

rochdev commented Aug 20, 2018

One thing to check out is Zone.js which has a lot of those and supports a variety of environments.

Does Zone.js support Node properly? I have yet to see a single project that uses it on the server.

It does not, however, support native syntax async/await.

I think this is an acceptable limitation in the browser since in most cases it's transpiled anyway. In Node however async/await is very common and must be supported.

But AFAIK there is no way to support async/await in some runtimes, like browsers.

From my understanding, async/await is implemented in most major browsers: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/await

However, as mentioned above, I don't think it's widely used yet.

With callbacks, CLS is super easy.

We had cases where we had to manually enter/exit the current continuation in a different asynchronous context, which cannot be done with callbacks. This could have been a limitation with continuation-local-storage specifically however as the APIs exist but are not exposed.

This is also a major departure from the current ScopeManager specification that is already implemented in other languages. I think it's worth trying to stick to the current specification if possible, and then iterate later on with an easier API on top. Otherwise the ScopeManager specification needs to be deprecated which could take a lot of time.

One thing to note also is that the main reason for the complexity of our current implementation is trying to automatically close the scope at the end of the entire asynchronous context tree. The same complexity would exist if trying to have this functionality with a CPS. Other than that the implementation is actually very simple.

Ideally, the standard OT tags cover a lot of ground.

Definitely agree. I wonder if it would be useful to have also some specification for integrations. For example which tags are useful for mysql so that the same set of tags can be added for every language. Of course differences in implementations can make this impossible in some cases, but at least to have guidelines of what to try to implement. Maybe some required and optional tags?

@pauldraper
Copy link
Author

pauldraper commented Aug 20, 2018

Does Zone.js support Node properly? I have yet to see a single project that uses it on the server.

Yes. It supports domains, fs, EventEmitters, etc. opentracing/opentracing-javascript#113 has the CPS-style ScopeManager and rivethealth/opentracing-javascript@13fc56c has the imperative ScopeManager. Both have functioning Node.js tests.

Since its core doesn't rely runtime-specific features and therefore cannot work with native async/await, it's not surprising that Node.js projects stick to the Node.js-specific implementations which can support native async/await.

There has been work to drive it with async_hooks but there are edge cases which have not been buttoned-up.

We had cases where we had to manually enter/exit the current continuation in a different asynchronous context, which cannot be done with callbacks.

Do you remember what those are? I've found that patching EventEmitter gets me 95% of the way there in Node.js.

For what it's worth, Node.js is choosing CPS for async_hooks/AsyncResource.runInAsyncScope. (They have emitBefore/emitAfter but have chosen to deprecate.) Whether that means hooks are easy is another question, but AsyncResource uses CPS.

One thing to note also is that the main reason for the complexity of our current implementation is trying to automatically close the scope at the end of the entire asynchronous context tree. The same complexity would exist if trying to have this functionality with a CPS.

True. Honestly I don't understand why this is even a thing.

Every span I've seen has well-defined start and stop. It starts with input, it ends with output. If I add code to fire-and-forget some analytics/diagnostics logic, why would I now want to include that as part of my HTTP request/response span?

A trace already has the property that it "starts" at the first component and "stops" at the last component, including ancillary non-critical path items. A span...I just don't see the point of finishOnClose (particularly if closes are automatic).

Glad to hear this is the source of all the complexity, because that makes a lot of sense.

@rochdev
Copy link
Member

rochdev commented Aug 20, 2018

There has been work to drive it with async_hooks but there are edge cases which have not been buttoned-up.

That would mean that the first implementation couldn't use Zone.js however, since all the recent projects I've seen use async/await.

Do you remember what those are? I've found that patching EventEmitter gets me 95% of the way there in Node.js.

I don't remember exactly, but it was a case where the way the library was built, it was not possible to handle a specific case. I think it was something along these lines: we needed to activate the span in the current context so we can access it later, but there were some events or promises triggered that would result in the user's code being called, but at that point the span should no longer be available since at this point the span is already finished. But again, I think this was mostly because of the way continuation-local-storage specifically handles events and promises, not necessarily an issue with CPS. For example, they chose to bind promises on resolve() while Node domains bind on .then().

For what it's worth, Node.js is choosing CPS for async_hooks/AsyncResource.runInAsyncScope.

I'm not saying I'm against CPS or that it's not a superior solution, but my worry is that having an entirely new specification approved that basically does the same thing as the existing scope manager could take a very long time. Any work on a CPS implementation until then is based on the assumption that it will be approved, which it might not.

True. Honestly I don't understand why this is even a thing.

This was mostly done to support finishOnClose. Since everything is asynchronous in JavaScript, it didn't really make sense to simply close at the end of the current event loop iteration. But as you said, use cases for finishOnClose are very limited anyway, so maybe it does make sense to simply close the scope early.

Would you be opened to use a scope manager implementation while this is discussed with the OTSC? This way it will be possible to work on something that is already approved and thus OpenTracing compliant with the current specification. Switching later on shouldn't be very difficult either. For reference, when I switched all our integrations from CLS to the scope manager, it only took a few hours.

@pauldraper
Copy link
Author

Give that imperative ScopeManager is the current standard, it does make sense to keep that; sorry, I didn't mean for that discussion to leak in here.

I think finishOnClose is extraordinary complexity for a use case I cannot think of, but again, that has to do with RFC itself and is discussed better elsewhere.


Main point of this issue here is splitting up the libraries.

  1. Async hooks/CLS stuff
  2. Datadog tracer implementation
  3. Library instumentations

You have some opinions about splitting up (1) even further; I'm not sure which way is up on that. As long as something works on Node.js, I'm happy.

(2) should stay here and depend on (1) and (3).

(3) should be separate projects, if OT has any value (and I think it does). They should be customizable, and it sounds like there are some Datadog specific bits.

@rochdev
Copy link
Member

rochdev commented Aug 20, 2018

Main point of this issue here is splitting up the libraries.

Totally agree.

You have some opinions about splitting up (1) even further

This can be iterated on. The most important part is move the entire functionality out. Afterwards it can be split further if necessary. I just think in the long run it would be interesting to have something that can be used outside of OpenTracing. This is definitely not a short-term concern.

They should be customizable, and it sounds like there are some Datadog specific bits.

The most important ones for us that are common across all integrations are the following:

  • service.name: this is the service name of the current service, which is different from service and peer.service which are for the remote service.
  • resource.name: a unique identifier of the resource accessed by the operation. This must be low cardinality as we use this for grouping operations in our UI. An example would be GET /user/:id
  • span.type this has a special meaning for how the span will be processed in the agent and the backend.
  • out.host: Used in distributed tracing. This could be replaced by peer.hostname
  • out.port: Used in distributed tracing. This could be replaced by peer.port

As long as the integrations are flexible enough to be able to add this metadata, it would handle most of our use cases.

  1. Async hooks/CLS stuff

Should this simply go directly in opentracing-javascript at first?

@pauldraper
Copy link
Author

pauldraper commented Aug 20, 2018

service.name: this is the service name of the current service, which is different from service and peer.service which are for the remote service.

That can easily be added at the Tracer level.

resource.name: a unique identifier of the resource accessed by the operation. This must be low cardinality as we use this for grouping operations in our UI. An example would be GET /user/:id

Personally, I think this is common enough be added to the OT standard, as resource, resource.name, or peer.resource.

span.type this has a special meaning for how the span will be processed in the agent and the backend.

There is component which is similar. The tracer could interpret that.

out.host: Used in distributed tracing. This could be replaced by peer.hostname
out.port: Used in distributed tracing. This could be replaced by peer.port

Excellent. And the tracer can always do the conversion.

Should this simply go directly in opentracing-javascript at first?

Yes. opentracing/opentracing-javascript#113 should be closed in favor of it. It may be helpful for reference, for tests, etc.

It will have to be in TypeScript though. I'm happy to port it, or you can.

@rochdev
Copy link
Member

rochdev commented Aug 21, 2018

That can easily be added at the Tracer level.

It already is, but we prefix the remote service basically. I think it's possible to achieve that using the existing tags, I would have to do some testing.

Personally, I think this is common enough be added to the OT standard, as resource, resource.name, or peer.resource.

The only issue is that different vendors expect different formats in most cases. While it's not ideal, it would probably take a lot of effort to achieve a common format for every integration.

There is component which is similar. The tracer could interpret that.

That could work. At worst a conversion would be needed before sending the span.

So, in summary, this is the steps I think that should be taken here:

  1. I'll open an RFC later today to discuss how to handle the different aspects of the scope manager implementation (i.e. how to handle close(), promises, etc). I can also provide real world usage feedback since we use this in our tracer.
  2. Let's review the RFC together before involving more people so we agree on the expected outcome.
  3. I'll start adapting our scope manager in a fork of our repo according to the result of the RFC. It will be greatly simplified if we don't track descendant contexts for auto-closing the scope.
  4. If you can help porting to TypeScript that would be great. I avoided accessing private properties to make porting easier.
  5. Once all is good we can submit a PR to opentracing-javascript

What do you think?

@pauldraper
Copy link
Author

Sounds good. Are you thinking the RFC will or will not include tracking descendents for finishOnClose?

Java tracks it.

I might be missing something about cases when that is helpful.

@rochdev
Copy link
Member

rochdev commented Aug 21, 2018

When you say Java tracks it, this means that the current thread plus every single thread that is created from that point on need to be terminated before close() is automatically called? This would be the equivalent behavior of our JavaScript scope manager.

In any case, the RFC will include all options. The benefit of tracking descendants is to support a CHILD_OF relationship when no callback is available so that the parent will include the duration of the entire asynchronous context tree. This is to support providers that don't have FOLLOWS_FROM. It does come however with an increase in complexity. If other tracers are consistent in how this is handled, we should aim to make JavaScript consistent as well. I doubt this is the case however since the spec is very vague.

I think the most important part is how close() behaves when called manually which you mentioned in opentracing/specification#126. This is mostly important because the outcome will either make a CPS wrapper possible or not.

@pauldraper
Copy link
Author

pauldraper commented Aug 21, 2018 via email

@rochdev
Copy link
Member

rochdev commented Aug 21, 2018

Do you see any issue with doing that in JS too other than the added complexity? (which is also already there)

@pauldraper
Copy link
Author

I have bad experiences with tracking complex leaking trace implementations, so I prefer to make it as simple as possible. (But obviously no simpler, if there is a needed use.)

@rochdev
Copy link
Member

rochdev commented Aug 23, 2018

Sorry for not posting the RFC yet. I'm currently experimenting with a few different approaches first to get a better idea of what to recommend. It's proving very difficult to keep track of when to automatically close the scope efficiently.

@stephenh
Copy link
Contributor

FWIW if you're discussing shared lib instrumentation (+1), I stumbled across Google's strackdriver, which looks like it is a ~fairly mature?/complete? implementation of these (I've not used it, just found it in github yesterday while googling for async hook misc things): https://github.com/GoogleCloudPlatform/cloud-trace-nodejs/tree/master/src/plugins. Disclaimer it's very likely you've already seen it/etc. but just thought it wouldn't hurt to mention.

@pauldraper
Copy link
Author

pauldraper commented Sep 20, 2019

@rochdev any more thoughts on this?

As awesome as DataDog is, the point of OpenTracing was to decouple the instrumentation from the APM provider.

Most of the work in the repo is theoretically just generic OpenTracing instrumentation.

@rochdev
Copy link
Member

rochdev commented Sep 24, 2019

@pauldraper Most of the work mentioned in the OP has either already happened or should happen soon for dd-trace-js. Most of it should also land in OpenTelemetry if they decide to use our tracer as the base for the JavaScript tracer given our recent partnership.

I would highly recommend participating to the OpenTelemetry discussions as the project is still young and a lot of decisions are still being taken, so you could have a big impact on the design. It's currently very Java centric, so the more JS engineers onboard the better to try and change that.

@pauldraper
Copy link
Author

@rochdev thanks, will do !

@rochdev
Copy link
Member

rochdev commented May 18, 2021

Closing this as we're moving away from OpenTracing, focusing instead on reworking dd-trace into individual components and eventually individual modules. As far as open standards go, we're adding support for OpenTelemetry in dd-opentelemetry-exporter-js for now, and will likely also support the OTel API at some point to interact with dd-trace directly, as well as W3C propagation.

@rochdev rochdev closed this as completed May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community core question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants