From bc78353512623522ace99ce10b01c48f8ccd949f Mon Sep 17 00:00:00 2001 From: John Bley Date: Mon, 27 Apr 2020 19:47:31 -0400 Subject: [PATCH 1/5] Sensitive Data Handling OTEP --- text/0000-sensitive-data-handling.md | 103 +++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 text/0000-sensitive-data-handling.md diff --git a/text/0000-sensitive-data-handling.md b/text/0000-sensitive-data-handling.md new file mode 100644 index 000000000..5af6961ce --- /dev/null +++ b/text/0000-sensitive-data-handling.md @@ -0,0 +1,103 @@ +# Sensitive Data Handling + +By default, OpenTelemetry libraries never capture potentially-sensitive data + +## Motivation + +Many businesses are under regulatory or contractual constraints on how they process sensitive data (e.g., GDPR, +PCI, HIPPA). If we are "safe by default" and manage this issue well, it will help greatly with +OpenTelemetry adoption in many markets. + +## Explanation + +By default, our java auto-instrumentation library does not capture your app's credit card numbers, taxpayer IDs, +personal medical data, users' passwords, etc. Unless you put them in the URL. If you have put them in the URL, we provide a way to fix that. + +How do we do that? For example, we don't know the semantics of the SQL your app sends to its database(s). Some of the data might be +sensitive, and some of it might not. What we do is remove all +*potentially* sensitive values. This means that sql like `SELECT * FROM USERS WHERE ID=123456789` gets turned into +`SELECT * FROM USERS WHERE ID=?` when it gets put into our `db.statement` attribute. Since our library doesn't know anything about the +meaning of `ID` values, we decided to be "safe by default" and drop all numeric and string values entirely from all SQL. So what gets +captured in our system is just table names, field names, clause structures, etc. (metadata about your data's structure and how you access it). + +Yes, this also has the pleasant side effect of reducing the cardinality of these dimensions, making it easier to build metrics or indices on them. +If you want to reduce it further, our collector has features for that. Again, the goal of our auto-instrumentation is to protect you from +liability or exposure due to captured sensitive data. + +We use similar techniques for all other protocols/query/command mechanisms. One exception is how we treat URLs. We will by +default capture the full URL (including the query and fragment) in `http.url`. Note that this will not include +the contents of form submissions or XHR posts. Since many products/features log the entire URL already +(like in Apache's access log), developers already need to consider this. If you are putting sensitive data in the URL, +we offer configuration options to work around this. + +If your app doesn't process any sensitive data at all and you want to see the raw sql values, you can configure that. The documentation +will have some bold/red warnings telling you to be sure you know what you're doing. + +I'd like to repeat here that I'm using SQL as an *example* of how we treat potentially-sensitive data in our ecosystem. We treat every other protocol/query format +with the same care. + +## Internal details + +I'm looking here for broad agreement of a design principle, rather than feedback on any specific low-level string manipulation technique to +achieve these goals. + +That said, I have worked on sql normalization at three prior APM companies and am working on contributing a simple first version of one for the +opentelemetry-auto-instr-java repo. It is based on using a lexer to parse out sql numeric and string literals and replacing them with a `?`, exactly +as described above and done by many APM products on the market. + +One interesting technical point to raise is what to do when our normalization/scrubbing knows it isn't working properly (for +example, encountering an ambiguous string it doesn't know how to parse). In some cases a safe fallback approach can be found (e.g., apply a +less-sophisticated regex that bluntly removes too much from the query), but in other cases we may need to, e.g., throw an internal exception and +interpret that as a requirement to not add the +attribute in the first place. In pseudo-code: +``` + try { + span.setAttribute("sensitive.field", normalize(sensitiveValue)); + } catch (CannotNormalizeException e) { + // don't emit sensitive.field since we can't normalize it safely + } +``` + +This work will affect all auto-instrumenting libraries, as well as any official (or recommended) manually wrapped libraries that are +produced as part of the ecosystem. As appropriate to each such library, features designed to properly handle sensitive data should +be highlighted in the documentation. Sticking with the SQL example, for the Java auto library this could take the form of an additional column +with a hyperlink to the documentation describing exactly what we do: + +| Library/Framework | Versions | Notes | +|------------------------------------------------------------------------------------------------------|--------------------------------|--------------------------------------------| +| [JDBC](https://docs.oracle.com/en/java/javase/11/docs/api/java.sql/java/sql/package-summary.html) | Java 7+ | (link to docs about sql normalization) | + + +## Trade-offs and mitigations + +In some cases, proper "safe by default" behavior might involve more intensive processing than can reasonably be performed inline with the application. There have +already been some discussions about having span (and attribute) processing logic run in-process but off-thread in the application (i.e., by a configurable span +processing pipeline, which would use separate threads for its work). For sql normalization (my first/example concern), I don't believe this is necessary, but +further work on this issue may put pressure here. + +One drawback of this "safe by default" approach is that it puts a configuration burden on some users who are not handling sensitive data but +want to have as much visibility as possible into their system. Staying with the sql normalization theme however, the vast majority of sql in the world doesn't +have different performance characteristics based on the exact values passed in, but on the number of them, or the complexity of the `WHERE` clauses, +the number of values in an `IN` list, presence of an index on fields, etc. + +Some might consider doing most of this work at the collector, rather than in-process with the app. For users who are already exposed to this +issue, I don't think the statement "yes, by default we scrub the credit card number immediately after it is sent over the wire to our collector" is a very reassuring one. + +## Prior art and alternatives + +Nearly every commercial APM product on the planet has features like the sql normalization discussed above (though some take a different approach in the name of reduced +cardinality). Most of them have landed on a similar answer for their treatment of URLs (though, again, some do different things for metrics features vs. tracing features). + +## Open questions + +I am not personally familiar with the OpenTelemetry stance on one value that is potentially-sensitive under some regulatory frameworks: client IP addresses +of users accessing a system. This should be discussed, perhaps separately. + +I am not an expert at all the protocols/query formats that our libraries support. Doing this well will require a thorough audit and cooperation/discussion across +the community (again, ideally in separate issues/PRs for each such feature). + +## Future possibilities + +We might want to migrate some of this functionality into external libraries such that manual instrumentation (and manually-constructed library wrappers) can take advantage of these capabilities. + +Additionally, work on more advanced configuration around this puts pressure on having our span processors be highly flexible, whether these are in-process or in the collector. From b95e94c70825e5acb3965928fdf81a780d80345d Mon Sep 17 00:00:00 2001 From: John Bley Date: Mon, 27 Apr 2020 19:57:49 -0400 Subject: [PATCH 2/5] Verbiage tweaks based on another read through. --- text/0000-sensitive-data-handling.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/text/0000-sensitive-data-handling.md b/text/0000-sensitive-data-handling.md index 5af6961ce..f2705a14a 100644 --- a/text/0000-sensitive-data-handling.md +++ b/text/0000-sensitive-data-handling.md @@ -1,6 +1,6 @@ # Sensitive Data Handling -By default, OpenTelemetry libraries never capture potentially-sensitive data +By default, OpenTelemetry libraries never capture potentially-sensitive data. ## Motivation @@ -21,7 +21,7 @@ meaning of `ID` values, we decided to be "safe by default" and drop all numeric captured in our system is just table names, field names, clause structures, etc. (metadata about your data's structure and how you access it). Yes, this also has the pleasant side effect of reducing the cardinality of these dimensions, making it easier to build metrics or indices on them. -If you want to reduce it further, our collector has features for that. Again, the goal of our auto-instrumentation is to protect you from +If you want to reduce it further, our collector has features for that. Again, our design goal is to protect you from liability or exposure due to captured sensitive data. We use similar techniques for all other protocols/query/command mechanisms. One exception is how we treat URLs. We will by @@ -61,7 +61,7 @@ attribute in the first place. In pseudo-code: This work will affect all auto-instrumenting libraries, as well as any official (or recommended) manually wrapped libraries that are produced as part of the ecosystem. As appropriate to each such library, features designed to properly handle sensitive data should be highlighted in the documentation. Sticking with the SQL example, for the Java auto library this could take the form of an additional column -with a hyperlink to the documentation describing exactly what we do: +on the "Supported Frameworks" table with a hyperlink to the documentation describing exactly what we do: | Library/Framework | Versions | Notes | |------------------------------------------------------------------------------------------------------|--------------------------------|--------------------------------------------| @@ -76,7 +76,7 @@ processing pipeline, which would use separate threads for its work). For sql no further work on this issue may put pressure here. One drawback of this "safe by default" approach is that it puts a configuration burden on some users who are not handling sensitive data but -want to have as much visibility as possible into their system. Staying with the sql normalization theme however, the vast majority of sql in the world doesn't +want to have as much visibility as possible into their system. Staying with the sql normalization example however, the vast majority of sql in the world doesn't have different performance characteristics based on the exact values passed in, but on the number of them, or the complexity of the `WHERE` clauses, the number of values in an `IN` list, presence of an index on fields, etc. From 509e83b81fe88afeb9de4cac7517d0a4c7f480fc Mon Sep 17 00:00:00 2001 From: John Bley Date: Mon, 27 Apr 2020 20:00:17 -0400 Subject: [PATCH 3/5] Rename OTEP per PR id (100). --- ...sensitive-data-handling.md => 0100-sensitive-data-handling.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename text/{0000-sensitive-data-handling.md => 0100-sensitive-data-handling.md} (100%) diff --git a/text/0000-sensitive-data-handling.md b/text/0100-sensitive-data-handling.md similarity index 100% rename from text/0000-sensitive-data-handling.md rename to text/0100-sensitive-data-handling.md From 7349386500de3594397ce94d7ee5cd5362da1908 Mon Sep 17 00:00:00 2001 From: John Bley Date: Tue, 28 Apr 2020 09:30:57 -0400 Subject: [PATCH 4/5] Explicitly call out URL capturing in the headline. --- text/0100-sensitive-data-handling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0100-sensitive-data-handling.md b/text/0100-sensitive-data-handling.md index f2705a14a..47601550a 100644 --- a/text/0100-sensitive-data-handling.md +++ b/text/0100-sensitive-data-handling.md @@ -1,6 +1,6 @@ # Sensitive Data Handling -By default, OpenTelemetry libraries never capture potentially-sensitive data. +By default, OpenTelemetry libraries never capture potentially-sensitive data, except for the full URL. ## Motivation From f142c93366a0bf851903ddfa326243a86f5c5566 Mon Sep 17 00:00:00 2001 From: John Bley Date: Tue, 5 May 2020 09:19:48 -0400 Subject: [PATCH 5/5] Alter section about the collector providing a default backstop for individual languages/libraries. --- text/0100-sensitive-data-handling.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/text/0100-sensitive-data-handling.md b/text/0100-sensitive-data-handling.md index 47601550a..565d4e65d 100644 --- a/text/0100-sensitive-data-handling.md +++ b/text/0100-sensitive-data-handling.md @@ -80,8 +80,11 @@ want to have as much visibility as possible into their system. Staying with the have different performance characteristics based on the exact values passed in, but on the number of them, or the complexity of the `WHERE` clauses, the number of values in an `IN` list, presence of an index on fields, etc. -Some might consider doing most of this work at the collector, rather than in-process with the app. For users who are already exposed to this -issue, I don't think the statement "yes, by default we scrub the credit card number immediately after it is sent over the wire to our collector" is a very reassuring one. +This "safe by default" behavior should be applied for a "default install", including the collector. While individual instrumentation +libraries can (and are encouraged to) apply appropriate scrubbing themselves, the collector will act as a cross-language backstop, with appropriate +semantic transformations based on received data (e.g., looking at the`db.type` field to determine whether to treat a string +as sql or a mongo query). This approach should reduce maintenance costs across the ecosystem as well as reduce pressure on needing +the in-app span processing pipeline mentioned above. ## Prior art and alternatives