Skip to content

Commit

Permalink
chore: fix rst
Browse files Browse the repository at this point in the history
Signed-off-by: ThibaultFy <[email protected]>
  • Loading branch information
ThibaultFy committed Sep 1, 2023
1 parent 78f1ca7 commit 2356f3a
Showing 1 changed file with 37 additions and 27 deletions.
64 changes: 37 additions & 27 deletions docs/source/additional/privacy-strategy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Privacy Enhancing Technologies.
We touch on a **few** of the main technologies that are making collaborative data sharing possible today in ways that can be considered more secure.

**Federated Learning**:
Federated Learning allows Machine Learning models to be sent to servers where they can train and test on data without having to ever move the data from its original location. This idea however is not restricted to machine learning and is then referred to as Federated Analytics. The Substra software enables both Federated Learning and Federated Analytics.
Federated Learning allows Machine Learning models to be sent to servers where they can train and test on data without having to ever move the data from its original location. This idea however is not restricted to machine learning and is then referred to as Federated Analytics. Substra enables both Federated Learning and Federated Analytics.

**Secure Enclaves**:
These are hardware-based features that provide an isolated environment to store and process data. A secure enclave is essentially an ultra secure space within a larger secure system. Although they are excellent for safely storing data, the privacy is hardware dependent and places trust in a physical chip as opposed to encryption.
Expand All @@ -37,36 +37,48 @@ When performing FL or using any privacy enhancing technologies, it's important t

Here are the assumptions we make in the rest of this document. If they do not match your environment, you might have to take additional measures to ensure full protection of your data.

#. 1. We assume that the Substra network is composed of a relatively small number of organizations agreeing on a protocol. All organizations are honest, and will follow the agreed upon protocol, without actively trying to be malicious as we are in a closed FL environment with high trust, as opposed to a wide open FL network.
#. 2. Some participants in the network might be honest but curious. This means that they follow the agreed protocol, but may try to infer as much information as possible from the artifacts shared during the federated experiments.
#. 3. The external world (outside of the network) contains malicious actors. We make no assumptions about any external communication and we aim at limiting as much as possible our exposure to the outside world.
#. 4. Models are accessible by data scientists in the network (with the right permissions). The data scientist is responsible for making sure that the trained model exported does not contain sensitive information enabling, for example, membership attacks. (explained below)
#. 5. Every organization in the network is a responsible actor. Every organization hosts its own node of the Substra network, and is responsible for ensuring minimal securitization of their infrastructure. Regular security audits and / or certifications are recommended.
#. 6. In this document the focus is on protecting data rather than models — thus we do not cover Byzantine attacks [Fang, M., Cao, X., Jia, J., & Gong, N. (2020). Local model poisoning attacks to {Byzantine-Robust} federated learning] and backdoor attacks [Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., & Shmatikov, V. (2020, June). How to backdoor federated learning.].- which are in a category of attacks that affect the quality of the generated model as opposed to compromising the data.
#. We assume that the Substra network is composed of a relatively small number of organizations agreeing on a protocol. All organizations are honest, and will follow the agreed upon protocol, without actively trying to be malicious as we are in a closed FL environment with high trust, as opposed to a wide open FL network.
#. Some participants in the network might be honest but curious. This means that they follow the agreed protocol, but may try to infer as much information as possible from the artifacts shared during the federated experiments.
#. The external world (outside of the network) contains malicious actors. We make no assumptions about any external communication and we aim at limiting as much as possible our exposure to the outside world.
#. Models are accessible by data scientists in the network (with the right permissions). The data scientist is responsible for making sure that the trained model exported does not contain sensitive information enabling, for example, membership attacks. (explained below)
#. Every organization in the network is a responsible actor. Every organization hosts its own node of the Substra network, and is responsible for ensuring minimal securitization of their infrastructure. Regular security audits and / or certifications are recommended.
#. In this document the focus is on protecting data rather than models — thus we do not cover Byzantine attacks *[Fang, M., Cao, X., Jia, J., & Gong, N. (2020). Local model poisoning attacks to {Byzantine-Robust} federated learning]* and backdoor attacks *[Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., & Shmatikov, V. (2020, June). How to backdoor federated learning.]*.- which are in a category of attacks that affect the quality of the generated model as opposed to compromising the data.

.. note::

We are aware that our initial assumption may seem restrictive, but we make this assumption because Substra does not provide protection against malicious actors within a closed network. The trust is here ensured through non-technical means - the organizations are honest due to liabilities, regulations and contracts. This excludes any wide open federated networks, where data is made available to any public researcher.

Following these assumptions, the privacy threats when performing Federated Learning can be classified in two categories.

#. **Generic cyber-security attacks**:
if a malicious actor can get access to the internal infrastructure, they can exfiltrate some sensitive data (or cause other kinds of mayhem). This is not specific to FL settings, but the inherent decentralization of FL does reduce the severity of such breaches despite the fact that each communication channel with the external world is a potential attack surface, and by design, part of the code is executed on remote machines.
#. **Attacks specific to FL**:
**1. Generic cyber-security attacks:**

If a malicious actor can get access to the internal infrastructure, they can exfiltrate some sensitive data (or cause other kinds of mayhem). This is not specific to FL settings, but the inherent decentralization of FL does reduce the severity of such breaches despite the fact that each communication channel with the external world is a potential attack surface, and by design, part of the code is executed on remote machines.

**2. Attacks specific to FL:**

These are attacks related to the information contained in the mathematical objects exchanged when training a model. Said otherwise, the model updates and/or the final trained model parameters might encode sensitive information about the training dataset. These may be relevant for pure machine learning as well but are exacerbated in FL as the data is often viewed by a model many times. Examples of such threats include:

#. **Membership attacks**:
When a final trained model is used to try to guess whether a specific data sample was used during training [Membership Inference Attacks against Machine Learning Models, Shokir et al. 2016]. Membership attack is not specific to FL, as it relies on the final trained model. It can be performed in the two following settings:
**a. Membership attacks:**

When a final trained model is used to try to guess whether a specific data sample was used during training *[Membership Inference Attacks against Machine Learning Models, Shokir et al. 2016]*.

Membership attack is not specific to FL, as it relies on the final trained model. It can be performed in the two following settings:

**- Black box attack:**

#. **Black box attack**:
This is an attack made from the prediction of a trained model on a given set of samples. Black box attack is an attack which requires the minimal amount of rights/permissions from the attacker. For example, only an API to request model prediction is provided to the attacker.
#. **White box attack**:
An attack where the attacker needs to access the architecture and weights of a trained model
This is an attack made from the prediction of a trained model on a given set of samples. Black box attack is an attack which requires the minimal amount of rights/permissions from the attacker.

#. **Reconstruction attacks**:
When the batch gradient or the FL model updates are used to reconstruct from scratch a data sample used during the training. [Inverting Gradients - How easy is it to break privacy in federated learning?, Geiping et al. 2020].
For example, only an API to request model prediction is provided to the attacker.

Other threats in this category also include Re-attribution attacks [SRATTA : Sample Re-ATTribution Attack of Secure Aggregation in Federated Learning, Marchand et al. 2023],
**- White box attack:**

An attack where the attacker needs to access the architecture and weights of a trained model.

**b. Reconstruction attacks:**

When the batch gradient or the FL model updates are used to reconstruct from scratch a data sample used during the training. *[Inverting Gradients - How easy is it to break privacy in federated learning?, Geiping et al. 2020]*.

Other threats in this category also include Re-attribution attacks *[SRATTA : Sample Re-ATTribution Attack of Secure Aggregation in Federated Learning, Marchand et al. 2023]*,

Hence, there are a variety of ways data can become vulnerable. The first layer of protection in a project is always introduced through proper governance - clear and proper agreements that make responsibilities of those controlling and accessing data is critical. Secondly, a thoroughly reviewed and tested infrastructure setup should be utilized as this layer will be the primary defense against any form of cyber attack. Privacy enhancing technologies such as Substra act as the third line of defense against the misuse of data, as they create protective barriers against data leakage.

Expand All @@ -81,7 +93,7 @@ To ensure that every participant in the network behaves honestly, Substra provid

As maintainers of Substra, we take cyber security risks very seriously. Substra development follows stringent processes to ensure high code quality (high test coverage, systematic code reviews, automated dependencies upgrade, etc) and the code base is audited regularly by external security experts.

At the infrastructure level, we are limiting our exposure (only one port is open for communication between the orchestrator and the backend) and enforcing strict privilege control of the pods in our namespace. We also strive for using best security practices such as encryption levels and access management. We welcome the responsible disclosure of any found vulnerabilities, which can be directly emailed to us at [email protected]
At the infrastructure level, we are limiting our exposure (only one port is open for communication between the orchestrator and the backend) and enforcing strict privilege control of the pods in our namespace. We also strive for using best security practices such as encryption levels and access management. We welcome the responsible disclosure of any found vulnerabilities, which can be directly emailed to us at [email protected].

Some of the risks listed in the previous section are deferred to the user. In particular, each organization is responsible for setting the appropriate level of security in its deployment of Substra. The next section provides some general guidelines and best practices that have worked well in our experience.

Expand All @@ -101,8 +113,8 @@ For the GDPR, projects should responsibly complete a Data Processing Impact Asse

Projects should also clearly define responsibilities such as:

- Who are the data controllers
- Who are the data processors
- Who are the data controllers.
- Who are the data processors.
- Precisely what actions will be performed on the data and by whom.

Security setup
Expand All @@ -118,11 +130,9 @@ When running Substra in production, please ensure that TLS and mTLS (:ref:`ops s

Several teams and personas have to be involved to ensure that a project handles data with maximum privacy and integrity and that these security protocols are upheld at all times.

* **Data scientists** bear a great ethical responsibility as they could run code that allows for data leakage. Processes such as code reviewing or auditing are highly recommended.It is crucial for them to follow best practices to the best of their ability (code is versioned; dependencies are limited to well-known libraries and kept up to date). A malicious actor here could still infer knowledge about the dataset.

* **Data engineers** must ensure that data is handled and uploaded according to agreed standards while also ensuring that additional copies do not exist and that data is not shared in any way other than on the secure server.

* **SRE / DevOps engineers** also need to follow best practices. (encryption options are activated; production-grade passwords are used when relevant; secrets are not shared, 2FA is enabled). Their contributions protect against cyber attacks but cannot prevent data leakage through training.
- **Data scientists** bear a great ethical responsibility as they could run code that allows for data leakage. Processes such as code reviewing or auditing are highly recommended.It is crucial for them to follow best practices to the best of their ability (code is versioned; dependencies are limited to well-known libraries and kept up to date). A malicious actor here could still infer knowledge about the dataset.
- **Data engineers** must ensure that data is handled and uploaded according to agreed standards while also ensuring that additional copies do not exist and that data is not shared in any way other than on the secure server.
- **SRE / DevOps engineers** also need to follow best practices. (encryption options are activated; production-grade passwords are used when relevant; secrets are not shared, 2FA is enabled). Their contributions protect against cyber attacks but cannot prevent data leakage through training.

Conclusion
----------
Expand Down

0 comments on commit 2356f3a

Please sign in to comment.