Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel SGX Device Plugin returns error "permission denied" for OpenShift 4.13 #113

Closed
Tracked by #188
tsadowsk opened this issue Aug 25, 2023 · 30 comments
Closed
Tracked by #188
Labels
bug Something isn't working gpu Intel GPU qat QAT feature sgx SGX feature
Milestone

Comments

@tsadowsk
Copy link

tsadowsk commented Aug 25, 2023

Summary

During installation of Intel SGX Device Plugin, an error occurs which states lack of access permissions for kubelet.sock socket from intel-sgx-plugin pod. This error happens in OpenShift 4.13 and was not present in OpenShift 4.12.

Detail

During installation of Intel SGX Device Plugin below error occurs:

oc -n openshift-operators logs pod/intel-sgx-plugin-ng262 -c intel-sgx-plugin
...
E0823 14:04:16.980975       1 manager.go:146] Failed to serve sgx.intel.com/provision: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"

As a workaround, I added privileged access rights for the DaemonSet/Pod by using below command line:

oc -n openshift-operators edit ds/intel-sgx-plugin

After replacing:

securityContext:
    allowPrivilegeEscalation: false

with:

securityContext:
    privileged: true

started working. Most probably, such privileges escalation is not needed and can be limited to necessary only privileges.

Resolving this issue would be very helpful/beneficial because 4.13 is a current version of OpenShift, and this plugin works without any issues in OpenShift 4.12, which is a previous version.

Also, it would be great to make sure that such issue does not occur for the upcoming OpenShift version 4.14. Many thanks in advance!

Update as of Dec 14 2023 from @mregmi latest comment:
Still waiting on fix to propagate to OCP 4.13 and 4.14 (https://issues.redhat.com/browse/OCPBUGS-20022)

  • Issue root cause: Kubelet is running with wrong label on OCP 4.13 and higher

Workaround:

Since the kubelet is running with wrong label on OCP 4.13 and beyond, we need to run SELinux in permissive mode as a workaround. To do this, please run the following command on all the nodes.

  1. Find all nodes in the OCP cluster:
$ oc get nodes

Example output:

NAME         STATUS   ROLES    AGE   VERSION
icx-dgpu-1   Ready    worker   30d   v1.25.4+18eadca
  1. Navigate to the node terminal on the web console (Compute -> Nodes -> Select a node -> Terminal). Run the following commands in the terminal. Repeat step 2 for any other nodes in the cluster.
$ chroot /host
$ setenforce Permissive
@hershpa hershpa added the sgx SGX feature label Aug 25, 2023
@mregmi
Copy link
Member

mregmi commented Aug 25, 2023

can you also upload /var/log/audit/audit.log

@hershpa hershpa added this to the v1.1.0 milestone Aug 25, 2023
@hershpa
Copy link
Contributor

hershpa commented Aug 25, 2023

Hi @tsadowsk! Thanks for submitting the issue and sharing all the details.
We are aware of this regression issue on OpenShift 4.13 and it is currently in triage. The behavior and analysis you elucidated above aligns with our observation. This issue was not present in OpenShift 4.12. We are working to resolve it asap and have identified at least one component (a missing SELinux patch) in OpenShift 4.13 that is responsible for this regression.

Currently, our operator officially supports OpenShift 4.13 (starting from 4.12.6). We are working on v1.1.0 which will add support for OpenShift 4.13 soon. Thanks for evaluating our operator on OpenShift 4.13 and sharing your experience.

@hershpa hershpa added the bug Something isn't working label Aug 25, 2023
@tsadowsk
Copy link
Author

@hershpa I sent /var/log/audit/audit.log directly to you as it might contain some sensitive information.

@mregmi
Copy link
Member

mregmi commented Aug 30, 2023

Looks like SELinux policies in container-selinux changed between 4.12 and 4.13. We will investigate

@hershpa
Copy link
Contributor

hershpa commented Aug 30, 2023

@mregmi Since this is a regression, can we create a RH ticket since these policies were missed in OCP 4.13 integration? The policies were already part of container-SELinux upstream project and were backported in OCP 4.12.

@hershpa
Copy link
Contributor

hershpa commented Sep 5, 2023

@tsadowsk, which OCP 4.13 z stream version are you using?

@tsadowsk
Copy link
Author

tsadowsk commented Sep 6, 2023

@hershpa Below more details about version:

oc version
Client Version: 4.11.8
Kustomize Version: v4.5.4
Server Version: 4.13.5
Kubernetes Version: v1.26.6+f245ced

oc get clusterversions
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.5    True        False         34d     Cluster version is 4.13.5

[tsadowsk@igk-0389 templates]$ oc get no -owide
NAME      STATUS   ROLES                  AGE   VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
master0   Ready    control-plane,master   34d   v1.26.6+f245ced   10.10.10.11   <none>        Red Hat Enterprise Linux CoreOS 413.92.202307140015-0 (Plow)   5.14.0-284.23.1.el9_2.x86_64   cri-o://1.26.3-11.rhaos4.13.git78941bf.el9
master1   Ready    control-plane,master   34d   v1.26.6+f245ced   10.10.10.12   <none>        Red Hat Enterprise Linux CoreOS 413.92.202307140015-0 (Plow)   5.14.0-284.23.1.el9_2.x86_64   cri-o://1.26.3-11.rhaos4.13.git78941bf.el9
master2   Ready    control-plane,master   34d   v1.26.6+f245ced   10.10.10.13   <none>        Red Hat Enterprise Linux CoreOS 413.92.202307140015-0 (Plow)   5.14.0-284.23.1.el9_2.x86_64   cri-o://1.26.3-11.rhaos4.13.git78941bf.el9
worker0   Ready    worker                 34d   v1.26.6+f245ced   10.10.10.21   <none>        Red Hat Enterprise Linux CoreOS 413.92.202307140015-0 (Plow)   5.14.0-284.23.1.el9_2.x86_64   cri-o://1.26.3-11.rhaos4.13.git78941bf.el9
worker1   Ready    worker                 34d   v1.26.6+f245ced   10.10.10.22   <none>        Red Hat Enterprise Linux CoreOS 413.92.202307140015-0 (Plow)   5.14.0-284.23.1.el9_2.x86_64   cri-o://1.26.3-11.rhaos4.13.git78941bf.el9
worker2   Ready    worker                 34d   v1.26.6+f245ced   10.10.10.23   <none>        Red Hat Enterprise Linux CoreOS 413.92.202307140015-0 (Plow)   5.14.0-284.23.1.el9_2.x86_64   cri-o://1.26.3-11.rhaos4.13.git78941bf.el9
worker3   Ready    worker                 34d   v1.26.6+f245ced   10.10.10.24   <none>        Red Hat Enterprise Linux CoreOS 413.92.202307140015-0 (Plow)   5.14.0-284.23.1.el9_2.x86_64   cri-o://1.26.3-11.rhaos4.13.git78941bf.el9
worker4   Ready    worker                 34d   v1.26.6+f245ced   10.10.10.31   <none>        Red Hat Enterprise Linux CoreOS 413.92.202307140015-0 (Plow)   5.14.0-284.23.1.el9_2.x86_64   cri-o://1.26.3-11.rhaos4.13.git78941bf.el9
worker5   Ready    worker                 34d   v1.26.6+f245ced   10.10.10.32   <none>        Red Hat Enterprise Linux CoreOS 413.92.202307140015-0 (Plow)   5.14.0-284.23.1.el9_2.x86_64   cri-o://1.26.3-11.rhaos4.13.git78941bf.el9
worker6   Ready    worker                 34d   v1.26.6+f245ced   10.10.10.33   <none>        Red Hat Enterprise Linux CoreOS 413.92.202307140015-0 (Plow)   5.14.0-284.23.1.el9_2.x86_64   cri-o://1.26.3-11.rhaos4.13.git78941bf.el9

Please let me know if you would need more info.

@hershpa
Copy link
Contributor

hershpa commented Sep 6, 2023

Thanks @tsadowsk

@hershpa
Copy link
Contributor

hershpa commented Sep 6, 2023

Can you try OCP 4.13.6? That has the expected version of container-selinux 2.215.0.

@uMartinXu
Copy link
Contributor

Thanks @hershpa!
@tsadowsk After our investigating and syncing with RH, looks like the container-selinux 2.215.0 which includes the SeLinux policy needed by SGX Provisioning was not properly integrated into OCP-4.13.5. RH told us this regression issue should have been resolved in 4.13.6. Could you please upgrade the OCP to this z release and have a try? If you still have a the issue, please let us know. Again, thank you very much for reporting this regression issue. :-)

@tsadowsk
Copy link
Author

tsadowsk commented Sep 15, 2023

@uMartinXu @hershpa
I tried to run Intel Device Plugin with OpenShift 4.13.10.
Unfortunately, I still receive error about permissions denied to /var/lib/kubelet/device-plugins/kubelet.sock:

oc -n openshift-operators logs pod/intel-sgx-plugin-qqrtx
...
E0915 12:49:36.740419       1 manager.go:146] Failed to serve sgx.intel.com/enclave: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"
...

oc version 
Client Version: 4.9.18
Server Version: 4.13.10
Kubernetes Version: v1.26.7+0ef5eae

oc get clusterversions
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.10   True        False         8d      Cluster version is 4.13.10

oc get no -owide
NAME      ROLES                  VERSION            OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
master1   control-plane,master   v1.26.7+0ef5eae    Red Hat Enterprise Linux CoreOS 413.92.202308210212-0 (Plow)   5.14.0-284.28.1.el9_2.x86_64   cri-o://1.26.4-3.rhaos4.13.git615a02c.el9
worker1   worker                 v1.26.7+0ef5eae    Red Hat Enterprise Linux CoreOS 413.92.202308210212-0 (Plow)   5.14.0-284.28.1.el9_2.x86_64   cri-o://1.26.4-3.rhaos4.13.git615a02c.el9

Below are yaml files which I used for Intel Device Plugin installation:

---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: intel-device-plugins-operator
  namespace: openshift-operators
spec:
  name: intel-device-plugins-operator
  source: certified-operators
  sourceNamespace: openshift-marketplace
  channel: alpha
  installPlanApproval: Automatic
  startingCSV: intel-device-plugins-operator.v0.26.1
---
kind: SgxDevicePlugin
apiVersion: deviceplugin.intel.com/v1
metadata:
  name: sgxdeviceplugin
spec:
  enclaveLimit: 110
  image: >-
    registry.connect.redhat.com/intel/intel-sgx-plugin@sha256:60a8cf855383bd149822c48b7369540c3e806b0b77efdf0f4aac7831ce1bb1b2
  initImage: >-
    registry.connect.redhat.com/intel/intel-sgx-initcontainer@sha256:18f0695fd5614c86e555423117c648be866f9b936fd1d2023c8949590fb549e3
  logLevel: 4
  nodeSelector:
    intel.feature.node.kubernetes.io/sgx: 'true'
  provisionLimit: 110

Could you please help?

Please let me know if you would need more info.

@hershpa
Copy link
Contributor

hershpa commented Sep 15, 2023

@tsadowsk thanks for the update. I am testing it on our end, let me see if I observe what you saw above.

@mythi
Copy link

mythi commented Sep 15, 2023

FYI, @eadamsintel got everything deployed OK on 4.13.11.

@eadamsintel
Copy link

I got it to work by turning off SELinux which is not really a proper solution.

@Feelas
Copy link

Feelas commented Sep 18, 2023

Maybe this issue might be a generalized case of the RHEL-3128 / Bug 2180456 bug?
Before JIRA migration it was ticket Bug 2180456.
Looks very close to what we are seeing.

The similarities:

  1. Trying to access kubelet.sock from unprivileged container, which fails,
  2. AVC issues reported by SELinux,
  3. Kubelet is running with unconfined_service_t, which looks to be unexpected by Red Hat,
  4. In comment from March 27, "{ connectto }" denial is also reported as it is in our case.

The differences:

  1. The bug is concerned specifically about Numa-aware scheduler.

Please note that at least one Red Hat's own operator identified the mentioned Bug 2180456 issue as a long-term fix while applying short-term SELinux policy workarounds until Bug 2180456 is fixed. The "{ connectto }" denial was also present there if you take a look at the comments.

@mregmi @hershpa could you take a look at this?

@mregmi
Copy link
Member

mregmi commented Sep 29, 2023

It looks like kubelet is running with incorrect label which is causing the SELinux access denial for plugins.
kubelet should run as kubelet_exec_t and not unconfined_service_t

The bugzilla above touches on this issue but does not seem to provide solution/fix. Will investigate further on why its happening and check with RedHat too.

sh-5.1# ps -AZ | grep unconfined
system_u:system_r:unconfined_service_t:s0 8719 ? 00:24:50 kubelet
sh-5.1# ls -Z /usr/bin/kubelet
system_u:object_r:kubelet_exec_t:s0 /usr/bin/kubelet

@hershpa hershpa added qat QAT feature gpu Intel GPU labels Oct 3, 2023
@hershpa
Copy link
Contributor

hershpa commented Oct 3, 2023

Same issue for all 3 device plugins (SGX, GPU, QAT). We need to work with RH to resolve this regression/bug.

@mregmi
Copy link
Member

mregmi commented Oct 3, 2023

actively being looked at by RedHat: https://issues.redhat.com/browse/OCPBUGS-20022

@hershpa
Copy link
Contributor

hershpa commented Oct 3, 2023

Thanks @mregmi!

@mregmi
Copy link
Member

mregmi commented Oct 9, 2023

@tsadowsk Which Intel Device Plugin Operator version are you using and where did you get the images from. Did you build it from IDPO upstream?

@tsadowsk
Copy link
Author

@mregmi We are using Intel Device Plugin 0.26.1 from alpha channel provided by Operator Hub. We haven't built it on our own, so this is the default Intel Device Plugin Operator without customizations.

@hershpa
Copy link
Contributor

hershpa commented Nov 4, 2023

We are waiting for a container SELinux patch to show up in a OCP 4.13.z and 4.14.z release.
Refer to parent issue https://issues.redhat.com/browse/OCPBUGS-20022,
4.13: https://issues.redhat.com/browse/OCPBUGS-22272
4.14: https://issues.redhat.com/browse/OCPBUGS-22270.

@tsadowsk
Copy link
Author

tsadowsk commented Nov 6, 2023

@hershpa I checked the issues i.e. https://issues.redhat.com/browse/OCPBUGS-20022 and looks like a blocker for this ticket, which was podman change: containers/container-selinux#277 was merged. I noticed about it in ticket for RedHat.

@hershpa
Copy link
Contributor

hershpa commented Nov 15, 2023

Waiting on Red Hat for visibility to target 4.13.z and 4.14.z release with the patch.

@brgavino
Copy link

Given that there's no LZ for this fix in upstream OCP, what's the suggested workaround for 4.13.11 (latest supported z per README)?

@mregmi
Copy link
Member

mregmi commented Dec 13, 2023

Still waiting on fix to propagate to OCP 4.13 and 4.14 (https://issues.redhat.com/browse/OCPBUGS-20022)

Issue root cause: Kubelet is running with wrong label in OCP 4.13 and higher

Workaround: Since the kubelet is running with wrong label in OCP 4.13 and beyond, we need to run SELinux in permissive mode as a workaround. To do this, In all the nodes, run the following command.
#setenforce Permissive

@mregmi
Copy link
Member

mregmi commented Jan 31, 2024

This is fixed in 4.14.10

@brgavino
Copy link

brgavino commented Feb 8, 2024

Just a note here; even with SELinux set to "permissive" pods accessing the GPU through /dev/dri/renderDXXX will always pick up the default device /dev/dri/renderD128, even when that is not the pod's assigned resource, this results in errors when trying to use the device for any work (ie ffmpeg). This may be related to the SELinux config and could be added to the regression as a sanity check.

This probably was resolved in this thread (intel/intel-device-plugins-for-kubernetes#1377) but the script that is used to check what device is usable (https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/gpu_plugin/render-device.sh) will always have the first device writable - so it's mostly useless for two GPUs on a single node.

@mythi
Copy link

mythi commented Feb 8, 2024

/cc @tkatila for visibility

vbedida79 added a commit to vbedida79/intel-technology-enabling-for-openshift that referenced this issue Feb 15, 2024
uMartinXu added a commit that referenced this issue Feb 15, 2024
device_plugins: Remove workaround for fixed issue #113
@uMartinXu
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpu Intel GPU qat QAT feature sgx SGX feature
Projects
None yet
Development

No branches or pull requests

8 participants