Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Dependencies (internal and external)
ccoctl should be able to delete AWS resources it created
ccoctl delete <infra-name>
The tool should be able to upload an OpenID Connect (OIDC) configuration to an S3 bucket, and create an AWS IAM Identity Provider that trusts identities from the OIDC provider. It should take infra name as input so that user can identify all the resources created in AWS. Make sure that resources created in AWS are tagged appropriately.
Sample command with existing key pair:
tool-name create identity-provider <infra-name> --public-key ./path/to/public/key
Ensure the Identity Provider includes audience config for both the in-cluster components ('openshift') and the pod-identity-webhook ('sts.amazonaws.com').
As a OpenShift administrator
I want the registry operator to use topology mode from Infrastructure (HighAvailable = 2 replicas, SingleReplica = 1 replica)
so that it the operator is not spending resources for high-availability purpose when it's not needed.
See also:
https://github.com/openshift/enhancements/blob/master/enhancements/cluster-high-availability-mode-api.md
https://github.com/openshift/api/pull/827/files
Platform | SingleReplica | HighAvailable |
---|---|---|
AWS | 1 replica | 2 replicas |
Azure | 1 replica | 2 replicas |
GCP | 1 replica | 2 replicas |
OpenStack (swift) | 1 replica | 2 replicas |
OpenStack (cinder) | 1 replica | 1 replica (PVC) |
oVirt | 1 replica | 1 replica (PVC) |
bare metal | Removed | Removed |
vSphere | Removed | Removed |
https://github.com/openshift/enhancements/pull/555
https://github.com/openshift/api/pull/827
The console operator will need to support single-node clusters.
We have a console deployment and downloads deployment. Each will to be updated so that there's only a single replica when high availability mode is disabled in the Infrastructure config. We should also remove the anti-affinity rule in the console deployment that tries to spread console pods across nodes.
The downloads deployment is currently a static manifest. That likely needs to be created by the console operator instead going forward.
Acceptance Criteria:
Bump github.com/openshift/api to pickup changes from openshift/api#827
Research if we can dynamically reserve memory and CPU for nodes.
Openshift Sandboxed Containers provide the ability to add an additional layer of isolation through virtualization for many workloads. The main way to enable the use of katacontainers on an Openshift Cluster is by first installing the Operator (for more information about operator enablement check [1]).
Once the feature is enabled on the cluster, it just a matter of a one-liner YAML modification on the pod/deployment level to run the workload using katacontianers. That might sound easy for some, but for others who don't care about YAML they might want more abstractions on how to use katacontainers for their workloads.
This feature covers all the efforts required to integrate and present Kata in Openshift UI (console) to cater to all user personas.
To enable for users to adopt Kata as a runtime, it is important to make it easy to use. Adding hook-points in the UI with ease-of-use as a goal in mind is one way to bring in more users.
The main goal of this feature is to make sure that:
Questions to be addressed:
References
[1] https://issues.redhat.com/browse/KATA-429?jql=project %3D KATA AND issuetype %3D Feature
The grand goal is to improve the usability of Kata from Openshift UI. This EPIC aims to cover only a subset that would help:
To use a different runtime e.g., Kata, the "runtimeClassName" will be set to the desired low-level runtime. Also please see [1]:
"RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used to run this pod. If no RuntimeClass resource matches the named class, the pod will not be run. If unset or empty, the "legacy" RuntimeClass will be used, which is an implicit class with an empty definition that uses the default runtime handler. More info: https://git.k8s.io/enhancements/keps/sig-node/runtime-class.md This is a beta feature as of Kubernetes v1.14.."
apiVersion: v1 kind: Pod metadata: name: nginx-runc spec: runtimeClassName: runC
The value of the runtime class cannot be changed on the pod level, but it can be changed on the deployment level
apiVersion: apps/v1 kind: Deployment metadata: name: sandboxed-nginx spec: replicas: 2 selector: matchLabels: app: sandboxed-nginx template: metadata: labels: app: sandboxed-nginx spec: runtimeClassName: kata. # ---> This can be changed containers: - name: nginx image: nginx ports: - containerPort: 80 protocol: TCP
[1] https://docs.openshift.com/container-platform/4.6/rest_api/workloads_apis/pod-core-v1.html
We should show the runtime class on workloads pages and add a badge to the heading in the case a workload uses Kata. A workload uses Kata if its pod template has `runtimeClassName` set to `kata`.
Acceptance Criteria:
Andrew Ronaldson indicated that adding a "kata" badge in the heading would be too much noise around other heading badges (ContainerCreating, Failed, etc).
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.
There are definitely grey areas, but in general:
Questions to be addressed:
Create a PR in openshift/cluster-ingress-operator to specify the random balancing algorithm if the feature gate is enabled, and to specify the leastconn balancing algorithm (the current default) otherwise.
Create a PR in openshift/cluster-ingress-operator to implement the PROXY protocol API.
CoreDNS v1.7 renamed some metrics that we use in our alerting rules. Make sure the alerting rules in https://github.com/openshift/cluster-dns-operator/blob/master/manifests/0000_90_dns-operator_03_prometheusrules.yaml are using the correct metrics names (and still work as intended).
This story is for actually updating the version of CoreDNS in github.com/openshift/coredns. Our fork will need to be rebased onto https://github.com/coredns/coredns/releases/tag/v1.8.1, which may involve some git fu. Refer to previous CoreDNS Rebase PR's for any pointers there.
We need to verify that no new CoreDNS dual stack features require any configuration changes or feature flags.
(All dual stack changes should just work once we rebase to coredns v1.8.1).
See https://github.com/coredns/coredns/pull/4339 .
We also need to verify that cluster DNS works for both v4 and v6 for a dual stack cluster IP service. (ie request via A and AAAA, make sure you get the desired response, and not just one or the other). A brief CI test on our dual stack metal CI might make the most sense here (KNI Might have a job like this already, need to investigate our options to add dual stack coverage to openshift/coredns).
The multiple destinations provided as a part of the allowedDestinations field is not working as it used to on OCP4: https://github.com/openshift/images/blob/master/egress/router/egress-router.sh#L70-L109
We need to parse this from the NAD and modify the iptables here to support them:
https://github.com/openshift/egress-router-cni/blob/master/pkg/macvlan/macvlan.go#L272-L349
Testing:
1) Created NAD:
[dsal@bkr-hv02 surya_multiple_destinations]$ cat nad_multiple_destination.yaml --- apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: egress-router spec: config: '{ "cniVersion": "0.4.0", "type": "egress-router", "name": "egress-router", "ip": { "addresses": [ "10.200.16.10/24" ], "destinations": [ "80 tcp 10.100.3.200", "8080 tcp 203.0.113.26 80", "8443 tcp 203.0.113.26 443" ], "gateway": "10.200.16.1" } }'
2) Created pod:
[dsal@bkr-hv02 surya_multiple_destinations]$ cat egress-router-pod.yaml --- apiVersion: v1 kind: Pod metadata: name: egress-router-pod annotations: k8s.v1.cni.cncf.io/networks: egress-router spec: containers: - name: openshift-egress-router-pod command: ["/bin/bash", "-c", "sleep 999999999"] image: centos/tools securityContext: privileged: true
3) Checked IPtables:
[root@worker-1 core]# iptables-save -t nat Generated by iptables-save v1.8.4 on Mon Feb 1 12:08:05 2021 *nat :PREROUTING ACCEPT [0:0] :INPUT ACCEPT [0:0] :POSTROUTING ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A POSTROUTING -o net1 -j SNAT --to-source 10.200.16.10 COMMIT # Completed on Mon Feb 1 12:08:05 2021
As we can see, only the SNAT rule is added. The DNAT doesn't get picked up because of the syntax difference.
Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.
The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:
Requirement | Notes | isMvp? |
---|---|---|
UI to enable and disable plugins | YES | |
Dynamic Plugin Framework in place | YES | |
Testing Infra up and running | YES | |
Docs and read me for creating and testing Plugins | YES | |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Documentation Considerations
Questions to be addressed:
Related to CONSOLE-2380
We need a way for cluster admins to disable a console plugin when uninstalling an operator if it's enabled in the console operator config. Otherwise, the config will reference a plugin that no longer exists. This won't prevent console from loading, but it's something that we can clean up during uninstall.
The UI will always remove the console plugin when an operator is uninstalled. There will not be an option to keep the operator. We should have a sentence in the dialog letting the user know that the plugin will disabled when the operator is uninstalled (but only if the CSV has the plugin annotation).
If the user doesn't have authority to patch the operator config, we should warn them that the operator config can't be updated to remove the plugin.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This would let us import YAML with multiple resources and add YAML templates that create related resources like image streams and build configs together.
See CONSOLE-580
Acceptance criteria:
The work on this story is dependent on following changes:
The console already supports custom routes on the operator config. With the new proposed CustomDomains API introduces a unified way how to stock install custom domains for routes, which both names and serving cert/keys, customers want to customise. From the console perspective those are:
The setup should be done on the Ingress config. There two new fields are introduced:
Console-operator will be only consuming the API and check for any changes. If a custom domain is set for either `console` or `downloads` route in the `openshift-console` namespace, console-operator will read the setup set a custom route accordingly. When a custom route will be setup for any of console's route, the default route wont be deleted, but instead it will updated so it redirects to the custom one. This is done because of two reasons:
Console-operator will still need to support the CustomDomain API that is available on it's config.
Acceptance criteria:
Questions:
Dump openshift/api godep to pickup new CustomDomain API for the Ingress config.
Implement console-operator changes to consume new CustomDomains API, based on the story details.
When moving to OCP 4 we didn't port the metrics charts for Deployments, Deployment Configs, StatefulSets, DaemonSets, ReplicaSets, and ReplicationControllers. These should be the same charts that we show on the Pods page: Memory, CPU, Filesystem, Network In and Out.
This was only done for pods.
We need to decide if we want use a multiline chart or some other representation.
Story:
As a user viewing the pod logs tab with a selected container, I want the ability to view past logs if they are available for the container.
Acceptance Criteria:
Design doc: https://docs.google.com/document/d/1PB8_D5LTWhFPFp3Ovf85jJTc-zAxwgFR-sAOcjQCSBQ/edit#
Feature Overview
This will be phase 1 of Internationalization of the OpenShift Console.
Phase 1 will include the following:
Phase 1 will not include:
Initial List of Languages to Support
---------- 4.7* ----------
*This will be based on the ability to get all the strings externalized, there is a good chance this gets pushed to 4.8.
---------- Post 4.7 ----------
POC
Goals
Internationalization has become table stakes. OpenShift Console needs to support different languages in each of the major markets. This is key functionality that will help unlock sales in different regions.
Requirements
Requirement | Notes | isMvp? |
---|---|---|
Language Selector | YES | |
Localized Date. + Time | YES | |
Externalization and translation of all client side strings | YES | |
Translation for Chinese and Japanese | YES | |
Process, infra, and testing capabilities put into place | YES | |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Out of Scope
Assumptions
Customer Considerations
We are rolling this feature in phases, based on customer feedback, there may be no phase 2.
Documentation Considerations
I believe documentation already supports a large language set.
We need to automate how we send and receive updated translations using Memsource for the Red Hat Globalization team. The Ansible Tower team already has automation in place that we might be able to reuse.
Acceptance Criteria:
We have too many namespaces if we're loading them upfront. We should consolidate some of the files.
Just do namespaces from A-D to reduce number of files being changed at once
Consolidate namepsaces N-R to reduce change size
Consolidate namepsaces E-I to reduce change size
Consolidate namepsaces S-Z to reduce change size
Consolidate namepsaces K-M to reduce change size
The OCP Console needs to detect if the ACM Operator has been installed, if detected then a new multi-cluster perspective option shows up in the perspective chooser.
As a user I need the ability to to switch to the the ACM UI from the OCP Console and vice versa without requiring the user to login multiple times.
This option also needs to be hidden if the user doesn't have the correct RBAC.
The console should detect the presence of the ACM operator and add an Advanced Cluster Management item to the perspective switcher. We will need to work with the ACM team to understand how to detect the operator and how to discover the ACM URL.
Additionally, we will need to provide a query parameter or URL fragment to indicate which perspective to use. This will allow ACM to link back to the a specific perspective since it will share the same perspective switcher in its UI. ACM will need to be able to discover the console URL.
This story does not include handling SSO, which will be tracked in a separate story.
We need to determine what RBAC checks to make before showing the ACM link.
Acceptance Criteria
1. Console shows a link to ACM in its perspective switcher
2. Console provides a way for ACM to link back to a specific perspective
3. The ACM option only appears when the ACM operator is installed
4. ACM should open in the same browser tab to give the appearance of it being one application
5. Only users with appropriate RBAC should see the link (access review TBD)
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
Console operator should swap from using monis.app to openshift/operator-boilerplate-legacy. This will allow switching to klog/v2, which the shared libs (api,client-go,library-go) have already done.
Node is currently 10.x. Let's increase that to at least 14.x.
It will require some changes on the ART side as well OSBS builds
This is required to bump node to avoid https://github.com/webpack/webpack/issues/4629. We need to evaluate whether this has a domino effect on our webpack dependencies.
See https://github.com/openshift/console/pull/7306#issuecomment-755509361
This epic is mainly focused to track the dev console QE automation activities for 4.8 release
1. Identify the scenarios for automation
2. Segregate the test cases into smoke, Regression and user stories
3. Designing the gherkin scripts with below priority
- Update the Smoke test suite
- Update the Regression test suite
4. Create the automation scripts using cypress
5. Implement CI
This improves the quality of the product
This is not related to any UI features. It is mainly focused on UI automation
This story is mainly related to push the pipelines code from dev console to gitops plugin folder for extensibility purpose
As a operator qe, I should be able to execute them on my operator folder
1. All pipelines scripts should be able to execute in the gitops plugin folder
2. gitops operator installation needs to be done by the script
Consolidate cypress cucumber and cypress frameworks related to pluigns/index.js files
CI implementation for pipelines, knative, devconsole
update package.json file
CI for pipelines:
Any update related to pipelines should execute pipelines smoke tests
on nightly builds, pipelines regression should be executed [TBD]
CI for devconsole:
Any update related to devconsole should execute devconsole smoke tests
on nightly builds, devconsole regression should be executed [TBD]
Ci for knative
Any update related to knative should execute knative smoke tests
on nightly builds, knative regression should be executed [TBD]
Setup the CI for all plugins smoke test scripts
References for CI implementation
Fixing the lint feature file lint issues and moving the topology features to topology folder which is occurring on executing `yarn run test-cypress-devconsole-headless`
Design the cypress scripts for the epic ODC-3991
Refer the Gherkin scripts https://issues.redhat.com/browse/ODC-5430
As a user,
All automation possible test scenarios related to EPIC ODC-3991 should be automated
Pipelines operator needs to be installed
This story is mainly related to push the pipelines code from dev console to pipelines plugin folder for extensibility purpose
Verify the pipelines regression test suite
As a operator qe, I should be able to execute them on my operator folder
1. All pipelines scripts should be able to execute in the pipelines plugin folder
2. Pipelines operator installation needs to be done by the script
updated all automation scripts and verify the execution on remote cluster
As a user,
Execute them on Chrome browser and 4.8 release cluster
Would like to include integration-tests for topology folder
Currently the PR looks too large, To reduce the size, creating these sub tasks
Updating the ReadMe documentation for knative plugin folder
This helps to automatically notify the web terminal team members on test scenario changes
By adding the owners file to service mesh, helps us to add the automatic reviewers on this gherkin scripts update
Create Github templates with certain criteria to met the Gherkin script standards, Automation script standards
Fixing all gherkin linter errors
As this .gherkin-lintrc is mainly used by QE team. so it's not necessary to be in frontend folder, So I am moving it to dev-console/integration-tests folder
Adding all necessary tags and modifying below rules due to recently observed scenarios
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
Please read: migrating-protractor-tests-to-cypress
Protractor test to migrate: `frontend/integration-tests/tests/filter.scenario.ts`
4) Filtering ✔ filters Pod from object detail ✔ filters invalid Pod from object detail ✔ filters from Pods list ⚠ CONSOLE-1503 - searches for object by label ✔ searches for pod by label and filtering by name ✔ searches for object by label using by other kind of workload
Accpetance Criteria
Please read: migrating-protractor-tests-to-cypress
Protractor test to migrate: `frontend/integration-tests/tests/storage.scenario.ts`
Loops through 6 storage kinds:
15) Add storage is applicable for all workloads
16) replicationcontrollers
✔ create a replicationcontrollers resource
✔ add storage to replicationcontrollers
17) daemonsets
✔ create a daemonsets resource
✔ add storage to daemonsets
18) deployments
✔ create a deployments resource
✔ add storage to deployments
19) replicasets
✔ create a replicasets resource
✔ add storage to replicasets
20) statefulsets
✔ create a statefulsets resource
✔ add storage to statefulsets
21) deploymentconfigs
✔ create a deploymentconfigs resource
✔ add storage to deploymentconfigs
Accpetance Criteria
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
discover-etcd-initial-cluster was written very early on in the cluster-etcd-operator life cycle. We have observed at least one bug in this code and in order to validate logical correctness it needs to be rewritten with unit tests.
This is a clone of issue OCPBUGS-249. The following is the description of the original issue:
—
+++ This bug was initially created as a clone of
Bug #2070318
+++
Description of problem:
In OCP VRRP deployment (using OCP cluster networking), we have an additional data interface which is configured along with the regular management interface in each control node. In some deployments, the kubernetes address 172.30.0.1:443 is nat’ed to the data management interface instead of the mgmt interface (10.40.1.4:6443 vs 10.30.1.4:6443 as we configure the boostrap node) even though the default route is set to 10.30.1.0 network. Because of that, all requests to 172.30.0.1:443 were failed. After 10-15 minutes, OCP magically fixes it and nat’ing correctly to 10.30.1.4:6443.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.Provision OCP cluster using cluster networking for DNS & Load Balancer instead of external DNS & Load Balancer. Provision the host with 1 management interface and an additional interface for data network. Along with OCP manifest, add manifest to create a pod which will trigger communication with kube-apiserver.
2.Start cluster installation.
3.Check on the custom pod log in the cluster when the first 2 master nodes were installing to see GET operation to kube-apiserver timed out. Check nft table and chase the ip chains to see the that the data IP address was nat'ed to kubernetes service IP address instead of the management IP. This is not happening all the time, we have seen 50:50 chance.
Actual results:
After 10-15 minutes OCP will correct that by itself.
Expected results:
Wrong natting should not happen.
Additional info:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895
ClusterVersion: Stable at "4.8.29"
ClusterOperators:
clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/baremetal is degraded because metal3 deployment inaccessible
clusteroperator/console is not available (RouteHealthAvailable: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because RouteHealthDegraded: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
clusteroperator/insights is degraded because Unable to report: unable to build request to connect to Insights server: Post "
https://cloud.redhat.com/api/ingress/v1/upload
": dial tcp: lookup cloud.redhat.com on 172.30.0.10:53: read udp 10.128.0.26:53697->172.30.0.10:53: i/o timeout
clusteroperator/network is progressing: DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
— Additional comment from
bnemec@redhat.com
on 2022-03-30 20:00:25 UTC —
This is not managed by runtimecfg, but in order to route the bug correctly I need to know which CNI plugin you're using - OpenShiftSDN or OVNKubernetes. Thanks.
— Additional comment from
lpbinh@gmail.com
on 2022-03-31 08:09:11 UTC —
Hi Ben,
We were deploying Contrail CNI with OCP. However, this issue happens at very early deployment time, right after the bootstrap node is started
and there's no SDN/CNI there yet.
— Additional comment from
bnemec@redhat.com
on 2022-03-31 15:26:23 UTC —
Okay, I'm just going to send this to the SDN team then. They'll be able to provide more useful input than I can.
— Additional comment from
trozet@redhat.com
on 2022-04-04 15:22:21 UTC —
Can you please provide the iptables rules causing the DNAT as well as the routes on the host? Might be easiest to get a sosreport during initial bring up during that 10-15 min when the problem occurs.
— Additional comment from
lpbinh@gmail.com
on 2022-04-05 16:45:13 UTC —
All nodes have two interfaces:
eth0: 10.30.1.0/24
eth1: 10.40.1.0/24
machineNetwork is 10.30.1.0/24
default route points to 10.30.1.1
The kubeapi service ip is 172.30.0.1:443
all Kubernetes services are supposed to be reachable via machineNetwork (10.30.1.0/24)
To make the kubeapi service ip reachable in hostnetwork, something (openshift installer?) creates a set of nat rules which translates the service ip to the real ip of the nodes which have kubeapi active.
Initially kubeapi is only active on the bootstrap node so there should be a nat rule like
172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)
However, what we see is
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)
The rule is configured on the controller nodes and lead to asymmetrical routing as the controller sends a packet FROM machineNetwork (10.30.1.x) to 172.30.0.1 which is then translated and forwarded to 10.40.1.10 which then tries to reply back on the 10.40.1.0 network which fails as the request came from 10.30.1.0 network.
So, we want to understand why openshift installer picks the 10.40.1.x ip address rather than the 10.30.1.x ip for the nat rule. What's the mechanism for getting the ip in case the system has multiple interfaces with ips configured.
Note: after a while (10-20 minutes) the bootstrap process resets itself and then it picks the correct ip address from the machineNetwork and things start to work.
— Additional comment from
smerrow@redhat.com
on 2022-04-13 13:55:04 UTC —
Note from Juniper regarding requested SOS report:
In reference to
https://bugzilla.redhat.com/show_bug.cgi?id=2070318
that @Binh Le has been working on. The mustgather was too big to upload for this Bugzilla. Can you access this link?
https://junipernetworks-my.sharepoint.com/:u:/g/personal/sleigon_juniper_net/ETOrHMqao1tLm10Gmq9rzikB09H5OUwQWZRAuiOvx1nZpQ
— Additional comment from
smerrow@redhat.com
on 2022-04-21 12:24:33 UTC —
Can we please get an update on this BZ?
Do let us know if there is any other information needed.
— Additional comment from
trozet@redhat.com
on 2022-04-21 14:06:00 UTC —
Can you please provide another link to the sosreport? Looks like the link is dead.
— Additional comment from
smerrow@redhat.com
on 2022-04-21 19:01:39 UTC —
See mustgather here:
https://drive.google.com/file/d/16y9IfLAs7rtO-SMphbYBPgSbR4od5hcQ
— Additional comment from
trozet@redhat.com
on 2022-04-21 20:57:24 UTC —
Looking at the must-gather I think your iptables rules are most likely coming from the fact that kube-proxy is installed:
[trozet@fedora must-gather.local.288458111102725709]$ omg get pods -n openshift-kube-proxy
NAME READY STATUS RESTARTS AGE
openshift-kube-proxy-kmm2p 2/2 Running 0 19h
openshift-kube-proxy-m2dz7 2/2 Running 0 16h
openshift-kube-proxy-s9p9g 2/2 Running 1 19h
openshift-kube-proxy-skrcv 2/2 Running 0 19h
openshift-kube-proxy-z4kjj 2/2 Running 0 19h
I'm not sure why this is installed. Is it intentional? I don't see the configuration in CNO to enable kube-proxy. Anyway the node IP detection is done via:
https://github.com/kubernetes/kubernetes/blob/f173d01c011c3574dea73a6fa3e20b0ab94531bb/cmd/kube-proxy/app/server.go#L844
Which just looks at the IP of the node. During bare metal install a VIP is chosen and used with keepalived for kubelet to have kapi access. I don't think there is any NAT rule for services until CNO comes up. So I suspect what really is happening is your node IP is changing during install, and kube-proxy is getting deployed (either intentionally or unintentionally) and that is causing the behavior you see. The node IP is chosen via the node ip configuration service:
https://github.com/openshift/machine-config-operator/blob/da6494c26c643826f44fbc005f26e0dfd10513ae/templates/common/_base/units/nodeip-configuration.service.yaml
This service will determine the node ip via which interfaces have a default route and which one has the lowest metric. With your 2 interfaces, do they both have default routes? If so, are they using dhcp and perhaps its random which route gets installed with a lower metric?
— Additional comment from
trozet@redhat.com
on 2022-04-21 21:13:15 UTC —
Correction: looks like standalone kube-proxy is installed by default when the provider is not SDN, OVN, or kuryr so this looks like the correct default behavior for kube-proxy to be deployed.
— Additional comment from
lpbinh@gmail.com
on 2022-04-25 04:05:14 UTC —
Hi Tim,
You are right, kube-proxy is deployed by default and we don't change that behavior.
There is only 1 default route configured for the management interface (10.30.1.x) , we used to have a default route for the data/vrrp interface (10.40.1.x) with higher metric before. As said, we don't have the default route for the second interface any more but still encounter the issue pretty often.
— Additional comment from
trozet@redhat.com
on 2022-04-25 14:24:05 UTC —
Binh, can you please provide a sosreport for one of the nodes that shows this behavior? Then we can try to figure out what is going on with the interfaces and the node ip service. Thanks.
— Additional comment from
trozet@redhat.com
on 2022-04-25 16:12:04 UTC —
Actually Ben reminded me that the invalid endpoint is actually the boostrap node itself:
172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)
vs
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)
So maybe a sosreport off that node is necessary? I'm not as familiar with the bare metal install process, moving back to Ben.
— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:33:45 UTC —
Created attachment 1875023 [details]sosreport
— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:34:59 UTC —
Created attachment 1875024 [details]sosreport-part2
Hi Tim,
We observe this issue when deploying clusters using OpenStack instances as our infrastructure is based on OpenStack.
I followed the steps here to collect the sosreport:
https://docs.openshift.com/container-platform/4.8/support/gathering-cluster-data.html
Got the sosreport which is 22MB which exceeds the size permitted (19MB), so I split it to 2 files (xaa and xab), if you can't join them then we will need to put the collected sosreport on a share drive like we did with the must-gather data.
Here are some notes about the cluster:
First two control nodes are below, ocp-binhle-8dvald-ctrl-3 is the bootstrap node.
[core@ocp-binhle-8dvald-ctrl-2 ~]$ oc get node
NAME STATUS ROLES AGE VERSION
ocp-binhle-8dvald-ctrl-1 Ready master 14m v1.21.8+ed4d8fd
ocp-binhle-8dvald-ctrl-2 Ready master 22m v1.21.8+ed4d8fd
We see the behavior that wrong nat'ing was done at the beginning, then corrected later:
sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 3 bytes 180 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y
}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
sh-4.4#
sh-4.4#
<....after a while....>
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y
}
sh-4.4# nft list chain ip nat KUBE-SEP-X33IBTDFOZRR6ONM
table ip nat {
chain KUBE-SEP-X33IBTDFOZRR6ONM
}
sh-4.4#
— Additional comment from
lpbinh@gmail.com
on 2022-05-12 17:46:51 UTC —
@
trozet@redhat.com
May we have an update on the fix, or the plan for the fix? Thank you.
— Additional comment from
lpbinh@gmail.com
on 2022-05-18 21:27:45 UTC —
Created support Case 03223143.
— Additional comment from
vkochuku@redhat.com
on 2022-05-31 16:09:47 UTC —
Hello Team,
Any update on this?
Thanks,
Vinu K
— Additional comment from
smerrow@redhat.com
on 2022-05-31 17:28:54 UTC —
This issue is causing delays in Juniper's CI/CD pipeline and makes for a less than ideal user experience for deployments.
I'm getting a lot of pressure from the partner on this for an update and progress. I've had them open a case [1] to help progress.
Please let us know if there is any other data needed by Juniper or if there is anything I can do to help move this forward.
[1]
https://access.redhat.com/support/cases/#/case/03223143
— Additional comment from
vpickard@redhat.com
on 2022-06-02 22:14:23 UTC —
@
bnemec@redhat.com
Tim mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=2070318#c14
that this issue appears to be at BM install time. Is this something you can help with, or do we need help from the BM install team?
— Additional comment from
bnemec@redhat.com
on 2022-06-03 18:15:17 UTC —
Sorry, I missed that this came back to me.
(In reply to Binh Le from
comment #16
)> We observe this issue when deploying clusters using OpenStack instances as
> our infrastructure is based on OpenStack.This does not match the configuration in the must-gathers provided so far, which are baremetal. Are we talking about the same environments?
I'm currently discussing this with some other internal teams because I'm unfamiliar with this type of bootstrap setup. I need to understand what the intended behavior is before we decide on a path forward.
— Additional comment from
rurena@redhat.com
on 2022-06-06 14:36:54 UTC —
(In reply to Ben Nemec from
comment #22
)> Sorry, I missed that this came back to me.
>
> (In reply to Binh Le from comment #16)
> > We observe this issue when deploying clusters using OpenStack instances as
> > our infrastructure is based on OpenStack.
>
> This does not match the configuration in the must-gathers provided so far,
> which are baremetal. Are we talking about the same environments?
>
> I'm currently discussing this with some other internal teams because I'm
> unfamiliar with this type of bootstrap setup. I need to understand what the
> intended behavior is before we decide on a path forward.I spoke to the CU they tell me that all work should be on baremetal. They were probably just testing on OSP and pointing out that they saw the same behavior.
— Additional comment from
bnemec@redhat.com
on 2022-06-06 16:19:37 UTC —
Okay, I see now that this is an assisted installer deployment. Can we get the cluster ID assigned by AI so we can take a look at the logs on our side? Thanks.
— Additional comment from
lpbinh@gmail.com
on 2022-06-06 16:38:56 UTC —
Here is the cluster ID, copied from the bug description:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895
In regard to your earlier question about OpenStack & baremetal (2022-06-03 18:15:17 UTC):
We had an issue with platform validation in OpenStack earlier. Host validation was failing with the error message “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”
It's found out that there is no platform type "OpenStack" available in [
https://github.com/openshift/assisted-service/blob/master/models/platform_type.go#L29
] so we set "baremetal" as the platform type on our computes. That's the reason why you are seeing baremetal as the platform type.
Thank you
— Additional comment from
ercohen@redhat.com
on 2022-06-08 08:00:18 UTC —
Hey, first you are currect, When you set 10.30.1.0/24 as the machine network, the bootstrap process should use the IP on that subnet in the bootstrap node.
I'm trying to understand how exactly this cluster was installed.
You are using on-prem deployment of assisted-installer (podman/ACM)?
You are trying to form a cluster from OpenStack Vms?
You set the platform to Baremetal where?
Did you set user-managed-netwroking?
Some more info, when using OpenStack platform you should install the cluster with user-managed-netwroking.
And that's what the failing validation is for.
— Additional comment from
bnemec@redhat.com
on 2022-06-08 14:56:53 UTC —
Moving to the assisted-installer component for further investigation.
— Additional comment from
lpbinh@gmail.com
on 2022-06-09 07:37:54 UTC —
@Eran Cohen:
Please see my response inline.
You are using on-prem deployment of assisted-installer (podman/ACM)?
--> Yes, we are using on-prem deployment of assisted-installer.
You are trying to form a cluster from OpenStack Vms?
--> Yes.
You set the platform to Baremetal where?
--> It was set in the Cluster object, Platform field when we model the cluster.
Did you set user-managed-netwroking?
--> Yes, we set it to false for VRRP.
— Additional comment from
itsoiref@redhat.com
on 2022-06-09 08:17:23 UTC —
@
lpbinh@gmail.com
can you please share assisted logs that you can download when cluster is failed or installed?
Will help us to see the full picture
— Additional comment from
ercohen@redhat.com
on 2022-06-09 08:23:18 UTC —
OK, as noted before when using OpenStack platform you should install the cluster with user-managed-netwroking (set to true).
Can you explain how you workaround this failing validation? “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”
What does this mean exactly? 'we set "baremetal" as the platform type on our computes'
To be honest I'm surprised that the installation was completed successfully.
@
oamizur@redhat.com
I thought installing on OpenStack VMs with baremetal platform (user-managed-networking=false) will always fail?
— Additional comment from
lpbinh@gmail.com
on 2022-06-10 16:04:56 UTC —
@
itsoiref@redhat.com
: I will reproduce and collect the logs. Is that supposed to be included in the provided must-gather?
@
ercohen@redhat.com
:
— Additional comment from
itsoiref@redhat.com
on 2022-06-13 13:08:17 UTC —
@
lpbinh@gmail.com
you will have download_logs link in UI. Those logs are not part of must-gather
— Additional comment from
lpbinh@gmail.com
on 2022-06-14 18:52:02 UTC —
Created attachment 1889993 [details]cluster log per need info request - Cluster ID caa475b0-df04-4c52-8ad9-abfed1509506
Attached is the cluster log per need info request.
Cluster ID: caa475b0-df04-4c52-8ad9-abfed1509506
In this reproduction, the issue is not resolved by OpenShift itself, wrong NAT still remained and cluster deployment failed eventually
sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 2 bytes 120 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y
}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 17:40:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 17:59:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 18:17:38 UTC 2022
sh-4.4#
sh-4.4#
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4
}
Tue Jun 14 18:49:28 UTC 2022
sh-4.4#
— Additional comment from
itsoiref@redhat.com
on 2022-06-15 15:59:22 UTC —
@
lpbinh@gmail.com
just for the protocol, we don't support baremetal ocp on openstack that's why validation is failing
— Additional comment from
lpbinh@gmail.com
on 2022-06-15 17:47:39 UTC —
@
itsoiref@redhat.com
as explained it's just a workaround on our side to make OCP work in our lab, and from my understanding on OCP perspective it will see that deployment is on baremetal only, not related to OpenStack (please correct me if I am wrong).
We have been doing thousands of OCP cluster deployments in our automation so far, if it's why validation is failing then it should be failing every time. However it only occurs occasionally when nodes have 2 interfaces, using OCP internal DNS and Load balancer, and sometime resolved by itself and sometime not.
— Additional comment from
itsoiref@redhat.com
on 2022-06-19 17:00:01 UTC —
For now i can assume that this endpoint is causing the issue:
{
"apiVersion": "v1",
"kind": "Endpoints",
"metadata": {
"creationTimestamp": "2022-06-14T17:31:10Z",
"labels":
,
"name": "kubernetes",
"namespace": "default",
"resourceVersion": "265",
"uid": "d8f558be-bb68-44ac-b7c2-85ca7a0fdab3"
},
"subsets": [
{
"addresses": [
],
"ports": [
{
"name": "https",
"port": 6443,
"protocol": "TCP"
}
]
}
]
},
— Additional comment from
itsoiref@redhat.com
on 2022-06-21 17:03:51 UTC —
The issue is that kube-api service advertise wrong ip but it does it cause kubelet chooses the one arbitrary and we currently have no mechanism to set kubelet ip, especially in bootstrap flow.
— Additional comment from
lpbinh@gmail.com
on 2022-06-22 16:07:29 UTC —
@
itsoiref@redhat.com
how do you perform OCP deployment in setups that have multiple interfaces if letting kubelet chooses an interface arbitrary instead of configuring a specific IP address for it to listen on? With what you describe above chance of deployment failure in system with multiple interfaces would be high.
— Additional comment from
dhellard@redhat.com
on 2022-06-24 16:32:26 UTC —
I set the Customer Escalation flag = Yes, per ACE EN-52253.
The impact is noted by the RH Account team: "Juniper is pressing and this impacts the Unica Next Project at Telefónica Spain. Unica Next is a critical project for Red Hat. We go live the 1st of July and this issue could impact the go live dates. We need clear information about the status and its possible resolution.
— Additional comment from
itsoiref@redhat.com
on 2022-06-26 07:28:44 UTC —
I have sent an image with possible fix to Juniper and waiting for their feedback, once they will confirm it works for them we will proceed with the PRs.
— Additional comment from
pratshar@redhat.com
on 2022-06-30 13:26:26 UTC —
=== In Red Hat Customer Portal Case 03223143 ===
— Comment by Prateeksha Sharma on 6/30/2022 6:56 PM —
//EMT note//
Update from our consultant Manuel Martinez Briceno -
====
on 28th June, 2022 the last feedback from Juniper Project Manager and our Partner Manager was that they are testing the fix. They didn't give an Estimate Time to finish, but we will be tracking this closely and let us know of any news.
====
Thanks & Regards,
Prateeksha Sharma
Escalation Manager | RHCSA
Global Support Services, Red Hat
P-06-TC01 | Text change is required |
P-06-TC04 | Text change is required |
P-06-TC13 | Text change is required |
P-03-TC03 also get fixed with this bug
console-operator codebase contains a lot of inline manifests. Instead we should put those manifests into a `/bindata` folder, from which they will be read and then updated per purpose.
Some of the steps in test scenarios [A-06-TC02]- Script fix required
A-06-TC05 - script fix required
A-06-TC11 update required as per the latest UI
Description of problem:
Description of problem: Version-Release number of selected component (if applicable): How reproducible: Search link: https://search.ci.openshift.org/?search=Create+namespace+from+install+operators+creates+namespace+from+operator+install+page&maxAge=12h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Steps to Reproduce: 1. 2. 3.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
1) We want to fix the order of Imports in the files.
2) We want to have vendor import, followed by console/package import and then relative imports should come at last.
Can be done manually or introduce some linter rules for this.
P-01-TC03 | On Second run, script worked fine |
P-01-TC06 | created seperate functions for docker file page |
P-01-TC09 | Removing this test case, by updating P-04-TC04 test scenario Updating pipelines section title in side bar |
Pull in the latest openshift/library content into the samples operator
If image eco e2e's fail, work with upstream SCL to address
List of EOL images needs to be sent to the Docs team and added to the release notes.
In the topology view, if you select any grouping (Application, Helm Release, Operator Backed service, etc), an extraneous blue box is displayed
This is a regression.
Create an application in any way ... but this will do ...
This animated gif shows the issue:
The blue box shouldn't be shown
Always
Seen on 4/26/2021 4.8 daily, but this behavior was discussed in slack last week
This is a regression
Update the kafka test scenarios in eventing-kafka-event-source.feature file
While Regression Test execution, updated the test scenarios
P-09-TC01, P-09-TC04, P-09-TC05, P-09-TC06, P-09-TC07, P-09-TC11 test scripts update required
Page obejcts updated for pipelines
This task adds support for setting socket options SO_REUSEADDR and SO_REUSEPORT to etcd listeners via ListenConfig. These options give the flexibility to cluster admins who wish to more explicit control of these features. What we have found is during etcd process restart there can be a considerable time waiting for the port to release as it is held open by TIME_WAIT which on many systems is 60s.
Create Namespaces script is keep on failing due to load issue
Unable to execute the create namespace script
Create Namespace script should work without any issue
P-02-TC02 | Script fix required - unable to identify locators |
P-02-TC03 | Script fix required - unable to identify locators |
P-02-TC06 | Script fix required - unable to identify locators |
Migrate the existing tests which are located here :
Helper functions/Views location: