Platform	SingleReplica	HighAvailable
AWS	1 replica	2 replicas
Azure	1 replica	2 replicas
GCP	1 replica	2 replicas
OpenStack (swift)	1 replica	2 replicas
OpenStack (cinder)	1 replica	1 replica (PVC)
oVirt	1 replica	1 replica (PVC)
bare metal	Removed	Removed
vSphere	Removed	Removed

Requirement	Notes	isMvp?
Language Selector		YES
Localized Date. + Time		YES
Externalization and translation of all client side strings		YES
Translation for Chinese and Japanese		YES
Process, infra, and testing capabilities put into place		YES
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES

Feature OCPPLAN-6007: OpenShift Core Networking Improvements

View the Description

Feature Overview

We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.

Goals

Feature enhancements (performance, scale, configuration, UX, ...)
Modernization (incorporation and productization of new technologies)

Requirements

Core Networking Stability
Core Networking Performance and Scale
Core Neworking Extensibility (Multus CNIs)
Core Networking UX (Observability)
Core Networking Security and Compliance

In Scope

Network Edge (ingress, DNS, LB)
SDN (CNI plugins, openshift-sdn, OVN, network policy, egressIP, egress Router, ...)
Networking Observability

Out of Scope

There are definitely grey areas, but in general:

CNV
Service Mesh
CNF

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic NE-354: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-ingress-operator/pull/576

Story NE-417: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/router/pull/193

Epic NE-393: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Task NE-551: Implement change in openshift/cluster-ingress-operator

View the Description View the linked PRs

Create a PR in openshift/cluster-ingress-operator to specify the random balancing algorithm if the feature gate is enabled, and to specify the leastconn balancing algorithm (the current default) otherwise.

https://github.com/openshift/cluster-ingress-operator/pull/589

Epic NE-302: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-ingress-operator/pull/580

Epic NE-377: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story NE-544: Verify CoreDNS v1.8.1 dual stack changes via brief CI test

View the Description View the linked PRs

We need to verify that no new CoreDNS dual stack features require any configuration changes or feature flags.
(All dual stack changes should just work once we rebase to coredns v1.8.1).

See https://github.com/coredns/coredns/pull/4339 .

We also need to verify that cluster DNS works for both v4 and v6 for a dual stack cluster IP service. (ie request via A and AAAA, make sure you get the desired response, and not just one or the other). A brief CI test on our dual stack metal CI might make the most sense here (KNI Might have a job like this already, need to investigate our options to add dual stack coverage to openshift/coredns).

https://github.com/openshift/origin/pull/25954

Story NE-514: Rebase openshift/CoreDNS to CoreDNS v1.8.1

View the Description View the linked PRs

This story is for actually updating the version of CoreDNS in github.com/openshift/coredns. Our fork will need to be rebased onto https://github.com/coredns/coredns/releases/tag/v1.8.1, which may involve some git fu. Refer to previous CoreDNS Rebase PR's for any pointers there.

Story NE-515: Update CoreDNS alerts for v1.8.1

View the Description View the linked PRs

CoreDNS v1.7 renamed some metrics that we use in our alerting rules. Make sure the alerting rules in https://github.com/openshift/cluster-dns-operator/blob/master/manifests/0000_90_dns-operator_03_prometheusrules.yaml are using the correct metrics names (and still work as intended).

https://github.com/openshift/cluster-dns-operator/pull/239

Epic SDN-1113: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story SDN-1249: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/1037

Task SDN-1569: Egress Router redirect mode: multiple destinations

View the Description View the linked PRs

The multiple destinations provided as a part of the allowedDestinations field is not working as it used to on OCP4: https://github.com/openshift/images/blob/master/egress/router/egress-router.sh#L70-L109

We need to parse this from the NAD and modify the iptables here to support them:

https://github.com/openshift/egress-router-cni/blob/master/pkg/macvlan/macvlan.go#L272-L349

Testing:

1) Created NAD:

[dsal@bkr-hv02 surya_multiple_destinations]$ cat nad_multiple_destination.yaml 
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
 name: egress-router
spec:
 config: '{
     "cniVersion": "0.4.0",
     "type": "egress-router",
     "name": "egress-router",
 "ip": {
     "addresses": [
         "10.200.16.10/24"
     ],
     "destinations": [
         "80 tcp 10.100.3.200",
         "8080 tcp 203.0.113.26 80",
         "8443 tcp 203.0.113.26 443"
     ],
     "gateway": "10.200.16.1"
  }
}'

2) Created pod:

[dsal@bkr-hv02  surya_multiple_destinations]$ cat egress-router-pod.yaml 
---
apiVersion: v1
kind: Pod
metadata:
  name: egress-router-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: egress-router
spec:
  containers:
    - name: openshift-egress-router-pod
      command: ["/bin/bash", "-c", "sleep 999999999"]
      image: centos/tools
      securityContext:
        privileged: true

3) Checked IPtables:

[root@worker-1 core]# iptables-save -t nat 
Generated by iptables-save v1.8.4 on Mon Feb 1 12:08:05 2021
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A POSTROUTING -o net1 -j SNAT --to-source 10.200.16.10
COMMIT # Completed on Mon Feb 1 12:08:05 2021

As we can see, only the SNAT rule is added. The DNAT doesn't get picked up because of the syntax difference.

https://github.com/openshift/egress-router-cni/pull/34

Epic NE-330: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story NE-553: Implement change in openshift/cluster-ingress-operator

View the Description View the linked PRs

Create a PR in openshift/cluster-ingress-operator to implement the PROXY protocol API.

https://github.com/openshift/cluster-ingress-operator/pull/581

Feature OCPPLAN-8029: Console: Dynamic Plugin Framework

View the Description

Feature Overview

Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.

The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:

Extend the Console
Deliver UI code with their Operator
Work in their own git Repo
Deliver at their own cadence

Goals

- Operators can deliver console plugins separate from the console image and update plugins when the operator updates.
- The dynamic plugin API is similar to the static plugin API to ease migration.
- Plugins can use shared console components such as list and details page components.
- Shared components from core will be part of a well-defined plugin API.
- Plugins can use Patternfly 4 components.
- Cluster admins control what plugins are enabled.
- Misbehaving plugins should not break console.
- Existing static plugins are not affected and will continue to work as expected.

Out of Scope

- Initially we don't plan to make this a public API. The target use is for Red Hat operators. We might reevaluate later when dynamic plugins are more mature.
- We can't avoid breaking changes in console dependencies such as Patternfly even if we don't break the console plugin API itself. We'll need a way for plugins to declare compatibility.
- Plugins won't be sandboxed. They will have full JavaScript access to the DOM and network. Plugins won't be enabled by default, however. A cluster admin will need to enable the plugin.
- This proposal does not cover allowing plugins to contribute backend console endpoints.

Requirements

Requirement	Notes	isMvp?
UI to enable and disable plugins		YES
Dynamic Plugin Framework in place		YES
Testing Infra up and running		YES
Docs and read me for creating and testing Plugins		YES
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?

Does this feature have doc impact?

New Content, Updates to existing content, Release Note, or No Doc Impact

If unsure and no Technical Writer is available, please contact Content Strategy.

What concepts do customers need to understand to be successful in [action]?

How do we expect customers will use the feature? For what purpose(s)?

What reference material might a customer want/need to complete [action]?

Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.

What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic CONSOLE-2830: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CONSOLE-2769: Remove console plugin from operator config on operator uninstall

View the Description View the linked PRs

Related to ~~CONSOLE-2380~~

We need a way for cluster admins to disable a console plugin when uninstalling an operator if it's enabled in the console operator config. Otherwise, the config will reference a plugin that no longer exists. This won't prevent console from loading, but it's something that we can clean up during uninstall.

The UI will always remove the console plugin when an operator is uninstalled. There will not be an option to keep the operator. We should have a sentence in the dialog letting the user know that the plugin will disabled when the operator is uninstalled (but only if the CSV has the plugin annotation).

If the user doesn't have authority to patch the operator config, we should warn them that the operator config can't be updated to remove the plugin.

cc Peter Kreuser Tony Wu Robb Hamilton

https://github.com/openshift/console/pull/8895

Epic CONSOLE-2368: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CONSOLE-2379: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/8378

Story CONSOLE-2380: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/8175

Feature OCPPLAN-8030: Console: Customer Happiness (RFEs) for 4.8-4.12

View the Description

Feature Overview

This Section:* High-Level description of the feature ie: Executive Summary

Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

Goals

This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?

CI - MUST be running successfully with test automation

This is a requirement for ALL features.

YES

Release Technical Enablement

Provide necessary release enablement details and documents.

YES

(Optional) Use Cases

This Section:

Main success scenarios - high-level user stories

Alternate flow/scenarios - high-level user stories

Questions to answer…

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

Customer Considerations

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?

Does this feature have doc impact?

New Content, Updates to existing content, Release Note, or No Doc Impact

If unsure and no Technical Writer is available, please contact Content Strategy.

What concepts do customers need to understand to be successful in [action]?

How do we expect customers will use the feature? For what purpose(s)?

What reference material might a customer want/need to complete [action]?

Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.

What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic CONSOLE-2382: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CONSOLE-1338: Add support for multiple documents when creating from YAML

View the Description View the linked PRs

This would let us import YAML with multiple resources and add YAML templates that create related resources like image streams and build configs together.

See ~~CONSOLE-580~~

Acceptance criteria:

Users should be able to drag multiple files into the import yaml.
- the yaml files should be displayed in the editor separated by "- - -"
After clicking create
- a dry run will be initiated and will report back any errors
- upon receiving no errors from the dry run, the resources will be created
- the results page will appear showing links for each resource

https://github.com/openshift/console/pull/8865

Story CONSOLE-2223: Past logs for selected container on pod logs tab.

View the Description View the linked PRs

Story:
As a user viewing the pod logs tab with a selected container, I want the ability to view past logs if they are available for the container.

Acceptance Criteria:

Provide a mechanism to expose past logs, if they are available.

Design doc: https://docs.google.com/document/d/1PB8_D5LTWhFPFp3Ovf85jJTc-zAxwgFR-sAOcjQCSBQ/edit#

https://github.com/openshift/console/pull/8896

Story CONSOLE-2384: Add metrics back to different workload resources

View the Description View the linked PRs

When moving to OCP 4 we didn't port the metrics charts for Deployments, Deployment Configs, StatefulSets, DaemonSets, ReplicaSets, and ReplicationControllers. These should be the same charts that we show on the Pods page: Memory, CPU, Filesystem, Network In and Out.

This was only done for pods.

We need to decide if we want use a multiline chart or some other representation.

https://github.com/openshift/console/pull/8763

Story CONSOLE-2496: Update custom console routes to use new CustomDomains cluster API

View the Description

The work on this story is dependent on following changes:

Enhancement doc - https://github.com/openshift/enhancements/pull/577
API change - https://github.com/openshift/api/pull/852
Ingress operator changes - https://github.com/openshift/cluster-ingress-operator/pull/552

The console already supports custom routes on the operator config. With the new proposed CustomDomains API introduces a unified way how to stock install custom domains for routes, which both names and serving cert/keys, customers want to customise. From the console perspective those are:

openshift-console / console
openshift-console / downloads (CLI downloads)

The setup should be done on the Ingress config. There two new fields are introduced:

ComponentRouteSpec - contains configuration of the for the custom domain(name, namespace, custom hostname, TLS secret reference)
ComponentRouteStatus - contains status of the custom domain(condition, previous hostname, rbac needed to read the TLS secret, ...)

Console-operator will be only consuming the API and check for any changes. If a custom domain is set for either `console` or `downloads` route in the `openshift-console` namespace, console-operator will read the setup set a custom route accordingly. When a custom route will be setup for any of console's route, the default route wont be deleted, but instead it will updated so it redirects to the custom one. This is done because of two reasons:

we want to prevent somebody from stealing the default hostname of both routes (console, downloads)
we want to prevent users from having unusable bookmarks that are pointing to the default hostname

Console-operator will still need to support the CustomDomain API that is available on it's config.

Acceptance criteria:

Console supports the new CustomDomains API for configuring a custom domain for both `console` and `downloads` routes
Console falls back to the deprecated API in the console operator config if present
Console supports the original default domains and redirects to the new ones

Questions:

Which CustomDomain API takes precedens? Ingress config vs. Console-operator config. Can upgrade cause any issues?

Sub-task CONSOLE-2792: Bump openshift/api dependecy in console-operator to get CustomDomain API for Ingress config

View the Description View the linked PRs

Dump openshift/api godep to pickup new CustomDomain API for the Ingress config.

https://github.com/openshift/console-operator/pull/517

Sub-task CONSOLE-2793: Implement console-operator changes to consume new CustomDomains API

View the Description View the linked PRs

Implement console-operator changes to consume new CustomDomains API, based on the story details.

https://github.com/openshift/console-operator/pull/522

Feature OCPSTRAT-402: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic CONSOLE-2471: Phase 1 - Add ACM perspective to OCP Console

View the Description

Epic Goal

The OCP Console needs to detect if the ACM Operator has been installed, if detected then a new multi-cluster perspective option shows up in the perspective chooser.

As a user I need the ability to to switch to the the ACM UI from the OCP Console and vice versa without requiring the user to login multiple times.

This option also needs to be hidden if the user doesn't have the correct RBAC.

Marvel design mockup

Why is this important?

Multi-cluster functionality is very important to our users. We need to provide a seamless experience for users.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CONSOLE-2506: Add Advanced Cluster Management option to the perspective switcher

View the Description View the linked PRs

The console should detect the presence of the ACM operator and add an Advanced Cluster Management item to the perspective switcher. We will need to work with the ACM team to understand how to detect the operator and how to discover the ACM URL.

Additionally, we will need to provide a query parameter or URL fragment to indicate which perspective to use. This will allow ACM to link back to the a specific perspective since it will share the same perspective switcher in its UI. ACM will need to be able to discover the console URL.

This story does not include handling SSO, which will be tracked in a separate story.

We need to determine what RBAC checks to make before showing the ACM link.

Acceptance Criteria

1. Console shows a link to ACM in its perspective switcher
2. Console provides a way for ACM to link back to a specific perspective
3. The ACM option only appears when the ACM operator is installed
4. ACM should open in the same browser tab to give the appearance of it being one application
5. Only users with appropriate RBAC should see the link (access review TBD)

https://github.com/openshift/console/pull/8199

Feature OCPSTRAT-983: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic OCPNODE-464: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/2324

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Bug OCPBUGS-6174: Cannot extract names from response with content-type: [] (accept no content-type on 204 Swift repsonses)

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-6086~~. The following is the description of the original issue:
—
We're still seeing errors with swift in the 4.8 branch:

level=error msg=Cannot extract names from response with content-type: []

See for instance this job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cloud-provider-openstack/169/pull-ci-openshift-cloud-provider-openstack-release-4.8-e2e-openstack/1616019853557108736

https://github.com/openshift/installer/pull/6792

Task RHSTOR-1554: Migrate object service dashboard tests

View the Description View the linked PRs

Migrate the existing tests which are located here :

https://github.com/openshift/console/blob/master/frontend/packages/noobaa-storage-plugin/integration-tests/tests/objectServiceDashboard.scenario.ts#L1

Helper functions/Views location:

https://github.com/openshift/console/blob/master/frontend/packages/noobaa-storage-plugin/integration-tests/views/obcPage.view.ts#L1

PR Link: https://github.com/openshift/console/pull/7815

https://github.com/openshift/console/pull/8270

Bug OCPBUGS-6545: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-driver-nfs/pull/107

Bug ODC-5622: Create Namespaces script is failing due to switching perspective

View the Description View the linked PRs

Description of problem:

Create Namespaces script is keep on failing due to load issue

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

<steps>

Actual results:

Unable to execute the create namespace script

Expected results:

Create Namespace script should work without any issue

Reproducibility (Always/Intermittent/Only Once):

Intermittent

Build Details:

Additional info:

https://github.com/openshift/console/pull/8375

Bug ODC-5919: Fix Automation scripts for Pipelines- Actions

View the Description View the linked PRs

Description of problem:

P-06-TC01	Text change is required
P-06-TC04	Text change is required
P-06-TC13	Text change is required

P-04-TC02 also get fixed with this bug

P-03-TC03 also get fixed with this bug

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

<steps>

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Additional info:

https://github.com/openshift/console/pull/9108

Task ETCD-178: refactor discover-etcd-initial-cluster and add tests

View the Description View the linked PRs

discover-etcd-initial-cluster was written very early on in the cluster-etcd-operator life cycle. We have observed at least one bug in this code and in order to validate logical correctness it needs to be rewritten with unit tests.

PR: https://github.com/openshift/etcd/pull/73

https://github.com/openshift/etcd/pull/73

Bug OCPBUGS-1455: Detect unsupported amount of workloads before rendering a lazy or crashing topology

View the Description View the linked PRs

+++ This bug was initially created as a clone of Bug #2060329 +++

Description of problem:
As a user, I was stopped from using the developer perspective when switching into a namespace with a lot of workloads (Deployments, Pods, etc.)

This is a follow up on https://bugzilla.redhat.com/show_bug.cgi?id=2006395

We recommend the following safety precautions against a lazy or crashing topology, also if we continue to work on performance improvements to allow more workloads rendered.

At the moment we expect that a topology with around about 100 nodes could be displayed. This could also depend on the node types, the used browser, the computer power of the PC, and how often the workload conditions changes.

Recommended safety guard:
The topology graph (maybe the list as well) should check how many nodes are fetched and will be rendered.

1. We need to evaluate if we could make this decision based on the shown graph nodes and edges or the number of underlying resources.

For example, is it required to count each Pod in a Deployment or not?

2. Based on a threshold (~ 100?) the topology graph should skip the rendering.

3. We should show a 'warning page' instead, which explains that the topology could not handle this amount of X nodes at the moment.

4. This page could have an option to "Show topology anyway" so that users who don't have issues here can still use the topology.

— Additional comment from bugzilla@redhat.com on 2022-05-09 08:32:18 UTC —

Account disabled by LDAP Audit for extended failure

— Additional comment from aos-team-art-private@redhat.com on 2022-05-09 19:42:54 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.

— Additional comment from openshift-bugzilla-robot@redhat.com on 2022-05-10 04:50:55 UTC —

Bugfix included in accepted release 4.11.0-0.nightly-2022-05-09-224745
Bug will not be automatically moved to VERIFIED for the following reasons:

PR openshift/console#11334 not approved by QA contact
This bug must now be manually moved to VERIFIED by spathak@redhat.com

https://github.com/openshift/console/pull/12060

Bug OCPBUGS-3554: ovnkube-master crash with ovn service controller panic

View the Description View the linked PRs

Description of problem:

ovnkube master container crash because service controller panic

2022-11-07T04:45:39.923889951+11:00 E1106 17:45:39.923862       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2022-11-07T04:45:39.923889951+11:00 goroutine 16732 [running]:
2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/runtime.logPanic(0x190bb00, 0x28ce990)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
2022-11-07T04:45:39.923889951+11:00 panic(0x190bb00, 0x28ce990)
2022-11-07T04:45:39.923889951+11:00     /usr/lib/golang/src/runtime/panic.go:965 +0x1b9
2022-11-07T04:45:39.923889951+11:00 k8s.io/api/core/v1.(*Service).GetObjectKind(0x0, 0x7f25ab6ead68, 0x0)
2022-11-07T04:45:39.923889951+11:00     <autogenerated>:1 +0x5
2022-11-07T04:45:39.923889951+11:00 k8s.io/client-go/tools/reference.GetReference(0xc00023b420, 0x1d5f6e0, 0x0, 0x7f257803baf8, 0x0, 0x0)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/reference/ref.go:59 +0x14d
2022-11-07T04:45:39.923889951+11:00 k8s.io/client-go/tools/record.(*recorderImpl).generateEvent(0xc0078afd80, 0x1d5f6e0, 0x0, 0x0, 0xc0d21a90f70f1ab5, 0x8c5d1203c70a, 0x290c9a0, 0x1b168b6, 0x7, 0x1b36c3b, ...)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/record/event.go:327 +0x5d
2022-11-07T04:45:39.923889951+11:00 k8s.io/client-go/tools/record.(*recorderImpl).Event(0xc0078afd80, 0x1d5f6e0, 0x0, 0x1b168b6, 0x7, 0x1b36c3b, 0x1d, 0xc0167a8340, 0x186)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/record/event.go:349 +0xc5
2022-11-07T04:45:39.923889951+11:00 k8s.io/client-go/tools/record.(*recorderImpl).Eventf(0xc0078afd80, 0x1d5f6e0, 0x0, 0x1b168b6, 0x7, 0x1b36c3b, 0x1d, 0x1b7df1f, 0x41, 0xc00ef72060, ...)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/record/event.go:353 +0xca
2022-11-07T04:45:39.923889951+11:00 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).syncServices(0xc001e60c60, 0xc00f1de210, 0x2f, 0x0, 0x0)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/controller/services/services_controller.go:246 +0x682
2022-11-07T04:45:39.923889951+11:00 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).processNextWorkItem(0xc001e60c60, 0x203000)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/controller/services/services_controller.go:184 +0xcd
2022-11-07T04:45:39.923889951+11:00 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).worker(...)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/controller/services/services_controller.go:173
2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0082cdda0)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0082cdda0, 0x1d535c0, 0xc005ef6e70, 0xc0106e2701, 0xc0000db500)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0082cdda0, 0x3b9aca00, 0x0, 0xc0106e2701, 0xc0000db500)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
2022-11-07T04:45:39.923889951+11:00 k8s.io/apimachinery/pkg/util/wait.Until(0xc0082cdda0, 0x3b9aca00, 0xc0000db500)
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
2022-11-07T04:45:39.923889951+11:00 created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn/controller/services.(*Controller).Run
2022-11-07T04:45:39.923889951+11:00     /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/controller/services/services_controller.go:161 +0x3b1
2022-11-07T04:45:39.927176450+11:00 panic: runtime error: invalid memory address or nil pointer dereference [recovered]
2022-11-07T04:45:39.927176450+11:00     panic: runtime error: invalid memory address or nil pointer dereference
2022-11-07T04:45:39.927176450+11:00 [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd3c385]
2022-11-07T04:45:39.927176450+11:00

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/ovn-kubernetes/pull/1390

Bug ODC-5922: Fix Automation scripts of Pipelines-workspaces

View the Description View the linked PRs

Description of problem:

Fix the P-10-TC01 test scenario

https://github.com/openshift/console/pull/9115

Bug OCPBUGS-16872: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-3760: [release-4.8] Update OWNERS file to reflect current team members

View the Description View the linked PRs

The OWNERS file for multiple branches in the openshift/jenkins repository need to be updated to reflect current team members for approvals.

https://github.com/openshift/jenkins/pull/1524

Bug OCPBUGS-3970: must-gather namespace should have “privileged” warn and audit pod security labels besides enforce

View the Description View the linked PRs

https://bugzilla.redhat.com/show_bug.cgi?id=2103126

https://github.com/openshift/oc/pull/1295

Bug OCPBUGS-1519: [OCP 4.8] Fix generate script in CBO

View the Description View the linked PRs

plus smaller fixes
this is a backport of https://github.com/openshift/cluster-baremetal-operator/pull/263

https://github.com/openshift/cluster-baremetal-operator/pull/293

Bug OCPBUGS-2591: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/12190

Story OCPCLOUD-914: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes/pull/447

Bug OCPBUGS-604: CI failing tests: Create namespace from install operators creates namespace from operator install page

View the Description View the linked PRs

Description of problem:

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:

Search link:

https://search.ci.openshift.org/?search=Create+namespace+from+install+operators+creates+namespace+from+operator+install+page&maxAge=12h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Steps to Reproduce:

1.

2.

3.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/11990

Task MON-1208: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus/pull/69

Bug OCPBUGS-1307: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-etcd-operator/pull/928

Bug OCPBUGS-2205: Prefer local dns does not work expectedly on OCPv4.8

View the Description View the linked PRs

Description of problem:

When queried dns hostname from certain pod on the certain node, responded from random coredns pod, not prefer local one. Is it expected result ?

# In OCP v4.8.13 case
// Ran dig command on the certain node which is running the following test-7cc4488d48-tqc4m pod.
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
:
07:16:33 :172.217.175.238
07:16:34 :172.217.175.238 <--- Refreshed the upstream result
07:16:36 :142.250.207.46
07:16:37 :142.250.207.46

// The dig results is matched with the running node one as you can see the above one.
$ oc rsh  test-7cc4488d48-tqc4m bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
:
07:16:35 :172.217.175.238 
07:16:36 :172.217.175.238 <--- At the same time, the pod dig result is also refreshed.
07:16:37 :142.250.207.46
07:16:38 :142.250.207.46


But in v4.10 case, in contrast, the dns query result is various and responded randomly regardless local dns results on the node as follows.

# In OCP v4.10.23 case, pod's response from DNS services are not consistent.
$ oc rsh test-848fcf8ddb-zrcbx  bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
07:23:00 :142.250.199.110
07:23:01 :142.250.207.46
07:23:02 :142.250.207.46
07:23:03 :142.250.199.110
07:23:04 :142.250.199.110
07:23:05 :172.217.161.78

# Even though the node which is running the pod keep responding the same IP...
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
07:23:00 :172.217.161.78
07:23:01 :172.217.161.78
07:23:02 :172.217.161.78
07:23:03 :172.217.161.78
07:23:04 :172.217.161.78
07:23:05 :172.217.161.78

Version-Release number of selected component (if applicable):

v4.10.23 (ROSA)
SDN: OpenShiftSDN

How reproducible:

You can always reproduce this issue using "dig google.com" from both any pod and the node the pod running according to the above "Description" details.

Steps to Reproduce:

1. Run any usual pod, and check which node the pod is running on.
2. Run dig google.com on the pod and the node.
3. Check the IP is consistent with the running node each other.

Actual results:

The response IPs are not consistent and random IP is responded.

Expected results:

The response IP is kind of consistent, and aware of prefer local dns.

Additional info:

This issue affects EgressNetworkPolicy dnsName feature.

https://github.com/openshift/sdn/pull/469

Bug OCPBUGS-448: Incorrect NAT when using cluster networking in control-plane nodes to install a VRRP Cluster

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-249~~. The following is the description of the original issue:
—
+++ This bug was initially created as a clone of
Bug #2070318
+++

Description of problem:
In OCP VRRP deployment (using OCP cluster networking), we have an additional data interface which is configured along with the regular management interface in each control node. In some deployments, the kubernetes address 172.30.0.1:443 is nat’ed to the data management interface instead of the mgmt interface (10.40.1.4:6443 vs 10.30.1.4:6443 as we configure the boostrap node) even though the default route is set to 10.30.1.0 network. Because of that, all requests to 172.30.0.1:443 were failed. After 10-15 minutes, OCP magically fixes it and nat’ing correctly to 10.30.1.4:6443.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.Provision OCP cluster using cluster networking for DNS & Load Balancer instead of external DNS & Load Balancer. Provision the host with 1 management interface and an additional interface for data network. Along with OCP manifest, add manifest to create a pod which will trigger communication with kube-apiserver.

2.Start cluster installation.

3.Check on the custom pod log in the cluster when the first 2 master nodes were installing to see GET operation to kube-apiserver timed out. Check nft table and chase the ip chains to see the that the data IP address was nat'ed to kubernetes service IP address instead of the management IP. This is not happening all the time, we have seen 50:50 chance.

Actual results:
After 10-15 minutes OCP will correct that by itself.

Expected results:
Wrong natting should not happen.

Additional info:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895
ClusterVersion: Stable at "4.8.29"
ClusterOperators:
clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/baremetal is degraded because metal3 deployment inaccessible
clusteroperator/console is not available (RouteHealthAvailable: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because RouteHealthDegraded: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
clusteroperator/insights is degraded because Unable to report: unable to build request to connect to Insights server: Post "
https://cloud.redhat.com/api/ingress/v1/upload
": dial tcp: lookup cloud.redhat.com on 172.30.0.10:53: read udp 10.128.0.26:53697->172.30.0.10:53: i/o timeout
clusteroperator/network is progressing: DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)

— Additional comment from
bnemec@redhat.com
on 2022-03-30 20:00:25 UTC —

This is not managed by runtimecfg, but in order to route the bug correctly I need to know which CNI plugin you're using - OpenShiftSDN or OVNKubernetes. Thanks.

— Additional comment from
lpbinh@gmail.com
on 2022-03-31 08:09:11 UTC —

Hi Ben,

We were deploying Contrail CNI with OCP. However, this issue happens at very early deployment time, right after the bootstrap node is started
and there's no SDN/CNI there yet.

— Additional comment from
bnemec@redhat.com
on 2022-03-31 15:26:23 UTC —

Okay, I'm just going to send this to the SDN team then. They'll be able to provide more useful input than I can.

— Additional comment from
trozet@redhat.com
on 2022-04-04 15:22:21 UTC —

Can you please provide the iptables rules causing the DNAT as well as the routes on the host? Might be easiest to get a sosreport during initial bring up during that 10-15 min when the problem occurs.

— Additional comment from
lpbinh@gmail.com
on 2022-04-05 16:45:13 UTC —

All nodes have two interfaces:

eth0: 10.30.1.0/24
eth1: 10.40.1.0/24

machineNetwork is 10.30.1.0/24
default route points to 10.30.1.1

The kubeapi service ip is 172.30.0.1:443

all Kubernetes services are supposed to be reachable via machineNetwork (10.30.1.0/24)

To make the kubeapi service ip reachable in hostnetwork, something (openshift installer?) creates a set of nat rules which translates the service ip to the real ip of the nodes which have kubeapi active.

Initially kubeapi is only active on the bootstrap node so there should be a nat rule like

172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)

However, what we see is
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)

The rule is configured on the controller nodes and lead to asymmetrical routing as the controller sends a packet FROM machineNetwork (10.30.1.x) to 172.30.0.1 which is then translated and forwarded to 10.40.1.10 which then tries to reply back on the 10.40.1.0 network which fails as the request came from 10.30.1.0 network.

So, we want to understand why openshift installer picks the 10.40.1.x ip address rather than the 10.30.1.x ip for the nat rule. What's the mechanism for getting the ip in case the system has multiple interfaces with ips configured.

Note: after a while (10-20 minutes) the bootstrap process resets itself and then it picks the correct ip address from the machineNetwork and things start to work.

— Additional comment from
smerrow@redhat.com
on 2022-04-13 13:55:04 UTC —

Note from Juniper regarding requested SOS report:

In reference to
https://bugzilla.redhat.com/show_bug.cgi?id=2070318
that @Binh Le has been working on. The mustgather was too big to upload for this Bugzilla. Can you access this link?
https://junipernetworks-my.sharepoint.com/:u:/g/personal/sleigon_juniper_net/ETOrHMqao1tLm10Gmq9rzikB09H5OUwQWZRAuiOvx1nZpQ

Making note private to hide partner link

— Additional comment from
smerrow@redhat.com
on 2022-04-21 12:24:33 UTC —

Can we please get an update on this BZ?

Do let us know if there is any other information needed.

— Additional comment from
trozet@redhat.com
on 2022-04-21 14:06:00 UTC —

Can you please provide another link to the sosreport? Looks like the link is dead.

— Additional comment from
smerrow@redhat.com
on 2022-04-21 19:01:39 UTC —

See mustgather here:
https://drive.google.com/file/d/16y9IfLAs7rtO-SMphbYBPgSbR4od5hcQ
— Additional comment from
trozet@redhat.com
on 2022-04-21 20:57:24 UTC —

Looking at the must-gather I think your iptables rules are most likely coming from the fact that kube-proxy is installed:

[trozet@fedora must-gather.local.288458111102725709]$ omg get pods -n openshift-kube-proxy
NAME READY STATUS RESTARTS AGE
openshift-kube-proxy-kmm2p 2/2 Running 0 19h
openshift-kube-proxy-m2dz7 2/2 Running 0 16h
openshift-kube-proxy-s9p9g 2/2 Running 1 19h
openshift-kube-proxy-skrcv 2/2 Running 0 19h
openshift-kube-proxy-z4kjj 2/2 Running 0 19h

I'm not sure why this is installed. Is it intentional? I don't see the configuration in CNO to enable kube-proxy. Anyway the node IP detection is done via:
https://github.com/kubernetes/kubernetes/blob/f173d01c011c3574dea73a6fa3e20b0ab94531bb/cmd/kube-proxy/app/server.go#L844
Which just looks at the IP of the node. During bare metal install a VIP is chosen and used with keepalived for kubelet to have kapi access. I don't think there is any NAT rule for services until CNO comes up. So I suspect what really is happening is your node IP is changing during install, and kube-proxy is getting deployed (either intentionally or unintentionally) and that is causing the behavior you see. The node IP is chosen via the node ip configuration service:
https://github.com/openshift/machine-config-operator/blob/da6494c26c643826f44fbc005f26e0dfd10513ae/templates/common/_base/units/nodeip-configuration.service.yaml
This service will determine the node ip via which interfaces have a default route and which one has the lowest metric. With your 2 interfaces, do they both have default routes? If so, are they using dhcp and perhaps its random which route gets installed with a lower metric?

— Additional comment from
trozet@redhat.com
on 2022-04-21 21:13:15 UTC —

Correction: looks like standalone kube-proxy is installed by default when the provider is not SDN, OVN, or kuryr so this looks like the correct default behavior for kube-proxy to be deployed.

— Additional comment from
lpbinh@gmail.com
on 2022-04-25 04:05:14 UTC —

Hi Tim,

You are right, kube-proxy is deployed by default and we don't change that behavior.

There is only 1 default route configured for the management interface (10.30.1.x) , we used to have a default route for the data/vrrp interface (10.40.1.x) with higher metric before. As said, we don't have the default route for the second interface any more but still encounter the issue pretty often.

— Additional comment from
trozet@redhat.com
on 2022-04-25 14:24:05 UTC —

Binh, can you please provide a sosreport for one of the nodes that shows this behavior? Then we can try to figure out what is going on with the interfaces and the node ip service. Thanks.

— Additional comment from
trozet@redhat.com
on 2022-04-25 16:12:04 UTC —

Actually Ben reminded me that the invalid endpoint is actually the boostrap node itself:
172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)

vs
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)

So maybe a sosreport off that node is necessary? I'm not as familiar with the bare metal install process, moving back to Ben.

— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:33:45 UTC —

Created attachment 1875023 [details]sosreport

— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:34:59 UTC —

Created attachment 1875024 [details]sosreport-part2

Hi Tim,

We observe this issue when deploying clusters using OpenStack instances as our infrastructure is based on OpenStack.

I followed the steps here to collect the sosreport:
https://docs.openshift.com/container-platform/4.8/support/gathering-cluster-data.html
Got the sosreport which is 22MB which exceeds the size permitted (19MB), so I split it to 2 files (xaa and xab), if you can't join them then we will need to put the collected sosreport on a share drive like we did with the must-gather data.

Here are some notes about the cluster:

First two control nodes are below, ocp-binhle-8dvald-ctrl-3 is the bootstrap node.

[core@ocp-binhle-8dvald-ctrl-2 ~]$ oc get node
NAME STATUS ROLES AGE VERSION
ocp-binhle-8dvald-ctrl-1 Ready master 14m v1.21.8+ed4d8fd
ocp-binhle-8dvald-ctrl-2 Ready master 22m v1.21.8+ed4d8fd

We see the behavior that wrong nat'ing was done at the beginning, then corrected later:

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 3 bytes 180 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 3 bytes 180 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }

}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 3 bytes 180 dnat to 10.40.1.7:6443 }

}
sh-4.4#
sh-4.4#
<....after a while....>
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 0 bytes 0 jump KUBE-SEP-X33IBTDFOZRR6ONM }
}
sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 0 bytes 0 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y { counter packets 0 bytes 0 jump KUBE-SEP-X33IBTDFOZRR6ONM }

}
sh-4.4# nft list chain ip nat KUBE-SEP-X33IBTDFOZRR6ONM
table ip nat {
chain KUBE-SEP-X33IBTDFOZRR6ONM

{ ip saddr 10.30.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 0 bytes 0 dnat to 10.30.1.7:6443 }

}
sh-4.4#

— Additional comment from
lpbinh@gmail.com
on 2022-05-12 17:46:51 UTC —

@
trozet@redhat.com
May we have an update on the fix, or the plan for the fix? Thank you.

— Additional comment from
lpbinh@gmail.com
on 2022-05-18 21:27:45 UTC —

Created support Case 03223143.

— Additional comment from
vkochuku@redhat.com
on 2022-05-31 16:09:47 UTC —

Hello Team,

Any update on this?

Thanks,
Vinu K

— Additional comment from
smerrow@redhat.com
on 2022-05-31 17:28:54 UTC —

This issue is causing delays in Juniper's CI/CD pipeline and makes for a less than ideal user experience for deployments.

I'm getting a lot of pressure from the partner on this for an update and progress. I've had them open a case [1] to help progress.

Please let us know if there is any other data needed by Juniper or if there is anything I can do to help move this forward.

[1]
https://access.redhat.com/support/cases/#/case/03223143
— Additional comment from
vpickard@redhat.com
on 2022-06-02 22:14:23 UTC —

@
bnemec@redhat.com
Tim mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=2070318#c14
that this issue appears to be at BM install time. Is this something you can help with, or do we need help from the BM install team?

— Additional comment from
bnemec@redhat.com
on 2022-06-03 18:15:17 UTC —

Sorry, I missed that this came back to me.

(In reply to Binh Le from
comment #16
)> We observe this issue when deploying clusters using OpenStack instances as
> our infrastructure is based on OpenStack.This does not match the configuration in the must-gathers provided so far, which are baremetal. Are we talking about the same environments?

I'm currently discussing this with some other internal teams because I'm unfamiliar with this type of bootstrap setup. I need to understand what the intended behavior is before we decide on a path forward.

— Additional comment from
rurena@redhat.com
on 2022-06-06 14:36:54 UTC —

(In reply to Ben Nemec from
comment #22
)> Sorry, I missed that this came back to me.
>
> (In reply to Binh Le from comment #16)
> > We observe this issue when deploying clusters using OpenStack instances as
> > our infrastructure is based on OpenStack.
>
> This does not match the configuration in the must-gathers provided so far,
> which are baremetal. Are we talking about the same environments?
>
> I'm currently discussing this with some other internal teams because I'm
> unfamiliar with this type of bootstrap setup. I need to understand what the
> intended behavior is before we decide on a path forward.I spoke to the CU they tell me that all work should be on baremetal. They were probably just testing on OSP and pointing out that they saw the same behavior.

— Additional comment from
bnemec@redhat.com
on 2022-06-06 16:19:37 UTC —

Okay, I see now that this is an assisted installer deployment. Can we get the cluster ID assigned by AI so we can take a look at the logs on our side? Thanks.

— Additional comment from
lpbinh@gmail.com
on 2022-06-06 16:38:56 UTC —

Here is the cluster ID, copied from the bug description:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895

In regard to your earlier question about OpenStack & baremetal (2022-06-03 18:15:17 UTC):

We had an issue with platform validation in OpenStack earlier. Host validation was failing with the error message “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”

It's found out that there is no platform type "OpenStack" available in [
https://github.com/openshift/assisted-service/blob/master/models/platform_type.go#L29
] so we set "baremetal" as the platform type on our computes. That's the reason why you are seeing baremetal as the platform type.

Thank you

— Additional comment from
ercohen@redhat.com
on 2022-06-08 08:00:18 UTC —

Hey, first you are currect, When you set 10.30.1.0/24 as the machine network, the bootstrap process should use the IP on that subnet in the bootstrap node.

I'm trying to understand how exactly this cluster was installed.
You are using on-prem deployment of assisted-installer (podman/ACM)?
You are trying to form a cluster from OpenStack Vms?
You set the platform to Baremetal where?
Did you set user-managed-netwroking?

Some more info, when using OpenStack platform you should install the cluster with user-managed-netwroking.
And that's what the failing validation is for.

— Additional comment from
bnemec@redhat.com
on 2022-06-08 14:56:53 UTC —

Moving to the assisted-installer component for further investigation.

— Additional comment from
lpbinh@gmail.com
on 2022-06-09 07:37:54 UTC —

@Eran Cohen:

Please see my response inline.

You are using on-prem deployment of assisted-installer (podman/ACM)?
--> Yes, we are using on-prem deployment of assisted-installer.

You are trying to form a cluster from OpenStack Vms?
--> Yes.

You set the platform to Baremetal where?
--> It was set in the Cluster object, Platform field when we model the cluster.

Did you set user-managed-netwroking?
--> Yes, we set it to false for VRRP.

— Additional comment from
itsoiref@redhat.com
on 2022-06-09 08:17:23 UTC —

@
lpbinh@gmail.com
can you please share assisted logs that you can download when cluster is failed or installed?
Will help us to see the full picture

— Additional comment from
ercohen@redhat.com
on 2022-06-09 08:23:18 UTC —

OK, as noted before when using OpenStack platform you should install the cluster with user-managed-netwroking (set to true).
Can you explain how you workaround this failing validation? “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”
What does this mean exactly? 'we set "baremetal" as the platform type on our computes'

To be honest I'm surprised that the installation was completed successfully.

@
oamizur@redhat.com
I thought installing on OpenStack VMs with baremetal platform (user-managed-networking=false) will always fail?

— Additional comment from
lpbinh@gmail.com
on 2022-06-10 16:04:56 UTC —

@
itsoiref@redhat.com
: I will reproduce and collect the logs. Is that supposed to be included in the provided must-gather?
@
ercohen@redhat.com
:

user-managed-networking set to true when we use external Load Balancer and DNS server. For VRRP we use OpenShift's internal LB and DNS server hence it's set to false, following the doc.
As explained OpenShift returns platform type as 'none' for OpenStack:
https://github.com/openshift/assisted-service/blob/master/models/platform_type.go#L29
, therefore we set the platformtype as 'baremetal' in the cluster object for provisioning the cluster using OpenStack VMs.

— Additional comment from
itsoiref@redhat.com
on 2022-06-13 13:08:17 UTC —

@
lpbinh@gmail.com
you will have download_logs link in UI. Those logs are not part of must-gather

— Additional comment from
lpbinh@gmail.com
on 2022-06-14 18:52:02 UTC —

Created attachment 1889993 [details]cluster log per need info request - Cluster ID caa475b0-df04-4c52-8ad9-abfed1509506

Attached is the cluster log per need info request.
Cluster ID: caa475b0-df04-4c52-8ad9-abfed1509506
In this reproduction, the issue is not resolved by OpenShift itself, wrong NAT still remained and cluster deployment failed eventually

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 2 bytes 120 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 2 bytes 120 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }
}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 17:40:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 17:59:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 9 bytes 540 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 18:17:38 UTC 2022
sh-4.4#
sh-4.4#
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 7 bytes 420 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 18:49:28 UTC 2022
sh-4.4#

— Additional comment from
lpbinh@gmail.com
on 2022-06-14 18:56:06 UTC —

Created attachment 1889994 [details]cluster log per need info request - Cluster ID caa475b0-df04-4c52-8ad9-abfed1509506

Please find the cluster-log attached per your request. In this deployment the wrong NAT was not automatically resolved by OpenShift hence the deployment failed eventually.

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 2 bytes 120 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y { counter packets 2 bytes 120 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }

}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 17:40:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 17:59:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 9 bytes 540 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 18:17:38 UTC 2022
sh-4.4#
sh-4.4#
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 7 bytes 420 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 18:49:28 UTC 2022
sh-4.4#

— Additional comment from
itsoiref@redhat.com
on 2022-06-15 15:59:22 UTC —

@
lpbinh@gmail.com
just for the protocol, we don't support baremetal ocp on openstack that's why validation is failing

— Additional comment from
lpbinh@gmail.com
on 2022-06-15 17:47:39 UTC —

@
itsoiref@redhat.com
as explained it's just a workaround on our side to make OCP work in our lab, and from my understanding on OCP perspective it will see that deployment is on baremetal only, not related to OpenStack (please correct me if I am wrong).

We have been doing thousands of OCP cluster deployments in our automation so far, if it's why validation is failing then it should be failing every time. However it only occurs occasionally when nodes have 2 interfaces, using OCP internal DNS and Load balancer, and sometime resolved by itself and sometime not.

— Additional comment from
itsoiref@redhat.com
on 2022-06-19 17:00:01 UTC —

For now i can assume that this endpoint is causing the issue:
{
"apiVersion": "v1",
"kind": "Endpoints",
"metadata": {
"creationTimestamp": "2022-06-14T17:31:10Z",
"labels":

{ "endpointslice.kubernetes.io/skip-mirror": "true" }

,
"name": "kubernetes",
"namespace": "default",
"resourceVersion": "265",
"uid": "d8f558be-bb68-44ac-b7c2-85ca7a0fdab3"
},
"subsets": [
{
"addresses": [

{ "ip": "10.40.1.7" }

],
"ports": [
{
"name": "https",
"port": 6443,
"protocol": "TCP"
}
]
}
]
},

— Additional comment from
itsoiref@redhat.com
on 2022-06-21 17:03:51 UTC —

The issue is that kube-api service advertise wrong ip but it does it cause kubelet chooses the one arbitrary and we currently have no mechanism to set kubelet ip, especially in bootstrap flow.

— Additional comment from
lpbinh@gmail.com
on 2022-06-22 16:07:29 UTC —

@
itsoiref@redhat.com
how do you perform OCP deployment in setups that have multiple interfaces if letting kubelet chooses an interface arbitrary instead of configuring a specific IP address for it to listen on? With what you describe above chance of deployment failure in system with multiple interfaces would be high.

— Additional comment from
dhellard@redhat.com
on 2022-06-24 16:32:26 UTC —

I set the Customer Escalation flag = Yes, per ACE EN-52253.
The impact is noted by the RH Account team: "Juniper is pressing and this impacts the Unica Next Project at Telefónica Spain. Unica Next is a critical project for Red Hat. We go live the 1st of July and this issue could impact the go live dates. We need clear information about the status and its possible resolution.

— Additional comment from
itsoiref@redhat.com
on 2022-06-26 07:28:44 UTC —

I have sent an image with possible fix to Juniper and waiting for their feedback, once they will confirm it works for them we will proceed with the PRs.

— Additional comment from
pratshar@redhat.com
on 2022-06-30 13:26:26 UTC —

=== In Red Hat Customer Portal Case 03223143 ===
— Comment by Prateeksha Sharma on 6/30/2022 6:56 PM —

//EMT note//

Update from our consultant Manuel Martinez Briceno -

====
on 28th June, 2022 the last feedback from Juniper Project Manager and our Partner Manager was that they are testing the fix. They didn't give an Estimate Time to finish, but we will be tracking this closely and let us know of any news.
====

Thanks & Regards,
Prateeksha Sharma
Escalation Manager | RHCSA
Global Support Services, Red Hat

https://github.com/openshift/installer/pull/6253

Bug OCPBUGS-1023: [4.8] Rebase openshift/etcd 4.8 onto 3.4.21

View the Description View the linked PRs

3.4.21 is about to go out soonish with this plan: https://github.com/etcd-io/etcd/issues/14438

Two important BZs from our side are pending this rebase:

https://github.com/openshift/etcd/pull/150

Bug OCPBUGS-3201: [4.8.z backport][4.8][OVN] RHEL 7.9 DHCP worker ovs-configuration fails

View the Description View the linked PRs

Description of problem:

A backport of ovs-configuration introduced the use of ip -j addr show to output JSON for easier parsing. Unfortunately the iproute2 version on RHEL 7.9 is too old to support the -j JSON option

configure-ovs.sh[1516]: + extra_if_brex_args=
configure-ovs.sh[1516]: ++ ip -j a show dev bond0
configure-ovs.sh[1516]: ++ jq '.[0].addr_info | map(. | select(.family == "inet")) | length'
configure-ovs.sh[1516]: Option "-j" is unknown, try "ip -help".
configure-ovs.sh[1516]: + num_ipv4_addrs=
configure-ovs.sh[1516]: + '[' '' -gt 0 ']'
configure-ovs.sh[1516]: /usr/local/bin/configure-ovs.sh: line 290: [: : integer expression expected

Version-Release number of selected component (if applicable):

4.8.0-0.nightly-2022-11-02-105425

How reproducible:

Always

Steps to Reproduce:

1. Deploy OVN cluster
2. Add RHEL 7.9 DHCP workers
3. oc adm node-logs $node -u ovs-configuration

Actual results:

As above

Option "-j" is unknown, try "ip -help".

Expected results:

ovs-configuration succeeds

+ extra_if_brex_args=
++ ip a show dev bond0
++ grep -E '^[[:blank:]]*inet\b'
++ wc -l
+ num_ipv4_addrs=1
+ '[' 1 -gt 0 ']'
+ extra_if_brex_args+='ipv4.may-fail no '
++ ip a show dev bond0
++ grep -E '^[[:blank:]]*inet6\b'
++ grep -v '\bscope link\b'
++ wc -l
+ num_ip6_addrs=1
+ '[' 1 -gt 0 ']'
+ extra_if_brex_args+='ipv6.may-fail no '
++ nmcli --get-values ipv4.dhcp-client-id conn show a7cc816d-3dbd-34c5-9902-d6b2f2956d92
+ dhcp_client_id= + extra_if_brex_args=
++ ip a show dev bond0
++ grep -E '^[[:blank:]]*inet\b'
++ wc -l
+ num_ipv4_addrs=1
+ '[' 1 -gt 0 ']'
+ extra_if_brex_args+='ipv4.may-fail no '
++ ip a show dev bond0
++ grep -E '^[[:blank:]]*inet6\b'
++ grep -v '\bscope link\b'
++ wc -l
+ num_ip6_addrs=1
+ '[' 1 -gt 0 ']'
+ extra_if_brex_args+='ipv6.may-fail no '
++ nmcli --get-values ipv4.dhcp-client-id conn show a7cc816d-3dbd-34c5-9902-d6b2f2956d92
+ dhcp_client_id=

Additional info:

https://github.com/openshift/machine-config-operator/pull/3401

Bug OCPBUGS-4321: Bump Jenkins version to 2.361.1 [release-4.8]

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1942~~. The following is the description of the original issue:
—
Description of problem:

Bump Jenkins version to 2.361.1 and also test the images built by running verify-jenkins.sh script. This script verifies the jenkins versions and plugin in an image. Verify script is present at https://gist.githubusercontent.com/coreydaley/fbf11d3b1a7a567f8c494da6a07bad41/raw/80e569131479c212d5e023bc41ce26fb15a17752/verify-jenkins.sh

Version-Release number of selected component (if applicable):

2.361.1

Additional info:

Verify script is present at https://gist.githubusercontent.com/coreydaley/fbf11d3b1a7a567f8c494da6a07bad41/raw/80e569131479c212d5e023bc41ce26fb15a17752/verify-jenkins.sh

https://github.com/openshift/jenkins/pull/1531

Story CONSOLE-2768: console-operator should use bindata instead of inlining manifests

View the Description View the linked PRs

console-operator codebase contains a lot of inline manifests. Instead we should put those manifests into a `/bindata` folder, from which they will be read and then updated per purpose.

https://github.com/openshift/console-operator/pull/543

Bug OCPBUGS-1314: Users can't silence alerts from the dev console

View the Description View the linked PRs

Description of problem:

When logged in as a non-admin user, I can't silence alerts from the Dev Console.

Version-Release number of selected component (if applicable):

4.10 but the same issue may exist for previous versions.

How reproducible:

Steps to Reproduce:

1. Login to the dev console as a non-admin user. 
2. Follow the OCP documentation to deploy the example application including the service monitor and the rule (the user needs to be granted the monitoring-edit role). See https://docs.openshift.com/container-platform/4.9/monitoring/enabling-monitoring-for-user-defined-projects.html for details. 
3. Go to the Observe > Alerts page and disable notifications during 30 minutes for the VersionAlert alert.

Actual results:

The alert notification seems to be disabled but on refresh, the notification is still enabled.

Expected results:

The notification is permanently disabled.

Additional info:

The console backend hits the wrong service which results in a 403 response code, it should use the tenancy-aware service.

Copied from https://bugzilla.redhat.com/show_bug.cgi?id=2117608

https://github.com/openshift/console/pull/12036

Bug OCPBUGS-2878: [release-4.8] machineConfigPool without nodes assigned stuck at 0% progress

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-2173~~. The following is the description of the original issue:
—
Description of problem:

When a custom machineConfigPool is created and no node is associated with it, the mcp remains at 0% progress.

Version-Release number of selected component (if applicable):

How reproducible:

100%

Steps to Reproduce:

1. Create a custom mcp:
~~~
cat << EOF | oc create -f -
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: custom
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,custom]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/custom: "" 
EOF
~~~

Actual results:

The mcp is visible from the from "Administrator view > Cluster Settings > Details" at 0% progress

Expected results:

It shouldn't be stuck at 0%

Additional info:

https://github.com/openshift/console/pull/12216

Bug ODC-5916: Fix Automation scripts for Pipeline Triggers - Text changes

View the Description View the linked PRs

Description of problem:

P-09-TC01, P-09-TC04, P-09-TC05, P-09-TC06, P-09-TC07, P-09-TC11 test scripts update required

Page obejcts updated for pipelines

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

<steps>

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Additional info:

https://github.com/openshift/console/pull/9106

Bug OCPBUGS-16682: [4.8] The cpb binary still was built dynamically

View the Description View the linked PRs

Description of problem:

Optional operators unpacking failure due to the `cpb` issue.

MacBook-Pro:~ jianzhang$ oc get pods
NAME                                                              READY   STATUS       RESTARTS   AGE
29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00d5pz2   0/1     Init:Error   0          4h24m
29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00ftpqw   0/1     Init:Error   0          4h25m
29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00ftts8   0/1     Init:Error   0          4h24m
29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v   0/1     Init:Error   0          4h24m
certified-operators-xjh27                                         1/1     Running      0          5h25m

MacBook-Pro:~ jianzhang$ oc describe pods pods 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v
Name:             29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v
Namespace:        openshift-marketplace
Priority:         0
...
...
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:90944becade86164c70b8dcd70415de83cab3951cb5dcda9e4fac6968c4f2492

[cloud-user@preserve-olm-env2 jian]$ sudo podman run --rm -ti quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:90944becade86164c70b8dcd70415de83cab3951cb5dcda9e4fac6968c4f2492
bash-4.4$ which cob
which: no cob in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
bash-4.4$ which cpb
/usr/bin/cpb
bash-4.4$ ldd /usr/bin/cpb
	linux-vdso.so.1 (0x00007ffe341a2000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f51a0214000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f519fff4000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f519fc2f000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f51a0418000)

Version-Release number of selected component (if applicable):

Cluster version is 4.8.0-0.nightly-2023-07-21-001905

How reproducible:

always

Steps to Reproduce:

1. Install OCP 4.8
2. Subscribe to an operator.

Actual results:

Failed to install the operator. Unpack pod failed to init.

MacBook-Pro:~ jianzhang$ oc get pods
NAME                                                              READY   STATUS       RESTARTS   AGE
29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00d5pz2   0/1     Init:Error   0          4h24m
29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00ftpqw   0/1     Init:Error   0          4h25m
29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00ftts8   0/1     Init:Error   0          4h24m
29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v   0/1     Init:Error   0          4h24m
certified-operators-xjh27                                         1/1     Running      0          5h25m
community-operators-wdzp9                                         1/1     Running      0          5h25m
marketplace-operator-6d8f7f6f89-bgfxb                             1/1     Running      0          5h36m
qe-app-registry-zc4pz                                             1/1     Running      0          5h4m
redhat-marketplace-bffjz                                          1/1     Running      0          5h25m
redhat-operators-hjjfc                                            1/1     Running      0          5h25m
MacBook-Pro:~ jianzhang$ oc logs 29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v
Defaulted container "extract" out of: extract, util (init), pull (init)
Error from server (BadRequest): container "extract" in pod "29f049d4a61031babbbcd0e303a2787017361bdeb45286f923cb891a00jbz8v" is waiting to start: PodInitializing

Expected results:

Unpack pod running well.

Additional info:

Bug OCPBUGS-3050: Network policies are not implemented or updated by OVN-Kubernetes

View the Description View the linked PRs

This bug is a backport clone of [Bugzilla Bug 2115926](https://bugzilla.redhat.com/show_bug.cgi?id=2115926). The following is the description of the original bug:
—
+++ This bug was initially created as a clone of Bug #2109442 +++

An important commit was missed during the downstream merge
Commit: https://github.com/openshift/ovn-kubernetes/pull/956/commits/96b2a2555a654d72a8546366032063a98a016f29
Initial downstream merge to master branch: https://github.com/openshift/ovn-kubernetes/pull/956
Downstream merge into the Release 4.10 branch: https://github.com/openshift/ovn-kubernetes/pull/971
Pull Request, um den fehlenden Commit in Release 4.10 aufzunehmen: https://github.com/openshift/ovn-kubernetes/pull/1195

+++ This bug was initially created as a clone of Bug #2048538 +++

Description of problem:

In one of our customer's clusters we see that new network policies are not created or updated by OVN-Kubernetes.
For one application this means it cannot reach the DNS service because the network policy that allows that is not being implemented.

In our own test on this cluster, pods in a namespace CAN reach each other despite this network policy:
~~~
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
creationTimestamp: "2022-01-27T14:41:05Z"
generation: 2
name: default-deny
namespace: customer-debug
resourceVersion: "311846645"
uid: 87646222-c86d-4000-8997-7f0557ac34cf
spec:
podSelector: {}
policyTypes:

Ingress
Egress
~~~

In one of our dev clusters this network policy is enforced.

Version-Release number of selected component (if applicable):

OCP 4.8.25

How reproducible:

This happens randomly and very difficult to predict.

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

The case has the must-gathers in from the cluster.

— Additional comment from Tim Rozet on 2022-02-03 16:04:01 UTC —

Upon finishing my analysis of the logs there are several bugs/errors happening here. All of which compound to either make network policies fail to be enforced properly or may cause them to stay enforcing when they shouldn't be:

1.policy.go:818] Failed to set ports in PortGroup for network policy ie-st-montun-filebeat/default-deny: Reconnecting...Transaction Failed due to an error: syntax error details: expected ["set", <array>] in

{update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]] }

This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z.

2. policy.go:1166] no pod IPs found on pod redhat-marketplace-brhvf: could not find OVN pod annotation in map[openshift.io/scc:anyuid operatorframework.io/managed-by:marketplace-operator]

This error is spammed throughout the log, but is benign. On pod add we could fail to get the OVN annotation due to racing with pod handler. However, once the pod handler annotates the pod an update event will happen and this code will be executed again. I'm going to ignore printing this error on pod add.

3. policy.go:733] logical port cd-argocd-cdteam_testssl2 not found in cache

This is the same as https://bugzilla.redhat.com/show_bug.cgi?id=2037884. The bug references stateful sets, but this was really true about any pod being added. When the network policy is created or pods are added that belong to the network policy's namespace, we attempt to get the pod's information from an internal cache. This races with the pod being added to the cache by the pod handler. The fix makes the network policy handler wait until the pod is added to the cache. Otherwise the network policy is created and potentially skips being applied to some pods in the namespace. This is already fixed in 4.8.29

4. policy.go:1166] failed to add IPs ... set contains duplicate value

The duplicate value here being added is a VIP for a load balancer. In 4.9 and later there is a lower probability of this happening (because we no longer store an internal cache, so there shouldn't be duplicates), however I'm still going to add checks to ensure we filter out any duplicate values before adding to them to the cache or sending the RPC to OVN. I'm going to ensure a proper fix going in master and then backport to 4.8z.

5. E0125 18:40:32.759129 1 policy.go:955] Failed to create port_group for network policy allow-prometheus in namespace ie-st-montun-filebeat

This is the most egregious bug. First of all the log is is not printing the actual error. Second, this failure causes the network policy to fail creation, and then it is not retried again (unless the policy is updated). We need a retry mechanism to attempt to recreate the policy just like we do with pods. This will require a heavier fix in master and then backport down to 4.8z.

— Additional comment from Tim Rozet on 2022-02-03 21:59:27 UTC —

Fix for number 2: https://github.com/ovn-org/ovn-kubernetes/pull/2792

— Additional comment from Tim Rozet on 2022-02-03 22:41:21 UTC —

Fix for number 4: https://github.com/ovn-org/ovn-kubernetes/pull/2794

— Additional comment from Tim Rozet on 2022-02-04 23:23:59 UTC —

Partial fix for number 5: https://github.com/ovn-org/ovn-kubernetes/pull/2797

Will need a follow up part 2 after this is reviewed + accepted.

— Additional comment from Tim Rozet on 2022-02-09 01:45:15 UTC —

Posted https://github.com/ovn-org/ovn-kubernetes/pull/2809 which will supersede PR 2797. That should be the complete fix for issue number 5.

— Additional comment from Andy Bartlett on 2022-02-09 10:33:15 UTC —

@trozet@redhat.com Do you have a link for the BZ / PR for:

{update Port_Group map[name:a11253394058733577533 ports:0xc001f1a1b0] [] [] [] 0 [[name == a11253394058733577533]] }

This is due to a bug in the go-ovn library that was fixed in 4.9. I'm going to backport the same fix to 4.8z.

Many thanks,

Andy

— Additional comment from Tim Rozet on 2022-02-14 16:55:30 UTC —

Yeah the fix for number 1 is a one liner in the ebay/libovsdb library:

https://github.com/openshift/ovn-kubernetes/commit/35677418d2bbfddb6229e1d776bba2064dde646b#diff-88e093886eb91e9ca5f9234d74a5f756c0251d685c141c902a7833d95bec5345R27

@@ -24,7 +24,7 @@ func NewOvsSet(goSlice interface{}) (*OvsSet, error)

{ return nil, errors.New("OvsSet supports only Go Slice types") }

var ovsSet []interface{}
+ ovsSet := make([]interface{}, 0, v.Len())
for i := 0; i < v.Len(); i++ { ovsSet = append(ovsSet, v.Index(i).Interface()) }

— Additional comment from Tim Rozet on 2022-02-15 14:51:40 UTC —

Moving back to assigned, a small issue was found with the previous patch: https://github.com/ovn-org/ovn-kubernetes/pull/2823

— Additional comment from Tim Rozet on 2022-02-16 17:21:06 UTC —

Found another issue where a delete/recreate of a policy with the same name may not clean up the stale version. Pushed a fix here: https://github.com/ovn-org/ovn-kubernetes/pull/2826

— Additional comment from anusaxen@redhat.com on 2022-07-21 20:30:21 UTC —

Tested with cluster bot build referencing PR #1195

All networkpolicy regression and checks passed in QE env

— Additional comment from trozet@redhat.com on 2022-07-25 13:52:33 UTC —

The description states the found in version is OCP 4.8. But the version on this bug is 4.10. It looks like the bug exists in 4.8 as well. Can we fix the version and make sure we backport to 4.8 and 4.9?

— Additional comment from aos-team-art-private@bot.bugzilla.redhat.com on 2022-08-05 03:32:07 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.10 release.

https://github.com/openshift/ovn-kubernetes/pull/1345

Bug OCPBUGS-16208: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/1776

Bug OCPBUGS-2347: [cluster-api-provider-baremetal] fix 4.8 build

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-api-provider-baremetal/pull/178

Bug ODC-5786: When selecting an APP, Helm Chart or Operator backed service in topology an extraneous blue box is displayed

View the Description View the linked PRs

Description of problem:

In the topology view, if you select any grouping (Application, Helm Release, Operator Backed service, etc), an extraneous blue box is displayed
This is a regression.

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Create an application in any way ... but this will do ...

Import from Git ... defaulting the application provided
Select the Application which is created & you'll see the blue bounding box

Actual results:

This animated gif shows the issue:

Expected results:

The blue box shouldn't be shown

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

Seen on 4/26/2021 4.8 daily, but this behavior was discussed in slack last week

Additional info:

This is a regression

https://github.com/openshift/console/pull/8785

Bug OCPBUGS-2754: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-azure/pull/265

Bug OCPBUGS-16253: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/sdn/pull/561

Bug OCPBUGS-7054: Sync jenkins-version.txt, base-plugins.txt and bundle-plugins.txt from master branch

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/jenkins/pull/1584

Bug OCPBUGS-4113: Various Jenkins CVEs for October 2022 [openshift-4.8.z]

View the Description View the linked PRs

CVE-2022-36881
CVE-2022-34177
CVE-2022-45047
CVE-2022-45379
CVE-2022-45380
CVE-2022-43403
CVE-2022-43409
CVE-2022-43408
CVE-2022-43407
CVE-2022-43404
CVE-2022-43401
CVE-2022-43402
CVE-2022-43405
CVE-2022-43406
CVE-2022-42889
CVE-2022-25857
CVE-2022-36882
CVE-2022-30948
CVE-2022-30945
CVE-2021-26291
CVE-2022-29047

https://github.com/openshift/jenkins/pull/1543

Bug OCPBUGS-4120: Dev Catalog taking too much time to load in a complete disconnected cluster

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1523~~. The following is the description of the original issue:
—
Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.

Actual results:
Taking too much time too load

Expected results:
Time taken should be reduced

Additional info:
Attached a gif for reference

https://github.com/openshift/console/pull/12309

Bug OCPBUGS-475: Various Jenkins CVEs for August 2022 [openshift-4.8.z]

View the Description View the linked PRs

CVE-2022-36882 - https://issues.redhat.com/browse/OCPBUGS-1179
CVE-2022-29047 - https://issues.redhat.com/browse/OCPBUGS-120
CVE-2022-30945 - https://issues.redhat.com/browse/OCPBUGS-311
CVE-2022-30946
CVE-2022-30948 - https://issues.redhat.com/browse/OCPBUGS-319
CVE-2022-30952
CVE-2022-30953
CVE-2022-30954
CVE-2022-34174
CVE-2022-36883
CVE-2022-36884
CVE-2022-36885
CVE-2022-34177
CVE-2022-34176
CVE-2022-36881

https://github.com/openshift/jenkins/pull/1511

Bug ODC-5915: Fix Automation scripts for Add flow - Create from git feature

View the Description View the linked PRs

Description of problem:

Some of the steps in test scenarios [A-06-TC02]- Script fix required

A-06-TC05 - script fix required

A-06-TC11 update required as per the latest UI

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

<steps>

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Additional info:

https://github.com/openshift/console/pull/9165

Bug OCPBUGS-644: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-6924: [release-4.9] openshift4/ose-jenkins:v4.10.0 run script throws too many arguments error

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-6881~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-5947~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4833~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4819~~. The following is the description of the original issue:
—
Description of problem:

s2i/run script has a bug - /usr/libexec/s2i/run: line 578: [: too many arguments

Version-Release number of selected component (if applicable):

v4.10

How reproducible:

Start jenkins from the container image using /usr/libexec/s2i/run while having a route that contains a certificate or key that includes special characters.

Steps to Reproduce:

1. create a route that contains a TLS certificate
2. start a pod using openshift4/ose-jenkins:v4.10.0
3. view the log

Actual results:

2022/12/12 17:30:33 [go-init] No pre-start command defined, skip
2022/12/12 17:30:33 [go-init] Main command launched : /usr/libexec/s2i/run
CONTAINER_MEMORY_IN_MB='12288', using /usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.el8_4.x86_64/bin/java and /usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.el8_4.x86_64/bin/javac
Administrative monitors that contact the update center will remain active
Migrating slave image configuration to current version tag ...
/usr/libexec/s2i/run: line 578: [: too many arguments
Using JENKINS_SERVICE_NAME=jenkins
Generating jenkins.model.JenkinsLocationConfiguration.xml using (/var/lib/jenkins/jenkins.model.JenkinsLocationConfiguration.xml.tpl) ...
Jenkins URL set to: https://bojenkinsdev.micron.com/ in file: /var/lib/jenkins/jenkins.model.JenkinsLocationConfiguration.xml
+ exec java -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:+ParallelRefProcEnabled -XX:+UseG1GC -XX:+UseStringDeduplication -XX:HeapDumpPath=/var/log/jenkins/ '-Xlog:gc*=debug:file=/var/log/jenkins-engserv/gc-%t.log:utctime:filecount=2,filesize=100m' -Xms2g -Xmx8g -Dfile.encoding=UTF8 -Djavamelody.displayed-counters=log,error -Djava.util.logging.config.file=/var/lib/jenkins/logging.properties -Djavax.net.ssl.trustStore=/var/lib/jenkins/ca-anchors-keystore -Dcom.redhat.fips=false -Djdk.http.auth.tunneling.disabledSchemes= -Djdk.http.auth.proxying.disabledSchemes= -Duser.home=/var/lib/jenkins -Djavamelody.application-name=jenkins -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=true -Djenkins.install.runSetupWizard=false -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=false -XX:+AlwaysPreTouch -XX:ErrorFile=/var/log/jenkins-engserv -Dhudson.model.ParametersAction.keepUndefinedParameters=false -jar /usr/lib/jenkins/jenkins.war --prefix=/je...
Picked up JAVA_TOOL_OPTIONS: -XX:+UnlockExperimentalVMOptions -Dsun.zip.disableMemoryMapping=true

Expected results:

not have the following error:
/usr/libexec/s2i/run: line 578: [: too many arguments

Additional info:

https://github.com/openshift/jenkins/pull/1570

Bug OCPBUGS-6934: hack/check-plugins-supply-chain-change.sh is not executable

View the Description View the linked PRs

Description of problem:

hack/check-plugins-supply-chain-change.sh script is not executable when running ci/prow/jenkins-check-plugins-supply-chain-change job

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/jenkins/pull/1571

Bug OCPBUGS-2748: Make northd probe interval default to 10 seconds

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-2594~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-1538~~. The following is the description of the original issue:
—
Description of problem:

Tracking this for backport of https://bugzilla.redhat.com/show_bug.cgi?id=2072710

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-network-operator/pull/1596

Bug OCPBUGS-2773: e2e tests: Installs Red Hat Integration - 3scale operator test is failing due to change of Operator name

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-2523~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-2451~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-2181~~. The following is the description of the original issue:
—
Description of problem:

E2E test Installs Red Hat Integration - 3scale operator test is failing due to change of Operator name

CI Search: https://search.ci.openshift.org/?search=Installs+Red+Hat+Integration+-+3scale+operator+in+test+namespace+and+creates+3scale+Backend+Schema+operand+instance&maxAge=24h&context=1&type=bug%2Bissue%2Bjunit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

https://github.com/openshift/console/pull/12203

Bug OCPBUGS-6043: Sync stable branch for CPO release-1.21 into release-4.8

View the Description View the linked PRs

Description of problem:

release-4.8 of openshift/cloud-provider-openstack is missing some commits that were backported in upstream project into the release-1.21 branch.
We should import them in our downstream fork.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cloud-provider-openstack/pull/169

Bug OCPBUGS-1098: Setting a telemeter proxy in the cluster-monitoring-config config map does not work as expected

View the Description View the linked PRs

Description of problem:

Setting a telemeter proxy in the cluster-monitoring-config config map does not work as expected

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
the following KCS details steps to add a proxy.
The steps have been verified at 4.7 but do not work at 4.8, 4.9 or 4.10

https://access.redhat.com/solutions/6172402

When testing at 4.8, 4.9 and 4.10 the proxy setting where also nested under `telemeterClient`

which triggered a telemeter restart but the proxy setting do not get set in the deployment as they do in 4.7

Actual results:

4.8, 4.9 and 4.10 without the nested `telemeterClient`
does not trigger a restart of the telemeter pod

Expected results:

I think the proxy setting should be nested under telemeterClient
but should set the environment variables in the deployment

Additional info:

https://github.com/openshift/cluster-monitoring-operator/pull/1765

Bug ODC-5920: Fix Automation scripts for Pipelines- Runs

View the Description View the linked PRs

Description of problem:

Fixing P-07-TC20 test scenario

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

<steps>

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Additional info:

https://github.com/openshift/console/pull/9109

Bug OCPBUGS-5573: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/1684

Task RHSTOR-1643: Fix Import Order

View the Description View the linked PRs

1) We want to fix the order of Imports in the files.

2) We want to have vendor import, followed by console/package import and then relative imports should come at last.

Can be done manually or introduce some linter rules for this.

https://github.com/openshift/console/pull/8131

Bug OCPBUGS-16685: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/1916

Bug OCPBUGS-4775: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-etcd-operator/pull/972

Bug OCPBUGS-5527: OLM generates invalid component selector labels

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-4948~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4799~~. The following is the description of the original issue:
—
[This issue is for a backport to 4.10.z for our CI. This issue was already addressed for 4.11+ in https://github.com/openshift/operator-framework-olm/pull/285]

Description of problem:

When installing an operator, OLM creates an "operator" Custom Resource whose status will be updated to contain a list of resources associated with the operator. This is done by labeling each resource associated with an operator with a label based off this code: https://github.com/operator-framework/operator-lifecycle-manager/blob/7eccf5342199b88f4657b6c996d4e66d9fa978fa/pkg/controller/operators/decorators/operator.go#L92-L105

Version-Release number of selected component (if applicable):

4.8

How reproducible:

Always

Steps to Reproduce:

1. Create a subscription named managed-node-metadata-operator in the openshift-managed-node-metadata-operator namespace, which causes the truncated label to end on `-`, which is an illegal character.
2. Watch the OLM Operator logs.

Actual results:

The adoption controller within OLM continuously fails to adopt the subscription due to an illegal label value:

{"level":"error","ts":1670862754.2096953,"logger":"controllers.adoption","msg":"Error adopting Subscription","request":"openshift-managed-node-metadata-operator/managed-node-metadata-operator","error":"Subscription.operators.coreos.com \"managed-node-metadata-operator\" is invalid: metadata.labels: Invalid value: \"operators.coreos.com/managed-node-metadata-operator.openshift-managed-node-metadata-\": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}
{"level":"error","ts":1670862754.2097518,"logger":"controller.subscription","msg":"Reconciler error","reconciler group":"operators.coreos.com","reconciler kind":"Subscription","name":"managed-node-metadata-operator","namespace":"openshift-managed-node-metadata-operator","error":"Subscription.operators.coreos.com \"managed-node-metadata-operator\" is invalid: metadata.labels: Invalid value: \"operators.coreos.com/managed-node-metadata-operator.openshift-managed-node-metadata-\": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')","errorCauses":[{"error":"Subscription.operators.coreos.com \"managed-node-metadata-operator\" is invalid: metadata.labels: Invalid value: \"operators.coreos.com/managed-node-metadata-operator.openshift-managed-node-metadata-\": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')"}
],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2\n\t/build/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

Expected results:

The adoption control creates a label that can be applied to the subscription so it may be "adopted" by the controller.

Additional info:

This was orginially fixed in 4.11 here: https://github.com/operator-framework/operator-lifecycle-manager/pull/2731

Story BUILD-249: 4.8 samples bumps

View the Description View the linked PRs

User Story

Pull in the latest openshift/library content into the samples operator
If image eco e2e's fail, work with upstream SCL to address

Acceptance Criteria

Samples operator installs current official content in openshift/library
List of removed/EOL images is prepared for docs update

Docs Impact

List of EOL images needs to be sent to the Docs team and added to the release notes.

https://github.com/openshift/cluster-samples-operator/pull/367

Bug OCPBUGS-895: Machine Controller stuck with Terminated Instances while Provisioning on AWS

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-572~~. The following is the description of the original issue:
—
This is a clone of Bug 2117557 to track backport to 4.9.z
+++ This bug was initially created as a clone of
Bug #2108021
+++

+++ This bug was initially created as a clone of
Bug #2106733
+++

Description of problem:
During a replacement of worker nodes, we noticed that the machine-controller container, which is deployed as part of the `openshift-machine-api` namespace, would panic when a machine OpenShift was still in "Provisioning" state, but the corresponding AWS instance was already "Terminated".

```
I0628 10:09:02.518169 1 reconciler.go:123] my-super-worker-skghqwd23: deleting machine
I0628 10:09:03.090641 1 reconciler.go:464] my-super-worker-skghqwd23: Found instance by id: i-11111111111111
I0628 10:09:03.090662 1 reconciler.go:138] my-super-worker-skghqwd23: found 1 existing instances for machine
I0628 10:09:03.090669 1 utils.go:231] Cleaning up extraneous instance for machine: i-11111111111111, state: running, launchTime: 2022-06-28 08:56:52 +0000 UTC
I0628 10:09:03.090682 1 utils.go:235] Terminating i-05332b08d4cc3ab28 instance
panic: assignment to entry in nil map

goroutine 125 [running]:
sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Reconciler).delete(0xc0012df980, 0xc0004bd530, 0x234c4c0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/reconciler.go:165 +0x95b
sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Actuator).Delete(0xc000a3a900, 0x25db9b8, 0xc0004bd530, 0xc000b9a000, 0x35e0100, 0x0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/actuator.go:171 +0x365
github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc0007bc960, 0x25db9b8, 0xc0004bd530, 0xc0007c5fc8, 0x15, 0xc0005e4a80, 0x2a, 0xc0004bd530, 0xc000032000, 0x206d640, ...)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:231 +0x2352
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x1feb8e0, 0xc00009f460)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000a38790, 0xc0003b20a0, 0x25db910, 0xc00087e040)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425
```

What is the business impact? Please also provide timeframe information.
We failed to recover from a major outage due to this bug.

Where are you experiencing the behavior? What environment?
Production and all envs.

When does the behavior occur? Frequency? Repeatedly? At certain times?
It appeared only once so far, but can appear in larger scaling scenarios.

Version-Release number of selected component (if applicable):
4.8.39

Actual results:

With the panicing machine-controller, no new instances could be provisioned, resulting in an unscalable cluster. The solution/workaround to the problem was to delete the offending Machines.
Expected results:
Make the cluster scaleable again without deleting manually.

Additional info:

— Additional comment from
gferrazs@redhat.com
on 2022-07-13 13:34:08 UTC —

Probably the issue is here:

https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/reconciler.go#L416-L426(the
fields referenced are on the file below. Probably duplicate the lines or move here).

https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/reconciler.go#L165
- - Additional comment from
    skumari@redhat.com
    on 2022-07-13 15:22:47 UTC —

Since issue is in machine-api, moving it to correct team.

— Additional comment from
rmanak@redhat.com
on 2022-07-14 08:20:00 UTC —

I am working on a fix for this.

— Additional comment from
aos-team-art-private@redhat.com
on 2022-07-14 19:10:14 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.

— Additional comment from
jspeed@redhat.com
on 2022-07-18 15:18:16 UTC —

Waiting for the first 4.11.z stream before we merge

— Additional comment from
jspeed@redhat.com
on 2022-08-08 15:07:16 UTC —

Waiting on 4.11 GA to move ahead here

https://github.com/openshift/cluster-api-provider-aws/pull/447

Bug ODC-5917: Fix Automation scripts of Pipelines - Create from Add options

View the Description View the linked PRs

Description of problem:

P-01-TC03	On Second run, script worked fine
P-01-TC06	created seperate functions for docker file page
P-01-TC09	Removing this test case, by updating P-04-TC04 test scenario Updating pipelines section title in side bar

Above test scenarios - on second run it works fine

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

<steps>

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Additional info:

https://github.com/openshift/console/pull/9103

Bug OCPBUGS-3105: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/etcd/pull/173

Bug OCPBUGS-4258: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/grafana/pull/95

Bug OCPBUGS-1461: Kubelet slowly leaking memory and pods eventually unable to start

View the Description View the linked PRs

+++ This bug was initially created as a clone of Bug #2106414 +++

+++ This bug was initially created as a clone of Bug #2065749 +++

Description of problem:

Over time the kubelet slowly consumes memory until, at some point, pods are no longer able to start on the node; coinciding with this are container runtime errors. It appears that even rebooting the node does not resolve the issue once it occurs - the node has to be completely rebuilt.

How reproducible: Consistently

Actual results: Pods are eventually unable to start on the node; rebuilding the node is the only workaround

Expected results: kubelet/crio would continue working as expected

— Additional comment from lstanton@redhat.com on 2022-03-18 17:12:39 UTC —

OVERVIEW
========

Ford is having an ongoing issue where they seeing node memory using slowly increasing over time until pods are unable to start, and errors like the following start to show up:

"Error: relabel failed /var/lib/kubelet/pods/47eb3e14-a631-412b-b07a-b19499faddbb/volumes/kubernetes.io~csi/pvc-05a55ed9-d63b-4c68-b3af-644a4353d9be/mount: lsetxattr /var/lib/kubelet/pods/47eb3e14-a631-412b-b07a-b19499faddbb/volumes/kubernetes.io~csi/pvc-05a55ed9-d63b-4c68-b3af-644a4353d9be/mount: operation not permitted"

Once this happens the node is unusable and has to be rebuilt, as rebooting apparently does not solve the problem.

We originally thought that this was going to get resolved in BZ 2042175 which was eventually closed as a duplicate of BZ 2060494. Since BZ 2044438 represented the 4.9 fix (in 4.9.21, errata ~~RHBA-2022~~:0488), we told customer to go ahead and upgrade to 4.9.21, but unfortunately this did not resolve the issue for them, hence the reason we are opening this BZ.

Ford has a very specific use case for OC -. they are mostly using the cluster to run large numbers of Tekton pipelines and are using some custom SCC's to get around current limitations in OCP filesystem support.

— Additional comment from lstanton@redhat.com on 2022-03-18 17:20:39 UTC —

There is a lot of data available on case 2065749, though this is before they upgraded from 4.8 to 4.9. If there's any specific data that we haven't captured yet but would be helpful please let me know.

— Additional comment from lstanton@redhat.com on 2022-03-18 19:08:44 UTC —

The latest data from an occurrence before the customer upgraded to 4.9.21 can be found here (so this is from 4.8):

— sosreport —
https://attachments.access.redhat.com/hydra/rest/cases/03130785/attachments/057aead6-71c7-47ad-8faf-b0ba437625f3

— Other data (including pprof) —
https://attachments.access.redhat.com/hydra/rest/cases/03130785/attachments/b384715b-4189-4f34-bd85-02a718b1314a

Would it possible to get your opinion on the above node data? Does anything look obviously out of line in terms of kubelet or crio behavior that might explain pods failing to start? I've requested newer data in the case.

— Additional comment from rphillips@redhat.com on 2022-03-28 14:47:34 UTC —

1. Kubelet pprof

go tool pprof pd103-7h7tj-worker-c-2qh7x-profile.pprof.gz
(pprof) top
560ms 17.78% 17.78% 570ms 18.10% syscall.Syscall
510ms 16.19% 33.97% 520ms 16.51% syscall.Syscall6
190ms 6.03% 40.00% 190ms 6.03% runtime.epollctl
130ms 4.13% 44.13% 270ms 8.57% runtime.scanobject
110ms 3.49% 47.62% 350ms 11.11% runtime.mallocgc
110ms 3.49% 51.11% 110ms 3.49% runtime.markBits.isMarked (inline)
80ms 2.54% 53.65% 80ms 2.54% aeshashbody
80ms 2.54% 56.19% 80ms 2.54% runtime.epollwait
70ms 2.22% 58.41% 70ms 2.22% runtime.futex
60ms 1.90% 60.32% 70ms 2.22% runtime.heapBitsSetType

This dump shows the kubelet is a state of constant GC. scanobject taking 270ms is high. Syscall6 is LSTAT taking 520ms is

1. Kubelet

I noticed that pods are failing to be fully cleaned up and created within the kubelet: Failed to remove cgroup (will retry)

This means the pods are staying within Kubelet's memory and the kubelet is retrying the cleanup operation. (Effectively leaking memory).

Additionally, pods are failing to be started from Comment #1, after during the start phase with the selinux rename. The failure in the start phase leads me to believe we are hitting (1) (not backported yet into 4.9). PR 107845 fixes an issue where pods on CRI error are arbitrarily marked as terminated when they should be marked as waiting.

1. https://github.com/kubernetes/kubernetes/pull/107845 .

— Additional comment from rphillips@redhat.com on 2022-03-28 15:40:19 UTC —

What are the custom SCCs? Fixing the issue of the relabel is the first step in solving this.

— Additional comment from rphillips@redhat.com on 2022-03-29 14:31:26 UTC —

Upstream leak https://github.com/kubernetes/kubernetes/pull/109103. The pprof on this BZ shows the same issue.

I'll attach the screenshot.

— Additional comment from rphillips@redhat.com on 2022-03-29 14:32:06 UTC —

Created attachment 1869023
leak pprof

— Additional comment from lstanton@redhat.com on 2022-04-05 15:02:09 UTC —

@rphillips@redhat.com do you still need info from me?

— Additional comment from rphillips@redhat.com on 2022-04-13 14:08:15 UTC —

No thank you!

— Additional comment from hshukla@redhat.com on 2022-04-25 09:37:55 UTC —

Created attachment 1874784
pprof from 3 nodes for Swisscom case 03125278

— Additional comment from rphillips@redhat.com on 2022-04-25 13:50:14 UTC —

@Hradayesh: CU has the same leak as this BZ.

❯ go tool pprof w0-d-altais.corproot.net-heap.pprof
File: kubelet
Build ID: 6eb513a78ba65574e291855722d4efa0a3fc9c23
Type: inuse_space
Time: Apr 25, 2022 at 3:49am (CDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 54.37MB, 75.57% of 71.94MB total
Showing top 10 nodes out of 313
flat flat% sum% cum cum%
26.56MB 36.92% 36.92% 26.56MB 36.92% k8s.io/kubernetes/pkg/kubelet/cm/containermap.ContainerMap.Add
10.50MB 14.60% 51.52% 10.50MB 14.60% k8s.io/kubernetes/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2.(*CreateContainerResponse).Unmarshal
5.05MB 7.02% 58.53% 5.05MB 7.02% k8s.io/kubernetes/pkg/kubelet/volumemanager/populator.(*desiredStateOfWorldPopulator).markPodProcessed
3.50MB 4.87% 63.40% 4MB 5.56% k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).Unmarshal
2.50MB 3.48% 66.88% 4.01MB 5.57% k8s.io/kubernetes/vendor/github.com/google/cadvisor/container/libcontainer.newContainerStats
1.50MB 2.09% 68.97% 2MB 2.78% k8s.io/kubernetes/vendor/k8s.io/api/core/v1.(*Pod).DeepCopy
1.50MB 2.09% 71.05% 1.50MB 2.09% k8s.io/kubernetes/vendor/k8s.io/api/core/v1.(*Container).Unmarshal
1.13MB 1.57% 72.62% 1.13MB 1.57% k8s.io/kubernetes/vendor/google.golang.org/protobuf/internal/strs.(*Builder).MakeString
1.12MB 1.55% 74.18% 1.62MB 2.25% k8s.io/kubernetes/vendor/github.com/golang/groupcache/lru.(*Cache).Add
1MB 1.39% 75.57% 1MB 1.39% k8s.io/kubernetes/vendor/github.com/aws/aws-sdk-go/aws/endpoints.init

— Additional comment from bugzilla@redhat.com on 2022-05-09 08:31:37 UTC —

Account disabled by LDAP Audit for extended failure

— Additional comment from bsmitley@redhat.com on 2022-05-24 15:17:47 UTC —

Is there any ETA on getting a fix for this? This issue is happening alot on Ford's Tekton pods.

— Additional comment from dahernan@redhat.com on 2022-06-02 14:35:33 UTC —

Once this is verified, is it viable to also backport it to 4.8, 4.9 or 4.10(at least)? I do not observe any cherry pick or backport related to other versions (yet) but we have a customer (Swisscom) pushing for that, as this is impacting them.

— Additional comment from aos-team-art-private@redhat.com on 2022-06-09 06:10:29 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.

— Additional comment from openshift-bugzilla-robot@redhat.com on 2022-06-13 13:13:11 UTC —

Bugfix included in accepted release 4.11.0-0.nightly-2022-06-11-054027
Bug will not be automatically moved to VERIFIED for the following reasons:

PR openshift/kubernetes#1229 not approved by QA contact

This bug must now be manually moved to VERIFIED by schoudha@redhat.com

— Additional comment from schoudha@redhat.com on 2022-06-15 15:21:28 UTC —

Checked on 4.11.0-0.nightly-2022-06-14-172335 by running pods over a day and don't see unexpectedly high memory usage by kubelet on node.

% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-14-172335 True False 8h Cluster version is 4.11.0-0.nightly-2022-06-14-172335

— Additional comment from errata-xmlrpc@redhat.com on 2022-06-15 17:57:08 UTC —

This bug has been added to advisory ~~RHEA-2022~~:5069 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com@REDHAT.COM)

— Additional comment from bsmitley@redhat.com on 2022-06-27 20:34:28 UTC —

Do we have a timeframe for this into 4.10?

This issue is impacting Ford.

Account name: Ford
Account Number: 5561914
TAM customer: yes
CSE customer: yes
Strategic: yes

— Additional comment from bsmitley@redhat.com on 2022-06-28 15:28:55 UTC —

Ford brought this up in my TAM meeting as a hot issue.

I need an ETA of when this will be in 4.10. This just needs to be a rough date. I know normally we don't share exact dates because timeframes could change.

Bug ODC-5910: Update the kafka related gherkin scripts

View the Description View the linked PRs

Description of problem:

Update the kafka test scenarios in eventing-kafka-event-source.feature file

Steps to Reproduce

<steps>

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Additional info:

While Regression Test execution, updated the test scenarios

https://github.com/openshift/console/pull/9092

Bug ODC-5954: Clean up task for 4.8 release

View the Description View the linked PRs

Description of problem:

Remove the odc tags and update the cypress.json files

https://github.com/openshift/console/pull/9212

Bug OCPBUGS-2577: [4.8] ETCD Operator goes degraded when a second internal node ip is added

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1758~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-1354~~. The following is the description of the original issue:
—
This was originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=2046335

—

Description of problem:

The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)

How reproducible:

Attach a NIC to a master node will trigger the issue

Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")

Actual results:

~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal -o json | jq ".status.addresses"
[

{ "address": "10.0.178.163", "type": "InternalIP" }

{ "address": "10.0.187.247", "type": "InternalIP" }

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "Hostname" }

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "InternalDNS" }

]

$ oc get co etcd
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
etcd 4.8.11 True False True 31h

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:47:42Z", "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]", "reason": "EtcdCertSignerController_Error", "status": "True", "type": "Degraded" }

~~~

Expected results:

To have the certificate valid also for the second IP (the newly created one "10.0.187.247")

Additional info:

Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret ~~n openshift-etcd | grep kubernetes.io/tls | grep ^etcd~~
etcd-client kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 58s

$ oc get secret ~~n openshift-etcd | grep kubernetes.io/tls | grep ^etcd~~ | awk '

{print $1}

' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:52:21Z", "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found", "reason": "AsExpected", "status": "False", "type": "Degraded" }

~~~

https://github.com/openshift/cluster-etcd-operator/pull/949

Task ETCD-180: reduce etcd MTTR on process restart

View the Description View the linked PRs

This task adds support for setting socket options SO_REUSEADDR and SO_REUSEPORT to etcd listeners via ListenConfig. These options give the flexibility to cluster admins who wish to more explicit control of these features. What we have found is during etcd process restart there can be a considerable time waiting for the port to release as it is held open by TIME_WAIT which on many systems is 60s.

https://github.com/openshift/etcd/pull/70

Bug OCPBUGS-3665: [Backport for 1986375] Avoid CMO being degraded when some nodes aren't available

View the Description View the linked PRs

Hello team,

Raising this bug to backport the fix for the bug (https://bugzilla.redhat.com/show_bug.cgi?id=1986375) in OCP version 4.8

https://github.com/openshift/cluster-monitoring-operator/pull/1813

Bug ODC-5918: Fix Automation scripts for Pipelines- Create from builder page

View the Description View the linked PRs

Description of problem:

P-02-TC02	Script fix required - unable to identify locators
P-02-TC03	Script fix required - unable to identify locators
P-02-TC06	Script fix required - unable to identify locators

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

<steps>

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Additional info:

https://github.com/openshift/console/pull/9211

Bug OCPBUGS-1230: [4.8] etcd should not rollout new revision when etcd Cluster is unhealthy/degraded

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-etcd-operator/pull/930

Bug OCPBUGS-1977: README file for helm charts coded in Chinese shows messy characters when viewing in developer perspective.

View the Description View the linked PRs

Description of problem:

Helm chart README file is coded in Chinese，the content turns into messy code in developer perspective while configuring the helm chart.

Version-Release number of selected component (if applicable):

OpenShift Container Platform : 4.8.20 and also found same behavior on : 4.10.16

How reproducible:

Steps to Reproduce:

1. Create a custom HelmChartRepository which consist a helm chart with a README.md file coded in Chinese  

2. Then check and try to install the helm chart from : Developer Catalog > Helm Charts , The README file contents will be showing messy.

Actual results:

Helm chart README file is coded in Chinese，the content turns into messy code in developer perspective while configuring the helm chart.

Expected results:

README file Chinese characters must show normally.

Additional info:

https://github.com/openshift/console/pull/12132

Bug OCPBUGS-4131: Externalize Jenkins version

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/jenkins/pull/1526

Bug OCPBUGS-1057: [4.8] Remove `yq` curls from CI steps

View the Description View the linked PRs

This is a clone of issue ~~OCPBUGS-1010~~. The following is the description of the original issue:
—
Description of problem:

+++ This bug was initially created as a clone of https://issues.redhat.com//browse/OCPBUGS-784

Various CI steps use the upi-installer container for it's access to the
aws cli tools among other things. However, most of those steps also
curl yq directly from GitHub. We can save ourselves some headaches
when GitHub is down by just embedding the binary in the image already.

Whenever GitHub has issues or throttles us, YQ hash mismatch error out. The hash mismatch error is because github is probably returning an error page, although our scripts hide it.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/6322

Bug OCPBUGS-2861: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-aws/pull/450

Bug OCPBUGS-5089: 4.8 node-exporter daemonset does not filter nodeReadyCount with kubernetes.io/os=linux nodeSelector

View the Description View the linked PRs

Description of problem:
the bug is found when debug https://issues.redhat.com/browse/OCPQE-13200

deploy 4.8.0-0.nightly-2022-11-30-073158 with aos-4_8/ipi-on-aws/versioned-installer-ovn-winc-ci template, the template created cluster with 3 linux masters, 3 linux workers and 2 windows workers. ip-10-0-149-219.us-east-2.compute.internal/ip-10-0-158-129.us-east-2.compute.internal are windows workers in this bug(they are with kubernetes.io/os=windows label, not kubernetes.io/os=linux).

# oc get node --show-labels
NAME                                         STATUS   ROLES    AGE     VERSION                             LABELS
ip-10-0-139-166.us-east-2.compute.internal   Ready    worker   4h35m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-139-166,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-143-178.us-east-2.compute.internal   Ready    master   4h47m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-143-178,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-149-219.us-east-2.compute.internal   Ready    worker   3h51m   v1.21.11-rc.0.1506+5cc9227e4695d1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5a.large,beta.kubernetes.io/os=windows,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ec2amaz-2hcbpla,kubernetes.io/os=windows,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5a.large,node.kubernetes.io/windows-build=10.0.17763,node.openshift.io/os_id=Windows,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-158-129.us-east-2.compute.internal   Ready    worker   3h45m   v1.21.11-rc.0.1506+5cc9227e4695d1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5a.large,beta.kubernetes.io/os=windows,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ec2amaz-golrucd,kubernetes.io/os=windows,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5a.large,node.kubernetes.io/windows-build=10.0.17763,node.openshift.io/os_id=Windows,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-175-105.us-east-2.compute.internal   Ready    worker   4h35m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-175-105,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b
ip-10-0-188-67.us-east-2.compute.internal    Ready    master   4h43m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-188-67,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b
ip-10-0-192-42.us-east-2.compute.internal    Ready    worker   4h35m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-192-42,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.large,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c
ip-10-0-210-137.us-east-2.compute.internal   Ready    master   4h43m   v1.21.14+a17bdb3                    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-210-137,kubernetes.io/os=linux,node-role.kubernetes.io/master=,node.kubernetes.io/instance-type=m5.xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c

# oc get node -l kubernetes.io/os=linux
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-139-166.us-east-2.compute.internal   Ready    worker   4h31m   v1.21.14+a17bdb3
ip-10-0-143-178.us-east-2.compute.internal   Ready    master   4h43m   v1.21.14+a17bdb3
ip-10-0-175-105.us-east-2.compute.internal   Ready    worker   4h31m   v1.21.14+a17bdb3
ip-10-0-188-67.us-east-2.compute.internal    Ready    master   4h39m   v1.21.14+a17bdb3
ip-10-0-192-42.us-east-2.compute.internal    Ready    worker   4h31m   v1.21.14+a17bdb3
ip-10-0-210-137.us-east-2.compute.internal   Ready    master   4h40m   v1.21.14+a17bdb3

# oc get node -l kubernetes.io/os=windows
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-149-219.us-east-2.compute.internal   Ready    worker   3h48m   v1.21.11-rc.0.1506+5cc9227e4695d1
ip-10-0-158-129.us-east-2.compute.internal   Ready    worker   3h41m   v1.21.11-rc.0.1506+5cc9227e4695d1

monitoring is degrade for "expected 8 ready pods for "node-exporter" daemonset, got 6"

# oc get co monitoring -oyaml
...
  - lastTransitionTime: "2022-12-21T03:08:47Z"
    message: 'Failed to rollout the stack. Error: running task Updating node-exporter
      failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object
      failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter:
      expected 8 ready pods for "node-exporter" daemonset, got 6 '
    reason: UpdatingnodeExporterFailed
    status: "True"
    type: Degraded
  extension: null

same errors in CMO logs

# oc -n openshift-monitoring logs -c cluster-monitoring-operator cluster-monitoring-operator-7fd77f4b87-pnfm9 | grep "reconciling node-exporter DaemonSet failed" | tail
I1221 07:30:52.343230       1 operator.go:503] ClusterOperator reconciliation failed (attempt 55), retrying. Err: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 
E1221 07:30:52.343253       1 operator.go:402] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 
I1221 07:35:54.713045       1 operator.go:503] ClusterOperator reconciliation failed (attempt 56), retrying. Err: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6 
E1221 07:35:54.713064       1 operator.go:402] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6

node-exporter pods are in kubernetes.io/os=linux nodes

# oc -n openshift-monitoring get ds
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-exporter   6         6         6       6            6           kubernetes.io/os=linux   4h33m

# oc -n openshift-monitoring get pod -o wide | grep node-exporter
node-exporter-2tkxv                            2/2     Running   0          5h35m   10.0.188.67    ip-10-0-188-67.us-east-2.compute.internal    <none>           <none>
node-exporter-hbn65                            2/2     Running   0          5h31m   10.0.175.105   ip-10-0-175-105.us-east-2.compute.internal   <none>           <none>
node-exporter-prn9h                            2/2     Running   0          5h35m   10.0.143.178   ip-10-0-143-178.us-east-2.compute.internal   <none>           <none>
node-exporter-q4tsw                            2/2     Running   0          5h31m   10.0.192.42    ip-10-0-192-42.us-east-2.compute.internal    <none>           <none>
node-exporter-qx7dc                            2/2     Running   0          5h31m   10.0.139.166   ip-10-0-139-166.us-east-2.compute.internal   <none>           <none>
node-exporter-zrsnx                            2/2     Running   0          5h35m   10.0.210.137   ip-10-0-210-137.us-east-2.compute.internal   <none>           <none>

# oc -n openshift-monitoring get ds node-exporter -oyaml
...
status:
  currentNumberScheduled: 6
  desiredNumberScheduled: 6
  numberAvailable: 6
  numberMisscheduled: 0
  numberReady: 6
  observedGeneration: 1
  updatedNumberScheduled: 6

reason why CMO reports monitoring is degraded is 4.8 treats all Ready node to nodeReadyCount, no matter they have kubernetes.io/os=linux or not

https://github.com/openshift/cluster-monitoring-operator/blob/release-4.8/pkg/client/client.go#L951-L959

the issue is fixed in 4.9+

https://github.com/openshift/cluster-monitoring-operator/blob/release-4.9/pkg/client/client.go#L1052-L1061

Version-Release number of selected component (if applicable):

deploy 4.8.0-0.nightly-2022-11-30-073158 with aos-4_8/ipi-on-aws/versioned-installer-ovn-winc-ci template, the template created cluster with 3 linux masters, 3 linux workers and 2 windows workers

How reproducible:

deploy OCP 4.8 on linux worker + windows worker

Steps to Reproduce:

1. see the description
2.
3.

Actual results:

monitoring is degraded for
waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 ready pods for "node-exporter" daemonset, got 6

Expected results:

no degraded

Additional info:

if we don't want to fix the bug in 4.8, we can close this bug

https://github.com/openshift/cluster-monitoring-operator/pull/1857

4.8.0-0.nightly-2023-07-31-231822

Changes from 4.7.60

Complete Features

Epic Goal

Why is this important?

Acceptance Criteria

Open questions:

Done Checklist

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Number of replicas on different platforms

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Epic Goal

Why is this important?

Acceptance Criteria

Previous Work (Optional):

Done Checklist

Feature Overview

Background, and strategic fit

Goal(s)

Documentation Considerations

Goal

User-stories

Requirements

References

Incomplete Features

Feature Overview

Goals

Requirements

In Scope

Out of Scope

Documentation Considerations

Feature Overview

Goals

Out of Scope

Requirements

Feature Overview

Goals

Requirements

(Optional) Use Cases

Questions to answer…

Out of Scope

Background, and strategic fit

Assumptions

Customer Considerations

Documentation Considerations

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Complete Epics

Goal:

Why is it important?

Note:

Description

Acceptance Criteria

Additional Details:

Description

Acceptance Criteria

Additional Details:

Description

Acceptance Criteria

Additional Details: