Requirement	Notes	isMvp?
Secrets and ConfigMaps can get shared across namespaces		YES

Feature ACM-1290: [RFE] Be able to rename local-cluster

View the Description

Epic Goal

Rename `local-cluster` in RHACM.

Why is this important?

Customers have found it confusing to see the `local-cluster` as a hardcoded object in their ACM clusters list.
- They have not complained about the fact that it is there, but rather just the name of it.
In particular, as the architecture of RHACM evolves to include a global Hub of Hubs, the management of sub-hubs ("leaf hubs") will get problematic if we start to see numerous managed sub-hubs all with the same name `local-cluster` being imported to the global hub.

Scenarios

Customer installs RHACM
Customer sees local-cluster in the all clusters list
Customer can rename local-cluster as needed

Alternate scenario

Customer installs RHACM
customer sees the management hub in the all clusters list with a unique cluster ID, not a user-configurable name
Customer cannot rename local-cluster as needed; instead they could use a label to indicate some colloquial nickname

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Too many to accurately list at this point, but we need to consider every component, every part of RHACM.

Previous Work (Optional):

Open questions:

Should the local-cluster object be a standardized unique cluster ID? or should it be user configurable?

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Slack Channel

#acm-1290-rename-local-cluster

Epic MGMT-18563: Remove hard-coded local-cluster from infrastructure operator

View the Description

Feature goal (what are we trying to solve here?)

Remove hard-coded local-cluster from the import local cluster feature and verify that we don't use it in the infrastructure operator

DoD (Definition of Done)

Testing the import local cluster and checking the behavior after the upgrade.

Does it need documentation support?

Yes.

Feature origin (who asked for this feature?)

A Customer asked for it

A solution architect asked for it

Internal request

- Yes, https://issues.redhat.com/browse/ACM-1290

Reasoning (why it's important?)

behavior change in ACM

Competitor analysis reference

Do our competitors have this feature?
- No

Feature usage (do we have numbers/data?)

Not relevant

Feature availability (why should/shouldn't it live inside the UI/API?)

Not relevant to UI

Task MGMT-18659: [Dev] Update local cluster import to not be hardwired to local-cluster

View the Description View the linked PRs

Presently the name of the local-cluster is hardwired to "local-cluster" in the local cluster import tool.
It is possible to redefine the name of the "local-cluster" in ACM then the correct local-cluster name needs to be picked up and used by the ManagedCluster.

Suggested approach

1: Obtain the correct "local-cluster" name from the ManagedCluster CR that has been labelled as "local-cluster"
2: Use this name to import the local cluster, annotate the created AgentServiceConfig, ClusterDeployment and InfraEnv as a "local cluster"
3: Handle any updates to ManagedCluster to keep the name in sync.
4: During deletion of local cluster CRs, this annotation may be used to identify CRs to be deleted.

This will leave an edge case, there will be an AgentServiceConfig, ClusterDeployment and InfraEnv "left behind" for any users who have renamed their ManagedCluster and then performed an upgrade to this new version. Those users will need to manually remove these CR's. (I will discuss further with ACM to determine a suitable course of action here.)

This makes the following assumptions, which should also be checked with the ACM team.

1: ACM users may rename their "local-cluster" in ACM (meaning that we should pick this change up)
2: ACM will use the label "local-cluster" in the ManagedCluster CR to signify a local cluster
3: There will only be one "local-cluster" in ACM (note that it's possible to add a label arbitrarily so this may not be properly enforceable.)

https://github.com/openshift/assisted-service/pull/6696

Feature CNV-28178: Increase workload density by adding memory oversubscription

View the Description

Requirement description:

As an VM Admin, I want to improve overall density. In our traditional VM environments, we find that we are memory bound much more than CPU. Even with properly sized VMs, we see a lot of memory just sitting around allocated to the VM, but not actually used. Moreover, we always see people requesting VMs that are sized way too big for their workloads. It is better customer service allow it to some degree and then recover the memory at the hypervisor level.

MVP:

Move SWAP to beta (OCP TP)
Dashboard for monitoring
Make sure the scheduler sees the real memory available, rather than that allocated to the VMs.

Documents:

Memory Overcommit Strategy (What tools we want to use): https://docs.google.com/document/d/1C6_idKxFgxOhcUBqjQg8pMY7xi3W7_LrqQL6Le3eBPA/edit#
Higher Density Goals (What goals we have user/admin facing): https://docs.google.com/document/d/1DaaCmzR0OpnehFBVh6hiHTfQhfBLW7JZVXkv96EEkHg/edit#
SWAP Goals (CNV's Goals wrt enabling SWAP in Kube): https://docs.google.com/document/d/1jwcx8YQ0mwotiT5GgVP4im13veYD7gPULTquxoX0aCc/edit#

Epic CNV-14960: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CNV-39065: Display the memory overcommit widget in the Dashboard

View the Description View the linked PRs

Prometheus query for UI:
sum by (instance)(((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) + (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes)) / node_memory_MemTotal_bytes) *100

In human words: This is approximating how much over-committment of memory is taking place. A value of 100 means RAM+SWAP usage are 100% of system RAM capacity. 105% means RAM+SWAP are factor 105% of system RAM capacity.

Threshold: Yellow 95%, Red 105%
Based on: https://docs.google.com/document/d/1AbR1LACNMRU2QMqFpe-Se2mCEFLMqW_M9OPKh2v3yYw,

https://docs.google.com/document/d/1E1joajwxQChQiDVTsr9Qk_iIhpQkSI-VQP-o_BMx8Aw

https://github.com/openshift/console/pull/14122

Feature CNV-51201: Integration between VMs and primary user-defined networks

View the Description

Provide a simple way to get a VM-friendly networking setup, without having to configure the underlying physical network.

Epic CNV-45524: OVN: Graduating user-defined primary networks in Virtualization

View the Description

Goal

Provide a network solution working out of the box, meeting expectations of a typical VM workload.

User Stories

As an owner of a VM that is connected only to a secondary overlay network, I want to fetch resources from outside networks (internet).
As a developer migrating my VMs to OCP, I do not want to change my application to support multiple NICs.
My application needs access to a flat network connecting it to other VMs and Pods.
I want to expose my selected applications over the network to users outside the cluster.
I'm limited by public cloud networking restrictions and I rely on their LoadBalancer to route traffic to my applications.
As a developer who defined a custom primary network in their project,
I want to connect my VM to this new primary network, so it can utilize it for east/west/north/south, while still being able to connect to KAPI.

Non-Requirements

Service mesh integration is not a part of this
Seamless live-migration is not a must
UI integration is tracked in CNV-46603

Notes

This epics tracks graduation of features previously developed upstream on OVN Kubernetes, and additional VM-specific work.
Once we have both ingress and egress, we may be able to obsolete the custom solution on OVN-K introduced for HyperShift.
From binding plugins epic:
- https://docs.google.com/document/d/1GZOxEh21TQilgnHULlVq0_6g0K8XKzZIItlZ_Yjv-x0/edit#heading=h.tv7nbn6fqsz2
- https://docs.google.com/document/d/1wt3z9EH5LKk02IQK7xlIxBbu-DSnw2jSgbnyYE96VDA/edit#heading=h.e9t0h9kcw91
- https://github.com/openshift/enhancements/pull/1623
- This could come in a form of a network binding. Passt should be used to allow integration with service meshes, while leveraging the L2 overlay for same IP inside and outside the guest
- There are two options how to implement this:
  - Plug each individual Pod NIC as a VNIC to the guest
  - Let Pod routing do its job and use Passt binding to connect the guest to it through a single VNIC
- Most likely this won't work with any existing core binding of KubeVirt - we will need to introduce our own binding plugin
  - Leveraging passt would make it easier to create our plugin, since it already handles user-space networking for us and DHCP for the guest
  - We could host the new binding plugin under the same repo as IPAM claims controller. Basic passt plugin template is available here: https://github.com/kubevirt/kubevirt/tree/main/cmd/cniplugins/passt-binding
- There was a presentation about binding plugins on KubeVirt Summit 2024

Story SDN-5346: [ocp/origin + ocp/release] Adding e2e tests for virt-aware OVN-K features into OCP

View the Description View the linked PRs

porting the persistent IPs tests from u/s to d/s
ensure these run in ovn-kubernetes ocp repo as presubmit job
gather feedback to gratuate the PersistentIPsForVirtualization feature gate to GA

Epic CNV-46603: UI for OVN Kubernetes: Primary user-defined networks

View the Description View the linked PRs

Goal

Primary used-defined networks can be managed from the UI and the user flow is seamless.

User Stories

As a cluster admin,
I want to use the UI to define a ClusterUserDefinedNetwork, assigned with a namespace selector.
As a project admin,
I want to use the UI to define a UserDefinedNetwork in my namespace.
As a project admin,
I want to be queried to create a UserDefinedNetwork before I create any Pods/VMs in my new project.
As a project admin running VMs in a namespace with UDN defined,
I expect the "pod network" to be called "user-defined primary network",
and I expect that when using it, the proper network binding is used.
As a project admin,
I want to use the UI to request a specific IP for my VM connected to UDN.

UX doc

https://docs.google.com/document/d/1WqkTPvpWMNEGlUIETiqPIt6ZEXnfWKRElBsmAs9OVE0/edit?tab=t.0#heading=h.yn2cvj2pci1l

Non-Requirements

<List of things not included in this epic, to alleviate any doubt raised during the grooming process.>

Notes

The user-defined networks design, including the API, is available here: https://github.com/openshift/enhancements/blob/master/enhancements/network/user-defined-network-segmentation.md

Feature OBSDA-372: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic OU-430: Dev / Admin console unification: Dashboards

View the Description

Description

“In order to have the same UX/UI in the dev and admin perspectives, we as the Observability UI Team need to reuse the dashboards coming from the monitoring plugin”

Goals & Outcomes

Product Requirements:

The dev console dashboards are loaded from the monitoring plugin

Task OU-259: Dev console: Use Dashboards page from monitoring-plugin

View the Description View the linked PRs

Background

The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

The dev console page displays fewer dashboards than the admin version of the page, so that difference will need to be supported by monitoring-plugin.

Outcomes

The dev console page for dashboards is loaded from monitoring-plugin and the code for the page is removed from the console codebase.
The dev console version of the page has the project selector dropdown, but the admin console page doesn't, so monitoring-plugin will need to be changed to support that difference.
We need to check when fetching dashboards that the dev and admin dashboards are fetched from the right endpoint

Epic OU-224: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Task OU-257: Dev console: Use silences tab page from monitoring-plugin

View the Description View the linked PRs

Background

The admin console's silences page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

Outcomes

The dev console silences tab is loaded from the monitoring plugin
The dev console silences detail is loaded from the monitoring plugin
The dev console silences creation is loaded from the monitoring plugin
The code for the silences is removed from the console codebase
The dev console version of the page has the project selector dropdown, but the admin console page doesn't, so monitoring-plugin will need to be changed to support that difference.

Ensure removal of deprecated patternfly components from kebab-dropdown.tsx and alerting.tsx once this story and ~~OU-561~~ are completed.

https://github.com/openshift/monitoring-plugin/pull/252

Feature OBSDA-565: Multicluster Alerting UI / ACM

View the Description

Proposed title of this feature request

Fleet / Multicluster Alert Management User Interface

What is the nature and description of the request?

Large enterprises are drowning in cluster alerts.

side note: Just within my demo RHACM Hub environment, across 12 managed clusters (OCP, SNO, ARO, ROSA, self-managed HCP, xKS), I have 62 alerts being reported! And I have no idea what to do about them!

Customers need the ability to interact with alerts in a meaningful way, to leverage a user interface that can filter, display, multi-select, sort, etc. To multi-select and take actions, for example:

alert filter state is warning
clusters filter is label environment=development
multi-select this result set
take action to Silence the alerts!

Why does the customer need this? (List the business requirements)

Platform engineering (sys admin; SRE etc) must maintain the health of the cluster and ensure that the business applications are running stable. There might indeed be another tool and another team which focuses on the Application health itself, but for sure the platform team is interested to ensure that the platform is running optimally and all critical alerts are responded to.

As of TODAY, what the customer must do is perform alert management via CLI. This is tedious, ad-hoc, and error prone. see blog link

The requirements are:

filtering fleet alerts
multiselect for actions like silence
as a bonus, configuring alert forwarding will be amazing to have.

List any affected packages or components.

OCP console Observe dynamic plugin

ACM Multicluster observability (MCO operator)

Epic OU-396: DP: Monitoring plugin in ACM

View the Description

Description

"In order to provide ACM with the same monitoring capabilities OCP has, we as the Observability UI Team need to allow the monitoring plugin to be installed and work in ACM environments."

Goals & Outcomes

Product Requirements:

Be able to install the monitoring plugin without CMO, use COO

Allow the monitoring plugin to use a different backend endpoint to fetch alerts, ACM has is own alert manager

Add a column to the alerts list to display the cluster that originated the alert
Include only the alerting parts which include the alerts list, alert detail and silences

UX Requirements:

Align UX text and patterns between ACM concepts (hub cluster, spoke cluster, core operators) and current the monitoring plugin

Open Questions

Do the current monitoring plugin and the ACM monitoring plugin need to coexist in a cluster?
Do we need to connect to a different prometheus/thanos or is just a different alert manager?

Task OU-507: Allow the monitoring plugin to connect to other alert managers

View the Description View the linked PRs

Background

In order for ACM to reuse the monitoring plugin, the plugin needs to connect to a different alert manager. It needs to also contain a new column in alerts to show the source cluster these alerts are generated from

Check the ACM documentation around alerts for reference: https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/observability/observing-environments-intro#observability-arch

Outcomes

The monitoring plugin can discover alert managers present in the cluster
If multiple alert managers are discovered, the plugin should display a dropdown to select an alert manager to connect to, if no alert manager is discovered the plugin should fallback to the incluster one
The monitoring plugin can connect to a specific alert manager, to create silences
The monitoring plugin can connect to a specific prometheus rules endpoint, to read alerts
Add a new column that can display from which cluster the alert is coming from

Steps

Use the backend API to list the alert managers that are different from the in-cluster one
If there is only one alert manager, we need to make sure to inform the user which one is selected
On Selecting an alert manager, all the requests for creating and silencing alerts should be targeted to the selected alert manager

https://github.com/openshift/monitoring-plugin/pull/202

Task OU-406: On board the monitoring plugin into COO Konflux

View the Description View the linked PRs

Background

In order to include the monitoring image so it can be installed using COO, we need to adjust COO Konflux

Outcomes

COO Mid stream includes the monitoring plugin image
COO build configuration includes the monitoring plugin

Task OU-450: Adjust CMO and sync with the ART team to migrate the monitoring plugin to a golang backend

View the Description View the linked PRs

Background

In order to enable/disable features for monitoring in different OpenShift flavors, the monitoring plugin should support feature flags

Outcomes

The monitoring plugin with the Go backend can be deployed with CMO and the image is built correctly from the ART team

Feature OBSDA-666: CCX OCP core maintenance 2024

View the Description

Placeholder feature for ccx-ocp-core maintenance tasks.

Epic CCXDEV-14174: IO maintenance OCP 4.18

View the Description

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

Bug OCPBUGS-41833: https://console.redhat.com/api/gathering/v2/%s/gathering_rules should have %s populated

View the Description View the linked PRs

Description of problem:

Insights operator should replaces %s in https://console.redhat.com/api/gathering/v2/%s/gathering_rules error messages like the failed-to-bootstrap:

$ jq -r .content osd-ccs-gcp-ad-install.log | sed 's/\\n/\n/g' | grep 'Cluster operator insights'
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights ClusterTransferAvailable is False with Unauthorized: failed to pull cluster transfer: OCM API https://api.openshift.com/api/accounts_mgmt/v1/cluster_transfers/?search=cluster_uuid+is+%REDACTED%27+and+status+is+%27accepted%27 returned HTTP 401: REDACTED"
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights Disabled is False with AsExpected: "
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights RemoteConfigurationAvailable is False with HttpStatus401: received HTTP 401 Unauthorized from https://console.redhat.com/api/gathering/v2/%s/gathering_rules"
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights RemoteConfigurationValid is Unknown with NoValidationYet: "
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights SCAAvailable is False with Unauthorized: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 401: REDACTED
level=info msg=Cluster operator insights ClusterTransferAvailable is False with Unauthorized: failed to pull cluster transfer: OCM API https://api.openshift.com/api/accounts_mgmt/v1/cluster_transfers/?search=cluster_uuid+is+%27REDACTED%27+and+status+is+%27accepted%27 returned HTTP 401: REDACTED
level=info msg=Cluster operator insights Disabled is False with AsExpected: 
level=info msg=Cluster operator insights RemoteConfigurationAvailable is False with HttpStatus401: received HTTP 401 Unauthorized from https://console.redhat.com/api/gathering/v2/%s/gathering_rules
level=info msg=Cluster operator insights RemoteConfigurationValid is Unknown with NoValidationYet: 
level=info msg=Cluster operator insights SCAAvailable is False with Unauthorized: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 401: REDACTED
level=info msg=Cluster operator insights UploadDegraded is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: {\"errors\":[{\"meta\":{\"response_by\":\"gateway\"},\"detail\":\"UHC services authentication failed\",\"status\":401}]}

Version-Release number of selected component

Seen in 4.17 RCs. Also in this comment.

How reproducible

Unknown

Steps to Reproduce:

Unknown.

Actual results:

ClusterOperator conditions talking about https://console.redhat.com/api/gathering/v2/%s/gathering_rules

Expected results

URIs we expose in customer-oriented messaging to not have %s placeholders.

Additional detail

Seems like the template is coming in as conditionalGathererEndpoint here. Seems like insights-operator#964 introduced the %s, but I'm not finding the logic that's supposed to populate that placeholder.

https://github.com/openshift/insights-operator/pull/997

Bug OCPBUGS-41932: RemoteConfiguration clusteroperator conditions reporting as available in disabled cluster

View the Description View the linked PRs

Description of problem:

When the Insights Operator is disabled (as described in the docs here or here), the RemoteConfigurationAvailable and RemoteConfigurationValid clusteroperator conditions are reporting the previous (before distabling the gathering) state (which might be Available=True and Valid=True).

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Disable the data gathering in the Insights Operator followings the docs links above
    2. Watch the clusteroperator conditions with "oc get co insights -o json | jq .status.conditions"
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/insights-operator/pull/1000

Bug OCPBUGS-42675: rapid recommendations - default built-in config is not taken into account

View the Description View the linked PRs

Rapid recommendations enhancement defines this built-in configuration when the operator cannot reach the remote endpoint.

The issue is that the built-in configuration (though currently empty) is no taken into account - i.e the data requested in the built-configuration is not gathered.

https://github.com/openshift/insights-operator/pull/1006

Bug OCPBUGS-42961: container log requets are wrongly aggregated

View the Description View the linked PRs

With the rapid recommendations feature (enhancement) one can request various messages from Pods matching various Pod name regular expressions

The problem is when there is a Pod (e.g foo-1 from the below example) matching more than one requested Pod name regex:

{
    'namespace': 'test-namespace',
    'pod_name_regex': 'foo-.*',
    'messages': ['regex1', 'regex2']
},
{
    'namespace': 'test-namespace'',
    'pod_name_regex': 'foo-1',
    'messages': ['regex3', 'regex4']
}

Assume Pods with names foo-1 and foo-bar. Currently all the regexes (regex1,regex2, regex3, regex4) are filtered for both Pods.

The desired behavior is foo1 filters all the regexes, but foo-bar is filtered only with regex1 and regex2

https://github.com/openshift/insights-operator/pull/1009

Feature OBSDA-704: CY24 Insights-operator enhancements

View the Description

Goal:
Track Insights Operator Data Enhancements epic in 2024

Epic CCXDEV-14173: IO Data Enhancements OCP 4.18

Task CCXDEV-10979: remove all the "hardcoded" container log gatherers (except the conditionals)

View the Description View the linked PRs

Description

We can remove all the hardcoded container log gatherers (except the conditionals) in favor of Rapid Recommendations approach. They can be remove in the 4.18 version

https://github.com/openshift/insights-operator/pull/1025

Story CCXDEV-14462: [IO Enhancement] collect how many unused MachineConfigs in the cluster

View the Description View the linked PRs

Context:

As we discussed in INSIGHTOCP-1814 , it's a good candidate can help customers to fix the issue caused by too many unused MachineConfigs.

Required Data:

The total number of MachineConfigs in the cluster, the unused number of MachineConfigs in the cluster.

Backports:

To the OCP version we supported.

https://github.com/openshift/insights-operator/pull/1007

Feature OBSDA-850: Include Insights-runtime-extractor in Insights Operator

View the Description

Proposed title of this feature request

Container scanner aims to gain data necessary for business analytics of usage of RH MW portfolio in live fleet.

The request includes assistance with on-boarding container scanner, help bringing it up to Insights Operator standards. GA quality requires performance and scalability QE on top of the functional testing alone.

Enhancement proposal tracked at: https://github.com/openshift/enhancements/pull/1584/files

Epic CCXDEV-12963: Insighs-runtime-extractor integration on IO side

View the linked PRs

https://github.com/openshift/insights-operator/pull/949

Feature OCPSTRAT-1003: Remove Terraform from the IBM Cloud VPC IPI installer

View the Description

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision IBM Cloud VPC infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

The IBM Cloud VPC IPI Installer no longer contains or uses Terraform.
The new provider should aim to provide the same results and have parity with the existing IBM Cloud VPC Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic CORS-3278: Replace Terraform with CAPI Provider for IBM Cloud

View the Description

Epic Goal

Replace Terraform infrastructure and machine (bootstrap, control plane) provisioning with CAPI-based approach.

Story CORS-3282: Machine Provisioning

View the Description View the linked PRs

Provisioning bootstrap and control plane machines using CAPI.

https://github.com/openshift/installer/pull/9203

Story CORS-3283: Infrastructure Resource Provisioning

View the linked PRs

https://github.com/openshift/installer/pull/9118

Story CORS-3474: Bootstrap Machine Ignition Preparation

View the Description View the linked PRs

User Story:

Setup ignition file in COS for bootstrap machine provisioning. Performed after infra but before machines (utilize ignition hook like GCP).

https://github.com/openshift/installer/pull/9199

Story CORS-3284: RHCOS Image Preparation

View the Description View the linked PRs

RHCOS Image Preparation as Pre-Infrastructure Provisioning Task

https://github.com/openshift/installer/pull/8235

Story CORS-3748: InfraReady (post infrastructure) Provisioning

View the Description View the linked PRs

InfraReady (post infrastructure) Provisioning

https://github.com/openshift/installer/pull/9191

Feature OCPSTRAT-1020: Implement Feature Gates for Hosted Control Plane API Aligned with OpenShift’s API Guidelines

View the Description

Feature Overview (Goal Summary)

Hosted Control Planes and HyperShift provide consumers with a different architectural path to OpenShift that aligns best with their multi-cluster deployment needs. However, today’s API surface area in HCP remains “like a box of chocolates you never know what you're gonna get”~ Forrest Gump. Sometimes gated best-effort via the `hcp` cli (which is suboptimal).

The goal of this feature is to build a standard for communicating features that are GA/Preview. This would allow us:

To experiment while setting the right expectations.
Prompt what we deem tested/stable.
Simplify our test matrix and smoothes the documentation process.

This can be done following the guidelines in the FeatureGate FAQ. For example, by introducing a structured system of feature gates in our hosted control plane API, such that features are categorized into 'n-by-default', 'accessible-by-default', 'inaccessible-by-default or TechPreviewNoUpgrade', and 'Tech Preview', we would be ensuring clarity, compliance, and a smooth development and user experience.

Requirements (Acceptance Criteria)

Feature Categorization: Ability to categorize API features according to OpenShift's guidelines (e.g., DevPreview/TechPreview/GA).
Backward Compatibility: Ensures backward compatibility.
Upgrade Path: Clear upgrade paths for 'accessible-by-default' features
Documentation: documentation for each category of feature gates.

Additional resources

There are other teams (e.g., the assisted installer) teams following a structured pattern for gating features:

Epic HOSTEDCP-1738: Ability to feature gate

View the Description

Currently there's no rigorous technical mechanism to feature gate functionality nor APIs in hypershift.
We defer to docs which results in bad UX, consumer confusion and maintainability burden.

We should have technical implementation to allow features and APIs to only run behind a flag.

Spike HOSTEDCP-1773: Spike: ability to feature gate APIs and controllers

View the Description View the linked PRs

Currently there's no rigorous technical mechanism to feature gate functionality nor APIs in hypershift.
We defer to docs which results in bad UX, consumer confusion and maintainability burden.

We should have technical implementation to allow features and APIs to only run behind a flag.

https://github.com/openshift/hypershift/pull/4918

Feature OCPSTRAT-1064: Improve upgrades - phase 3 - Control plane & worker node independence

View the Description

Feature Overview

As a cluster-admin, I want to run update in discrete steps. Update control plane and worker nodes independently.
I also want to back-up and restore incase of a problematic upgrade.

Background:

This Feature is a continuation of https://issues.redhat.com/browse/OCPSTRAT-180.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.Below is the list of done tasks.

~~OTA-700~~ Reduce False Positives (such as Degraded)
~~OTA-922~~ - Better able to show the progress made in each discrete step
[Cover by status command]Better visibility into any errors during the upgrades and documentation of what they error means and how to recover.

Goals

Have an option to do upgrades in more discrete steps under admin control. Specifically, these steps are:
- Control plane upgrade
- Worker nodes upgrade
- Workload enabling upgrade (i..e. Router, other components) or infra nodes
An user experience around an end-2-end back-up and restore after a failed upgrade
MCO-530 - Support in Telemetry for the discrete steps of upgrades

References

Epic TRT-1578: Ensure all HA components are not degraded by design during upgrades

View the Description View the linked PRs

Epic Goal

Eliminate the gap between measured availability and Available=true

Why is this important?

Today it's not uncommon, even for CI jobs, to have multiple operators which blip through either Degraded=True or Available=False conditions
We should assume that if our CI jobs do this then when operating in customer environments with higher levels of chaos things will be even worse
We have had multiple customers express that they've pursued rolling back upgrades because the cluster is telling them that portions of the cluster are Degraded or Unavailable when they're actually not
Since our product is self-hosted, we can reasonably expect that the instability that we experience on our platform workloads (kube-apiserver, console, authentication, service availability), will also impact customer workloads that run exactly the same way: we're just better at detecting it.

Scenarios

In all of the following, assume standard 3 master 0 worker or 3 master 2+ worker topologies
Add/update CI jobs which ensure 100% Degraded=False and Available=True for the duration of upgrade
Add/update CI jobs which measure availability of all components which are not explicitly defined as non-HA (ex: metal's DHCP server is singleton)
Address all identified issues

Acceptance Criteria

openshift/enhancements CONVENTIONS outlines these requirements
CI - Release blocking jobs include these new/updated tests
Release Technical Enablement - N/A if we do this we should need no docs
No outstanding identified issues

Dependencies (internal and external)

Previous Work (Optional):

Clayton, David, and Trevor identified many issues early in 4.8 development but were unable to ensure all teams addressed them that list is in this query, teams will be asked to address everything on this list as a 4.9 blocker+ bug and we will re-evaluate status closer to 4.9 code freeze to see which may be deferred to 4.10
https://bugzilla.redhat.com/buglist.cgi?columnlist=product%2Ccomponent%2Cassigned_to%2Cbug_severity%2Ctarget_release%2Cbug_status%2Cresolution%2Cshort_desc%2Cchangeddate&f1=longdesc&f2=cf_environment&j_top=OR&list_id=12012976&o1=casesubstring&o2=casesubstring&query_based_on=ClusterOperator%20conditions&query_format=advanced&v1=should%20not%20change%20condition%2F&v2=should%20not%20change%20condition%2F

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
DEV - Tests in place
DEV - No outstanding failing tests
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story TRT-1575: CI: fail update suite if any ClusterOperator go Degraded=True

View the Description View the linked PRs

These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:

Feb 18 21:11:25.799 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()

And the job failed, but none of the failures were "something made openshift-apiserver mad enough to go Degraded".

Definition of done:

Same as ~~OTA-362~~
File bugs or the existing issues
If bug exists then add the tests to the exception list.
Unless tests are in exception list , they should fail if we see degraded != false.

https://github.com/openshift/origin/pull/29175

Feature OCPSTRAT-1203: [GA] OpenShift on Oracle Cloud Infrastructure (OCI) Bare metal

View the Description

BU Priority Overview

Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) Bare metal

Goals

Validating OpenShift on OCI baremetal to make it officially supported.
Enable installation of OpenShift 4 on OCI bare metal using Assisted Installer.
Provide published installation instructions for how to install OpenShift on OCI baremetal
OpenShift 4 on OCI baremetal can be updated that results in a cluster and applications that are in a healthy state when update is completed.
Telemetry reports back on clusters using OpenShift 4 on OCI baremetal for connected OpenShift clusters (e.g. platform=external or none + some other indicator to know it's running on OCI baremetal).

Use scenarios

As a customer, I want to run OpenShift Virtualization on OpenShift running on OCI baremetal.
As a customer, I want to run Oracle BRM on OpenShift running OCI baremetal.

Why is this important

Customers who want to move from on-premises to Oracle cloud baremetal
OpenShift Virtualization is currently only supported on baremetal

Requirements

Requirement	Notes
OCI Bare Metal Shapes must be certified with RHEL	It must also work with RHCOS (see iSCSI boot notes) as OCI BM standard shapes require RHCOS iSCSI to boot (~~OCPSTRAT-1246~~) Certified shapes: https://catalog.redhat.com/cloud/detail/249287
Successfully passing the OpenShift Provider conformance testing – this should be fairly similar to the results from the OCI VM test results.	Oracle will do these tests.
Updating Oracle Terraform files
Making the Assisted Installer modifications needed to address the CCM changes and surface the necessary configurations.	Support Oracle Cloud in Assisted-Installer CI: ~~MGMT-14039~~

RFEs:

~~RFE-3635~~ - Supporting Openshift on Oracle Cloud Infrastructure(OCI) & Oracle Private Cloud Appliance (PCA)

OCI Bare Metal Shapes to be supported

Any bare metal Shape to be supported with OCP has to be certified with RHEL.

From the certified Shapes, those that have local disks will be supported. This is due to the current lack of support in RHCOS for the iSCSI boot feature. ~~OCPSTRAT-749~~ is tracking adding this support and remove this restriction in the future.

As of Aug 2023 this excludes at least all the Standard shapes, BM.GPU2.2 and BM.GPU3.8, from the published list at: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#baremetalshapes

Assumptions

Pre-requisite: RHEL certification which includes RHEL and OCI baremetal shapes (instance types) has successfully completed.

Epic MGMT-15721: Oracle cloud improvements

View the Description

Feature goal (what are we trying to solve here?)

Please describe what this feature is going to do.

DoD (Definition of Done)

Please describe what conditions must be met in order to mark this feature as "done".

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

A Customer asked for it

- Name of the customer(s)
- How many customers asked for it?
- Can we have a follow-up meeting with the customer(s)?

A solution architect asked for it

- Name of the solution architect and contact details
- How many solution architects asked for it?
- Can we have a follow-up meeting with the solution architect(s)?

Internal request

- Who asked for it?

Catching up with OpenShift

Reasoning (why it’s important?)

Please describe why this feature is important
How does this feature help the product?

Competitor analysis reference

Do our competitors have this feature?
- Yes, they have it and we can have some reference
- No, it's unique or explicit to our product
- No idea. Need to check

Feature usage (do we have numbers/data?)

We have no data - the feature doesn’t exist anywhere
Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
- Please list all related data usage information
We have the numbers and can relate to them
- Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

Please describe the reasoning behind why it should/shouldn't live inside the UI/API
If it's for a specific customer we should consider using AMS
Does this feature exist in the UI of other installers?

Task MGMT-18696: Shift responsibility of vnic configuration to Oracle

View the Description View the linked PRs

To make iSCSI work, a secondary VNIC must be configured during discovery, and when the machine reboots on core OS. The configuration is almost the same for discovery and Core OS.

Currently, we have one script owned by Red Hat for discovery, and a custom manifest owned by Oracle for CoreOS configuration.

I think this configuration should be owned by Oracle because the network configuration depends on OCI API. Also, we need this script to be the same is order to ensure that the configuration applied on discovery will be the same when the machine reboots on Core OS. Finally, if a customer has a specific need, they won't be able to tailor the configuration to their needs easily, as they would have to use the REST API of the assisted service.

My suggestion is to ask Oracle to drop the configuration script in their metadata service using Oracle's terraform template. On Red Hat side, we would have to pull this script on the node, and execute it thanks to a systemd unit. The same would be done from the custom manifest provided by Oracle.

https://github.com/openshift/assisted-service/pull/6773

Epic MGMT-16167: Assisted-installer: support booting from iSCSI in 4.15 for OCI

View the Description

Feature goal (what are we trying to solve here?)

During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15 when using OCI external platform.

DoD (Definition of Done)

iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend.

When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1 ip=ibft` kargs during install to enable iSCSI booting.

Does it need documentation support?

yes

Feature origin (who asked for this feature?)

A Customer asked for it

- Oracle

Reasoning (why it’s important?)

In OCI there are bare metal instances with iscsi support and we want to allow customers to use it{}

Task MGMT-18514: Properly configure machines networks when using external platform

View the Description View the linked PRs

PR https://github.com/openshift/assisted-service/pull/6257 must be adapted to be used along external platform.

Since we ensure that the iscsi network is not the default route, the PR above will ensure that automatically select the subnet used by the default route.

https://github.com/openshift/assisted-service/pull/6661

Task MGMT-18121: Inject network configuration script during discovery

View the Description View the linked PRs

The secondary VNIC must be configured manually in OCI, a script must be injected in the discovery ISO to configure it.

https://github.com/openshift/assisted-service/pull/6665

Feature OCPSTRAT-1219: Allow 5-node control planes in day 1 with Agent-Based Installer

View the Description

Feature Overview

We are planning to support 5-node control planes to cover a set of active-active failure domains for OpenShift control planes (see ~~OCPSTRAT-1199~~).

The Agent-Based Installer is required to enable this setup on day-1.

For additional context of the 5-node and 2-node control plane model please read:

Epic MGMT-18588: Allow 4/5 - node control planes in day 1 with Assisted Installer

View the Description

Feature Overview

We are planning to support 4/5-node control planes to cover a set of active-active failure domains for OpenShift control planes (see ~~OCPSTRAT-1199~~).

Assisted Installer must support this new topology too.

For additional context of the 5-node and 2-node control plane model please read:

Task MGMT-19080: [BE] Enable 4/5 CP installation in day1 using assisted-installer

View the Description View the linked PRs

Currently, in HA clusters, assisted-service enforces exactly 3 control planes. This issue should change this behaviour to enable 3-5 control planes instead. It was decided in https://redhat-internal.slack.com/archives/G01A5NB3S6M/p1728296942806519?thread_ts=1727250326.825979&cid=G01A5NB3S6M that there will be no fail mechanism to continue with the installation in case one of the control planes failed to install. This issue should also align assisted-service behaviour with marking control planes as schedulable if there are less than 2 workers in the cluster, and not otherwise. It should also align assisted-service behaviour with failing installation if the user asked for at least 2 workers and got less

https://github.com/openshift/assisted-service/pull/6917

Epic AGENT-989: Allow 5-node control planes in day 1 with Agent-Based Installer

View the Description

Epic Goal

Enable users to install a 4/5-node control plane on day 1.

Why is this important?

Users would like more resilient clusters that are deployed across two sites in 3+2 or 2+2 deployments

Scenarios

In a 2 + 1 compact cluster deployment across two sites, the failure of the site with 2 control plane nodes would leave a single control plane node to be used for recovery. Having an extra node on the surviving site would give better odds that the cluster can be recovered in the event another node fails.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
Users can set control plane replicas to 4 or 5.
Any number of worker nodes can be added to a control plane with 4 or 5 replicas.

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Should this be made available only for the baremetal platform?

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story AGENT-990: Remove validations blocking 4 and 5 node control planes

View the Description View the linked PRs

User Story:

As a user, I want to be able to:

configure my install-config.yaml with either 4 or 5 control plane replicas

so that I can achieve

install a cluster in day 1 with either 4 or 5 control plane nodes

Acceptance Criteria:

Description of criteria:

Upstream documentation
Integration test showing adding 4 and 5 control plane node does not result in error

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This does not require a design proposal.
This does not require a feature gate.

https://github.com/openshift/installer/pull/9154

Feature OCPSTRAT-1248: Native Network Isolation for Namespaces

View the Description

Feature Overview (aka. Goal Summary)

Support network isolation and multiple primary networks (with the possibility of overlapping IP subnets) without having to use Kubernetes Network Policies.

Goals (aka. expected user outcomes)

Provide a configurable way to indicate that a pod should be connected to a unique network of a specific type via its primary interface.
Allow networks to have overlapping IP address space.
The primary network defined today will remain in place as the default network that pods attach to when no unique network is specified.
Support cluster ingress/egress traffic for unique networks, including secondary networks.
Support for ingress/egress features where possible, such as:
- EgressQoS
- EgressService
- EgressIP
- Load Balancer Services

Requirements (aka. Acceptance Criteria):

Support for 10,000 namespaces

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Design Document

Use Cases (Optional):

As an OpenStack or vSphere/vCenter user, who is migrating to OpenShift Kubernetes, I want to guarantee my OpenStack/vSphere tenant network isolation remains intact as I move into Kubernetes namespaces.
As an OpenShift Kubernetes user, I do not want to have to rely on Kubernetes Network Policy and prefer to have native network isolation per tenant using a layer 2 domain.
As an OpenShift Network Administrator with multiple identical application deployments across my cluster, I require a consistent IP-addressing subnet per deployment type. Multiple applications in different namespaces must always be accessible using the same, predictable IP address.

Questions to Answer (Optional):

Out of Scope

Multiple External Gateway (MEG) Support - support will remain for default primary network.
Pod Ingress support - support will remain for default primary network.
Cluster IP Service reachability across networks. Services and endpoints will be available only within the unique network.
Allowing different service CIDRs to be used in different networks.
Localnet will not be supported initially for primary networks.
Allowing multiple primary networks per namespace.
Allow connection of multiple networks via explicit router configuration. This may be handled in a future enhancement.
Hybrid overlay support on unique networks.

Background

OVN-Kubernetes today allows multiple different types of networks per secondary network: layer 2, layer 3, or localnet. Pods can be connected to different networks without discretion. For the primary network, OVN-Kubernetes only supports all pods connecting to the same layer 3 virtual topology.

As users migrate from OpenStack to Kubernetes, there is a need to provide network parity for those users. In OpenStack, each tenant (analog to a Kubernetes namespace) by default has a layer 2 network, which is isolated from any other tenant. Connectivity to other networks must be specified explicitly as network configuration via a Neutron router. In Kubernetes the paradigm is the opposite; by default all pods can reach other pods, and security is provided by implementing Network Policy.

Network Policy has its issues:

it can be cumbersome to configure and manage for a large cluster
it can be limiting as it only matches TCP, UDP, and SCTP traffic
large amounts of network policy can cause performance issues in CNIs

With all these factors considered, there is a clear need to address network security in a native fashion, by using networks per user to isolate traffic instead of using Kubernetes Network Policy.

Therefore, the scope of this effort is to bring the same flexibility of the secondary network to the primary network and allow pods to connect to different types of networks that are independent of networks that other pods may connect to.

Customer Considerations

Documentation Considerations

Interoperability Considerations

Test scenarios:

E2E upstream and downstream jobs covering supported features across multiple networks.
E2E tests ensuring network isolation between OVN networked and host networked pods, services, etc.
E2E tests covering network subnet overlap and reachability to external networks.
Scale testing to determine limits and impact of multiple unique networks.

Epic SDN-4919: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/api/pull/1988

Feature OCPSTRAT-1267: Making Crun default in 4.18

View the Description

Feature Overview (aka. Goal Summary)

Crun is GA as non default since OCP 4.14 . We want to make it as default in 4.18 while still supporting runc as non-default

Benefits of Crun is covered here https://github.com/containers/crun

FAQ.: https://docs.google.com/document/d/1N7tik4HXTKsXS-tMhvnmagvw6TE44iNccQGfbL_-eXw/edit

***Note -> making Crun default does not means we will remove the support for runc nor we have any plans in foreseeable future to do that

Epic OCPNODE-2217: Make crun the default runtime for OpenShift

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPNODE-2633: Make crun default container runtime in MCO

View the Description View the linked PRs

xref: https://github.com/openshift/machine-config-operator/pull/4610

https://github.com/openshift/machine-config-operator/pull/4610

Story OCPNODE-2357: Add Openshift + MCO specific changes to set crun as default

View the Description View the linked PRs

Check with ACS team; see if there are external repercussions.

https://github.com/openshift/machine-config-operator/pull/4437

Feature OCPSTRAT-1278: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic STOR-2043: GCP PD CSI support for C3 instance type with hyperdisk-balanced disk

View the Description

Epic Goal*

What is our purpose in implementing this? What new capability will be available to customers?

Per ~~OCPSTRAT-1278~~, we want to support OCP on C3 instance type (baremetal) in order to enabled OCP virt on GCP. The C3 instance type supports the hyperdisk-balanced disks.

The goal is to validate that our GCP CSI operator can deploy the driver on C3 baremetal nodes and function as expected.

As OCP virt requires RWX to support VM live migration, we need to make sure the driver works with this access type with volumeType block.

Why is this important? (mandatory)

Product level priority to enabled OCP virt on GCP. Multiple customers are waiting for this solution. See ~~OCPSTRAT-1278~~ for additional details.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

As a customer, I want to run OpenShift Virtualization on OpenShift running on GCP baremetal instance types.

Dependencies (internal and external) (mandatory)

PD CSI driver to support baremetal / C3 instance type

PD CSI driver to support block RWX

Contributing Teams(and contacts) (mandatory)

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

Development - STOR
Documentation - VIRT / Installer
QE - STOR
PX -
Others -

Acceptance Criteria (optional)

GCP PD CSI on C3 nodes passes the regular CSI tests + RWX with volumeType block. Actual VM live migration tests will be done by the virt team.

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be "Release Pending"

Story STOR-2052: CI implementation: GCP PD CSI support for C3 instance type with hyperdisk-balanced disk

View the linked PRs

Feature OCPSTRAT-1307: Allow installing with iSCSI boot in Agent, Assisted and ACM/MCE installers

View the Description

Feature Overview

iSCSI boot is supported in RHEL and since the implementation of ~~OCPSTRAT-749~~ it's also available in RHCOS.

Customers require using this feature in different bare metal environments on-prem and cloud-based.

Assisted Installer implements support for it in Oracle Cloud Infrastructure (~~MGMT-16167~~) to support their bare metal standard "shapes".

This feature extends this support to make it generic and supported in the Agent-Based Installer, the Assisted Installer and in ACM/MCE.

Goals

Support iSCSI boot in bare metal nodes, including platform baremetal and platform "none".

Requirements

Assisted installer can boot and install OpenShift on nodes with iSCSI disks.

Agent-Based Installer can boot and install OpenShift on nodes with iSCSI disks.

MCE/ACM can boot and install OpenShift on nodes with iSCSI disks.

The installation can be done on clusters with platform baremetal and clusters with platform "none".

Epic AGENT-954: ABI: support booting from iSCSI

View the Description

Epic Goal

Support booting from iSCSI using ABI starting OCP 4.16.

The following PRs are the gaps between release-4.17 branch and master that are needed to make the integration work on 4.17.

https://github.com/openshift/assisted-service/pull/6665

https://github.com/openshift/assisted-service/pull/6603

https://github.com/openshift/assisted-service/pull/6661

The feature has to be backported to 4.16 as well. TBD - list all the PRs that have to be backported.

Instructions to test the AI feature with local env - https://docs.google.com/document/d/1RnRhJN-fgofnVSBTA6mIKcK2_UW7ihbZDLGAVHSdpzc/edit#heading=h.bf4zg53460gu

Why is this important?

Oracle has a client with disconnected env waiting for it - slack discussion.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
iSCSI boot is enabled on ocp >= 4.16

Dependencies (internal and external)

https://issues.redhat.com/browse/MGMT-16167 - AI support boot for iSCSI for COI
https://issues.redhat.com/browse/MGMT-17556 - AI generic support for iSCSI boot

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story AGENT-959: Support iSCSI Boot

View the Description View the linked PRs

Add new systemd services ( already available in Assisted service) into ABI to enable iSCSI boot

https://github.com/openshift/installer/pull/8886

Feature OCPSTRAT-132: [Tech Preview] Cluster API Provider for Azure

View the Description

Feature Overview

Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift

prerequisite work Goals completed in ~~OCPSTRAT-1122~~
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.

Phase 1 & 2 covers implementing base functionality for CAPI.

Background, and strategic fit

Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
CAPI has much better community interaction than MAPI.
Other projects are considering using CAPI and it would be cleaner to have one solution
Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

Epic OCPCLOUD-1577: CAPI providers: Azure Tech Preview

View the Description

Epic Goal

As we prepare to move over to using Cluster API (CAPI) we need to make sure that we have the providers in place to work with this. This Epic is to track the tech preview of the provider for Azure

Why is this important?

What are the benefits to the customer, or to us, that make this worth
doing? Fulfills a critical need for a customer? Improves
supportability/debuggability? Improves efficiency/performance? This
section is used to help justify the priority of this item vs other things
we can do.

Drawbacks

Reasons we should consider NOT doing this such as: limited audience for
the feature, feature will be superceded by other work that is planned,
resulting feature will introduce substantial administrative complexity or
user confusion, etc.

Scenarios

Detailed user scenarios that describe who will interact with this
feature, what they will do with it, and why they want/need to do that thing.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement
details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub
Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPCLOUD-2584: Install new manifest generator tool in CAPZ repository

View the Description View the linked PRs

User Story

As an OpenShift engineer I want the CAPI Providers repositories to use the new generator tool so that they can independently generate CAPI Provider transport ConfigMaps

Background

Once the new CAPI manifests generator tool is ready, we want to make use of that directly from the CAPI Providers repositories so we can avoid storing the generated configuration centrally and independently apply that based on the running platform.

Steps

Install new CAPI manifest generator as a go `tool` to all the CAPI provider repositories
Setup a make target under the `/openshift/Makefile` to invoke the generator. Make it output the manifests under `/openshift/manifests`
Make sure `/openshift/manifests` is mapped to `/manifests` in the openshift/Dockerfile, so that the files are later picked up by CVO
Make sure the manifest generation works by triggering a manual generation
Check in the newly generated transport ConfigMap + Credential Requests (to let them be applied by CVO)

Stakeholders

<Who is interested in this/where did they request this>

Definition of Done

CAPI manifest generator tool is installed

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-capi-operator/pull/115

Feature OCPSTRAT-1323: Sigstore image re-verification for namespace( TP -4.18)

View the Description

Feature Overview (aka. Goal Summary)

Sigstore image verification for namespace

Epic OCPNODE-2253: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story OCPNODE-2333: Implement ImagePolicy CRD in MCO ctrcfg runtime controller

View the Description View the linked PRs

As an openshift developer, I want to implement MCO ctrcfg runtime controller watching the ImagePolicy resources. The controller will update the sigstore verification file that crio --signature-policy-dir uses for namespaced policies.

Feature OCPSTRAT-1328: Implement Shared Ingress for Tenant Clusters API servers

View the Description

Goal

This goals of this features are:

As part of a Microsoft guideline/requirement for implementing ARO HCP, we need to design a shared-ingress to kube-apiserver because MSFT has internal restrictions on IPv4 usage.

Background

Given Microsoft's constraints on IPv4 usage, there is a pressing need to optimize IP allocation and management within Azure-hosted environments.

Interoperability Considerations

Impact: Which versions will be impacted by the changes?
Test Scenarios: Must test across various network and deployment scenarios to ensure compatibility and scale (perf/scale)

Epic HOSTEDCP-1720: Implement shared ingress solution for ARO

View the Description

There's currently multiple ingress strategies we support for hosted cluster service endpoints (kas, nodePort, router...).
In a context of uncertainty about what use cases would be more critical to support, we initially exposed this in a flexible API that enables to potentially choose any combination of ingress strategies and endpoints.
ARO has internal restrictions on IPv4 usage. Because of this, to simplify the above and to be more cost effective in terms of infra we'd want to have a common shared ingress solution for all hosted clusters fleet.

Story HOSTEDCP-1732: Add PDBs for shared ingress router pod

View the Description View the linked PRs

As a management cluster owner I want to make sure the shared ingress is resilient to cluster failures

https://github.com/openshift/hypershift/pull/4596

Story HOSTEDCP-1939: SharedIngress LoadBalacner should be created immediately when HO is installed

View the Description View the linked PRs

User Story:

Currently the SharedIngress controller waits for a HostedCluster to exist before creating the Service/LoadBalancer of the shared-ingress.

The controller should create the Service/LoadBalancer even

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/4650

Feature OCPSTRAT-1332: [Tech Preview] Bare Metal Cluster API Provider

View the Description

Feature Overview

Introduce the CAPI provider for bare metal in OpenShift as an alternative and long term replacement of MAPI for managing nodes and clusters.

Goals

Technology Preview release, introducing the current implementation of the Cluster API Provider Metal3 (CAPM3) into OpenShift.

https://github.com/metal3-io/cluster-api-provider-metal3

Epic METAL-528: Productize the baremetal Cluster API provider

View the Description

Goal

Our goal is to be able to deploy baremetal clusters using Cluster API in Openshift.

Upstream

Metal3, our upstream community, already provides a CAPI provider, and our aim is to bring it downstream.

Other

We will collaborate with the Cluster Infrastructure team on points of integration as needed.

Scope questions

Changes in Ironic?
- No
Changes in Metal3?
- Bringing in downstream-only MAPI changes to upstream CAPI
Changes in OpenShift?
- Create and set up the CAPI repo
- Bring in any useful changes from MAPI
Spec/Design/Enhancements?
- Not for this
- But any follow-up work (replacing BM MAPI with CAPI or doing Hypershift) likely will
Dependencies on other teams?
- Maybe ART?

Story METAL-966: Start CAPM3 as part of cluster-capi-operator

View the Description View the linked PRs

See https://github.com/openshift/cluster-capi-operator/pull/149 for inspiration.

https://github.com/openshift/cluster-capi-operator/pull/209

Feature OCPSTRAT-1333: [Tech Preview] Bare Metal day 2 firmware settings reconfiguration and firmware updates

View the Description

Feature Overview

Firmware (BIOS) updates and attributes configuration from OpenShift is key in O-RAN clusters. While can do it on day 1, customers need to set firmware attributes to hosts that have already been deployed and are part of a cluster.

This feature adds the capability of updating firmware attributes and updating the firmware image for hosts in deployed clusters.

Epic METAL-940: Day 2 hardware reconfiguration (firmware settings and updates)

View the Description

As part of demoing our integration with hardware vendors, we need to show the ability to reconfigure already provisioned hosts: modify their BIOS settings and, in the future, do firmware upgrades. The initial demo will be concentrated on BIOS settings. The demo is expected to be based on 4.15 and to use unmerged patches since 4.15 is closed for feature development. The path to productization will be determined as an outcome of the demo.

The assumed end result is an ability to run firmware upgrades and update BIOS settings for hosts that are already provisioned without fully deprovisioning them. The hosts will still be rebooted, so some external orchestrator (a human or ZTP) will need to drain the nodes first.

Story METAL-1158: Sync BMO downstream to include HUP CR

View the linked PRs

https://github.com/openshift/baremetal-operator/pull/379

Story METAL-982: Metal3 changes for live firmware setting changes (HFS)

View the linked PRs

https://github.com/openshift/cluster-baremetal-operator/pull/445

Feature OCPSTRAT-1347: [GA release] Next-gen OLM (OLM v1)

View the Description

Feature Overview (aka. Goal Summary)

With this next-gen OLM GA release (graduated from ‘Tech Preview’), customers can:
- discover collections of k8s extension/operator contents released in the FBC format with richer visibility into their release channels, versions, update graphs, and the deprecation information (if any) to make informed decisions about installation and/or update them.
- install a k8s extension/operator declaratively and potentially automate with GitOps to ensure predictable and reliable deployments.
- update a k8s extension/operator to a desired target version or keep it updated within a specific version range for security fixes without breaking changes.
- remove a k8s extension/operator declaratively and entirely including cleaning up its CRDs and other relevant on-cluster resources (with a way to opt out of this coming up in a later release).

To address the security needs of 30% of our customers who run clusters in disconnected environments, the GA release will include cluster extension lifecycle management functionality for offline environments.

[Tech Preview] (Cluster)Extension lifecycle management can handle runtime signature validation for container images to support OpenShift’s integration with the rising Sigstore project for secure validation of cloud-native artifacts,

Goals (aka. expected user outcomes)

1. Pre-installation:

Customers can access a collection of k8s extension contents from a set of default catalogs leveraging the existing catalog images shipped with OpenShift (in the FBC format) with the new Catalog API from the OLM v1 GA release.

With the new GAed Catalog API, customers get richer package content visibility in their release channels, versions, update graphs, and the deprecation information (if any) to help make informed decisions about installation and/or update.

With the new GAed Catalog API, customers can render the catalog content in their clusters with fewer resources in terms of CPU and memory usage and faster performance.

Customers can filter the available packages based on the package name and see the relevant information from the metadata shipped within the package.

2. Installation:

Customers using a ServiceAccount with sufficient permissions can install a k8s extension/operator with a desired target version or the latest version within a specific version range (from the associated channel) to get the latest security fixes.

Customers can easily automate the installation flow declaratively with GitOps to ensure predictable and reliable deployments.

Customers get protection from having two conflicting k8s extensions/operators owning the same API objects, i.e., no conflicting ownership, ensuring cluster stability.

Customers can access the* metadata of the installed k8s extension/operator to see essential information such as its provided APIs, example YAMLs of its provided APIs, descriptions, infrastructure features, valid subscriptions, etc.

3. Update:

Customers can see what updates are available for their k8s extension/operators in the form of immediate target versions and the associated update channels.

Customers can trigger the update of a k8s extension/operator with a desired target version or the latest version within a specific version range (from the associated channel) to get the latest security fixes.

Customers get protection from workload or k8s extension/operator breakage due to CustomResourceDefinition (CRD) being upgraded to a backward incompatible version during an update.

During OpenShift cluster update, customers* get Informed when installed k8s extensions/operators ** do not support the next OpenShift version *(when annotated by the package author/provider). Customers must update those k8s extensions/operators to a newer/compatible version before OLM unblocks the OpenShift cluster update.

4. Uninstallation/Deletion:

Customers can cleanly remove an installed k8s extension/operator including deleting CustomResourceDefinitions (CRDs), custom resource objects (CRs) of the CRDs, and other relevant resources to revert the cluster to its original state before the installation declaratively.

5. Disconnected Environments for High-Security Workloads:

Approximately 30% of our customers prioritize high security by running their clusters in internet-disconnected environments, especially for mission-critical production workloads. To benefit these users, our supported GA release needs to include cluster extension lifecycle management functionality that functions within these disconnected environments.

6. [Tech Preview] Signature Validation for Secure Workflows:

The Red Hat-sponsored Sigstore project is gaining traction in the Kubernetes community, aiming to simplify the signing of cloud-native artifacts. OpenShift leverages Sigstore tooling to enable scalable and flexible signature validation, including support for disconnected environments. This functionality will be available as a Tech Preview in 4.17 and is targeted for ~~General Availability (GA)~~ Tech Preview Phase 2 in the upcoming 4.18 release. To ~~fully~~ support this integration as a Tech Preview release, the (cluster)extension lifecycle management needs to (be prepared to) handle runtime validation of Sigstore signatures for container images.

Requirements (aka. Acceptance Criteria):

All the expected user outcomes and the acceptance criteria in the engineering epics are covered.

Background

OLM: Gateway to the OpenShift Ecosystem

Operator Lifecycle Manager (OLM) has been a game-changer for OpenShift Container Platform (OCP) 4. Since its launch in 2019, OLM has fostered a rich ecosystem, expanding from a curated set of 25 operators to over 100 officially supported Red Hat operators and hundreds more from certified ISVs and the community.

OLM empowers users to manage diverse technologies with ease, including ACM, ACS, Quay, GitOps, Pipelines, Service Mesh, Serverless, and Virtualization. It has also facilitated the introduction of groundbreaking operators for entirely new workloads, like Nvidia GPU, PTP, Windows Machine Config, SR-IOV networking, and more. Today, a staggering 91% of our connected customers leverage OLM's capabilities.

OLM v0: A Stepping Stone

While OLM v0 has been instrumental, it has limitations. The API design, not fully GitOps-friendly or entirely declarative, presents a steeper learning curve due to its complexity. Furthermore, OLM v0 was designed with the assumption of namespace-scoped CRDs (Custom Resource Definitions), allowing for independent operator installations and parallel versions within a single cluster. However, this functionality never materialized in core Kubernetes, and OLM v0's attempt to simulate it has introduced limitations and bugs.

The Operator Framework Team: Building the Future

The Operator Framework team is the cornerstone of the OpenShift ecosystem. They build and manage OLM, the Operator SDK, operator catalog formats, and tooling (opm, file-based catalogs). Their work directly impacts how operators are developed, packaged, delivered, and managed by users and SRE teams on OpenShift clusters.

A Streamlined Future with OLM v1

The Operator Framework team has undergone significant restructuring to focus on the next generation of OLM – OLM v1. This transition includes moving the Operator SDK to a feature-complete state with ongoing maintenance for compatibility with the latest Kubernetes and controller-runtime libraries. This strategic shift allows the team to dedicate resources to completely revamping OLM's API and management concepts for catalog content delivery.

Leveraging learnings and customer feedback since OCP 4's inception, OLM v1 is designed to be a major overhaul, and it will be shipped as a Generally Available (GA) feature in OpenShift 4.17.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

1. Pre-installation:

[GA release] Docs provide instructions on how to add Red Hat-provided Operator catalogs with the pull secret for catalogs hosted on a secure registry.

[GA release] Docs provide instructions on how to discover the Operator packages from a catalog.

[GA release] Docs provide instructions on how to query and inspect the metadata of Operator bundles and find feasible ones to be installed with the OLM v1.

2. Installation:

[GA release] Docs provide instructions on how to use a ServiceAccount with sufficient permissions to install a k8s extension/operator with a desired target version or the latest version within a specific version range to get the latest security fixes.

[GA release] Docs provide instructions on how to automate the installation flow declaratively with GitOps to ensure predictable and reliable deployments.

[GA release] Docs mention the OLM v1’s protection from having two conflicting k8s extensions/operators owning the same API objects, i.e., no conflicting ownership, ensuring cluster stability.

[GA release] Docs provide instructions on how to access the metadata of the installed k8s extension/operator to see essential information such as its provided APIs, example YAMLs of its provided APIs, descriptions, infrastructure features, valid subscriptions, etc.

[GA release] Docs explain how to create RBACs from a CRD to grant cluster users access to the installed k8s extension/operator's provided APIs.

3. Update:

[GA release] Docs provide instructions on how to see what updates are available for their k8s extension/operators in the form of immediate target versions and the associated update channels.

[GA release] Docs provide instructions on how to trigger the update of a k8s extension/operator with a desired target version or the latest version within a specific version range to get the latest security fixes.

[GA release] Docs mention OLM v1’s protection from workload or k8s extension/operator breakage due to CustomResourceDefinition (CRD) being upgraded to a backward incompatible version during an update.

[GA release] Docs mention OLM v1 will block the OpenShift cluster update if installed k8s extensions/operators do not support the next OpenShift version (when annotated by the package author/provider). Provide instructions on how to find and update to a newer/compatible version before OLM unblocks the OpenShift cluster update.

4. Uninstallation/Deletion:

[GA release] Docs provide instructions on how to cleanly remove an installed k8s extension/operator including deleting CustomResourceDefinitions (CRDs), custom resource objects (CRs) of the CRDs, and other relevant resources.

[GA release] Docs provide instructions to verify the cluster has been reverted to its original state after uninstalling a k8s extension/operator.

Relevant upstream CNCF OLM v1 requirements, engineering brief, and epics:

1. Pre-installation:

F1 - Extension catalogs
F2 - Extension catalog discovery
Brief: Catalogd Content Storage and Serving
epic#242 Catalogd webserver uses HTTPS
epic#239 Finalize Catalogd API Definitions for Phase 1 (API Review)

2. Installation:

F7 - Extension installation
Brief: ClusterExtension Controller uses Helm for managing installed content
Brief: ClusterExtension support for simple registry+v1 bundles
epic#733 ClusterExtension uses Helm for deploying bundle content
epic#734 ClusterExtension supports only: ‘Registry+v1 bundles’, ‘AllNamespaces mode’, ‘No webhooks’, ‘No dependencies’. If criteria not met, OLMv1 will block installation
epic#735 Remove Extension API from main branch
epic#736 Implement initial solution for no two (Cluster)Extension objects managing the same underlying object
epic#737 ClusterExtension uses service account provided in spec
epic#740 Finalize ClusterExtension API Definitions for Phase 1 (API Review)

3. Update:

F10 - Extension updates
F8 - Semver-based update policy
Brief: CRD Upgrade Safety
epic#657 CRD Upgrade Safety
epic#738 Default and full support for Replaces, Skips, SkipRange
epic#739 Support OperatorConditions expectations of registry+v1 bundles
[Downstream epic] Inform when installed operators do not support the next Kubernetes minor version
epic#740 Finalize ClusterExtension API Definitions for Phase 1 (API Review)

4. Uninstallation/Deletion:

F17 - Extension cascading removal
Brief: ClusterExtension Controller uses Helm for managing installed content
epic#733 ClusterExtension uses Helm for deploying bundle content
epic#740 Finalize ClusterExtension API Definitions for Phase 1 (API Review)

Relevant documents:

Epic OPRUN-3372: Include OLM ClusterCatalogs by default

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Today, OLM v0 ships with four catalogs: redhat-operators, certified-operators, community-operators and redhat-marketplace. Since catalogd does not know about the existence of the OLM v0 catalog sources, we need to expose those catalogs to the cluster by default once OLM v1 is GA.
The goal of this epic is to ensure that those four catalogs are available by default as Catalog objects so that the operator-controller can resolve and install content without additional user configuration.

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OPRUN-3571: [Downstream] Add controller to cluster-olm-operator for managing default ClusterCatalogs

View the Description View the linked PRs

Typically, any non-deployment resource managed by cluster-olm-operator would be handled by a StaticResourceController (usage ref). Unfortunately, the StaticResourceController only knows how to handle specific types, as seen by the usage of the ApplyDirectly function in the StaticResourceController.Sync method. Due to the ApplyDirectly function only handling a set of known resources, the ClusterCatalog resource would likely not be handled the same as other static manifests currently managed by cluster-olm-operator.

In order to enable cluster-olm-operator to properly manage ClusterCatalog resources, it is proposed that we implement a custom factory.Controller that knows how to appropriately apply and manage ClusterCatalog resources such that:

Changes to any fields specified in the default ClusterCatalog resources are reverted to the default values
Changes to fields not specified in the default ClusterCatalog resources are left untouched

The openshift/library-go project has a lot of packages that will likely make this implementation pretty straightforward. The custom controller implementation will likely also require implementation of some pre-condition logic that ensures the ClusterCatalog API is available on the cluster before attempting to use it.

https://github.com/openshift/cluster-olm-operator/pull/73

Epic OPRUN-3460: Support For Disconnected Environments

View the Description

NOTE: All features will be tech-preview in the first release and then will graduate to GA next release or when it is ready for GA.

Epic Goal

OLM V1 supports disconnected Environments for High-Security Workloads

Why is this important?

Significant number of our customers prioritize high security by running their clusters in internet-disconnected environments, especially for mission-critical production workloads. To benefit these users, our supported GA release needs to include cluster extension lifecycle management functionality that functions within these disconnected environments.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OPRUN-3529: Add operator-controller kustomize overlay for hostPath volume mount of /etc/containers

View the Description View the linked PRs

Downstream change to add kustomize overlay for hostPath volume mount of /etc/containers

https://github.com/openshift/cluster-olm-operator/pull/69

Epic OPRUN-3551: Downstream testing for OLM V1

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Currently a place holder.

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OPRUN-3580: Create an openshift/origin test that checks for the presence of default ClusterCatalogs

View the Description View the linked PRs

Create a test that makes sure that the pre-defined, default, cluster catalogs are defined and are in a good state.

https://github.com/openshift/origin/pull/29246

Story OPRUN-3579: Add a openshift/origin happy-path test (operator-controller)

View the Description View the linked PRs

Create a test that builds upon the catalogd happy-path, by creating a manifest image, and then updating the ClusterCatalog to references that image. Then creating a ClusterExtension to deploy the manifests.

The status of the ClusterExtension should then be checked.

The manifests do not need to create a deployment, in fact it would be better if the manifest included simpler resources such as a configmap or secret.

https://github.com/openshift/origin/pull/29249

Story OPRUN-3576: Create initial openshift/origin tests

View the Description View the linked PRs

This will create the initial openshift/origin tests. This will consist of tests that ensure, while in tech-preview, that the ClusterExtension and ClusterCatalog APIs are present. This includes creating an OWNERS files that will make approving/reviewing future PRs easier.

https://github.com/openshift/origin/pull/29189

Story OPRUN-3570: Downstream Testing of OCP max version

View the Description View the linked PRs

Test 1:

Create a Bundle with the following property:

apiVersion: operators.coreos.com/v1alpha1

kind: ClusterServiceVersion

metadata:

annotations:

olm.properties: '[\{"type": "olm.maxOpenShiftVersion", "value": "4.17"}]'

Note the value needs to be equal to the cluster version this is being tested on.

Apply a ClusterExtension resource that installs the bundle
Query the operator conditions to ensure that:
Upgradeable is set to false
Reason is “IncompatibleOperatorsInstalled”
Message is the name of the bundle name

Test 2

Same as test 1 but with two bundles. Message should have names in alphabetical order.

Test 3

Apply a bundle without the annotation. Upgradeable should be True.

https://github.com/openshift/origin/pull/29242

Epic OPRUN-3285: OLM v1 Supports MaxOCPVersion field

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

https://docs.google.com/document/d/18m-OG0PN8-jjjgGT33WNujzmj_1B2Tqoqd-bVKX4CkE/edit?usp=sharing

Many operators write the MaxOCPVersion field in their bundle metadata. OLM v1 needs to support the same MaxOCPVersion workflow, where OLM blocks a cluster upgrade when that version is set.
Outside the scope of this epic, but in a future iteration, we should also respect MinKubeVersion (and potentially support MaxKubeVersion?)

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OPRUN-3368: cluster olm operator blocks upgrades based on maxocpversion

View the Description View the linked PRs

cluster-olm-operator watches clusterextensions
cluster-olm-operator queries downstream-only helm chart metadata in the release secrets of each installed operator
cluster-olm-operator sets Upgradeable=False with the appropriate reason and message when the maxocpversion is the current cluster version

https://github.com/openshift/cluster-olm-operator/pull/64

Epic OPRUN-3380: Enable OLM v1 to be on by default in the release payload

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Once OLM v1.0.0 is feature complete and the team feels comfortable enabling it by default, we should remove the OLM v1 feature flag and deploy it on all clusters by default.
We should also introduce OLMv1 behind a CVO capability to give customers the option of leaving it disabled in their clusters.

Why is this important?

Scenarios

Acceptance Criteria

OLMv1 is enabled by default in OCP
OLMv1 can be fully disabled at install/upgrade time using CVO capabilities

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OPRUN-3605: last minute GA default payload adjustments

View the Description View the linked PRs

We are encountering toleration misses in origin azure tests which are preventing our components from stabilizing.
cluster-olm-operator is spamming api server with condition lastUpdateTimes
disconnected environment in CI/origin is different from OLMv1 expectations (but we do feel that v1 disconnected functionality is getting enough validation elsewhere to be confident). Created OCPBUGS-44810 to align expectations of the disconnected environments

https://github.com/openshift/cluster-olm-operator/pull/89

Task OPRUN-3590: update cluster-olm-operator for OLM v1 API

View the Description View the linked PRs

Refactor cluster-olm-operator to use v1 of the OLM openshift/api/operator API

A/C:

- cluster-olm-operator now uses OLM v1

- OLM resource manifest updated to use v1

- CI is green

https://github.com/openshift/cluster-olm-operator/pull/79

Story OPRUN-3588: Ability to disable OLM v1 at installation time

View the Description View the linked PRs

OpenShift offers a "capabilities" to allow users to select which components to include in the cluster at install time.

It was decided the capability name should be: OperatorLifecycleManagerV1 [ref

A/C:

- ClusterVersion resource updated with OLM v1 capability
- cluster-olm-operator manifests updated with capability.openshift.io/name=OperatorLifecycleManagerV1 annotation

Task OPRUN-3402: openshift/api/operator v1alpha OLM to v1

View the Description View the linked PRs

Promote OLM API in the OpenShift API from v1alpha1 to v1 (see https://github.com/openshift/api/blob/master/operator/v1alpha1/types_olm.go#L1)

A/C:

- openshift/api/operator/v1alpha1 OLM promoted to v1

- openshift/api/operator/v1alpha1 OLM removed

Story OPRUN-3599: enable must-gather resource collection for cluster-olm-operator

View the Description View the linked PRs

As someone troubleshooting an OLMv1 issue with a cluster, I'd like to be able to see the state of cluster-olm-operator and the OLM resource, so that I can have all the information I need to fix the issue.

A/C:

- must-gather contains cluster-olm-operator namespace and contained resources
- must-gather contains OLM cluster scoped resource
- if cluster-olm-operator fails before updating its ClusterOperator, I'd still want the cluster-olm-operator namespace, it's resources, and the cluster scoped OLM resource to be in the must-gather

https://github.com/openshift/cluster-olm-operator/pull/77

Feature OCPSTRAT-1352: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic SDN-5231: Rebase Kube version to 1.31 in repos maintained by the SDN team

View the Description

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

Priority+ is set by engineering
Epic must be Linked to a +Parent Feature
Target version+ must be set
Assignee+ must be set
(Enhancement Proposal is Implementable
(No outstanding questions about major work breakdown
(Are all Stakeholders known? Have they all been notified about this item?
Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement
details and documents.

...

Dependencies (internal and external)

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story SDN-5235: CNCC 1.31 Kube Rebase

View the linked PRs

https://github.com/openshift/cloud-network-config-controller/pull/155

Story SDN-5234: CNO 1.31 Kube rebase

View the linked PRs

https://github.com/openshift/cluster-network-operator/pull/2509

Epic WRKLDS-1432: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story WRKLDS-1492: Update kubernetes version to v1.31 (oc & workloads operator)

View the Description View the linked PRs

Goal:
Update team owned repositories to Kubernetes v1.31

?? is the 1.31 freeze
?? is the 1.31 GA

Problem:<please update links for 1.31>
The following repository must be rebased onto the latest version of Kubernetes:

oc: https://github.com/openshift/oc/pull/1877

The following repositories should be rebased onto the latest version of Kubernetes:

cluster-kube-controller-manager operator: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/816
cluster-policy-controller: https://github.com/openshift/cluster-policy-controller/pull/156
cluster-kube-scheduler operator: https://github.com/openshift/cluster-kube-scheduler-operator/pull/547
secondary-scheduler-operator: https://github.com/openshift/secondary-scheduler-operator/pull/225
cluster-capacity: https://github.com/openshift/cluster-capacity/pull/97
run-once-duration-override-operator: https://github.com/openshift/run-once-duration-override-operator/pull/68
run-once-duration-override: https://github.com/openshift/run-once-duration-override/pull/36
cluster-openshift-controller-manager-operator: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/368
openshift-controller-manager: https://github.com/openshift/openshift-controller-manager/pull/345
cli-manager-operator: https://github.com/openshift/cli-manager-operator/pull/358
cli-manager: https://github.com/openshift/cli-manager/pull/144
cluster-kube-descheduler-operator: https://github.com/openshift/cluster-kube-descheduler-operator/pull/384
descheduler:

Entirely remove dependencies on k/k repository inside oc.

Why is this important:

Customers demand we provide the latest stable version of Kubernetes.
The rebase and upstream participation represents a significant portion of the Workloads team's activity.

Epic OCPCLOUD-2720: Rebase Cluster Infrastructure Components onto 1.31

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Cluster Infrastructure owned components should be running on Kubernetes 1.29
This includes
- The cluster autoscaler (+operator)
- Machine API operator
  - Machine API controllers for:
    - AWS
    - Azure
    - GCP
    - vSphere
    - OpenStack
    - IBM
    - Nutanix
- Cloud Controller Manager Operator
  - Cloud controller managers for:
    - AWS
    - Azure
    - GCP
    - vSphere
    - OpenStack
    - IBM
    - Nutanix
- Cluster Machine Approver
- Cluster API Actuator Package
- Control Plane Machine Set Operator

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task OCPCLOUD-2732: Rebase/update to K8s 1.31 for Machine API Provider AWS

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/machine-api-provider-aws/pull/114

Task OCPCLOUD-2724: Rebase/update to K8s 1.31 for Cloud Provider Azure

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cloud-provider-azure/pull/122

Task OCPCLOUD-2726: Rebase/update to K8s 1.31 for Machine API Provider Nutanix

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/machine-api-provider-nutanix/pull/83

Task OCPCLOUD-2725: Rebase/update to K8s 1.31 for Cloud Provider AWS

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

Task OCPCLOUD-2729: Rebase/update to K8s 1.31 for Machine API Provider Azure

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/machine-api-provider-azure/pull/121

Task OCPCLOUD-2737: Rebase/update to K8s 1.31 for Cloud Controller Manager Operator

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/368

Task OCPCLOUD-2721: Rebase/update to K8s 1.31 for Cloud Provider Nutanix

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cloud-provider-nutanix/pull/33

Task OCPCLOUD-2722: Rebase/update to K8s 1.31 for Cloud Provider IBM

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cloud-provider-ibm/pull/71

Task OCPCLOUD-2723: Rebase/update to K8s 1.31 for Cloud Provider GCP

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cloud-provider-gcp/pull/68

Task OCPCLOUD-2739: Rebase/update to K8s 1.31 for Cluster Autoscaler Operator

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-autoscaler-operator/pull/333

Task OCPCLOUD-2727: Rebase/update to K8s 1.31 for Machine API Provider IBM

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/machine-api-provider-ibmcloud/pull/47

Task OCPCLOUD-2728: Rebase/update to K8s 1.31 for Machine API Provider GCP

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/machine-api-provider-gcp/pull/93

Task OCPCLOUD-2736: Rebase/update to K8s 1.31 for Cluster Machine Approver

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-machine-approver/pull/240

Task OCPCLOUD-2734: Rebase/update to K8s 1.31 for Control Plane Machine Set Operator

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/328

Task OCPCLOUD-2733: Rebase/update to K8s 1.31 for Cluster Autoscaler

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/kubernetes-autoscaler/pull/319

Task OCPCLOUD-2738: Rebase/update to K8s 1.31 for Machine API Operator

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/machine-api-operator/pull/1292

Epic MCO-1285: Update MCO dependencies to Kubernetes 1.31

View the Description

Epic Goal

The goal of this epic is to upgrade all OpenShift and Kubernetes components that MCO uses to v1.29 which will keep it on par with rest of the OpenShift components and the underlying cluster version.

Why is this important?

Uncover any possible issues with the openshift/kubernetes rebase before it merges.
MCO continues using the latest kubernetes/OpenShift libraries and the kubelet, kube-proxy components.
MCO e2e CI jobs pass on each of the supported platform with the updated components.

Acceptance Criteria

All stories in this epic must be completed.
Go version is upgraded for MCO components.
CI is running successfully with the upgraded components against the 4.18/master branch.

Dependencies (internal and external)

ART team creating the go 1.31 image for upgrade to go 1.31.
OpenShift/kubernetes repository downstream rebase PR merge.

Open questions::

Do we need a checklist for future upgrades as an outcome of this epic?-> yes, updated below.

Done Checklist

Step 1 - Upgrade go version to match rest of the OpenShift and Kubernetes upgraded components.
Step 2 - Upgrade Kubernetes client and controller-runtime dependencies (can be done in parallel with step 3)
Step 3 - Upgrade OpenShift client and API dependencies
Step 4 - Update kubelet and kube-proxy submodules in MCO repository
Step 5 - CI is running successfully with the upgraded components and libraries against the master branch.

Bug OCPBUGS-40752: ART requests updates to 4.18 image ose-machine-config-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-config-operator/pull/4561

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/machine-config-operator/pull/4561

Story MCO-1314: Pick up openshift/kubernetes 1.31 rebase updates

View the Description View the linked PRs

User or Developer story

As a MCO developer, I want to pick up the openshift/kubernetes updates for the 1.31 k8s rebase to track the k8s version as rest of the OpenShift 1.31 cluster.

Engineering Details

Update the go.mod, go.sum and vendor dependencies pointing to the kube 1.31 libraries. This includes all direct kubernetes related libraries as well as openshift/api , openshift/client-go, openshift/library-go and openshift/runtime-utils

Acceptance Criteria:

All k8s.io related dependencies should be upgraded to 1.31.
openshift/api , openshift/client-go, openshift/library-go and openshift/runtime-utils should be upgraded to latest commit from master branch
All ci tests must be passing

https://github.com/openshift/machine-config-operator/pull/4629

Story MCO-1294: Update MCO Dockerfile to Use Multi-Base Images from OpenShift 4.18

View the Description View the linked PRs

As part of our continuous improvement efforts, we need to update our Dockerfile to utilize the new multi-base images provided in OpenShift 4.18. The current Dockerfile is based on RHEL 8 and RHEL 9 builder images from OpenShift 4.17, and we want to ensure our builds are aligned with the latest supported images, for multiple architectures.

Updating the RHEL 9 builder image to

registry.ci.openshift.org/ocp/builder:rhel-9-golang-1.22-builder-multi-openshift-4.18

Updating the RHEL 8 builder image to

registry.ci.openshift.org/ocp/builder:rhel-8-golang-1.22-builder-multi-openshift-4.18

Updating the base image to

registry.ci.openshift.org/ocp-multi/4.18-art-latest-multi:machine-config-operator

or specifying a different tag if we dont want to only do mco

Ensuring all references and dependencies in the Dockerfile are compatible with these new images.

https://github.com/openshift/machine-config-operator/pull/4630

Epic OCPCLOUD-2740: Rebase Cluster API Components onto 1.30

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Cluster Infrastructure owned CAPI components should be running on Kubernetes 1.30
target is 4.18 since CAPI is always a release behind upstream

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task OCPCLOUD-2743: Rebase/update to K8s 1.30 for Cluster API Provider GCP

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-gcp/pull/233

Task OCPCLOUD-2747: Rebase/update to K8s 1.30 for Cluster API Provider IBM

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/92

Task OCPCLOUD-2745: Rebase/update to K8s 1.30 for Cluster API Provider Azure

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-azure/pull/325

Task OCPCLOUD-2742: Rebase/update to K8s 1.30 for Core Cluster API

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api/pull/225

Task OCPCLOUD-2746: Rebase/update to K8s 1.30 for Cluster API Provider vSphere

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-vsphere/pull/51

Task OCPCLOUD-2744: Rebase/update to K8s 1.30 for Cluster API Provider AWS

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-api-provider-aws/pull/529

Task OCPCLOUD-2741: Rebase/update to K8s 1.30 for Cluster CAPI Operator

View the Description View the linked PRs

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

https://github.com/openshift/cluster-capi-operator/pull/218

Epic CCO-595: Upgrade to Kubernetes 1.31

View the Description View the linked PRs

Epic Goal

The goal of this epic is to upgrade all OpenShift and Kubernetes components that CCO uses to v1.31 which keeps it on par with rest of the OpenShift components and the underlying cluster version.

Why is this important?

To make sure that Hive imports of other OpenShift components do not break when those rebase
To avoid breaking other OpenShift components importing from CCO.
To pick up upstream improvements

Acceptance Criteria

CI - MUST be running successfully with tests automated

Dependencies (internal and external)

Kubernetes 1.31 is released (August 2024)

Previous Work (Optional):

Similar previous epic ~~CCO-541~~

Done Checklist

CI - CI is running, tests are automated and merged.

https://github.com/openshift/cloud-credential-operator/pull/763

Epic WRKLDS-1449: Upgrade to Kubernetes 1.31

View the Description View the linked PRs

Epic Goal*

Drive the technical part of the Kubernetes 1.31 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.

Why is this important? (mandatory)

OpenShift 4.18 cannot be released without Kubernetes 1.31

Scenarios (mandatory)

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

PRs:

Retro: Kube 1.31 Rebase Retrospective Timeline (OCP 4.18)

Retro recording: https://drive.google.com/file/d/1htU-AglTJjd-VgFfwE3z_dH5tKXT1Tes/view?usp=drive_web

Bug OCPBUGS-39375: oc command won't mirror images with different name, but same layers

View the Description View the linked PRs

Description of problem:

Given 2 images with different names, but same layers, "oc image mirror" will only mirror 1 of them. For example:

$ cat images.txt
quay.io/openshift/community-e2e-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/bertinatto/test-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS
quay.io/openshift/community-e2e-images:e2e-31-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/bertinatto/test-images:e2e-31-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS

$ oc image mirror -f images.txt
quay.io/
  bertinatto/test-images
    manifests:
      sha256:298dcd808e27fbf96614e4c6f06730f22964dce41dcdc7bf21096c42411ba773 -> e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS
  stats: shared=0 unique=0 size=0B

phase 0:
  quay.io bertinatto/test-images blobs=0 mounts=0 manifests=1 shared=0

info: Planning completed in 2.6s
sha256:298dcd808e27fbf96614e4c6f06730f22964dce41dcdc7bf21096c42411ba773 quay.io/bertinatto/test-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS
info: Mirroring completed in 240ms (0B/s)

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Only one of the images were mirrored.

Expected results:

Both images should be mirrored.

Additional info:

https://github.com/openshift/oc/pull/1864

Bug OCPBUGS-42532: Revert back to equality check in [sig-cli] policy scc-subject-review, scc-review

View the Description View the linked PRs

This PR https://github.com/openshift/origin/pull/29141 loosens the check to ignore the warning message in the output in order to unblock https://github.com/openshift/oc/pull/1877. Once the requires PRs are merged, we should revert back to `o.Equal` again. This issue is created to track this work.

https://github.com/openshift/origin/pull/29289

Bug OCPBUGS-41265: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-41261: CRD violation issues following controller-tools bump

View the Description View the linked PRs

Following the recent changes in the CRD schema validation (introduced in https://github.com/kubernetes-sigs/controller-tools/pull/944), our tooling have identified several CRD violations in our APIs:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_api/1983/pull-ci-openshift-api-master-verify-crd-schema/1825851451549683712

https://github.com/openshift/api/pull/2040

Bug OCPBUGS-41257: kube-apiserver unable to start in TechPreview clusters

View the Description View the linked PRs

TechPreview clusters are unable to bootstrap because kube-apiserver fails to start with the following error:

E0827 20:29:22.653501 1 run.go:72] "command failed" err="group version resource.k8s.io/v1alpha2 that has not been registered"

This happens because, in Kubernetes 1.31, the group version resource.k8s.io/v1alpha2 was removed and replaced with resource.k8s.io/v1alpha3. This is part of the DynamicResourceAllocation feature, which is currently TechPreview.

After discussing this with the team, we decided that the best approach is to modify the cluster-kube-apiserver-operator to start the kube-apiserver with the correct group version based on the Kubernetes version being used.

Feature OCPSTRAT-1356: 'oc adm upgrade status' command improvements - Tech Preview

View the Description

As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.

Feature Overview (aka. Goal Summary)

Here are common update improvements from customer interactions on Update experience

Show nodes where pod draining is taking more time.
Customers have to dig deeper often to find the nodes for further debugging.
The ask has been to bubble up this on the update progress window.
oc update status ?
From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"
But the ask is to show more details in a human-readable format.
Know where the update has stopped. Consider adding at what run level it has stopped.
```
oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS

version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
```

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic OTA-1256: Improved presentation in oc adm upgrade status command

View the Description

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process.
Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Tests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Other

Story OTA-1292: Provide more information about degraded and unavailable nodes

View the Description View the linked PRs

We utilize MCO annotations to determine whether a node is degraded or unavailable, and we solely source the Reason annotation to put into the insight. Many common cases are not covered by this, especially the unavailable ones: nodes can be cordoned, have a condition like DiskPressure, be in the process of termination etc. Not sure whether our code or something like MCO should provide it, but captured this as a card for now.

Story OTA-1153: status: communicate control plane completion clearly

View the Description View the linked PRs

Current state:

An update is in progress for 28m42s: Working towards 4.14.1: 700 of 859 done (81% complete), waiting on network

= Control Plane =
...
Completion:      91%

Improvement opportunities

1. Inconsistent info: CVO message says "700 of 859 done (81% complete)" but control plane section says "Completion: 91%"
2. Unclear measure of completion: CVO message counts manifest applied and control plane section says "Completion: 91%" which counts upgraded COs. Both messages do not state what they count. Manifest count is an internal implementation detail which users likely do not understand. COs are less so, but we should be more clear in what the completion means.
3. We could take advantage of this line and communicate progress with more details

Definition of Done

We'll only remove CVO message once the rest of the output functionally covers it, so the inconsistency stays until ~~OTA-1154~~. Otherwise:

= Control Plane =
...
Completion:      91% (30 operators upgraded, 1 upgrading, 2 waiting)

Upgraded operators are COs that updated its version, no matter its conditions
Upgrading operators are COs that havent updated its version and are Progressing=True
Waiting operators are COs that havent updated its version and are Progressing=False

https://github.com/openshift/oc/pull/1859

Story OTA-1212: Hide or condense control plane sections once successfully updated

View the Description View the linked PRs

Description

During an upgrade, once control plane is successfully updated, status items related to that part of the upgrade cease to be relevant, and therefore we can either hide them entirely, or we can show a simplified version of them. The relevant sections are Control plane and Control plane nodes.

https://github.com/openshift/oc/pull/1844

Task OTA-1309: Ensure the node in a single-node cluster is handled correctly

View the Description View the linked PRs

As an OTA engineer,
I would like to make sure the node in a single-node cluster is handled correctly in the upgrade-status command.

Context:
According to the discussion with the MCO team,
the node is in MCP/master but not worker.
This card is to make sure that the node are displayed that way too. My feeling is that the current code probably does the job already. In that case, we should add test coverage for the case to avoid regression in the future.

AC:

The node is displayed in the master section in the output of the upgrade-status command.
The node is NOT displayed in the worker section in the output of the upgrade-status command.
A test case exists in https://github.com/openshift/oc/tree/master/pkg/cli/admin/upgrade/status/examples

https://github.com/openshift/oc/pull/1858

Feature OCPSTRAT-1357: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic NP-1059: Whereabouts: Performance and scale considerations

View the Description

Epic Goal

Address performance and scale issues in Whereabouts IPAM CNI

Why is this important?

Whereabouts is becoming increasingly more popular for use on workloads that operate at scale. Whereabouts was originally built as a convenience function for a handful of IPs, however, more and more customers want to use whereabouts in scale sitatuions.

Notably, for telco and ai/ml scenarios. Some ai/ml scenarios launch a large number of pods that need to use secondary networks for related traffic.

Supporting Documents

Upstream collaboration outline

Acceptance Criteria

Bug OCPBUGS-33946: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/whereabouts-cni/pull/314

Feature OCPSTRAT-1389: On Cluster Layering: Phase 3 (GA)

View the Description

Feature Overview

This is Image mode on OpenShift. It uses the rpm-ostree native containers interface and not bootc but that is an implementation detail.

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

On-cluster, automated RHCOS Layering builds are important for multiple reasons:

One-click/one-command upgrades of OCP are very popular. Many customers may want to make one or just a few customizations but also want to keep that simplified upgrade experience.
Customers who only need to customize RHCOS temporarily (hotfix, driver test package, etc) will find off-cluster builds to be too much friction for one driver.
One of OCP's virtues is that the platform and OS are developed, tested, and versioned together. Off-cluster building breaks that connection and leaves it up to the user to keep the OS up-to-date with the platform containers. We must make it easy for customers to add what they need and keep the OS image matched to the platform containers.

Goals & Requirements

The goal of this feature is primarily to bring the 4.14 progress (~~OCPSTRAT-35~~) to a Tech Preview or GA level of support.
Customers should be able to specify a Containerfile with their customizations and "forget it" as long as the automated builds succeed. If they fail, the admin should be alerted and pointed to the logs from the failed build.
- The admin should then be able to correct the build and resume the upgrade.
Intersect with the Custom Boot Images such that a required custom software component can be present on every boot of every node throughout the installation process including the bootstrap node sequence (example: out-of-box storage driver needed for root disk).
Users can return a pool to an unmodified image easily.
RHEL entitlements should be wired in or at least simple to set up (once).
Parity with current features – including the current drain/reboot suppression list, CoreOS Extensions, and config drift monitoring.

Epic MCO-1316: On-Cluster Layering GA - upgrades and integrations

View the Description

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.

As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.

As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.

To test:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.

As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.

As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up
(~~MCO-770~~, ~~MCO-578~~, ~~MCO-574~~ )

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.

Maybe:

Entitlements: ~~MCO-1097~~, ~~MCO-1099~~

Not Likely:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.

Story MCO-1231: Use Kubernetes Job objects for image builds

View the Description View the linked PRs

Currently, we are using bare pod objects for doing our image builds. While this works, it makes adding retry logic and other things much more difficult since we will have to implement this logic. Instead, we should use Kubernetes Jobs objects.

Jobs have built-in mechanisms for retrying, exponential backoff, concurrency controls, etc. This frees us from having to implement complicated retry logic for build failures beyond our control such as pod evictions, etc.

Done When:

BuildController uses Kubernetes Jobs instead of bare pods to perform builds.
All tests have been updated.

https://github.com/openshift/machine-config-operator/pull/4700

Story MCO-1097: Automate mechanism to gather the Simple Content Access Certificate/secret

View the Description View the linked PRs

The Insights Operator syncs the customer's Simple Content Access certificate to the etc-pki-entitlement secret in the openshift-config-managed namespace every 8 hours. Currently, the user is expected to clone this secret into the MCO namespace, prior to initiating a build if they require this cert during the build process. We'd like this step automated so that user does not have to do this manual step.

https://github.com/openshift/machine-config-operator/pull/4613

Story MCO-1418: Must-Gathers should include MachineOSConfigs and MachineOSBuilds

View the Description View the linked PRs

Whenever a must-gather is collected, it includes all of the objects at the time of the must-gather creation. Right now, must-gathers do not include MachineOSConfigs and MachineOSBuilds, which would be useful to have for support and debugging purposes.

Done When:

must-gathers include all MachineOSConfigs / MachineOSBuilds, if present.

https://github.com/openshift/must-gather/pull/463

Epic MCO-828: On-Cluster Layering GA

View the Description

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

To test:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.

As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.

As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.

Maybe:

Entitlements: ~~MCO-1097~~, ~~MCO-1099~~

Not Likely:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.

Story MCO-694: Reverting from layered to non-layered MachineConfigPools does not work

View the Description View the linked PRs

Currently, it is not possible for cluster admins to revert from a pool that is opted into on-cluster builds and layered MachineConfig updates. See https://issues.redhat.com/browse/OCPBUGS-16201 for details around what happens.

It is worth mentioning that this is mostly an issue for UPI (user provided infrastructure) / bare metal users of OpenShift. For IPI cases in AWS / GCP / Azure / et. al., one can simply delete the node and the machine, which will cause the Machine API to provision a fresh node to replace it, e.g.:

#!/bin/bash

node_name="$1"
node_name="${node_name/node\//}"
machine_id="$(oc get "node/$node_name" -o jsonpath='{.metadata.annotations.machine\.openshift\.io/machine}')"
machine_id="${machine_id/openshift-machine-api\//}"
oc delete --wait=false "machine/$machine_id" -n openshift-machine-api
oc delete --wait=false "node/$node_name"

Done When

The MCD can revert from a node from on-cluster builds / layered MachineConfigs into the legacy behavior.
Or we've determined that the above is either infeasible or undeisrable.

https://github.com/openshift/machine-config-operator/pull/4284

Bug OCPBUGS-43382: In OCB. Revert OCL takes too long to start updating the nodes

View the Description View the linked PRs

Description of problem:

When we create a MOSC to enable OCL in a pool, and then we delete the MOSC resource to revert it, then the MOSB and CMs are garbage collected but we need to wait a long and random time until the nodes are updated with the new config.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-10-15-080246-ci-ln-0gsqflb-latest   True        False         8h      Cluster version is 4.18.0-0.test-2024-10-15-080246-ci-ln-0gsqflb-latest

How reproducible:

Always

Steps to Reproduce:

    1. Create a MOSC to enable OCL in the worker pool
    2. Wait until the new OCL image is applied to all worker nodes
    3. Remove the MOSC resource created in step 1

Actual results:

MOSB and CMs are cleaned, but the nodes are not updated. After a random amount of time  the nodes are updated. (Somewhere around 10-20 minutes)

Expected results:

There should be no long pause between the deletion of the MOSC resource and the beginning of the nodes update process.

Additional info:

As a workaround, if we add any label to the worker pool to force a sync operation the worker nodes start updating immediately.

https://github.com/openshift/machine-config-operator/pull/4683

Story MCO-1273: Proxy Support for OCL

View the Description View the linked PRs

Description of problem:

When OCL is configured in a cluster using a proxy configuration, OCL is not using the proxy to build the image.

Version-Release number of selected component (if applicable):

 oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-rc.8   True        False         5h14m   Cluster version is 4.16.0-rc.8

How reproducible:

Always

Steps to Reproduce:

    1. Create a cluster that uses a proxy and cannot access the internet if not by using this proxy
    
    We can do it by using this flexy-install template, for example:
    https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/5724d9c157d51f175069c5bf09be1872173d0167/functionality-testing/aos-4_16/ipi-on-aws/versioned-installer-customer_vpc-http_proxy-multiblockdevices-fips-ovn-ipsec-ci

    private-templates/functionality-testing/aos-4_16/ipi-on-aws/versioned-installer-customer_vpc-http_proxy-multiblockdevices-fips-ovn-ipsec-ci

    2. Enable OCL in a machineconfigpool by creating a MOSC resrouce

Actual results:

The build pod will not use the proxy to build the image and it will fail with a log similar to this one


time="2024-06-25T13:38:19Z" level=debug msg="GET https://quay.io/v1/_ping"
time="2024-06-25T13:38:49Z" level=debug msg="Ping https://quay.io/v1/_ping err Get \"https://quay.io/v1/_ping\": dial tcp 44.216.66.253:443: i/o timeout (&url.Error{Op:\"Get\", URL:\"https://quay.io/v1/_ping\", Err:(*net.OpError)(0xc000220d20)})"
time="2024-06-25T13:38:49Z" level=debug msg="Accessing \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883\" failed: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 44.216.66.253:443: i/o timeout"
time="2024-06-25T13:38:49Z" level=debug msg="Error pulling candidate quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 44.216.66.253:443: i/o timeout"
Error: creating build container: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 44.216.66.253:443: i/o timeout
time="2024-06-25T13:38:49Z" level=debug msg="shutting down the store"
time="2024-06-25T13:38:49Z" level=debug msg="exit status 125"

Expected results:

The build should be able to access the necessary resources by using the configured proxy

Additional info:

When verifying this ticket, we need to pay special attention to https proxies using their own user-ca certificate

We can use this flexy-install template: 
https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/5724d9c157d51f175069c5bf09be1872173d0167/functionality-testing/aos-4_16/ipi-on-osp/versioned-installer-https_proxy

private-templates/functionality-testing/aos-4_16/ipi-on-osp/versioned-installer-https_proxy

In this kind of clusters it is not enough to use the proxy to build the image, but we need to use the /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt file to be able to reach the yum repositories, since rpm-ostree will complain about an intermediate certificate (this one of the https proxy) being self-signed.

To test it we can use a custom Containerfile including something simelar to:

RUN cd /etc/yum.repos.d/ && curl -LO https://pkgs.tailscale.com/stable/fedora/tailscale.repo && \
    rpm-ostree install tailscale && rpm-ostree cleanup -m && \
    systemctl enable tailscaled && \
    ostree container commit

https://github.com/openshift/machine-config-operator/pull/4599

Story MCO-1327: Make BuildController less monolithic

View the Description View the linked PRs

BuildController is responsible for a lot of things. Unfortunately, it is very difficult to determine where and how BuildController does its job, which makes it more difficult to extend and modify as well as test.

Instead, it may be more useful to think of BuildController as the thing that converts MachineOSBuilds into build pods, jobs, et. al. Similar to how we have a subcontroller for dealing with build pods, we should have another subcontroller whose job is to produce MachineOSBuilds.

Done When:

A MachineOSBuildController (or similar) is introduced into pkg/controller/build whose sole job is to watch for MachineOSConfig creation / changes, as well as MachineConfigPool config updates.
In response to the aforementioned events, MachineOSBuildController should create a MachineOSBuild object using those inputs.
If a build is currently in progress and one of the aforementioned events occurs, either MachineOSBuildController or BuildController (TBD which one), should cancel the running build, clean up any ephemeral build objects, and start a new build.
BuildController can be simplified to only look for the creation and deletion of MachineOSBuild objects.
This, coupled with https://issues.redhat.com/browse/MCO-1326, will go a long way toward making BuildController more resilient, modular, and testable.

https://github.com/openshift/machine-config-operator/pull/4624

Bug OCPBUGS-42695: In OCB. MCDs are restarted every few seconds when we create several MOSC resources

View the Description View the linked PRs

Description of problem:

When OCL is enabled and we configure several MOSC resources for several MCPs, the MCD pods are restarted every few seconds.
They should only be restarted once per MOSC, instead they are continuously restarted.

Version-Release number of selected component (if applicable):

IPI on AWS version 4.17.0-0.test-2024-10-02-080234-ci-ln-2c0xsqb-latest

How reproducible:

Always

Steps to Reproduce:

    1. Enable techpreview
    2. Create 5 custom MCPs
    3. Create one MOSC resource for each new MCP

Actual results:

MCD pods will be restarted every few seconds

$ oc get pods
NAME                                                             READY   STATUS    RESTARTS   AGE
kube-rbac-proxy-crio-ip-10-0-31-199.us-east-2.compute.internal   1/1     Running   4          4h51m
kube-rbac-proxy-crio-ip-10-0-31-37.us-east-2.compute.internal    1/1     Running   4          4h43m
kube-rbac-proxy-crio-ip-10-0-38-189.us-east-2.compute.internal   1/1     Running   4          4h51m
kube-rbac-proxy-crio-ip-10-0-54-127.us-east-2.compute.internal   1/1     Running   3          4h43m
kube-rbac-proxy-crio-ip-10-0-69-126.us-east-2.compute.internal   1/1     Running   4          4h51m
machine-config-controller-d6bdf7d85-2wb22                        2/2     Running   0          113m
machine-config-daemon-d7t4d                                      2/2     Running   0          6s
machine-config-daemon-f7vv2                                      2/2     Running   0          12s
machine-config-daemon-h8t8z                                      2/2     Running   0          8s
machine-config-daemon-q9fhr                                      2/2     Running   0          10s
machine-config-daemon-xvff2                                      2/2     Running   0          4s
machine-config-operator-56cdd7f8fd-wlsdd                         2/2     Running   0          105m
machine-config-server-klggk                                      1/1     Running   1          4h48m
machine-config-server-pmx2n                                      1/1     Running   1          4h48m
machine-config-server-vwxjx                                      1/1     Running   1          4h48m
machine-os-builder-7fb58586bc-sq9rj                              1/1     Running   0          50m

Expected results:

MCD pods should only be restarted once for every MOSC

Additional info:

https://github.com/openshift/machine-config-operator/pull/4691

Story MCO-1283: Quick-start guide for OCL

View the Description View the linked PRs

As an OpenShift cluster admin, I would like to try out on-cluster layering (OCL) to better understand how it works, how to set it up, and how to use it. To that end, a quick-start guide for what I need to do to get started as well as a troubleshooting guide would be indispensable.

Done When:

A quick-start guide has been written and merged into the MCO code repository.
User testing is out-of-scope for this card as that concern is better handled by https://issues.redhat.com/browse/MCO-1195.

https://github.com/openshift/machine-config-operator/pull/4544

Story MCO-1326: Decouple BuildController, ImageBuildRequest, etc.

View the Description View the linked PRs

Within BuildController, there is a lot of code concerned with creating all of the ephemeral objects for performing a build, converting secrets from one form to another, cleaning up after the build is completed, etc. Unfortunately, because of how BuildController is currently written, this code has become a bit unwieldy and difficult to modify and test. In addition, it is very difficult to reason about what is actually happening. Therefore, it should be broken up and refactored into separate modules within pkg/controller/build.

By doing this, we can have very high test granularity as well as tighter assertions for the places where it is needed the most while simultaneously allowing looser and more flexible testing for BuildController itself.

Done When:

ImageBuildRequest and all of the various helpers and test code has been repackaged into a submodule within pkg/controller/build.
Repackaged code should only have a few ways to use it as opposed to global structs, methods, and functions. This will ensure that the code is effectively modularized.
Unneeded code is removed from BuildController, such as anything referring to OpenShift Image Builds.
Unit tests are updated.

https://github.com/openshift/machine-config-operator/pull/4568

Feature OCPSTRAT-1411: [Tech Preview] automatic backup with etcd BackupAPI (no config)

View the Description

Feature Overview

ETCD backup API was delivered behind a feature gate in 4.14. This feature is to complete the work for allowing any OCP customer to benefit from the automatic etcd backup capability.

The feature introduces automated backups of the etcd database and cluster resources in OpenShift clusters, eliminating the need for user-supplied configuration. This feature ensures that backups are taken and stored on each master node from the day of cluster installation, enhancing disaster recovery capabilities.

Why is it important?

The current method of backing up etcd and cluster resources relies on user-configured CronJobs, which can be cumbersome and prone to errors. This new feature addresses the following key issues:

User Experience: Automates backups without requiring any user configuration, improving the overall user experience.
Disaster Recovery: Ensures backups are available on all master nodes, significantly improving the chances of successful recovery in disaster scenarios where multiple control-plane nodes are lost.
Cluster Stability: Maintains cluster availability by avoiding any impact on etcd and API server operations during the backup process.

Requirements

Complete work to auto-provision internal PVCs when using the local PVC backup option. (right now, the user needs to create PVC before enabling the service).

Out of Scope

The feature does not include saving cluster backups to remote cloud storage (e.g., S3 Bucket), automating cluster restoration, or providing automated backups for non-self-hosted architectures like Hypershift. These could be future enhancements (see OCPSTRAT-464)

Epic ETCD-609: Automated Backups with No Config Tech Preview

View the Description

Epic Goal*

Provide automated backups of etcd saved locally on the cluster on Day 1 with no additional config from the user.

Why is this important? (mandatory)

The current etcd automated backups feature requires some configuration on the user's part to save backups to a user specified PersistentVolume.
See: https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/config/v1alpha1/types_backup.go#L46

Before the feature can be shipped as GA, we would require the capability to save backups automatically by default without any configuration. This would help all customers have an improved disaster recovery experience by always having a somewhat recent backup.

Scenarios (mandatory)

After a cluster is installed the etcd-operator should take etcd backups and save them to local storage.
The backups must be pruned according to a "reasonable" default retention policy so it doesn't exhaust local storage.
A warning alert must be generated upon failure to take backups.

Implementation details:
One issue we need to figure out during the design of this feature is how the current API might change as it is inherently tied to the configuration of the PVC name.
See:
https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/config/v1alpha1/types_backup.go#L99
and
https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/operator/v1alpha1/types_etcdbackup.go#L44

Additionally we would need to figure out how the etcd-operator knows about the available space on local storage of the host so it can prune and spread backups accordingly.

Dependencies (internal and external) (mandatory)

Depends on changes to the etcd-operator and the tech preview APIs

Contributing Teams(and contacts) (mandatory)

Development - etcd team
Documentation - etcd docs team
QE - Sandeep Kundu
PX -
Others -

Acceptance Criteria (optional)

Upon installing a tech-preview cluster backups must be saved locally and their status and path must be visible to the user e.g on the operator.openshift.io/v1 Etcd cluster object.

An e2e test to verify that the backups are being saved locally with some default retention policy.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Story ETCD-687: Remove etcd-backup-server StaticPod approach e2e test

View the Description View the linked PRs

As a developer, I want to add etcd-backup-server container within a separate deployment away from the etcd static pod.

https://github.com/openshift/origin/pull/29196

Story ETCD-686: Remove etcd-backup-server StaticPod approach

View the Description View the linked PRs

As a developer, I want to add etcd-backup-server container within a separate deployment away from the etcd static pod.

https://github.com/openshift/cluster-etcd-operator/pull/1358

Story ETCD-662: Add etcd-backup-server e2e test

View the Description View the linked PRs

As a developer, I want to add e2e test for the ** etcd-backup-server sidecar container

Story ETCD-666: Add etcd-backup-server backup pruning logic

View the Description View the linked PRs

As a developer, I want to add etcd backup pruning logic within etcd-backup-server sidecar container

https://github.com/openshift/cluster-etcd-operator/pull/1334

Feature OCPSTRAT-1417: oc-mirror automatically detects and mirror SigStore-style attachments

View the Description

Feature Overview

oc-mirror ~~by default leverages OCI 1.1 referrers~~ or its fallback (tag-based discover) to discover related image signatures for any image that it mirrors
this feature is enabled by default and can be disabled globally
Optionally, oc-mirror can be configured to include other referring artifacts, e.g. SBOMs or in-toto attestations referenced by their OCI artifact media type

Goals

As part of OCPSTRAT-918 and ~~OCPSTRAT-1245~~ we are introducing broad coverage in the OpenShift platform for signatures produced with the SigStore tooling, which allow for scalable and flexibly validation of the signatures, incl. offline environments
In order to enable offline verification, oc-mirror needs to detect whether any image that is in scope for its mirroring operation has one or more related SigStore signatures referring to, by using the OCI 1.1 referrers API or it's fallback, or cosigns tag naming convention for signatures and mirror those artifacts as well

Requirements

SigStore-style signature should be mirrored by default, but opt-out has to be available
The public key from Red Hat and the public Rekor key from Red Hat used to sign products images needs to be available offline
SigStore-style attachments should optionally be able to be discovered and mirrored as well as an opt-in, the user should be able to supply a list of OCI media types they are interested in (e.g. text/spdx or application/vnd.cyclonedx for SBOMs)

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

Background, and strategic fit

OpenShift is planning to ship all payload and layered product images signed consistently via cosign with OpenShift 4.17. oc-mirror should be able to leverage this to provide a seamless signature verification experience in an offline environment by automatically making all required signature artifacts available in the offline registry.

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CLID-150: As a customer I want to be sure the release signatures are made available

Story CLID-232: As oc-mirror user I would like to get the release signature in the cluster-resources folder

View the Description View the linked PRs

Overview

This task is really to ensure oc-mirror v2 has backward compatibility to what v1 was doing regarding signatures

Goal

Ensure the correct configmaps are generated and stored in a folder so that the user can deploy the related artifact/s to the cluster as in v1

https://github.com/openshift/oc-mirror/pull/924

Feature OCPSTRAT-1425: Support additional NTP servers in install-config for bare metal

View the Description

Feature Overview

As a user deploying OpenShift on bare metal I want the installer to use the NTP servers that I specify at install time.

Problem

When the Ironic pre-provisioning image containing IPA is running, there is no way to sync the clocks to a custom NTP server. This causes issues with certificates - IPA generates a certificate for itself to be valid starting 1 hour in the past (see ~~OCPBUGSM-21571~~), so if the hardware clock is more than 1 hour ahead of the real time then the certificate will be rejected by Ironic.

A new field is required in install-config.yaml where the user can specify additional NTP servers that can then be used to set up a chrony config in the IPA ISO. (Potentially this could also be used to automatically generate the MachineConfig manifests to add the same config to the cluster.)

See initial discussion here: ~~OCPBUGS-22957~~

Epic METAL-1022: Support additional NTP servers in install-config that would be passed to IPA

View the Description

See initial discussion here: ~~OCPBUGS-22957~~

Task METAL-1138: ICC support for additional NTP servers

View the Description View the linked PRs

Create an ICC patch that will read the new env variable for additional NTP servers and use it to create a chrony ingnition file.

https://github.com/openshift/image-customization-controller/pull/129

Task METAL-1163: CBO support for additional NTP servers

View the Description View the linked PRs

Create an CBO patch to add a field for additional NTP servers that will be passed to image customization.

https://github.com/openshift/cluster-baremetal-operator/pull/449

Feature OCPSTRAT-1426: [GA] OC mirror v2

View the Description

Feature description

Oc-mirror v2 is focuses on major enhancements that include making oc-mirror faster and more robust and introduces caching as well as address more complex air-gapped scenarios. OC mirror v2 is a rewritten version with three goals:

Manage complex air-gapped scenarios, providing support for the enclaves feature
Faster and more robust: introduces caching, it doesn’t rebuild catalogs from scratch
Improves code maintainability, making it more reliable and easier to add features, and fixes, and including a feature plugin interface

Epic CLID-234: Delete functionality fine tuning

View the Description

4.17 version of the delete functionality needs some improvements regarding:

~~CLID-196~~: should be able to delete operators previously mirrored using mirror to mirror
~~CLID-224~~: should not delete blobs that are shared with images that were not deleted.

Story CLID-196: Enable delete on operators when the previous command was mirror to mirror

View the Description View the linked PRs

Check if it is possible to delete operators using the delete command when the previous command was mirror to mirror. Probably it won't work because in mirror to mirror the cache is not updated.

It is necessary to find a solution for this scenario.

https://github.com/openshift/oc-mirror/pull/920

Epic CLID-235: As oc-mirror user, I would like the migration between v1 and v2 to be smooth

View the Description

oc-mirror should account for users who are relying on oc-mirror v1 in production and accomodate an easy migration:

namespaces used for release mirroring should be the same
icsp to idms

Story CLID-205: Migration feature

View the Description View the linked PRs

The way of tagging images for releases, operators and additional images is different between v1 and v2. So it is necessary to have some kind of migration feature in order to enable customers to migrate from one version to the other.

Use cases:

As an oc-mirror user, I'd like to be able to use the delete feature of v2 to delete images mirrored images with v1, so that I can keep the registry volume under control.
- Since the algorithm is different between version, the delete feature of v2 won't find the images mirrored by v1 and the customer won't be able to delete them.

As an oc-mirror user switching to v2, I'd like for ICSP and IDMS/ITMS cohabitation to not cause major cluster problems.
- the namespace used for releases in v1 is ocp/release always. In v2 this is different. So IDMS/ITMS of v2 won't recognize release images mirrored by v1.
As an oc-mirror user switching to v2, I'd like to apply the new catalog source files without them colliding with the ones generated by v1

As an oc-mirror v1 user, I'd like the images already mirrored in v1 to be reusable (recognized) when using oc-mirror v2, so that I don't double the storage volume of my registry unnecessarily

As an oc-mirror user switching to v2, I'd like to be able to easily construct the openshift-install.yaml file which is necessary to create a disconnected cluster

From Naval: Previously we were doing the mirroring with oc adm release which was pushing in the directory of our choice, but when switching to oc-mirror v2 for release mirroring, I wasn't able to "reuse" the image path that was already created using oc adm release for 4.16.5 because when specifying a path to oc-mirror, it forces two new repository that we can't chose name of. My suggestions would be to let this as an option to avoid having to do dark things like I did (honestly, since we are on a 100 Mb connectivity for pulling images, I ended up doing a crane copy of the oc adm release mirrored path into the oc-mirror v2 created path, then oc-mirror v2 again to let it detect that images were already there...)

The solutions is still to be discussed.

https://github.com/openshift/oc-mirror/pull/939

Epic CLID-137: Working towards oc-mirror v2 GA

Story CLID-160: Upgrade container/image to the latest one

View the linked PRs

https://github.com/openshift/oc-mirror/pull/921

Feature OCPSTRAT-1430: Hosted Control Plane for OpenStack clusters-DevPreview

View the Description

Feature Overview (aka. Goal Summary)

Customers who deploy a large number of OpenShift on OpenStack clusters want to minimise the resource requirements of their cluster control planes.

Customers deploying RHOSO (OpenShift services for OpenStack, i.e. OpenStack control plane on bare metal OpenShift) already have a bare metal management cluster capable of serving Hosted Control Planes.

We should enable self-hosted (i.e. on-prem) Hosted Control Planes to serve Hosted Control Planes to OpenShift on OpenStack clusters, with a specific focus of serving Hosted Control Planes from the RHOSO management cluster.

Goals (aka. expected user outcomes)

As an enterprise IT department and OpenStack customer, I want to provide self-managed OpenShift clusters to my internal customers with minimum cost to the business.

As an internal customer of said enterprise, I want to be able to provision an OpenShift cluster for myself using the business's existing OpenStack infrastructure.

Requirements (aka. Acceptance Criteria):

TBD

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic OSASINFRA-3536: Support OpenStack CSIs with Hypershift

View the Description

Goal

Ability to run cinder and manila operators as controller Pods in a hosted control plane
Ability to run Node DaemonSet in a guest clusters

Why is this important?

Continue supporting usage of CSIs for the guest cluster just how it's possible with standalone OpenShift clusters.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Story OSASINFRA-3635: Configure cluster-storage-operator to support Hypershift for Manila CSI

View the Description View the linked PRs

In ~~OSASINFRA-3483~~, we modified openshift/cluster-storage-operator to integrate support for kustomize and provide the infrastructure to generate two sets of assets: one for standalone deployment, and one for hypershift deployment. In this story, we will track actually adding support for the latter.

https://github.com/openshift/cluster-storage-operator/pull/527

Story OSASINFRA-3638: Add Manila CSI support with Hypershift in the csi-operator

View the Description View the linked PRs

In OSASINFRA-3610, we merged the openshift/csi-driver-manila-operator repository into openshift/csi-operator and modified it to take advantage of the new generator framework provided therein. Now, we want to build on this, adding Hypershift-specific assets and tweaking whatever else is needed.

Story OSASINFRA-3535: Add Cinder CSI support with Hypershift in the csi-operator

View the Description View the linked PRs

In OSASINFRA-3608, we merged the openshift/openstack-cinder-csi-driver-operator repository into openshift/csi-operator and modified it to take advantage of the new generator framework provided therein. Now, we want to build on this, adding Hypershift-specific assets and tweaking whatever else is needed.

Story OSASINFRA-3632: Configure cluster-storage-operator to support Hypershift for Cinder CSI

View the Description View the linked PRs

Story OSASINFRA-3483: Configure cluster-storage-operator to deploy OSP CSIs

View the Description View the linked PRs

We want to prepare cluster-storage-operator for eventual Hypershift integration. To this end, we need to migrate the assets and references to same to integrate kustomize. This will likely look similar to https://github.com/openshift/cluster-storage-operator/pull/318 once done (albeit, without the Hypershift work).

https://github.com/openshift/cluster-storage-operator/pull/510

Epic OSASINFRA-3482: HCP for OpenStack dev preview tasks in 4.18

View the Description

This is a container Epic for tasks which we know need to be done for Tech Preview but which we don't intend to do now. It needs to be groomed before it is useful for planning.

Task OSASINFRA-3573: Remove extra FIP

View the Description View the linked PRs

We don't need to create another service for Ingress, so we can save a FIP.

https://github.com/openshift/hypershift/pull/4589

Task OSASINFRA-3584: Remove FIP support

View the Description View the linked PRs

Matthew Booth is worried about that feature that we added to pre-create a FIP and assign it to the Service object for router-default. This is indeed racy and could be problematic if another controller would take over that field as well, it'll create infinite loops and the result wouldn't be great for customers.

The idea is to remove that feature now and eventually add it back later when it's safer (e.g. feature added to the Ingress operator?). It's worth noting that core kubernetes has deprecated the loadBalancerIP field in the Service object, and it now works with annotations. Maybe we need to investigate that path.

https://github.com/openshift/hypershift/pull/4669

Task OSASINFRA-3553: Enable Hypershift HA

View the Description View the linked PRs

Right now, our pods are SingleReplica because to have multiple replicas we need more than one zone for nodes which translates into AZ in OpenStack. We need to figure that out.

https://github.com/openshift/hypershift/pull/4513

Task OSASINFRA-3623: Allow cloud to be configurable for hcp executable

View the Description View the linked PRs

We should not have to explicitly configure the location of the clouds.yaml file, since there is a list of well-known places where these can be found. We should also be able to configure the cloud used from the chosen clouds.yaml.

Task OSASINFRA-3565: Additional networks for node pools

View the Description View the linked PRs

Being able to connect the node pools to additional networks, like we support already on standalone clusters.

This task will be necessary for some use cases, like using Manila CSI on a storage network, or running NFV workload on a SRIOV provider network or also running ipv6 dual stack workloads on a provider network.

I see at least 2 options:

We patch CAPO to support AdditionalNetworks (type: []NetworkParam), and we append what is in the spec.ports when creating the OpenStackMachine with the additional networks
We provision CAPO cluster with ManagedSubnets and provide ports to the OpenStackMachine.Spec.Ports directly. No need to patch CAPO.
Either way, I start to think that we could simply HCP/OSP:
in BYON, the user will have to provide Router, Network and Subnets in OpenStackCluster.Spec.
In non-BYON (the current default and supported way), we would ALWAYS provide ManagedSubnets (we'll have a default []SubnetSpec) so whether we have additional networks, we can have control over OpenStackMachine.Spec.Ports .

One thing we need to solve as well is the fact that when a Node has > 1 port, kubelet won't necessarily listen on the primary interface. We need to address that too; and it seems CPO has an option to define the primary network name: https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/openstack-cloud-controller-manager/using-openstack-cloud-controller-manager.md#networking

If we don't solve that, the nodepool (worker) won't join the cluster since Kubelet might listen on the wrong interface.

https://github.com/openshift/hypershift/pull/4515

Task OSASINFRA-3572: Handle Ingress DNS record in CI on AWS

View the Description View the linked PRs

When the management cluster runs on AWS, make sure we update the DNS record for *apps, so ingress can work out of the box.

https://github.com/openshift/hypershift/pull/4586

Epic OSASINFRA-3475: HCP for OpenStack (Infra)

View the Description

HyperShift should be able to deploy the minimum useful OpenShift cluster on OpenStack. This is the minimum requirement to be able to test it. It is not sufficient for GA.

Bug OSASINFRA-3567: ExternalDNSDomain is ignored when creating HostedControlPlane

View the linked PRs

https://github.com/openshift/hypershift/pull/4517

Feature OCPSTRAT-1448: Eliminate installer-aro fork of OpenShift Installer (Phase I)

View the Description

Goal

Stop using the openshift/installer-aro repo during installation of ARO cluster. installer-aro is a fork of openshift/installer with carried patches. Currently it is vendored into openshift/installer-aro-wrapper in place of the upstream installer.

Benefit Hypothesis

Maintaining this fork requires considerable resources from the ARO team, and results in delays of offering new OCP releases through ARO. Removing the fork will eliminate the work involved in keeping it up to date from this process.

Resources

https://docs.google.com/document/d/1xBdl2rrVv0EX5qwhYhEQiCLb86r5Df6q0AZT27fhlf8/edit?usp=sharing

It appears that the only work required to complete this is to move the additional assets that installer-aro adds for the purpose of adding data to the ignition files. These changes can be directly added to the ignition after it is generated by the wrapper. This is the same thing that would be accomplished by OCPSTRAT-732, but that ticket involves adding a Hive API to do this in a generic way.

Responsibilities

The OCP Installer team will contribute code changes to installer-aro-wrapper necessary to eliminate the fork. The ARO team will review and test changes.

Success Criteria

The fork repo is no longer vendored in installer-aro-wrapper.

Results

Add results here once the Initiative is started. Recommend discussions & updates once per quarter in bullets.

Epic CORS-3559: Eliminate installer-aro fork of installer

View the Description

Epic Goal

Eliminate the need to use the openshift/installer-aro fork of openshift/installer during the installation of an ARO cluster.

Why is this important?

Maintaining the fork is time-consuming for the ARO and causes delays in rolling out new releases of OpenShift to ARO.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task CORS-3753: Allow mocking of Azure client in unit tests

View the Description View the linked PRs

Currently the Azure client can only be mocked in unit tests of the pkg/asset/installconfig/azure package. Using the mockable interface consistently and adding a public interface to set it up will allow other packages to write unit tests for code involving the Azure client.

https://github.com/openshift/installer/pull/9201

Feature OCPSTRAT-1465: DeploymentConfig disabled by default starting OCP 4.19

View the Description

We deprecated "DeploymentConfig" in-favor of "Deployment" in OCP 4.14

Now in 4.18 we want to make "Deployment " as default out of box that means customer will get Deployment when they install OCP 4.18 .

Deployment Config will still be available in 4.18 as non default for user who still want to use it .

FYI "DeploymentConfig" is tier 1 API in Openshift and cannot be removed from 4.x product

Please Review this FAQ : https://docs.google.com/document/d/1OnIrGReZKpc5kzdTgqJvZYWYha4orrGMVjfP1fUpljY/edit#heading=h.oranye5nwtsy

Epic WRKLDS-1326: DeploymentConfig capability disabled by default

View the Description

Epic Goal*

~~WRKLDS-695~~ was implemented to make the DC enabled through capability in 4.14. In order to prepare customers for migration to Deployments the capability got enabled by default. After 3 releases we need to reconsider whether disabling the capability by default is feasible.

More about capabilities in https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#capability-sets.

Why is this important? (mandatory)

Disabling a capability by default make an OCP installation lighter. Less component running by default reduces a security risk/vulnerability surface.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Users can still enable the capability in vanilla clusters. Existing cluster will keep the DC capability enabled during a cluster upgrade.

Dependencies (internal and external) (mandatory)

None

Contributing Teams(and contacts) (mandatory)

Development - Workloads team
Documentation - Docs team
QE - Workloads QE team
PX -
Others -

Acceptance Criteria (optional)

The DC capability is disabled by default in vanilla OCP installations
The DC capability can be enabled in a vanilla OCP installation
The DC capability is enabled after an upgrade in OCP clusters that have the capability already enabled before the upgrade
The DC capability is disabled after an upgrade OCP clusters that have the capability disabled before the upgrade

Drawbacks or Risk (optional)

None. The DC capability can be enabled if needed.

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be "Release Pending"

Story WRKLDS-1431: [e2e] Migrate DeploymentConfig to Deployments

View the Description View the linked PRs

Before the DCs can be disabled by default all the relevant e2e relying on DCs need to be migrated to Deployments to maintain the same testing coverage.

Feature OCPSTRAT-1469: Bare Metal Spoke Cluster Provisioning for Hosted Control Planes from a KubeVirt Hosted Cluster

View the Description

Feature Overview

This feature enables users of Hosted Control Planes (HCP) on bare metal to provision spoke clusters from ACM at scale, supporting hundreds to low thousands of clusters per hub cluster. It will use ACM's multi-tenancy to prevent interference across clusters. The implementation assumes the presence of workers in hosted clusters (either bare metal or KubeVirt).

Why is this important

We have a customer requirement to allow for massive scale & TCO reduction via Multiple ACM Hubs on a single OCP Cluster - Kubevirt Version

Resources

Epic METAL-1056: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story METAL-1145: Enable BMO support for CBO on KubeVirtPlatform Type

View the linked PRs

https://github.com/openshift/cluster-baremetal-operator/pull/448

Feature OCPSTRAT-1481: UI/UX improvements when using a multi-architecture environment

View the Description

Feature Overview (aka. Goal Summary)

When using OpenShift in a mixed, multi-architecture environment some key details or checks or not always available. With this feature we will take a first pass at improving the UI/UX for customers as adoption of this configuration continues at pace.

Goals (aka. expected user outcomes)

The UI/UX experience should improved when being used in a mixed architecture OCP cluster

Requirements (aka. Acceptance Criteria):

check that only the relevant CSI drivers are deployed to the relevant architectures
Improve filtering/autodeterming arches in operatorhub
Console improvements, especially node views

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Y
Classic (standalone cluster)	Y
Hosted control planes	Y
Multi node, Compact (three node), or Single node (SNO), or all	Y
Connected / Restricted Network	Y
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	All architectures
Operator compatibility	n/a
Backport needed (list applicable versions)	n/a
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	OpenShift Console
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic MULTIARCH-4148: UI/UX Console enhancements for multiarch

View the Description View the linked PRs

Epic Goal

Console improvements, especially node views

Why is this important?

Scenarios
1. …

Acceptance Criteria

(Enter a list of Acceptance Criteria unique to the Epic)
…

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1. …

Done Checklist

CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
Release Enablement: <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
QE - Automated tests merged: <link or reference to automated tests>
QE - QE to verify documentation when testing
DOC - Downstream documentation merged: <link to meaningful PR>
All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

https://github.com/openshift/console/pull/13718

Feature OCPSTRAT-1496: Enable OpenShift on GCP N4 Machine Series

View the Description

Feature Overview (aka. Goal Summary)

Add support to GCP N4 Machine Series to be used as Control Plane and Compute Nodes when deploying Openshift on Google Cloud

Goals (aka. expected user outcomes)

As a user, I want to deploy OpenShift on Google Cloud using N4 Machine Series for the Control Plane and Compute Node so I can take advantage of these new Machine types

Requirements (aka. Acceptance Criteria):

OpenShift can be deployed in Google Cloud using the new N4 Machine Series for the Control Plane and Compute Nodes

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all	all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Background

Google has made N4 Machine Series available on their cloud offering. These Machine Series use "hyperdisk-balanced" disk for the boot device that are not currently supported

Documentation Considerations

The documentation will be updated adding the new disk type that needs to be supported as part of this enablement. Also the N4 Machine Series will be added as tested Machine types for Google Cloud when deploying OpenShift

Epic CORS-3580: Add GCP N4 Machine Series to tested instances for OCP

View the Description

Epic Goal

Add GCP N4 Machine Series to the tested instances list for OpenShift on GCP

Why is this important?

This is a new Machine Series Google has introduced that customers will use for their OpenShift deployments

Scenarios

Deploy an OpenShift Cluster with both the Control Plane and Compute Nodes running on N4 GCP Machines

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

https://issues.redhat.com/browse/CORS-3561

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug OCPBUGS-42123: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9029

Feature OCPSTRAT-1515: Reflect only the mirrored operators in the operator catalogs

View the Description

Feature Overview

As a oc-mirror user, I would like mirrored operator catalogs to reflect the mirrored operators only, so that, after I mirror my catalog I can check that it contains the filtered operators using:

$ oc-mirror list operators --catalog mirror.syangsao.net:8443/ocp4/redhat/redhat-operator-index:v4.12

Context

In oc-mirror v2 (and in v1 after bug fix ~~OCPBUGS-31536~~), oc-mirror doesn't rebuild catalogs.

The filtered declarative config isn't recreated based on the imagesetconfig filter
The catalog cache isn't regenerated
The catalog image isn't rebuilt based on the above 2 elements
Instead, the original catalog image is pushed as is to the mirror registry. Its declarative config will show all operators, and for each operator all channels and all bundles.
This behavior is causing some inconvenience to our users.

Concerns, complexity

How to deal with caches
How to deal with default channels
How to deal with keeping a single valid channel head
What to do when cross channel filtering is involved

Known ongoing/related work

Additional info:

Epic CLID-199: As a oc-mirror user, I would like mirrored operator catalogs to reflect the mirrored operators only

View the Description

As a oc-mirror user, I would like mirrored operator catalogs to reflect the mirrored operators only, so that, after I mirror my catalog I can check that it contains the filtered operators using:

oc-mirror list operators --catalog mirror.syangsao.net:8443/ocp4/redhat/redhat-operator-index:v4.12

Context

In oc-mirror v2 (and in v1 after bug fix ~~OCPBUGS-31536~~), oc-mirror doesn't rebuild catalogs.

The filtered declarative config isn't recreated based on the imagesetconfig filter
The catalog cache isn't regenerated
The catalog image isn't rebuilt based on the above 2 elements
Instead, the original catalog image is pushed as is to the mirror registry. Its declarative config will show all operators, and for each operator all channels and all bundles.
This behavior is causing some inconvenience to our users.

Concerns, complexity

How to deal with caches
How to deal with default channels
How to deal with keeping a single valid channel head
What to do when cross channel filtering is involved
How to deal with mirroring by bundle selection (how to rebuild the update graph)
Make multi-arch catalogs

Known ongoing/related work

Additional info:

https://redhat-internal.slack.com/archives/C02JW6VCYS1/p1719601148099479
https://redhat-internal.slack.com/archives/C02JW6VCYS1/p1719908123474019
Naval group feedback: There is a lot of informations missing in the Operator Hub for disconnected environment, when provisioning a catalogsources, there should be an arg that stipulate that we are in a disconnected environment, and shows : name of the operator for oc-mirror, other operator dependencies, even a tab with a sample ImageSetConfig for oc-mirror to let us know easily what bundle we will need.

Story CLID-256: Rebuild catalog refactorings

View the Description

This user story is to cover all the scenarios that were not covered by ~~CLID-230~~

Sub-task CLID-270: Change the tool to rebuild catalogs

View the Description View the linked PRs

Currently buildah could bring problems due the unshare.

https://github.com/openshift/oc-mirror/pull/943

Sub-task CLID-273: should build store location remain the default?

View the Description View the linked PRs

I found some problems

signatures,
image previously pulled via podman : it becomes single arch, and therefore we cannot push the manifest list anymore at end of rebuild. Error message is:

image is not a manifest list
and the only way out was to rm -fr $HOME/.local/share/containers/storage

https://github.com/openshift/oc-mirror/pull/950

Sub-task CLID-275: M2M catalogs are getting rebuilt more than once

View the linked PRs

https://github.com/openshift/oc-mirror/pull/951

Sub-task CLID-261: Rebuild of the catalog outside the collector

View the Description View the linked PRs

From this comment: https://github.com/openshift/oc-mirror/pull/937#discussion_r1812240892

In order to keep the single-responsability principle, the rebuild of the catalog should happen outside the collector phase.

https://github.com/openshift/oc-mirror/pull/953

Sub-task CLID-265: Generate folders in the working-dir based on the filtered catalog digest

View the Description View the linked PRs

Each catalog filtered should have its own folder named by the digest of its contents and inside of this folder the following items should be present:

declarative config file with only the operators in

container file used to generate the catalog image

error log file containing possible errors occurred during the rebuilding

https://github.com/openshift/oc-mirror/pull/945

Sub-task CLID-271: Manage rebuilding OCI catalogs

View the linked PRs

https://github.com/openshift/oc-mirror/pull/954

Sub-task CLID-258: Reduce the number of arguments from NewBuilder

View the Description View the linked PRs

From this comment: https://github.com/openshift/oc-mirror/pull/937#discussion_r1806084982

Since o.Opts is already passed to imagebuilder.NewBuilder(), passing o.Opts.SrcImage.TlsVerify and o.Opts.DestImage.TlsVerify is not needed as additional arguments.

https://github.com/openshift/oc-mirror/pull/944

Sub-task CLID-262: Create a different interface for the RebuildOfCatalogs

View the Description View the linked PRs

From this comment: https://github.com/openshift/oc-mirror/pull/937#discussion_r1812461948

Ideally ImageBuilderInterface would be the interface to build any kind of image, since RebuildCatalogs is very specific only for catalog images, it would be better to have a separate interface only for that or reuse BuildAndPush.

https://github.com/openshift/oc-mirror/pull/945

Sub-task CLID-259: Save the container file used to generate the filtered catalog in the working-dir

View the Description View the linked PRs

From this comment: https://github.com/openshift/oc-mirror/pull/937#discussion_r1806145892

Keeping the container file used to filter the catalog in the working-dir can help in troubleshooting.

https://github.com/openshift/oc-mirror/pull/945

Sub-task CLID-274: Waiting for catalogs to get collected, then to be rebuilt is long

View the Description View the linked PRs

maybe look at adding spinners here, at saying at which catalog we are...

https://github.com/openshift/oc-mirror/pull/953

Story CLID-228: As oc-mirror user, I would like the declarative config included in the mirrored catalog to be inline with the filtering applied to the catalog in the imagesetconfig

View the Description View the linked PRs

This implies that we generate a new declarative config containing only a portion of the declarative config.
Acceptance criteria:

Only necessary operators should remain
Only necessary bundles should remain
Channels remaining should reflect the upgrade graph that is possible among remaining bundles
cross channel filtering
no multiple heads
default channel
should construct an upgrade graph when selectedBundles are used

https://github.com/openshift/oc-mirror/pull/929

Story CLID-230: As a oc-mirror user, I would like oc-mirror to build catalog images with filtered content

View the Description View the linked PRs

This story is about creating an image that contains opm, the declarative config (and optionally the cache)
Multiple solutions here:

go-containerregistry as in v1
buildah (with a dockerfile generated from opm generate dockerfile)
oras (with a dockerfile generated from opm generate dockerfile)
an external call to podman

Acceptance criteria:

should build image in enclave environment (with registries.conf)
should build image behind a proxy
should build a multi-arch image
Should build an image with catalog source in a way that it runs properly on clusters

https://github.com/openshift/oc-mirror/pull/937

Spike CLID-226: Design of rebuild catalogs solution

View the linked PRs

https://github.com/openshift/oc-mirror/pull/926

Feature OCPSTRAT-1518: TechDebt - Merge CSI driver operators into common csi-operator repo (Internal / Not customer facing)

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.

Goals (aka. expected user outcomes)

Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo. Having a common repo will across drivers will ease maintenance burden.

Requirements (aka. Acceptance Criteria):

As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	yes
Classic (standalone cluster)	yes
Hosted control planes	all
Multi node, Compact (three node), or Single node (SNO), or all	all
Connected / Restricted Network	all
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	all
Operator compatibility
Backport needed (list applicable versions)	no
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	no
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

N/A includes all the CSI operators Red Hat manages as part of OCP

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

This effort started with CSI operators that we included for HCP, we want to align all CSI operator to use the same approach in order to limit maintenance efforts.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Not customer facing, this should not introduce any regression.

Documentation Considerations

No doc needed

Interoperability Considerations

N/A, it's purely tech debt / internal

Epic OSASINFRA-3610: Merge Manila CSI driver operator into csi-operator repo

View the Description

Epic Goal*

Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.

Why is this important? (mandatory)

Scenarios (mandatory)

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

Dependencies (internal and external) (mandatory)

None, this can be done just by the storage team and independently on other operators / features.

Contributing Teams(and contacts) (mandatory)

Development -
QE -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be "Release Pending"

Task OSASINFRA-3620: Move operator to new structure in csi-operator

View the Description View the linked PRs

Implement the following step from the enhancement

https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-driver-operator-merge.md#process-of-moving-operators-to-csi-operator-monorepo

Task OSASINFRA-3621: Remove code from vendor/legacy in csi-operator repository

View the Description View the linked PRs

Implement the following step of the enhancement

https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-driver-operator-merge.md#post-migration-changes

Task OSASINFRA-3628: Make guest Namespace configurable

View the linked PRs

Task OSASINFRA-3654: Ensure that we have test manifest available in test/e2e directory

View the Description View the linked PRs

Implement one of the post migration steps

https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-driver-operator-merge.md#post-migration-changes

Task OSASINFRA-3611: Add initial support of csi-driver-manila-operator into csi-operator

View the Description View the linked PRs

Initial steps of this enhancement
https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-driver-operator-merge.md#process-of-moving-operators-to-csi-operator-monorepo

Epic STOR-1856: Merge AWS EFS CSI driver operator into csi-operator repo

View the Description View the linked PRs

Epic Goal*

Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.

Why is this important? (mandatory)

Scenarios (mandatory)

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

Note: we do not plan to do any changes for HyperShift. The EFS CSI driver will still fully run in the guest cluster, including its control plane.

Dependencies (internal and external) (mandatory)

None, this can be done just by the storage team and independently on other operators / features.

Contributing Teams(and contacts) (mandatory)

Development -
QE -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be "Release Pending"

Epic OSASINFRA-3608: Merge Cinder CSI driver operator into csi-operator repo

View the Description

Epic Goal*

Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.

Why is this important? (mandatory)

Scenarios (mandatory)

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

Dependencies (internal and external) (mandatory)

None, this can be done just by the storage team and independently on other operators / features.

Contributing Teams(and contacts) (mandatory)

Development -
QE -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be "Release Pending"

Story OSASINFRA-3609: Add initial merge of openstack-cinder-csi-driver-operator into csi-operator

View the Description View the linked PRs

Following the step described in the enhancement, we should do the following:

Move existing code from openstack-cinder-csi-driver-operator into a legacy directory in the csi-operator repository,
Add a Dockerfile for building the operator image from the new location,
Update openshift/release to build the image from new location,
Change ocp-build-data repository to ship image from new location, and
Coordinate merges in ocp-build-data and release repository

Once this is done, we can work towards rewriting the operator to take advantage of the new generator tooling used for existing migrated operators.

Story OSASINFRA-3618: Fully integrate openstack-cinder-csi-driver-operator into csi-operator

View the Description View the linked PRs

In ~~OSASINFRA-3609~~ we moved the existing Cinder CSI Driver Operator from openshift/openstack-cinder-csi-driver-operator to openshift/csi-operator, adding the contents of the former in a legacy/openstack-cinder-csi-driver-operator directory in the latter Now, we need to rework or adapt this migrated code to integrate it fully into the csi-operator.

Following the step described in the enhancement, we should do the following:

Move the operator to the new structure in csi-operator, and
Make post-migration changes, including:
- Ensuring that we have test manifest available in test/e2e directory,
- Ensuring nothing in the release repository relies on the legacy directory, and
- Removing the legacy directory

Once this work is complete, we can investigate adding HyperShift support to this driver. That work will be tracked and addressed via a separate epic.

Feature OCPSTRAT-1519: Enhance support for RAID storage using Intel VROC

View the Description

Feature Overview (aka. Goal Summary)

Intel VROC (Virtual RAID on CPU) is a nontraditional RAID option that can offer some management and potential performance improvements compared to a traditional hardware raid. RAID devices can be set up from firmware or via remote management tools and present as MD devices.

Initial support was delivered in OpenShift 4.16. This feature is to enhance that support by:

streamlining the process
plumbing through to the agent installer and baremetal IPI

Out of Scope

Any technologies not already supported by the RHEL kernel.
**

Background

https://www.intel.com/content/www/us/en/software/virtual-raid-on-cpu-vroc.html

Interoperability Considerations

Epic MGMT-19150: Add intel vROC support to the Assisted Installer

View the Description View the linked PRs

Feature goal (what are we trying to solve here?)

Allow users of Intel VROC hardware to deploy OpenShift to it via the Assisted Installer.

https://www.intel.com/content/www/us/en/software/virtual-raid-on-cpu-vroc.html

Currently the support only exists with UPI deployments. The Assisted Installer blocks it.

DoD (Definition of Done)

Assisted Installer can deploy to hardware using the Intel VROC.

Does it need documentation support?

Yes

Feature origin (who asked for this feature?)

A Customer asked for it

Catching up with OpenShift

Intel VROC support exists in OpenShift, just not the Assisted Installer, this epic seeks to add it.

Reasoning (why it’s important?)

We support Intel VROC with OpenShift UPI but Assisted Installer blocks it. Please see https://issues.redhat.com/browse/SUPPORTEX-22763 for full details of testing and results.

Customers using Intel VROC with OpenShift will want to use Assisted Installer for their deployments. As do we.

Competitor analysis reference

TBC

Feature usage (do we have numbers/data?)

Assisted installer is part of NPSS so this will benefit Telco customers using NPSS with Intel VROC.

Feature availability (why should/shouldn't it live inside the UI/API?)

Brings Assisted installer into alignment with the rest of the product.

Feature OCPSTRAT-1521: Add MCO telemetry to track locally layered packages

View the Description

Feature Overview (aka. Goal Summary)

Generally speaking, customers and partners should not be installing packages client-side, i.e. `rpm-ostree install $pkg` directly on the nodes. It's not officially supported outside of troubleshooting situations, but the documentation is not very explicit on this point and we have anecdotal data that customers and partners do in fact install packages directly on hosts.

Adding some telemetry to help understand how common this is among data-reporting clusters. Hopefully such data will help us understand how important it is to preserve this ability in the bootc-world. While it's not a pattern we want to encourage, we should be careful about dropping it without considering how to avoid breaking users' clusters in unexpected ways.

Goals (aka. expected user outcomes)

Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.

Requirements (aka. Acceptance Criteria):

This needs to be backported to 4.14 so we have a better sense of the fleet as it is.

4.12 might be useful as well, but is optional.

Questions to Answer (Optional):

Why not simply block upgrades if there are locally layered packages?

That is indeed an option. This card is only about gathering data.

Customer Considerations

Some customers are known to layer packages locally but it's worse if the issue is a third party integration. In such a case, if the add-on breaks, the customer will call the 3rd party first because that's what appears to be broken. It may be a long, undelightful trip to get to a satisfying resolution. If they are blocked on upgrade due to that 3rd party integration they may not be able to upgrade the OCP y-version. That could be a lengthy delay.

Epic MCO-1229: Add MCO Telemetry to track locally layered packages

View the Description

Description copied from attached feature card: https://issues.redhat.com/browse/OCPSTRAT-1521

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.

Requirements (aka. Acceptance Criteria):

This needs to be backported to 4.14 so we have a better sense of the fleet as it is.

4.12 might be useful as well, but is optional.

Questions to Answer (Optional):

Why not simply block upgrades if there are locally layered packages?

That is indeed an option. This card is only about gathering data.

Customer Considerations

Story MCO-1339: e2e-test for ensuring unsupported packages are logged in metrics

View the Description View the linked PRs

create an e2e test that confirms that metrics collection in the MCD works and that it collects unsupported package installations using rpm-ostree

https://github.com/openshift/machine-config-operator/pull/4614

Story MCO-1276: Integrate Metric Collection with MCO Daemon

View the Description View the linked PRs

Implement the logic in the MCO Daemon to collect the defined metrics and send them to Prometheus. For the Prometheus side of things, this will involve some manipulation in `metrics.go`.

Acceptance Criteria:
1. The MCO daemon should collect package installation data (defined from the spike ~~MCO-1275~~) during its normal operation.
2. The daemon should report this data to Prometheus at a specified time interval (defined from spike ~~MCO-1277~~).
3. Include error handling for scenarios where the rpm-ostree command fails or returns unexpected results.

Feature OCPSTRAT-1532: Support multiple NICs in Nutanix

View the Description

Feature Overview

Ability to install OpenShift on Nutanix with nodes having multiple NICs (multiple subnets) from IPI and for autoscaling with MachineSets.

Epic CORS-3741: Support multiple NICs in Nutanix

View the Description View the linked PRs

Feature Overview

Ability to install OpenShift on Nutanix with nodes having multiple NICs (multiple subnets) from IPI and for autoscaling with MachineSets.

Feature OCPSTRAT-1550: Enhanced admin-defined reboot & drain policies

View the Description

Phase 3 Deliverable:

TBD

Epic Goal

To be refined based on initial feedback on GA

Why is this important?

Scenarios

As a cluster admin, I want to reconfigure sudo without disrupting workloads.
As a cluster admin, I want to update or reconfigure sshd and reload the service without disrupting workloads.
As a cluster admin, I want to remove mirroring rules from an ICSP, ITMS, IDMS object without disrupting workloads because the scenario in which this might lead to non-pullable images at a undefined later point in time doesn't apply to me.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Epic MCO-1176: Admin Defined Node Disruption Enhancements

View the Description

stub for post GA

Bug OCPBUGS-38857: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/4514

Feature OCPSTRAT-1571: Add Authorization to internal Components of Agent-Based Installer

View the Description

Feature Overview

Implement authorization to secure API access for different user personas/actors in the agent-based installer.

User Personas:

Read-Only Access: For "wait-for" and "monitor-add-nodes" commands.
Read-Write Access: For systemd services and the agent service.

This is

Goals

The agent-based installer APIs have implemented basic security measures through authentication, as covered in ~~AGENT-145~~.

To further enhance security, it is crucial to implement user persona/actor-based authorization, allowing for differentiated access control, such as read-only or read-write permissions, based on the user's role.

The goal of this implementation is to provide a more robust and secure API framework, ensuring that users can only perform actions appropriate to their role.

Epic AGENT-931: Add Authorization to internal Components of Agent Installer

View the Description

Epic Goal

Implement authorization to secure API access for different user personas/actors in the agent-based installer.
User Personas:
- Read-Only Access: For "wait-for" and "monitor-add-nodes" commands.
- Read-Write Access: For systemd services and the agent service.

Why is this important?

The agent-based installer APIs have implemented basic security measures through authentication, as covered in ~~AGENT-145~~. To further enhance security, it is crucial to implement user persona/actor-based authorization, allowing for differentiated access control, such as read-only or read-write permissions, based on the user's role. This approach will provide a more robust and secure API framework, ensuring that users can only perform actions appropriate to their role.

Scenarios

Users running the wait-for or monitor-add-nodes commands should have read-only permissions. They should not be able to write to the API. If they attempt to perform write operations, appropriate error messages could be displayed, indicating that they are not authorized to write.
Users associated with running systemd services should have both read and write permissions.
Users associated with running the agent service should also have read and write permissions.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story AGENT-949: Create New Authorizer Handler in Assisted Service for Authorization

View the Description View the linked PRs

User Story:

As a developer working on the Assisted Service, I want to:

Identify the type of authorization provided to each endpoint.
Ensure that the token claims match the authorization level that the endpoint is supposed to have based on the authenticator scheme.
Use the authenticator scheme to derive the security definition, ensuring that the correct authorization is enforced for each endpoint.

so that I can achieve

Proper authorization control across all endpoints.
Verification that only authorized users can access specific endpoints based on their token claims.

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/assisted-service/pull/6784

Story AGENT-951: Add New Security Definitions and Update Endpoints in swagger.yaml for wait-for and monitor-add-nodes User Personas

View the Description View the linked PRs

User Story:

As a wait-for and monitor-add-nodes user, I want to be able to:

Access relevant endpoints in a read-only capacity to monitor the progress of hosts joining a cluster.
View and verify the addition of nodes to an existing cluster without the ability to make changes.
Receive accurate and up-to-date information related to waiting for hosts and monitoring added nodes without requiring administrative permissions.

So that I can achieve:

Secure and restricted read-only access to essential information, ensuring that there is no risk of unintended modifications.
Prevent unauthorized or unintended changes to the system, maintaining the security and integrity of the environment.
Facilitate clear and appropriate authorization for the read-only role

Acceptance Criteria:

Description of criteria:

The swagger.yaml file must be updated to include read-only security definitions specifically for the wait-for and monitor-add-nodes user personas.
The relevant endpoints should be configured to utilize these read-only security definitions.
Ensure that the wait-for and monitor-add-nodes users can only view data without the ability to make changes.
The changes must be tested and validated to confirm the correct implementation of read-only access.

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/assisted-service/pull/6784

Story AGENT-950: Implement Separate JWT Tokens for Different User Personas

View the Description View the linked PRs

User Story:

As a user with userAuth, agentAuth, and watcherAuth persona (wait-for and monitor-add-nodes):

I want to be able to authorize actions specific to each user persona (user, agent, watcher) based on predefined claims.
I want to ensure that each persona's actions are validated against the claims agreed upon by the installer and Assisted Service.
I want to enforce role-based permissions to control access and operations during the installation process.

So that I can achieve:

Proper authorization of actions according to each persona's role.
Secure execution of tasks by validating them against agreed claims.
Controlled access to resources and operations, reducing the risk of unauthorized actions during installation.

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/9039

Feature OCPSTRAT-1577: [Tech Preview] OpenShift Zones support for vSphere Host Groups

View the Description

Feature Overview

Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.

When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.

There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.

In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.

Requirements{}

Users can define OpenShift zones mapping them to host groups at installation time (day 1)
Users can use host groups as OpenShift zones post-installation (day 2)

Epic SPLAT-1728: [Tech Preview] OpenShift Zones support for vSphere Host Groups

View the Description

Epic Goal

Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.

When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.

In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.

Requirements{}

Users can define OpenShift zones mapping them to host groups at installation time (day 1)
Users can use host groups as OpenShift zones post-installation (day 2)

Task SPLAT-1744: Update capv in the installer

View the Description View the linked PRs

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/?tab=readme-ov-file#compatibility-with-cluster-api-and-kubernetes-versions

https://github.com/kubernetes-sigs/cluster-api/releases

https://github.com/openshift/installer/pull/8956

Feature OCPSTRAT-1578: [Tech Preview] vSphere multi-NIC VM creation support in the IPI installer

View the Description

Feature Overview

Support in the IPI installer for OpenShift on vSphere to create the OpenShift node VMs with multiple NICs and subnets.

This is necessary when users want to have dedicated network links in the node VMs for storage or database for example, in addition to the service network link that we create now

Requirements

Users can specify multiple NICs for the OpenShift VMs that will be created for the OpenShift cluster nodes with different subnets.

Epic SPLAT-1732: [Tech Preview] vSphere multi-NIC VM creation support in the IPI installer

View the Description

Epic Goal

Support in the IPI installer for OpenShift on vSphere to create the OpenShift node VMs with multiple NICs and subnets.

This is necessary when users want to have dedicated network links in the node VMs for storage or database for example, in addition to the service network link that we create now

Requirements

Users can specify multiple NICs for the OpenShift VMs that will be created for the OpenShift cluster nodes with different subnets.

Task SPLAT-1765: machine config operator needs to be bumped

View the Description View the linked PRs

Description:

The machine config operator needs to be bumped to pick up the API change:

I0819 17:50:00.396986       1 machineconfig.go:87] ControllerConfig not found, creating new one
E0819 17:50:00.400599       1 machineconfig.go:90] Failed to create ControllerConfig: ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.infra.spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

Acceptance Criteria:

The API is bumped to merged https://github.com/openshift/api/pull/1999

https://github.com/openshift/machine-config-operator/pull/4538

Task SPLAT-1755: update API to remove network count restriction

View the Description View the linked PRs

Description:

The infrastructure spec validation needs to be updated to change the network count restriction to [10|https://configmax.esp.vmware.com/guest?vmwareproduct=vSphere&release=vSphere%208.0&categories=1-0.]

When multiple NICs are enabled(the installer allows this?) bootstrapping fails with:

Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: [#1673] failed to create some manifests:
Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

Acceptance Criteria:

API changes are tested in a payload along with MAPI and the installer

issue created by splat-bot

https://github.com/openshift/api/pull/2002

Story SPLAT-1750: installer support for multiple network adapters

View the Description View the linked PRs

{}USER STORY:{}

As an OpenShift provisioner, I want to provision a cluster in which nodes have multiple network adapters so that I can implement the desired network topology.

{}DESCRIPTION:{}

Customers have a need to provision nodes with multiple adapters in day 0. capv supports the ability to specify multiple adapters in its clone spec. The installer should be augmented to support additional NICs.

{}Required:{}

{}Nice to have:{}

...

{}ACCEPTANCE CRITERIA:{}

install-config.yaml is updated to allow multiple NICs
CI job testing an install with 2 network adapters
Validation of mutliple network adapters

{}ENGINEERING DETAILS:{}

https://github.com/openshift/installer/pull/8890

Task SPLAT-1760: machine API provider failing to render compute nodes

View the Description View the linked PRs

The machine API is failing to render compute nodes when multiple NICs are configured:

Unable to apply 4.17.0-0.ci.test-2024-08-15-193100-ci-ln-igm0nhk-latest: ControllerConfig.mac
hineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.infra.spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules w
ere not checked because the object was invalid; correct the existing errors to complete validation]

Description:

Bump machine-api to pick up changes in openshift/api#2002.

Acceptance Criteria:

openshift/api#2002 is merged
openshift/library-go#1777 is merged
this PR is merged

issue created by splat-bot

https://github.com/openshift/machine-api-operator/pull/1276

Feature OCPSTRAT-1582: Add nodes to clusters enhancements

View the Description

Feature Overview

Improve the cluster expansion with the agent workflow added in OpenShift 4.16 (TP) and OpenShift 4.17 (GA) with:

Caching RHCOS image for faster node addition, i.e. no extraction of image every time)
Add a single node with just one command, no need to write config files describing node
Support creating PXE artifacts

Goals

Improve the user experience and functionality of the commands to add nodes to clusters using the image creation functionality.

Epic AGENT-939: Day2 add node via agent-install leftover tasks

View the Description

Epic Goal

Cleanup/carryover work from ~~AGENT-682~~ and ~~WRKLDS-937~~ that were non-urgent for GA of the day 2 implementation

Story AGENT-988: dev-scripts: allows using pxe when adding a node

View the Description View the linked PRs

Currently dev-scripts supports the add-nodes workflow by using only the ISO. We should be able to select the mode to add a node via an explicit config variables, so that also the pxe approach could be used

https://github.com/openshift/installer/pull/9100

Story AGENT-967: Improve monitor output for the multi-node case

View the Description View the linked PRs

Improve the output shown for monitor command, especially in the case of multiple nodes, so that it could be more readable.

Note
A possible approach could be to change the monitoring logic in a polling loop, where nodes are grouped by "stages". A stage represents which point the node reached while working over the add workflow (we don't have yet defined them).

https://github.com/openshift/installer/pull/9172

Story AGENT-998: Add CI job for running integration tests

View the Description View the linked PRs

Run integration tests for presubmit jobs in the installer repo

https://github.com/openshift/installer/pull/9064

Story AGENT-901: Update internal documentation for the agent services

View the Description View the linked PRs

This page https://github.com/openshift/installer/blob/master/docs/user/agent/agent-services.md needs to be updated, to reflect the new services available in case of add nodes workflow vs install workflow

https://github.com/openshift/installer/pull/9215

Story WRKLDS-1320: Review PXE support

View the Description View the linked PRs

The add-nodes-image command may also generate PXE artifacts (instead of the ISO). This will require an additional command flag (and review the command name)

(evaluate also the possibility to use instead a sub-command)

https://github.com/openshift/oc/pull/1898

Story AGENT-966: Reduce dependency from kube-system/cluster-config-v1 when generating the image

View the Description View the linked PRs

Currently the oc node-image create command looks for the kube-system/cluster-config-v1 resource to infer some of the required elements for generating the ISO.
The main issue is that the kube-system-cluster-config-v1 resource may be stale, since it contains information used when the cluster was installed, and that may have changed during the lifetime of the cluster.

tech note about the replacement

Field	Source
APIDNSName	oc get infrastructure cluster -o=jsonpath=' {.status.apiServerURL} '
ImageDigestSource	oc get imagedigestmirrorsets image-digest-mirror -o=jsonpath=' {.spec.imageDigestMirrors} '
ImageContentSources	oc get imagecontentsourcepolicy
ClusterName	Derived from APIDNSName (api.<cluster name>.<base domain>)
SSHKey	oc get machineconfig 99-worker-ssh -o jsonpath=' {.spec.config.passwd.users[0].sshAuthorizedKeys} '
FIPS	oc get machineconfig 99-worker-ssh -o jsonpath=' {.spec.fips} '

Story AGENT-965: Improve troubleshooting information for the create command

View the Description View the linked PRs

Currently the oc node-image create command does not report any revelant information that could help the user to understand which element was retrieved from (for example, the SSH key), thus making more difficult to troubleshoot an eventual issue.

For this reason, it could be useful that the node-joiner tool would produce a proper json file, reporting all the details about the relevent resources fetched for generating image. The oc command should be able to expose them when required (ie via command flag)

https://github.com/openshift/installer/pull/9146

Story AGENT-1000: Improve error reporting for create command

View the Description View the linked PRs

Currently the error reporting of the oc node-image create command is pretty rough, as it prints out in the console the log traces captured from the node-joiner pod standard output. Even though this could help the user in understanding the problem, a lot of many unnecessary technical details are exposed, making the overall experience cumbersome.

For this reasons, node-joiner tool should generate a proper json file with the outcome of the action, including all the error messages eventually found.
The oc command should fetch such json output and report it in the console, instead of the showing up the node-joiner pod logs output.

Use also a flag to report the full pod logs, in case of troubleshooting
Manage the backward compatibility with the older version of node-joiner that does not support the enhanced output

https://github.com/openshift/oc/pull/1916

Story AGENT-859: Support for PXE files in day 2 node add

View the Description View the linked PRs

Support adding nodes using PXE files instead of ISO.

Questions

What kind of interface would be recommended?
- Use a different command for generating the pxe artifacts
- Use a flag for the existing commands

https://github.com/openshift/installer/pull/9097

Feature OCPSTRAT-1588: Shared-VPC for Hypershift

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

A set of capabilities need to be added to the Hypershift Operator that will enable AWS Shared-VPC deployment for ROSA w/ HCP.

Goals (aka. expected user outcomes)

Build capabilities into HyperShift Operator to enable AWS Shared-VPC deployment for ROSA w/ HCP.

Requirements (aka. Acceptance Criteria):

Antoni Segura Puimedon Please help with providing what Hypershift will need on the OCPSTRAT side.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	(perhaps) both
Classic (standalone cluster)
Hosted control planes	yes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86_64 and Arm
Operator compatibility
Backport needed (list applicable versions)	4.14+
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	no (this is an advanced feature not being exposed via web-UI elements)
Other (please specify)	ROSA w/ HCP

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic HOSTEDCP-1444: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story HOSTEDCP-1442: Create separate SGs for VPC endpoint and Workers

View the Description View the linked PRs

Currently the same SG is used for both workers and VPC endpoint. Create a separate SG for the VPC endpoint and only open the ports necessary on each.

https://github.com/openshift/hypershift/pull/4742

Epic HOSTEDCP-677: Support Shared VPC AWS infrastructure

View the Description

"Shared VPCs" are a unique AWS infrastructure design: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html

See prior work/explanations/etc here: https://issues.redhat.com/browse/SDE-1239

Summary is that in a Shared VPC environment, a VPC is created in Account A and shared to Account B. The owner of Account B wants to create a ROSA cluster, however Account B does not have permissions to create a private hosted zone in the Shared VPC. So they have to ask Account A to create the private hosted zone and link it to the Shared VPC. OpenShift then needs to be able to accept the ID of that private hosted zone for usage instead of creating the private hosted zone itself.

QE should have some environments or testing scripts available to test the Shared VPC scenario

Story HOSTEDCP-1930: Support shared VPC infrastructure

View the Description View the linked PRs

The AWS endpoint controller in the CPO currently uses the control plane operator role to create the private link endpoint for the hosted cluster as well as the corresponding dns records in the hypershift.local hosted zone. If a role is created to allow it to create that vpc endpoint in the vpc owner's account, the controller would have to explicitly assume the role so it can create the vpc endpoint, and potentially a separate role for populating dns records in the hypershift.local zone.

The users would need to create a custom policy to enable this

Add the necessary API fields to support a Shared VPC infrastructure, and enable development/testing of Shared VPC support by adding the Shared VPC capability to the hypershift CLI.

https://github.com/openshift/hypershift/pull/4814

Feature OCPSTRAT-1613: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic SDN-4930: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story SDN-5202: [UDN Tests] Port e2e tests to D/S

View the Description View the linked PRs

The e2e tests that were introduced in U/S OVN-K repo should be ported and added to D/S.

https://github.com/openshift/origin/pull/29041

Story SDN-5031: [monitoring][L2/L3] Open default network ports on UDN pods via users's request through pod annotations

View the linked PRs

https://github.com/openshift/ovn-kubernetes/pull/2314

Story SDN-5135: [ocp/api] Make UDN GA

View the Description View the linked PRs

API PR to make this GA
Release PR to remove TechPreview from being a required job in OVNK repo

https://github.com/openshift/api/pull/1988

Story SDN-5472: [API] Replicate CUDN CRD & RBAC to CNO

View the Description View the linked PRs

Feature OCPSTRAT-1620: Console: Customer Happiness (RFEs) for 4.18

View the Description

Feature Overview

Console enhancements based on customer RFEs that improve customer user experience.

Goals

This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?

CI - MUST be running successfully with test automation

This is a requirement for ALL features.

YES

Release Technical Enablement

Provide necessary release enablement details and documents.

YES

(Optional) Use Cases

This Section:

Main success scenarios - high-level user stories

Alternate flow/scenarios - high-level user stories

Questions to answer…

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

Customer Considerations

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?

Does this feature have doc impact?

New Content, Updates to existing content, Release Note, or No Doc Impact

If unsure and no Technical Writer is available, please contact Content Strategy.

What concepts do customers need to understand to be successful in [action]?

How do we expect customers will use the feature? For what purpose(s)?

What reference material might a customer want/need to complete [action]?

Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.

What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic ODC-7696: Update Import YAML, git Import and Container image to Quick create

View the Description

Problem:

As a user, I want to access the Import from Git and Container image form from the admin perspective as well.

Goal:

Provide Import from Git and Container image option to redirect the users to respective form.

Why is it important?

Use cases:

Users can navigate to Import from Git and Container image form from the Admin perspective.

Acceptance criteria:

Change Import YAML to a dropdown
Add 3 menu actions
1. Import YAML
2. Import from GIT
3. Container image
Add a tooltip `Quick create`

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Story ODC-7703: Add dropdown options to Import YAML action

View the Description View the linked PRs

Description

As a user, I want to access Import from Git to Container image form from anywhere in the console.

Acceptance Criteria

Update Import YAML button to a dropdown
Add 3 options
1. Import YAML
2. Import from Git
3. Container Image
Add a tooltip 'Quick create'
Add e2e tests

Additional Details:

https://github.com/openshift/console/pull/14345

Epic CONSOLE-4205: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CONSOLE-4289: Start a Job from a CronJob

View the Description View the linked PRs

Customer would like to be able to start individual CronJobs manually via a button in the OpenShift Webconsole, without having to use the OCI CLI.

To start a Job from a CronJob using CLI, following command is being used:

$ oc create job a-cronjob --from=cronjob/a-cronjob

AC:

Add a 'Start Job' option to both List and Details pages for a CronJob
Add an integration test

Created from https://issues.redhat.com/browse/RFE-6131

https://github.com/openshift/console/pull/14415

Story CONSOLE-4055: (console-operator) Cluster setting for hiding "Getting started resources" banner from Overview

View the Description View the linked PRs

As a cluster admin I want to set a cluster wide setting for hiding the "Getting started resources" banner from Overview, for all the console users.

AC:

Add new field to the console-operator's config, to its 'spec.customization' section, which would set the console. New field should be named 'GettingStartedBanner', which should be an enum, with states "Show" and "Hide".
By default the state should be "Enabled"
Pass the state variable to the console-config CM
Add e2e and integration test

RFE: https://issues.redhat.com/browse/RFE-4475

Story CONSOLE-4056: (console) Cluster setting for hidding "Getting started resources" banner from Overview

View the Description View the linked PRs

As a cluster admin I want to set a cluster wide setting for hiding the "Getting started resources" banner from Overview, for all the console users.

AC:

Console will read the value of 'GettingStartedBannerState' on start and set it as a SERVER_FLAG. Based on the value it will render the "Getting started resources" banner
Add integration test

RFE: https://issues.redhat.com/browse/RFE-4475

https://github.com/openshift/console/pull/14285

Epic ODC-7658: Developer Console 4.18 UX improvements

View the Description

Problem: ODC UX improvements based on customer RFEs that improve user experience.

Goal:

Why is it important?

Use cases:

<case>

Acceptance criteria:

Add dark/light mode support for the YAML editor, matching the console theme

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Bug OCPBUGS-43799: Debounce tile view page search field

View the Description View the linked PRs

Description

When typing quickly in a search field in OperatorHub and other catalogs, the browser slows down. The input should be debounced, so that searching for operators on OperatorHub will feel faster.

Acceptance Criteria

searching in tile view pages are debounced

Additional Details:

https://github.com/openshift/console/pull/14365

Story ODC-7610: Add dark/light mode support for the Moncao code editor, matching the console theme

View the Description View the linked PRs

Description

As a user who is visually impaired, or a user who is out in the sun, when I switch the theme in the console to Light mode, then try to edit text files (e.g., the YAML configuration for a pod) using the web console, I want the editor to be in light theme.

Acceptance Criteria

The CodeEditor component should change its base theme from vs-dark to vs-light when the console theme is changed from dark to light.
Similarly, the console theme is changed from light to dark, the base theme for the monaco editor should change from vs-light to vs-dark.

Additional Details:

https://github.com/openshift/console/pull/14089

Feature OCPSTRAT-1622: Nutanix RHCOS VM Image Support for Bootstrapping Cluster

View the Description

Feature Overview

Allow users to create an RHCOS image to be used for bootstrapping new clusters.

Goals

The IPI installer is currently uploading the RHCOS image to all AOS Clusters. In environments where each cluster is on a different subnet this uses unnecessary bandwidth and takes a long time on low bandwidth networks.

The goal is to use a pre-existing VM images in Prism Central to bootstrap the cluster

Epic CORS-3706: Nutanix RHCOS VM Image Support for Bootstrapping Cluster

View the Description View the linked PRs

Epic Goal

Allow users to create an RHCOS image to be used for bootstrapping new clusters.

The goal is to use a pre-existing VM images in Prism Central to bootstrap the cluster

https://github.com/openshift/installer/pull/9093

Feature OCPSTRAT-1624: Enable OpenShift on GCP C4/C4A Machine Series

View the Description

Feature Overview (aka. Goal Summary)

Add support to GCP C4/C4A Machine Series to be used as Control Plane and Compute Nodes when deploying Openshift on Google Cloud

Goals (aka. expected user outcomes)

As a user, I want to deploy OpenShift on Google Cloud using C4/C4A Machine Series for the Control Plane and Compute Node so I can take advantage of these new Machine types

Requirements (aka. Acceptance Criteria):

OpenShift can be deployed in Google Cloud using the new C4/C4A Machine Series for the Control Plane and Compute Nodes starting in OpenShift 4.17.z

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all	all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Background

Google has made C4/C4A Machine Series available on their cloud offering.

Documentation Considerations

The documentation will be updated adding the new disk type that needs to be supported as part of this enablement. Also the C4/C4A Machine Series will be added as tested Machine types for Google Cloud when deploying OpenShift

Epic CORS-3659: Add GCP C4/C4A Machine Series to tested instances for OCP

View the Description

Epic Goal

Add GCP C4/C4A to the tested instances list for OpenShift on GCP

Why is this important?

This is a new Machine Series Google has introduced that customers will use for their OpenShift deployments

Scenarios

Deploy an OpenShift Cluster with both the Control Plane and Compute Nodes running on C4 GCP Machines
Deploy an OpenShift Cluster with both the Control Plane and Compute Nodes running on C4A GCP Machines

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Hyperdisk-balanced enablement work via OCPSTRAT-1496

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug OCPBUGS-43737: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CORS-3680: Update Docs to include new Instances

View the Description View the linked PRs

1. Add C4 and C4A instances to list of tested instances in docs.

2. Reference that the user should know that not all zones can be used for installation of these types. There is no way for the installer to know if these instances can actually be installed in the zones. To successfully install in some zones, specify the zones in the control plane and compute machine pools (in the install config).

https://github.com/openshift/installer/pull/9193

Bug OCPBUGS-39586: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9122

Bug OCPBUGS-39585: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9121

Feature OCPSTRAT-1636: Seamless Transition to crun in HyperShift: Automatic Runtime Upgrade in OpenShift 4.17 to 4.18

View the Description

Feature Overview (aka. Goal Summary)

The transition from runc to crun is part of OpenShift’s broader strategy for improved performance and security. In OpenShift clusters with hosted control planes, retaining the original runtime during upgrades was considered complex and unnecessary, given the success of crun in tests and the lack of proof for significant risk. This decision aligns with OpenShift’s default container runtime upgrade and simplifies long-term support.

Requirements (aka. Acceptance Criteria)

Transparent Runtime Change: The switch to crun should be seamless, with minimal disruption to the user experience. Any workload impacts should be minimal and well-communicated.
Documentation: Clear documentation should be provided, explaining the automatic runtime switch, outlining potential performance impacts, and offering guidance on testing workloads after the transition.

Deployment Considerations

Deployment Configurations	Specific Needs
Self-managed, managed, or both	Both
Classic (standalone cluster)	N/A
Hosted control planes	Yes
Multi-node, Compact (three-node), SNO	All
Connected / Restricted Network	N/A
Architectures (x86_64, ARM, IBM Power, IBM Z)	All

Backport needed	None
UI Needs	No additional UI needs. OCM may require an acknowledgment for runtime change.

Use Cases

Scenario 1:
A user upgrading from OpenShift 4.17 to 4.18 in a HyperShift environment has NodePools running runc. After the upgrade, the NodePools automatically switch to crun without user intervention, providing consistency across all clusters.

Scenario 2:
A user concerned about performance with crun in 4.18 can create a new NodePool to test workloads with crun while keeping existing NodePools running runc. This allows for gradual migration, but default behavior aligns with the crun upgrade.

Scenario 2 needs to be well documented as best practice.

Questions to Answer

How will this automatic transition to crun affect workloads that rely on specific performance characteristics of runc?
Are there edge cases where switching to crun might cause compatibility issues with older OpenShift configurations or third-party tools?

Out of Scope

Supporting retention of runc as the default runtime post-upgrade is not part of this feature.
Direct runtime configuration options for individual NodePools are not within scope, as the goal is to align with OpenShift defaults and reduce complexity.

Documentation Considerations

Based on this conversation, we should make sure we document the following:

Canary Update Strategy:
Highlight the benefits of HyperShift and HCP as architecture that allows the decoupling of NodePools and controlplanes upgrades better enabling the canary upgrade pattern
Reuse or create new docs around canary upgrades with HCP NodePools.
Gradually upgrading a small subset of nodes / Nodepools ("canaries") first to test the new runtime in a production environment before rolling out the upgrade to the rest of the nodes.

Release Notes:
Clearly announce the switch from runc to crun as the default runtime in version 4.18 and explain HCP’s behaviour.
Briefly explain the rationale behind the change, emphasizing the expected transparency and minimal user impact.
Reference the documentation on the canary update strategy for users seeking further information or guidance.

Epic HOSTEDCP-1998: Seamless Transition to crun in HyperShift

View the Description

Goal

Stay aligned with OCP Standalone with the runtime when relevant events happen.
- Is there any implication in ControlPlane upgrades.
  - Being HostedControlPlane a MGMT Cluster workload, it will be affected as any other workload, we need to verify if that affects us in any way.
- NodePool Updates from 4.18
- NodePool updates from 4.17 to 4.18 (Changing runtime release)
Testing
Implications to maintain a NodePool with the non-default runtime
- Deprecation of the runc at some point, forcing the customers to create a new NodePool with the new runtime
- Perfomance implications
- Footprint
- Testing (duplicated tests?)
- MultiArch config
- Backup and Restore
Documentation
- How to change the runtime
- Implications of changing and use
Service Delivery affectation
- Eventually this change will happen too in SD, we need to stay aligned too with them. (ADR https://docs.google.com/document/d/1I-sxbXqdZUZyJ6ZyQoZIQnktaSVlZ13ONkogTQIBxqo/edit)

Why is this important?

Prevent issues on runtime use and migration with customer workloads in SaaS and Self-Manage platforms
Stay aligned with OCP Standalone

Scenarios

Scenario 1: A user upgrading from OpenShift 4.17 to 4.18 wants to ensure that their nodepools continue using runc. The upgrade proceeds without changes to the container runtime, preserving the existing environment.
Scenario 2: A user intends to switch to crun post-upgrade. They create a new nodepool explicitly configured with crun, ensuring a controlled transition.

Acceptance Criteria

Dev
- Validated upgrade from 4.17 to 4.18 does not modify the runtime
- Validated upgrade from 4.18 does not modify the runtime
- Validated from 4.18 the new nodepools are based on crun
- Questions from above answered
- Keep aligned with OCP Standalone
CI
- MUST be running successfully with tests automated
- E2E test to validate the desired behaviour
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
Doc
- Documentation in place Upstream and the docs team aware of the new additions for downstream

Open questions:

How will the automatic retention of runc impact long-term support for crun as the default runtime?
1. Deprecation of the runc at some point, forcing the customers to create a new NodePool with the new runtime
2. Perfomance implications
3. Footprint
4. MultiArch config
5. Backup and Restore
Are there edge cases where the automatic retention of runc could cause issues with newer OpenShift features or configurations?
Is there any implication in ControlPlane upgrades?
Will we cover the testing of both runtimes?

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-2002: Documentation about CRUN runtime change that affects HCP

View the Description View the linked PRs

User Story:

As a customer I would like to know how the runtime change from runc to crun could affect me, for that we will need to

Document how to change the runtime in HostedControlPlanes
Accepted and not accepted scenarios
Implications of the changing the runtime (downtime, restarts, etc...)
How to validate a good "migration"

Acceptance Criteria:

Description of criteria:

Upstream and downstream documentation

Story HOSTEDCP-2000: Validate upgrade from 4.17 to 4.18 scenario

View the Description View the linked PRs

User Story:

As a customer I want to upgrade my HostedCluster from 4.17 to 4.18, so I can verify:

The runtime changes from runc to crun.
The new NodePools are created with crun runtime.
The disconnected deployments keep working fine.

If any of the points above fails, we need to fill a bug in order to solve it and put it under same Epic as this user story.

Acceptance Criteria:

Description of criteria:

Upstream and downstream documentation.
Validated the above scenarios.
Filled the proper issues if applies.
Modify the Upgrade E2E tests to check the runtime.

https://github.com/openshift/hypershift/pull/4945

Feature OCPSTRAT-1650: Comprehensive testing strategy for HCP to ensure compatibility across versions and environments

View the Description

Feature Overview (Goal Summary)

We aim to continue establishing a comprehensive testing strategy for Hosted Control Planes (HCP) that aligns with Red Hat’s support requirements and ensures customer satisfaction. This involves testing across various permutations, including providers, lifecycle, upgrades, and version compatibility. The testing must span management clusters, hubs, MCE, control planes, and nodepools, while coordinating across multiple QE teams to avoid duplication and inefficiencies. We aim to sustain an evolving testing matrix to meet product demands, especially as new versions and extended OCP lifecycles are introduced.

Goals (Expected User Outcomes)

Provide a scalable, systematic approach for testing HCP across multiple environments and scenarios.
Ensure coordination between all QE teams (ACM/MCE, HCP, KubeVirt, Agent) to avoid redundancies and inefficiencies in testing.
Establish a robust testing framework that can handle upgrades and version compatibility while maintaining compliance with Red Hat’s lifecycle policies.
Offer a clear view of coverage across different permutations of control planes and node pools.

Requirements (Acceptance Criteria)

Testing matrix covers all relevant permutations of management clusters, hubs, MCE, control planes, and node pools.
Use of representative sampling to ensure critical combinations are tested without unnecessary resource strain.
Ensure testing for upgrades includes fresh install scenarios to streamline coverage.
Automated processes in place to trigger relevant tests for new MCE builds or HCP updates.
Comprehensive tracking of QE teams’ coverage to avoid duplicated efforts.
Test execution time is optimized to reduce delays in delivery without compromising coverage.

Deployment Considerations

Self-managed, managed, or both: self-managed.
Classic (standalone cluster): No.
Hosted control planes: Yes.
Multi-node, Compact (three node), or Single node (SNO), or all: N/A.
Connected / Restricted Network: Yes.
Architectures: x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x).
Operator compatibility: Yes, ensure operator updates don't break testing workflows.
Backport needed: N/A
UI need:N/A
Other: N/A.

Use Cases (Optional)

See: https://docs.google.com/spreadsheets/d/1j8TjMfyCfEt8OzTgvrAG3tuC6WMweBh5ElzWu6oAvUw/edit?gid=0#gid=0

Same hub multiple HCP Versions: Using the same managmenent/hub cluster (e.g., 4.15), to provision up to n+4 newer cluster versions
MCE ft. Management cluster compatibility.
MCE ft. HCP versions compatibility
Upgrade Scenarios: Testing a management cluster upgrade from version 4.14 to 4.15, ensuring all connected node pools and control planes operate seamlessly.
Fresh Install Scenarios: Testing a new deployment with different node pool versions to ensure all configurations work correctly without requiring manual interventions.

Background

The HCP architecture introduces decoupled control planes and worker nodes, significantly increasing the number of testing permutations. Ensuring these scenarios are tested is crucial to maintaining product quality, customer satisfaction, and stay compliant as an OpenShift form-factor.

Epic HOSTEDCP-1569: Use HO/e2e from main in CI tests for all releases

View the Description View the linked PRs

This was attempted once before

https://github.com/openshift/release/pull/47599

Then reverted

https://github.com/openshift/release/pull/48326

ROSA HCP prod runs with HO from main but 4.14 and 4.15 HCs (currently), however, we do not test these together in presubmit testing, increases the chance of an escape.

Feature OCPSTRAT-1656: Updated boot images: Phase 3 (AWS GA, vSphere TP)

View the Description

Feature Overview

OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.

In phase 1 provided tech preview for GCP.

In phase 2, GCP support goes to GA and AWS goes to TP.

In phase 3, AWS support goes to GA and vsphere goes TP.

Requirements

Epic MCO-1303: Update boot images for AWS (GA)

View the Description

This epic will encompass work involved to GA the boot image update feature for the AWS platform.

Story MCO-1306: Bump API to capture feature gate changes

View the Description View the linked PRs

This work will involve bumping the API in the MCO repo, capturing the new feature gate changes.

https://github.com/openshift/machine-config-operator/pull/4690

Story MCO-1370: Add boot image configmap test to openshift/origin

View the Description View the linked PRs

Per GA requirements, we are required to add five tests to openshift/origin. This story will encompass part of that work.

https://github.com/openshift/origin/pull/29219

Story MCO-1304: Add tests to openshift/origin

View the Description View the linked PRs

Per GA requirements, we are required to add tests to openshift/origin. This story will encompass that work.

https://github.com/openshift/origin/pull/29099

Story MCO-1305: Move ManagedBootImagesAWS feature gate to Default featureset

View the Description View the linked PRs

This work will involve updating the feature gate in the openshift/api.

Example: https://github.com/openshift/api/pull/1975

This will be blocked by ~~MCO-1304~~. Once we lands those tests, it will need some soaking time as indicated by the GA requirements.

https://github.com/openshift/api/pull/2079

Feature OCPSTRAT-1664: Continuosly test minimum permissions required for AWS ROSA

View the Description

Feature Overview (aka. Goal Summary)

To introduce tests for new permissions required as pre-submit tests on PRs so that PR authors can see whenever their changes affect the minimum required permissions

Goals (aka. expected user outcomes)

Currently, the process is that QE installs with the documented minimum permissions, which starts failing whenever something new unknowingly requires additional permissions.

That test runs once a week. When it fails QE reviews and files bugs, the Installer then goes and adds them to a file which tracks the required permissions in the installer repo.

The issue is that it takes some time to get a permissions change implemented by AWS, so the late discovery of a need can become a release blocker

Requirements (aka. Acceptance Criteria):

Early test new minimum permissions required to deploy OCP on AWS so ROSA can be informed before any feature that alters the minimum permissions requirements gets released.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Documentation Considerations

This is an internal-only feature and should not require any user-facing documentation

Epic CORS-3571: Introduce tests for new permissions required as presubmit tests on PRs

View the Description View the linked PRs

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

To introduce tests for new permissions required as presubmit tests on PRs so so PR authors can see whenever their changes affect the minimum required permissions

Why is this important?

Currently the process is that QE installs with the documented minimum permissions, that starts failing whenever something new unknowingly requires additional permissions. That test runs once a week. When it fails QE reviews and files bugs, installer then goes and adds them to a file which tracks the required permissions in the installer repo.
The issue is that it takes some time to get a permissions change implemented by AWS, so the late discovery of a need can become a release blocker

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Open questions::

The details of what would happen when the tests then fail and a new permission is required for example. As in would it be a new PR or what documentation we need to put in place to explain.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

https://github.com/openshift/installer/pull/8704

Feature OCPSTRAT-1665: CAPI-based Installer technical debt

View the Description

Feature Overview (aka. Goal Summary)

Review, refine and harden the CAPI-based Installer implementation introduced in 4.16

Goals (aka. expected user outcomes)

From the implementation of the CAPI-based Installer started with OpenShift 4.16 there is some technical debt that needs to be reviewed and addressed to refine and harden this new installation architecture.

Requirements (aka. Acceptance Criteria):

Review existing implementation, refine as required and harden as possible to remove all the existing technical debt

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Documentation Considerations

There should not be any user-facing documentation required for this work

Epic CORS-3623: Technical debt for 4.18

View the Description

Epic Goal

This epic includes tasks the team would like to tackle to improve our process, QOL, CI. It may include tasks like updating the RHEL base image and vendored assisted-service.

Why is this important?

We need a place to add tasks that are not feature oriented.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CORS-3664: Authentication tech debt for agent based installer in assisted-service repo

View the Description View the linked PRs

User Story:

The agent installer does not require the infra-env id to be present in the claim to perform the authentication.

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/assisted-service/pull/6697

Story CORS-3663: Authentication tech debt for agent based installer in installer repo

View the Description View the linked PRs

User Story:

The agent installer does not require the infra-env id to be present in the claim to perform the authentication.

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/8902

Epic CORS-3563: Hardening CAPI Installs

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Continue to refine and harden aspects of CAPI-based Installs launched in 4.16

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CORS-3582: Remove feature gates for providers that use CAPI by default

View the Description View the linked PRs

User Story:

Once a cloud provider uses CAPI by default, the feature gate it used becomes tech debt.

Acceptance Criteria:

Description of criteria:

openshift/api PR removing the feature gate
remove feature gate conditionals from the installer
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/api/pull/1950

Feature OCPSTRAT-1666: Ensure Sustainability of the HyperShift Project through Comprehensive Refactor and Standardization of Key Components

View the Description

Feature Overview (aka. Goal Summary)

This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.

Goals (aka. Expected User Outcomes)

Unified Codebase: Achieve a consistent and unified codebase across different HCP components, reducing redundancy and making the code easier to understand and maintain.
Enhanced Developer Experience: Streamline the developer workflow by reducing boilerplate code, standardizing interfaces, and improving documentation, leading to faster and safer development cycles.
Improved Maintainability: Refactor large, complex components into smaller, modular, and more manageable pieces, making the codebase more maintainable and easier to evolve over time.
Increased Reliability: Enhance the reliability of the platform by increasing test coverage, enforcing immutability where necessary, and ensuring that all components adhere to best practices for code quality.
Simplified Networking and Upgrade Mechanisms: Standardize and simplify the handling of networking flows and NodePool upgrade triggers, providing a clear, consistent, and maintainable approach to these critical operations.

Requirements (aka. Acceptance Criteria)

Standardized CLI Implementation: Ensure that the CLI is consistent across all supported platforms, with increased unit test coverage and refactored dependencies.
Unified NodePool Upgrade Logic: Implement a common abstraction for NodePool upgrade triggers, consolidating scattered inputs and ensuring a clear, consistent upgrade process.
Refactored Controllers: Break down large, monolithic controllers into modular, reusable components, improving maintainability and readability.
Improved Networking Documentation and Flows: Update networking documentation to reflect the current state, and refactor network proxies for simplicity and reusability.
Centralized Logic for Token and Userdata Generation: Abstract the logic for token and userdata generation into a single, reusable library, improving code clarity and reducing duplication.
Enforced Immutability for Critical API Fields: Ensure that immutable fields within key APIs are enforced through proper validation mechanisms, maintaining API coherence and predictability.
Documented and Clarified Service Publish Strategies: Provide clear documentation on supported service publish strategies, and lock down the API to prevent unsupported configurations.

Use Cases (Optional)

Developer Onboarding: New developers can quickly understand and contribute to the HCP project due to the reduced complexity and improved documentation.
Consistent Operations: Operators and administrators experience a more predictable and consistent platform, with reduced bugs and operational overhead due to the standardized and refactored components.

Out of Scope

Introduction of new features or functionalities unrelated to the refactor and standardization efforts.
Major changes to user-facing commands or APIs beyond what is necessary for standardization.

Background

Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.

Customer Considerations

Minimal Disruption: Ensure that existing users experience minimal disruption during the refactor, with clear communication about any changes that might impact their workflows.
Enhanced Stability: Customers should benefit from a more stable and reliable platform as a result of the increased test coverage and standardization efforts.

Documentation Considerations

Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.

This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.

Epic HOSTEDCP-1976: Implements proper Controller and Component Abstractions

View the Description View the linked PRs

Goal

Refactor and modularize controllers and other components to improve maintainability, scalability, and ease of use.

https://github.com/openshift/hypershift/pull/4823

Story HOSTEDCP-1989: Move any nto related logic from nodepool controller into a single reconcile() func that is implemented in nto.go

View the Description View the linked PRs

Move any nto related logic from nodepool controller into a single reconcile() func that is implemented in nto.go

Story HOSTEDCP-2018: Reorder conditions

View the Description View the linked PRs

As dev I want to understand at a glance which conditions are relevant for the NodePool.
As dev I want to have the ability to add/collapse conditions easily.
As dev I want any conditions expectation to be unit testable.

https://github.com/openshift/hypershift/pull/4859

Story HOSTEDCP-1968: Refactor token and userdata generation

View the Description View the linked PRs

Abstract away in a single place all the logic related to token and userdata secrets consuming the output of https://issues.redhat.com/browse/HOSTEDCP-1678
This should result in a single abstraction i.e. "Token" that expose a thin library e.g. Reconcile() and hide all details for token/userdata secrets lifecycle

https://github.com/openshift/hypershift/pull/4734

Story HOSTEDCP-1678: Refactor nodepool controller config generation

View the Description View the linked PRs

As as dev I want to easily add and understand which input results in triggering a nodepool upgrade.

There's many scattered things that triggers nodepool rolling upgrade on change.
For code sustainability it'd be good to try to have a common abstraction that discovers all of them based on an input and return the authoritative hash for any targeted config version in time.
Related https://github.com/openshift/hypershift/pull/4057
https://github.com/openshift/hypershift/pull/3969#discussion_r1587198191

https://github.com/openshift/hypershift/pull/4717

Story HOSTEDCP-1984: Refactor Cluster API logic

View the Description View the linked PRs

Following up to abstracting pieces into cohesively units, capi is the next logic choice since there's many reconciliation business logic for it in the NodePool controller.
Goals:
All capi related logic is driven by a single abstraction/struct.
Almost full unit test coverage
Deeper refactor of the concrete implementation logic is left out of the scope for gradual test driven follow ups

https://github.com/openshift/hypershift/pull/4795

Story HOSTEDCP-2040: Refactor CPO Components to use the new abstraction

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

As an external dev I want to be able to add new components to the CPO easily
As a core dev I want to feel safe when adding new components to the CPO
As a core dev I want to add new components to the CPO with our copy/pasting big chunks of code

https://issues.redhat.com//browse/HOSTEDCP-1801 introduced a new abstraction to be used by ControlPlane components. We need to refactor every component to use this abstraction.

Acceptance Criteria:

Description of criteria:

All ControlPlane Components are refactored:

~~HCCO~~
~~kube-apiserver (Mulham)~~
~~kube-controller-manager (Mulham)~~
~~ocm (Mulham)~~
~~etcd (Mulham)~~
~~oapi (Mulham)~~
~~scheduler (Mulham)~~
~~clusterpolicy (Mulham)~~
~~CVO (Mulham)~~
~~oauth (Mulham)~~
~~hcp-router (Mulham)~~
CCO (Mulham)
CNO (Jparrill)
CSI (Jparrill)
dnsoperator
ignition (Ahmed)
ingressoperator (Bryan)
machineapprover
nto
olm
pkioperator
registryoperator (Bryan)
snapshotcontroller
storage

Example PR to refactor HCCO: https://github.com/openshift/hypershift/pull/4860
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md

Sub-task HOSTEDCP-2077: Refactor kube-scheduler

View the linked PRs

https://github.com/openshift/hypershift/pull/5083

Sub-task HOSTEDCP-2067: Refactor cloud-controller-manager

View the linked PRs

https://github.com/openshift/hypershift/pull/5007

Sub-task HOSTEDCP-2057: Refactor etcd

View the linked PRs

https://github.com/openshift/hypershift/pull/4987

Sub-task HOSTEDCP-2173: Refactor oauth-server

View the linked PRs

https://github.com/openshift/hypershift/pull/5104

Sub-task HOSTEDCP-2063: Refactor OAPI

View the Description View the linked PRs

Provide a PR with a OAPI standard refactor

Example PR to refactor HCCO: https://github.com/openshift/hypershift/pull/4860
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md

Sub-task HOSTEDCP-2054: Refactor kube-apiserver

View the linked PRs

https://github.com/openshift/hypershift/pull/4941

Sub-task HOSTEDCP-2072: Refactor OCM

View the linked PRs

https://github.com/openshift/hypershift/pull/5035

Sub-task HOSTEDCP-2177: Refactor cluster-version-operator

View the linked PRs

https://github.com/openshift/hypershift/pull/5125

Sub-task HOSTEDCP-2055: Refactor kube-controller-manager

View the linked PRs

https://github.com/openshift/hypershift/pull/4986

Story HOSTEDCP-1801: Refactor CPO

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

As an external dev I want to be able to add new components to the CPO easily
As a core dev I want to feel safe when adding new components to the CPO
As a core dev I want to add new components to the CPO with our copy/pasting big chunks of code

Acceptance Criteria:

Context:
If you ever had to add or modify a component to the control plane operator the need for this becomes very obvious. There should be possible to only add components manifest through a gated interface.
Right now adding a new component requires copy/paste hundreds of lines of boilerplate and there's plenty of room for side effects. A dev need to manually remember to set the right config like AutomountServiceAccountToken false, topology opinions...

We should refactor support/config and all the consumers in the CPO to enforce components creation through audited and common signature/interfaces.
Adding a new component is only possible through this higher abstractions

More Details

If you ever had to add or modify a component to the control plane operator the need for this becomes very obvious. There should be possible o only add components manifest through a gated interface.
Right now adding a new component requires copy/paste hundreds of lines of boilerplane and there's plenty of room for side effects

Epic HOSTEDCP-1979: API UX Validation

View the Description

Goal

Improve the consistency and reliability of APIs by enforcing immutability and clarifying service publish strategy support.

Why is this important?

Scenarios

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-2075: Add missing API validation and docs for HostedCluster

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Capability 1
Capability 2
Capability 3

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/5090

Story HOSTEDCP-2073: Add missing API validation and docs for NodePools

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Capability 1
Capability 2
Capability 3

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/5046

Epic HOSTEDCP-1980: Codebase Modernization and Technical Debt Reduction

View the Description

Goal

Focus on the general modernization of the codebase, addressing technical debt, and ensuring that the platform is easy to maintain and extend.

Why is this important?

Scenarios

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-967: Delete Alpha API and conversion webhook

View the Description View the linked PRs

DoD:

Delete conversion webhook https://github.com/openshift/hypershift/pull/2267

This needs to be backward compatible for IBM.
Review IBM PRs: * https://github.com/openshift/hypershift/pull/1939

https://github.com/openshift/hypershift/pull/4977

Story HOSTEDCP-1883: Remove hardcoded catalog images in CPO

View the Description View the linked PRs

User Story:

As a user of HyperShift, I want:

the hardcoded catalog images removed and fetched from the OCP release image in the HCP

so that I can achieve

use the catalog images related to the OCP release image rather than a hardcoded value

Acceptance Criteria:

Description of criteria:

Hardcoded catalog images removed
Catalog image versions derived from OCP release image version listed in HCP

Out of Scope:

N/A

Engineering Details:

Hardcoded images are here
Every branching event, we have to remember to update this hardcoded value. Removing the hardcoded value and deriving the version from the OCP release image will remove this requirement.
Patryk Stefanski had code in a related PR that would do this, that can be reused here.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/4612

Epic HOSTEDCP-1799: Reduce Costs

View the Description

Goal

As a dev I want the base code to be easier to read, maintain and test

Why is this important?

If devs are don't have a healthy dev environment the project will go and the business won't make $$

Scenarios

Acceptance Criteria

80% unit tested code
No file > 1000 lines of code

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-1565: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4580

Feature OCPSTRAT-1682: OCP Console - Upgrade to PatternFly 6 (PF6)

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CONSOLE-4089: OCP 4.18 - PF6 adoption prep-work

View the Description

Epic Goal

Goal of this epic is to prepare the console codebase as well as dynamic plugins SDK. In order to do that we need to identify areas in console that need to be updated and issues which need to be fixed.

Why is this important?

Console as well as its dynamic plugins will need to support PF6 once its available in a stable version

Acceptance Criteria

Identity all the areas of code that need to be updated or fixed
Create stories which will address those updates and fixes

Open questions::

Should we be removing PF4 as part of 4.16 ?

NOTE:

Nicole Thoen already started with crafting a technical debt impeding PF6 migrations document, which contains list of identified tech -debt items, deprecated components etc...

Story CONSOLE-4182: Replace DropdownDeprecated and KebabToggleDeprecated within /public/components

View the Description View the linked PRs

Locations

frontend/public/components/‎

search-filter-dropdown.tsx (note: Steve has a branch that's converted this) [merged]

‎frontend/public/components/monitoring/‎

~~kebab-dropdown.tsx~~ – code duplicated at https://github.com/openshift/monitoring-plugin/blob/main/web/src/components/kebab-dropdown.tsx and that version will be updated in https://issues.redhat.com/browse/OU-257 as the console version is eventually going away

~~ListPageCreate.tsx~~ – addressed in https://issues.redhat.com//browse/CONSOLE-4118

~~alerting.tsx~~ – code duplicated at https://github.com/openshift/monitoring-plugin/blob/main/web/src/components/alerting.tsx and that version should be updated in https://issues.redhat.com/browse/OU-561 as the console version is eventually going away

AC: Go though the mentioned files and swap the usage of DropdownDeprecated and KebabToggleDeprecated with PF components, based on their semantics (either Dropdown or Select components).

Note:

DropdownDeprecated and KebabToggleDeprecated are replaced with latest components

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/menu-toggle#plain-toggle-with-icon

https://www.patternfly.org/components/menus/dropdown

https://www.patternfly.org/components/menus/select

https://github.com/openshift/console/pull/14390

Story CONSOLE-4234: Replace TableDeprecated in console

View the Description View the linked PRs

Part of the PF6 adoption should be replacing TableDeprecated with the Table component

Location:

frontend/packages/console-app/src/components/console-operator/ConsoleOperatorConfig.tsx
frontend/public/components/custom-resource-definition.tsx
frontend/public/components/factory/table.tsx
frontend/public/components/factory/Table/VirtualizedTable.tsx

AC:

Change the TableDeprecated component in the locations above in favour of PF Table component.
Remove the patternfly/react-table/deprecated package from console dependancies.

Sub-task CONSOLE-4297: Replace TableDeprecated in frontend/public/components/custom-resource-definition.tsx

View the linked PRs

https://github.com/openshift/console/pull/14418

Sub-task CONSOLE-4296: Replace TableDeprecated in frontend/packages/console-app/src/components/console-operator/ConsoleOperatorConfig.tsx

View the linked PRs

https://github.com/openshift/console/pull/14403

Story CONSOLE-4120: Replace DropdownDeprecated and KebabToggleDeprecated within ‎frontend/packages/console-shared/src/components/‎

View the Description View the linked PRs

Locations

‎frontend/packages/console-shared/src/components/‎

GettingStartedGrid.tsx (has KebabToggleDeprecated)

Note

DropdownDeprecated is replaced with latest components

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/menu-toggle#plain-toggle-with-icon

https://www.patternfly.org/components/menus/dropdown

https://www.patternfly.org/components/menus/select

AC: Go though the mentioned files and swap the usage of DropdownDeprecated and KebabToggleDeprecated with PF components, based on their semantics (either Dropdown or Select components).

https://github.com/openshift/console/pull/14151

Story CONSOLE-4125: Replace SelectDeprecated in ‎frontend/packages/console-app/src/components/‎

View the Description View the linked PRs

~~NodeLogs.tsx~~ (two) [merged]

PerspectiveDropdown.tsx (??? Can not locate this dropdown in the UI. Reached out to Christoph but didn't hear back.)

PerspectiveDropdown.spec.tsx

~~UserPreferenceDropdownField.tsx~~ [merged]

~~ClusterConfigurationDropdownField.tsx~~ ~~(??? Can not locate this dropdown in the UI)~~ Dead code

LanguageDropdown.tsx

LanguageDropdown.spec.tsx (Need fix for this test failure)

~~PerspectiveConfiguration.tsx (options have descriptions)~~ [merged]

Acceptance Criteria

SelectDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.

Sub-task CONSOLE-4294: Migrate user preferences PerspectiveDropdown to PF Select

View the Description View the linked PRs

Migrate PerspectiveDropdown to PF Select

https://github.com/openshift/console/pull/14391

Story CONSOLE-4227: Replace SelectDeprecated and SelectOptionDeprecated in ‎frontend/packages/console-shared/src/‎

View the Description View the linked PRs

multiselectdropdown.tsx (multiple typeahead with placeholder and noResultsFoundText)

Note

SelectDeprecated and SelectOptionDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.

https://github.com/openshift/console/pull/14428

Story CONSOLE-4127: Replace SelectDeprecated and SelectOptionDeprecated in ‎frontend/packages/console-shared/src/‎

View the Description View the linked PRs

~~multiselectdropdown.tsx (multiple typeahead with placeholder and noResultsFoundText)~~ only used in packages/local-storage-operator moved to https://issues.redhat.com/browse/CONSOLE-4227

UtilizationDurationDropdown.tsx (checkbox select, plain toggle, with placeholder text)

~~SelectInputField.tsx (uses most Select props)~~ moved to https://issues.redhat.com/browse/ODC-7655

~~Used in PortInputField~~
~~Used in ServerlessRouteSection~~
~~Used in BuildOptions~~
~~Used in ResourceSection~~
~~Used in KafkaSourceSection (2 times)~~
~~Used in KafkaSinkSection~~
~~Used in~~ ~~RequestPane~~

QueryBrowser.tsx (Currently using DropdownDeprecated, should be using a Select)

Note

SelectDeprecated and SelectOptionDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.

Story CONSOLE-4117: Replace ApplicationLauncher component

View the Description View the linked PRs

AC:

Replace ApplicationLauncher, ApplicationLauncherGroup, ApplicationLauncherItem, ApplicationLauncherSeparator with Dropdown and Menu components.
Update integration tests

PatternFly demo using Dropdown and Menu components

https://www.patternfly.org/components/menus/application-launcher/

https://github.com/openshift/console/pull/14150

Story CONSOLE-4126: Replace SelectDeprecated in ‎frontend/packages/operator-lifecycle-manager

View the Description View the linked PRs

operator-channel-version-select.tsx (Two)

Acceptance Criteria

SelectDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.

https://github.com/openshift/console/pull/14153

Story CONSOLE-4118: Replace DropdownDeprecated and SelectDeprecated with wrapper using latest corresponding PF components

View the Description View the linked PRs

Replace DropdownDeprecated

OAuthConfigDetails.tsx

ListPageCreate.tsx

Replace SelectDeprecated

utilization-card.tsx

Acceptance Criteria

Switch deprecated components to latest PatternFly corresponding components.

Note:

DropdownDeprecated and KebabToggleDeprecated are replaced with latest components

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/menu-toggle#plain-toggle-with-icon

https://www.patternfly.org/components/menus/dropdown

https://www.patternfly.org/components/menus/select

Story CONSOLE-4124: Replace Deprecated components in ‎frontend/public/components/‎

View the Description View the linked PRs

resource-dropdown.tsx (checkbox, options have tooltips, grouped options, hasInlineFilter which is not supported in V6 Select, convert to Typeahead)

resource-log.tsx

filter-toolbar.tsx (grouped, checkbox select)

~~monitoring/dashboards/index.tsx (checkbox select, hasInlineFilter which is not supported in V6 Select, convert to Typeahead)~~ covered by https://issues.redhat.com/browse/ODC-7655

silence-form.tsx (Currently using DropdownDeprecated, should be using a Select)

~~timespan-dropdown.ts~~ ~~(Currently using DropdownDeprecated, should be using a Select)~~ covered by https://issues.redhat.com/browse/ODC-7655

~~poll-interval-dropdown.tsx~~ ~~(Currently using DropdownDeprecated, should be using a Select)~~ covered by https://issues.redhat.com/browse/ODC-7655

Note

SelectDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

AC: Go though the mentioned files and swap the usage of Deprecated components with PF components, based on their semantics (either Dropdown or Select components).

Story CONSOLE-4119: Replace DropdownDeprecated within ‎frontend/packages/console-app/src/components/‎

View the Description View the linked PRs

Locations

‎frontend/packages/console-app/src/components/‎

~~NavHeader.tsx~~ [merged]

~~PDBForm.tsx~~ (This should be a <Select>) [merged]

Acceptance Criteria:

Change the DropdownDeprecated component in NavHeader.tsx in favour of PF Select component.
Change the DropdownDeprecated component in OAuthConfigDetails.tsx in favour of PF Dropdown component.
Change the DropdownDeprecated component in PDBForm.tsx in favour of PF Select component.
Create a wrapper for these replacements, if necessary.
Update integration tests, if necessary.
Add an integration test to verify if the wrapper is accessible via keyboard.

DropdownDeprecated are replaced with latest components

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/dropdown

https://www.patternfly.org/components/menus/select

Epic ODC-7675: ODC 4.18 - PF6 adoption prep-work

View the Description

Epic Goal

Goal of this epic is to prepare the console codebase as well as dynamic plugins SDK. In order to do that we need to identify areas in console that need to be updated and issues which need to be fixed.

Why is this important?

Console as well as its dynamic plugins will need to support PF6 once its available in a stable version

Acceptance Criteria

Identity all the areas of code that need to be updated or fixed
Create stories which will address those updates and fixes

Open questions::

Should we be removing PF4 as part of 4.16 ?

NOTE:

Nicole Thoen already started with crafting a technical debt impeding PF6 migrations document, which contains list of identified tech -debt items, deprecated components etc...

Story ODC-7669: Replace DropdownDeprecated within ‎frontend/packages/pipelines-plugin/src/components/‎

View the Description View the linked PRs

Locations

‎frontend/packages/pipelines-plugin/src/components/‎

~~PipelineQuickSearchVersionDropdown.tsx (Currently using DropdownDeprecated, should be using a Select)~~

~~PipelineMetricsTimeRangeDropdown.tsx (Currently using DropdownDeprecated, should be using a Select)~~

Note

DropdownDeprecated are replaced with latest Select components

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.

Story ODC-7667: Replace SelectDeprecated in frontend/packages/dev-console

View the Description View the linked PRs

~~SecureRouteFields.tsx (Two)~~

Acceptance Criteria

SelectDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

https://github.com/openshift/console/pull/14168

Story ODC-7671: Replace SelectDeprecated & DropdownDeprecated in ‎frontend/packages/topology/

View the Description View the linked PRs

~~KindFilterDropdown.tsx (checkbox select with custom content - not options)~~

~~FilterDropdown.tsx (checkbox, grouped, switch component in select menu)~~

~~NameLabelFilterDropdown.tsx (Should be a Select component; Currently using DropdownDeprecated)~~

Acceptance Criteria

SelectDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

Story ODC-7672: Replace SelectDeprecated in frontend/packages/console-telemetry-plugin

View the Description View the linked PRs

~~TelemetryConfiguration.tsx (options have descriptions)~~

~~TelemetryUserPreferenceDropdown.tsx (options have descriptions)~~

Acceptance Criteria

SelectDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

Story ODC-7670: Replace DropdownDeprecated within ‎frontend/packages/topology/

View the Description View the linked PRs

Locations

~~‎frontend/packages/topology/MoveConnectionModal.tsx~~

Note:

DropdownDeprecated are replaced with latest components

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/dropdown

https://www.patternfly.org/components/menus/select

AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select or Dropdown components.

https://github.com/openshift/console/pull/14204

Story ODC-7680: Replace TableDeprecated in console

View the Description View the linked PRs

Part of the PF6 adoption should be replacing TableDeprecated with the Table component

Location:

frontend/packages/knative-plugin/src/components/overview/FilterTable.tsx
frontend/packages/pipelines-plugin/src/components/shared/results/ResultsList.tsx
~~frontend/packages/rhoas-plugin/src/components/service-table/ServiceInstanceTable.tsx~~ dead project
frontend/public/components/monitoring/metrics.tsx
~~frontend/public/components/monitoring/dashboards/table.tsx (file will be removed from console in CONSOLE-4236, belongs to OU-499)~~

AC:

Change the TableDeprecated component in the locations above in favour of PF Table component.
Remove the patternfly/react-table/deprecated package from console dependancies.

Story ODC-7655: Replace Deprecated components in ‎frontend/public/components/‎

View the Description View the linked PRs

monitoring/dashboards/index.tsx (checkbox select, hasInlineFilter which is not supported in V6 Select, convert to Typeahead)

~~timespan-dropdown.ts (Currently using DropdownDeprecated, should be using a Select)~~

~~poll-interval-dropdown.tsx (Currently using DropdownDeprecated, should be using a Select)~~

SelectInputField.tsx (uses most Select props)

Used in PortInputField
Used in ServerlessRouteSection
Used in BuildStrategySelector
Used in ResourceSection
Used in KafkaSourceSection (2 times)
Used in KafkaSinkSection
Used in RequestPane

`FilterSelect`, `VariableDropdown`, `TimespanDropdown`, and `IntervalDropdown`are the components that need to be updated; frontend/packages/dev-console/src/components/monitoring/MonitoringPage.tsx is the only valid instance usage of `MonitoringDashboardsPage` as web/src/components/alerting.tsx is orphaned.

Note

SelectDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

AC: Go though the mentioned files and swap the usage of Deprecated components with PF components, based on their semantics (either Dropdown or Select components).

Feature OCPSTRAT-1697: VolumeAttributesClass (TP)

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

K8s 1.31 introduces VolumeAttributesClass as beta (code in external provisioner). We should make it available to customers as tech preview.

VolumeAttributesClass allows PVC to be modified after their creation and while attached. There is as vast number of parameters that can be updated but the most popular is to change the QoS values. Parameters that can be changed depend on the driver used.

Goals (aka. expected user outcomes)

Productise VolumeAttributesClass as TP in anticipation for GA. Customer can start testing VolumeAttributesClass.

Requirements (aka. Acceptance Criteria):

Disabled by default
put it under TechPreviewNoUpgrade
make sure VolumeAttributeClass object is available in beta APIs
enable the feature in external-provisioner and external-resizer at least in AWS EBS CSI driver, check the other drivers.
- Add RBAC rules for these objects
make sure we run its tests in one of TechPreviewNoUpgrade CI jobs (with hostpath CSI driver)
reuse / add a job with AWS EBS CSI driver + tech preview.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	both
Classic (standalone cluster)	yes
Hosted control planes	yes
Multi node, Compact (three node), or Single node (SNO), or all	all
Connected / Restricted Network	both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	all
Operator compatibility	N/A core storage
Backport needed (list applicable versions)	None
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	TBD for TP
Other (please specify)	n/A

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

As an OCP user, I want to change parameters of my existing PVC such as the QoS attributes.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

UI for TP

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

There's been some limitations and complains on the fact that PVC attributes are sealed after their creation avoiding customers to update them. This is particularly impacting for a specific QoS is set and the volume requirements are changing.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Customer should not use it in production atm.

Documentation Considerations

Document VolumeAttributesClass creation and how to update PVC. Mention any limitation. Mention it's tech preview no upgrade. Add drivers support if needed.

Interoperability Considerations

Check which drivers support it for which parameters.

Epic STOR-2078: Upstream Beta Tracking: VolumeAttributesClass (TP)

View the Description View the linked PRs

Epic Goal

Support upstream feature "VolumeAttributesClass" in OCP as Beta, i.e. test it and have docs for it.

Why is this important?

We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Enhancement issue: https://github.com/kubernetes/enhancements/issues/3751
KEP: https://github.com/kubernetes/enhancements/pull/3780

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Feature OCPSTRAT-1700: Baremetal IPI/ABI Close Feature Gaps

View the Description

Feature Overview

There are a number of features or use cases supported by metal IPI that currently do not work in the agent-based installer (mostly due to being prevented by validations).

In phased approach, we first need to close all the identified gaps in ABI (this feature).

In a second phase, we would introduce in the IPI flow the ABI technology, once its on par with the IPI feature-set.

Goals

Close the gaps identified in Baremetal IPI/ABI Feature Gap Analysis

Epic AGENT-394: Agent-based config to be similar to an IPI

View the Description

Given that IPI (starting 4.10) with nmstate config, the overall configuration seems very similar apart from the fact the it spreaded into different files.

Given: a configuration that works for an IPI methods

When: i do agent based installation on the same configuration

Then: it works (with the exception that isos are entered manually)

Bug OCPBUGS-42670: AdditionalTrustBundlePolicy in install-config.yaml should be used

View the Description View the linked PRs

Description of problem:

Currently the AdditionalTrustBundlePolicy is not being used and when set  to a value other than "Proxyonly" generates a warning message
{noformat}
Warning AdditionalTrustBundlePolicy: always is ignored
{noformat}

There are certain configurations where its necessary to set this value, see more discussion in https://redhat-internal.slack.com/archives/C02SPBZ4GPR/p1727793787922199

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Always

Steps to Reproduce:

    1. In install-config.yaml set AdditionalTrustBundlePolicy to Always
    2. Note the warning message that is output.
    3.

Actual results:

AdditionalTrustBundlePolicy is unused.

Expected results:

AdditionalTrustBundlePolicy is used in cluster installation.

Additional info:

https://github.com/openshift/installer/pull/9164

Feature OCPSTRAT-1702: Hosted Control Planes proxy configurations coverage improvements

View the Description

Feature Overview (aka. Goal Summary)

As we gain hosted control planes customers, that bring in more diverse network topologies, we should evaluate relevant configurations and topologies and provide a more thorough coverage in CI and promotion testing

Goals (aka. expected user outcomes)

Cut down proxy issues in managed and self-managed hosted control planes

Requirements (aka. Acceptance Criteria):

E2E testing for Managed Hosted Control Planes with a good trade-off of different topology/configuration coverage
E2E testing for self Managed Hosted Control Planes with a good trade-off of different topology/configuration coverage

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	both
Classic (standalone cluster)	No
Hosted control planes	Yes
Multi node, Compact (three node), or Single node (SNO), or all	All supported Hosted Control Planes node topologies
Connected / Restricted Network	Connected
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	all
Operator compatibility	N/A
Backport needed (list applicable versions)	Coverage over all supported releases
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	No
Other (please specify)

Background

There's been a few significant customer bugs related to proxy configurations with Hosted Control Planes

Customer Considerations

Will increase reliability for customers, preventing regressions

Documentation Considerations

Documentation improvements that better detail the flow of communication and supported configurations

Interoperability Considerations

E2E should probably cover both ROSA/HCP and ARO/HCP

Epic HOSTEDCP-2021: Proxy configurations coverage improvements

View the Description

Goal

Solid proxy configuration/topology coverage

Why is this important?

Cut down on proxy bugs/incidents/regressions

Scenarios

Access external services through proxy (idp, image registries)
Workers needing a proxy to access the ignition endpoint
Workers needing a proxy to access the APIServer
Management cluster uses a proxy for external traffic

Acceptance Criteria

CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented

Done Checklist

CI - CI is running, tests are automated and merged.
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests.

Story HOSTEDCP-2025: Allow setting VPC CIDR when creating AWS cluster infrastructure

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Specify the VPC CIDR when creating a cluster with the hypershift CLI

so that I can achieve

Create separate VPCs that can be peered in CI so that we can test proxy use cases.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/4866

Feature OCPSTRAT-1708: Control Plane fleet wide fix delivery mechanism

View the Description

Feature Overview (aka. Goal Summary)

A common concern with dealing with escalations/incidents in Managed OpenShift Hosted Control Planes is the resolution time incurred when the fix needs to be delivered in a component of the solution that ships within the OpenShift release payload. This is because OpenShift's release payloads:

Have a hotfix process that is customer/support-exception targeted rather than fleet targeted
Can take weeks to be available for Managed OpenShift

This feature seeks to provide mechanisms that put the upper time boundary in delivering such fixes to match the current HyperShift Operator <24h expectation

Goals (aka. expected user outcomes)

Hosted Control Plane fixes are delivered through Konflux builds
No additional upgrade edges
Release specific
Adequate, fleet representative, automated testing coverage
Reduced human interaction

Requirements (aka. Acceptance Criteria):

Overriding Hosted Control Plane components can be done automatically once the PRs are ready and the affected versions have been properly identified
Managed OpenShift Hosted Clusters have their Control Planes fix applied without requiring customer intervention and without workload disruption beyond what might already be incurred because of the incident it is solving
Fix can be promoted through integration, stage and production canary with a good degree of observability

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	managed (ROSA and ARO)
Classic (standalone cluster)	No
Hosted control planes	Yes
Multi node, Compact (three node), or Single node (SNO), or all	All supported ROSA/HCP topologies
Connected / Restricted Network	All supported ROSA/HCP topologies
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	All supported ROSA/HCP topologies
Operator compatibility	CPO and Operators depending on it
Backport needed (list applicable versions)	TBD
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	No
Other (please specify)	No

Use Cases (Optional):

Incident response when the engineering solution is partially or completely in the Hosted Control Plane side rather than in the HyperShift Operator

Out of Scope

HyperShift Operator binary bundling

Background

Discussed previously during incident calls. Design discussion document

Customer Considerations

Because the Managed Control Plane version does not change but it is overridden, customer visibility and impact should be limited as much as possible.

Documentation Considerations

SOP needs to be defined for:

Requesting and approving the fleet wide fixes described above
Building and delivering them
Identifying clusters with deployed fleet wide fixes

Epic CNTRLPLANE-16: Control Plane Operator Konflux pipeline

View the Description

Goal

Have a Konflux build for every supported branch on every pull request / merge that modifies the Control Plane Operator

Why is this important?

In order to build the Control Plane Operator images to be used for management cluster wide overrides.
To be able to deliver managed Hosted Control Plane fixes to managed OpenShift with a similar SLO as the fixes for the HyperShift Operator.

Scenarios

A PR that modifies the control plane in a supported branch is posted for a fix affecting managed OpenShift

Acceptance Criteria

Dev - Konflux application and component per supported release
Dev - SOPs for managing/troubleshooting the Konflux Application
Dev - Release Plan that delivers to the appropriate AppSre production registry
QE - HyperShift Operator versions that encode an override must be tested with the CPO Konflux builds that they make

Dependencies (internal and external)

Konflux

Previous Work (Optional):

HOSTEDCP-2027

Open questions:

Antoni Segura Puimedon How long or how many times should the CPO override be tested?

Done Checklist

CI - CI is running, tests are automated and merged.
DEV - Konflux App link: <link to Konflux App for CPO>
DEV - SOP: <link to meaningful PR or GitHub Issue>
QE - Test plan in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task CNTRLPLANE-15: Create CPO Konflux app with supported branch components

View the Description

Acceptance criteria:

Workspace: crt-redhat-acm
Components: One per supported branch
Separate Containerfile
Should only build for area/control-plane-operator

Epic HOSTEDCP-2027: Quick Control Plane fleet wide fix delivery

View the Description

Goal

Deliver Control Plane fixes within the same time constraints that we deliver HyperShift Operator fixes for Managed Hosted Control Planes

Why is this important?

Drastically cut SLO and contractual risk incurred from outages caused by Control Plane components in Managed OpenShift Hosted Control Planes
Improved Managed OpenShift Hosted Control Planes user experience in receiving fixes
Reduced SRE / Eng toil

Scenarios

Incident response when the engineering solution is partially or completely in the Hosted Control Plane side rather than in the HyperShift Operator

Acceptance Criteria

Dev - Has a valid enhancement
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
SOPs for

- Requesting and approving the fleet wide fixes described above
- Building and delivering them
- Identifying clusters with deployed fleet wide fixes
Managed OpenShift Hosted Clusters have their Control Planes fix applied without requiring customer intervention and without workload disruption beyond what might already be incurred because of the incident it is solving

Dependencies (internal and external)

TBD: Konflux automated pipeline for building and delivering these fixes (needs another EPIC)

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-2026: Support specifying CPO overrides for HyperShift operator

View the Description View the linked PRs

User Story:

As a developer, I want to be able to:

Setup a mapping in HyperShift operator to replace the CPO for a specific OpenShift release

so that I can

Deliver CPO fixes to managed services quickly

https://github.com/openshift/hypershift/pull/4889

Feature OCPSTRAT-1712: IPsec Design Modernization

View the Description

Feature Overview

The OpenShift IPsec implementation will be enhanced for a growing set of enterprise use cases, and for larger scale deployments.

Goals

The OpenShift IPsec implementation was originally built for purpose-driven use cases from telco NEPs, but was also be useful for a specific set of other customer use cases outside of that context. As customer adoption grew and it was adopted by some of the largest (by number of cluster nodes) deployments in the field, it became obvious that some redesign is necessary in order to continue to deliver enterprise-grade IPsec, for both East-West and North-South traffic, and for some of our most-demanding customer deployments.

Key enhancements include observability and blocked traffic across paths if IPsec encryption is not functioning properly.

Requirements

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

Questions to answer…

Out of Scope

Configuration of external-to-cluster IPsec endpoints for N-S IPsec.

Background, and strategic fit

The OpenShift IPsec feature is fundamental to customer deployments for ensuring that all traffic between cluster nodes (East-West) and between cluster nodes and external-to-the-cluster entities that also are configured for IPsec (North-South) is encrypted by default. This encryption must scale to the largest of deployments.

Assumptions

Customer Considerations

Customers require the option to use their own certificates or CA for IPsec.
Customers require observability of configuration (e.g. is the IPsec tunnel up and passing traffic)
If the IPsec tunnel is not up or otherwise functioning, traffic across the intended-to-be-encrypted network path should be blocked.

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic SDN-5334: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-42679: [4.14] Libreswan and xfrm information and logs are not getting collected in sos-reports

View the Description View the linked PRs

Description of problem:

    In 4.14 libreswan is running as a containerized process inside the pod. SOS-Reports and must-gathers are not collecting libreswan logs and xfrm information from the nodes which is making the debugging process heavier. This should be fixed by working with the sos-report team OR by changing our must-gather scripts in 4.14 alone.

    From 4.15 libreswan is a systemd process running on the host so the swan logs are gathered in sos-report

For 4.14 specially during escalations gathering individual node data over and over is becoming painful for IPSEC. We need to ensure all the data required to debug IPSEC is collected in one place

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/must-gather/pull/458

Story SDN-5436: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Feature OCPSTRAT-1727: Document signing keys rotation with Openshift Azure Entra Workload ID enabled clusters

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

As an OpenShift Administrator, I need to ensure that I rotate signing keys for self-managed Openshift Azure Entra Workload ID enabled clusters to comply with PCI-DSS v4 (see #8 on life cycle management) and NIST (see PCI “Tokenization Product Security Guidelines”) rules.

Goals (aka. expected user outcomes)

When creating a self-managed Openshift cluster on Azure using Azure Entra Workload ID, a dedicated OIDC endpoint is created. This endpoint exposes a document located at .well-known/openid_configuration which contains key jwks_uri, that points itself to JSON Web Key Sets.

Regular key rotations are an important part of PCI-DSS v4 and NIST rules. To ensure PCI-DSS V4 requirements, a mechanism is needed to seamlessly rotate signing keys. Currently, we can only have one signing/private key present in the OpenShift cluster; however, JWKS supports multiple public keys.

This feature will be split into 2 phases:

Phase 1: document the feature.
Phase 2 (post Phase 1): automate as much as we can of the feature to be informed by what's possible based on what we do in Phase 1 – this will be in a future OCPSTRAT (TBD).

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Self-managed
Classic (standalone cluster)	Classic
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all	All
Connected / Restricted Network	All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86_x64, ARM (aarch64)
Operator compatibility
Backport needed (list applicable versions)	TBD (Affects OpenShift 4.14+)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Related references

Customer case : 03786147

Jira : ~~RFE-6397~~

Additional references

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CCO-601: Document Bound Service Account Signer Key Rotation

View the Description

As an OpenShift Administrator, I need to ensure that I rotate signing keys for self-managed short-term credentials enabled clusters (Openshift Azure Entra Workload ID, GCP Workload Identity, AWS STS) to comply with PCI-DSS v4 (see #8 on life cycle management) and NIST (see PCI “Tokenization Product Security Guidelines”) rules.

Spike CCO-602: Bound Service Account Signer Key Rotation

View the Description View the linked PRs

Add documentation to the cloud-credential-repo for how to rotate the cluster bound-service-account-signing-key to include adding the new key to the Microsoft Azure Workload Identity issuer file. The process should meet the following requirements:

The next-bound-service-account-signing-key is (re)generated by the cluster.
The (next-)bound-service-account-signing-key private key never leavers the cluster.
There is minimal downtime (preferably zero) for pods using Microsoft Azure WI credentials while authenticating to the Azure API.

https://github.com/openshift/cloud-credential-operator/pull/779

Feature OCPSTRAT-1733: MultiOperatorManager Phase 1

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic API-1835: Produce MultiOperatorManager POC

View the Description View the linked PRs

link back to OCPSTRAT-1644 somehow

Epic Goal*

What is our purpose in implementing this? What new capability will be available to customers?

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Bug OCPBUGS-42932: 4.15-4.18 upgrade stuck on authentication operator during stage of 4.17-4.18 update

View the Description View the linked PRs

Description of problem:

    Failed ci jobs:
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-multi-nightly-4.18-cpou-upgrade-from-4.15-aws-ipi-mini-perm-arm-f14/1842004955238502400

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-arm64-nightly-4.18-cpou-upgrade-from-4.15-azure-ipi-fullyprivate-proxy-f14/1841942041722884096

The 4.15-4.18 upgrade failed at stage of 4.17 to 4.18 update while authentication operator degraded and unavailable due to APIServerDeployment_PreconditionNotFulfilled

$ omc get clusterversion
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-arm64-2024-10-03-172957   True        True          1h44m   Unable to apply 4.18.0-0.nightly-arm64-2024-10-03-125849: the cluster operator authentication is not available

$ omc get co authentication
NAME             VERSION                                    AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.18.0-0.nightly-arm64-2024-10-03-125849   False       False         True       8h

$ omc get co authentication -ojson|jq .status.conditions[]
{
  "lastTransitionTime": "2024-10-04T04:22:39Z",
  "message": "APIServerDeploymentDegraded: waiting for .status.latestAvailableRevision to be available\nAPIServerDeploymentDegraded: ",
  "reason": "APIServerDeployment_PreconditionNotFulfilled",
  "status": "True",
  "type": "Degraded"
}
{
  "lastTransitionTime": "2024-10-04T03:54:13Z",
  "message": "AuthenticatorCertKeyProgressing: All is well",
  "reason": "AsExpected",
  "status": "False",
  "type": "Progressing"
}
{
  "lastTransitionTime": "2024-10-04T03:52:34Z",
  "reason": "APIServerDeployment_PreconditionNotFulfilled",
  "status": "False",
  "type": "Available"
}
{
  "lastTransitionTime": "2024-10-03T21:32:31Z",
  "message": "All is well",
  "reason": "AsExpected",
  "status": "True",
  "type": "Upgradeable"
}
{
  "lastTransitionTime": "2024-10-04T00:04:57Z",
  "reason": "NoData",
  "status": "Unknown",
  "type": "EvaluationConditionsDetected"
}

Version-Release number of selected component (if applicable):

 4.18.0-0.nightly-arm64-2024-10-03-125849
 4.18.0-0.nightly-multi-2024-10-03-193054

How reproducible:

    always

Steps to Reproduce:

    1. upgrade from 4.15 to 4.16, and then to 4.17, and then to 4.18
    2.
    3.

Actual results:

    upgrade stuck on authentication operator

Expected results:

    upgrade succeed

Additional info:

    The issue is found in a control plane only update jobs(with paused worker pool), but it's not cpou specified because it can be reproduced in a normal chain upgrade from 4.15 to 4.18 upgrade.

https://github.com/openshift/cluster-authentication-operator/pull/727

Feature OCPSTRAT-1753: Support for Specification of a Pre-created Loadbalancer IP on OpenStack

View the Description

Feature Overview (aka. Goal Summary)

Add OpenStackLoadBalancerParameters and add an option for setting the load-balancer IP address for only those platforms where it can be implemented.

Goals (aka. expected user outcomes)

As a user of on-prem OpenShift, I need to manage DNS for my OpenShift cluster manually. I can already specify an IP address for the API server, but I cannot do this for Ingress. This means that I have to:

Manually create the API endpoint IP
Add DNS for the API endpoint
Create the cluster
Discover the created Ingress endpoint
Add DNS for the Ingress endpoint

I would like to simplify this workflow to:

Manually create the API and Ingress endpoint IPs
Add DNS for the API and Ingress endpoints
Create the cluster

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Questions to Answer (Optional):

Out of Scope

Although the Service API's loadBalancerIP API field was defined to be platform-agnostic, it wasn't consistently supported across platforms, and Kubernetes 1.24 has even deprecated it for this reason: https://github.com/kubernetes/kubernetes/pull/107235. We would not want to add a generic option to set loadBalancerIP given that it is deprecated and that it would work only on some platforms and not on others.

Background

This request is similar to RFE-843 (for AWS), RFE-2238 (for GCP), RFE-2824 (for AWS and MetalLB, and maybe others), RFE-2884 (for AWS, Azure, and GCP), and RFE-3498 (for AWS). However, it might make sense to keep this RFE specifically for OpenStack.

Customer Considerations

Documentation Considerations

Interoperability Considerations

Epic OSASINFRA-3489: Support for Specification of a Pre-created Loadbalancer IP on OpenStack

View the Description

Goal

Make Ingress working on day 1 without extra step for the customer

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Task OSASINFRA-3642: Support for Service.Spec.FloatingIP in cluster-ingress-operator

View the Description View the linked PRs

Bump openshift/api in cluster-ingress-operator and use the new floatingIP field on platform OpenStack.

https://github.com/openshift/cluster-ingress-operator/pull/1147

Feature OCPSTRAT-1761: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic OSASINFRA-3558: IPI: ShiftStack single stack IPv6 support - GA

View the Description

Goal

As an operator installing OCP on OSP with IPI, I would like to have single stack IPv6 enabled in day 1.

Why is this important?

Scarcity of IPv4 addresses

Scenarios

Install IPv6 OCP cluster on IPv6 OpenStack with IPI and any address type (slaac, stateful and stateless).

Out of scope

Fast datapath
Conversion from dual-stack to single stack IPv6
UPI

Acceptance Criteria

CI - MUST be running successfully with tests automated
Docs

Dependencies (internal and external)

https://bugzilla.redhat.com/show_bug.cgi?id=2236671
https://github.com/coreos/ignition/pull/1909
(If SLAAC is necessary) https://bugzilla.redhat.com/show_bug.cgi?id=2304331

Previous Work (Optional):

https://docs.google.com/document/d/1vT8-G2SFvanoeZWx38FiYY272RJjIv-l_5B6cOLzjVY/edit

Open questions::

Task OSASINFRA-3585: MCO: add support for config drive

View the Description View the linked PRs

If config drive is available on the machine use it instead of metadata.

Task OSASINFRA-3571: Try installation with VMs using config-drive instead of metadata

View the Description View the linked PRs

To overcome the OVN metadata issue, we are adding an additional IPv4 network so metadata can be reached over IPv4 instead of IPv6 and we got a working installation. Now, let's try with config-drive, so we avoid specifying an IPv4 network and get the VMs to be IPv6 only.

Task OSASINFRA-3615: Update upstream docs for IPv6 support

View the linked PRs

https://github.com/openshift/installer/pull/9044

Task OSASINFRA-3614: Avoid creating IPv4 ingress for single stack IPv6 clusters

View the linked PRs

https://github.com/openshift/installer/pull/9038

Task OSASINFRA-3607: Add install-config validation to avoid setting external network

View the linked PRs

https://github.com/openshift/installer/pull/9011

Task OSASINFRA-3559: Installer: Updates required to allow a single stack IPv6 install

View the Description View the linked PRs

Update the validation to allow controlPlanePort field with one subnet
Revisit the floatingIPs creation step that is done after infraReady to avoid creation of Floatin IP

https://github.com/openshift/installer/pull/8925

Task OSASINFRA-3560: MCO: update to reach metadata over IPv6

View the Description View the linked PRs

With the metadata support over IPv6 being included on OSP, we should updated MCO to use the IPv6 address on single stack IPv6 install.

https://github.com/openshift/machine-config-operator/blob/master/templates/common/openstack/files/usr-local-bin-openstack-kubelet-nodename.yaml#L34

Feature OCPSTRAT-1823: [GA] 'oc adm upgrade status' command and status API

View the Description

Feature Overview (aka. Goal Summary)

Here are common update improvements from customer interactions on Update experience

Show nodes where pod draining is taking more time.
Customers have to dig deeper often to find the nodes for further debugging.
The ask has been to bubble up this on the update progress window.
oc update status ?
From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"
But the ask is to show more details in a human-readable format.
Know where the update has stopped. Consider adding at what run level it has stopped.
```
oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS

version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
```

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic OTA-1260: Status API for oc adm upgrade status command

View the Description

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process.
Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Tests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Other

Story OTA-1340: status: show updating cluster operator names in the output

View the Description View the linked PRs

One piece of information that we lost compared to oc adm upgrade command is which ClusterOperators are updated right now. Previously, we presented CVO's Progressing=True message that says:

waiting on cloud-credential, cluster-autoscaler, console, csi-snapshot-controller, image-registry, ingress, insights, kube-storage-version-migrator, machine-approver, marketplace, monitoring, node-tuning, openshift-apiserver, openshift-controller-manager, operator-lifecycle-manager, service-ca, storage

The oc adm upgrade status output presents counts of updated/updating/pending operators, but does not say which ones are in which state. We should show this information somehow.

This is what we did for this card (for QE to verify):
- In the control plane section, we add a line of "Updating" to display the names of Cluster Operators that are being updated.

In the "detail" mode, we add a table for those Cluster Operators to show their details.
These 2 new parts will be hidden completely if there are no updating COs at the moment.

The following is an example.

= Control Plane =
Assessment:      Progressing
Target Version:  4.14.1 (from 4.14.0)
Updating:        machine-config
Completion:      97% (32 operators updated, 1 updating, 0 waiting)
Duration:        14m (Est. Time Remaining: <10m)
Operator Health: 32 Healthy, 1 Unavailable

Updating Cluster Operators
NAME             SINCE   REASON   MESSAGE
machine-config   1m10s   -        Working towards 4.14.1

https://github.com/openshift/oc/pull/1882

Story OTA-1224: status: simplify worker status line

View the Description View the linked PRs

The current format of the worker status line is consistent with the original format of the operator status line. However, the operator status line is being reworked and simplified as part of the ~~OTA-1155~~. The goal of this task is to make the worker status line somewhat consistent with the newly modified operator status line and simplified.

The current worker status line (see the “Worker Status: ...” line):

= Worker Pool =
Worker Pool:     worker
Assessment:      Degraded
Completion:      39%
Worker Status:   59 Total, 46 Available, 5 Progressing, 36 Outdated, 12 Draining, 0 Excluded, 7 Degraded

The exact new format is not defined and is for the assignee to create.

A relevant Slack discussion: https://redhat-internal.slack.com/archives/CEGKQ43CP/p1706727395851369

The main goal of this task is to:

Make the worker status line consistent with the new operator status line.
Simplify the output information.
- For example, we don’t have to display zero non-happy values such as “0 Degraded” or zero redundant values such as “0 Excluded”:
```
Worker Status: 4 Total, 4 Available, 0 Progressing, 3 Outdated, 0 Draining, 0 Excluded, 0 Degraded
```
- For example, 1 Available and Updated, 2 Available and Outdated may be more clear than 3 Total, 3 Available, 2 Outdated
- A possible solution:
```
Worker Status: <Available and Updated>, <Available and Outdated> [from which X are paused], <Unavailable but Progressing (Progressing and thus Unavailable)>, <Unavailable AND NOT Progressing>
```

Definition of Done:

A pull request of a modified worker status line is merged.
The new worker status line is somewhat consistent with the new operator status line (~~OTA-1155~~) and is simplified.
Update the “Omitted additional…” line (shown when a large number of nodes is present and –details=false) respectively as well.

https://github.com/openshift/oc/pull/1915

Story OTA-1269: Scaffold the status API controller

View the Description View the linked PRs

On the call to discuss oc adm upgrade status roadmap to server side-implementation (notes) we agreed on basic architectural direction and we can starting moving in that direction:

status API will be backed by a new controller
new controller will be a separate binary but delivered in the CVO image (=release payload) to avoid needing new ClusterOperator
new operator will maintain a singleton resource of a new UpgradeStatus CRD - this is the interface to the consumers

Let's start building this controller; we can implement the controller perform the functionality currently present in the client, and just expose it through an API. I am not sure how to deal with the fact that we won't have the API merged until it merges into o/api, which is not soon. Maybe we can implement the controller over a temporary fork of o/api and rely on manually inserting the CRD into the cluster when we test the functionality? Not sure.

We need to avoid committing to implementation details and investing effort into things that may change though.

Definition of Done

CVO repository has a new controller (a new cluster-version-operator cobra subcommand sounds like a good option; an alternative would a completely new binary included in CVO image)
The payload contains manifests (SA, RBAC, Deployment) to deploy the new controller when DevPreviewNoUpgrade feature set is enabled (but not TechPreview)
The controller uses properly scoped minimal necessary RBAC through a dedicated SA
The controller will react on ClusterVersion changes in the cluster through an informer
The controller will maintain a single ClusterVersion status insight as specified by the Update Health API Draft
The controller does not need to maintain all fields precisely: it can use placeholders or even ignore fields that need more complicated logic over more resources (estimated finish, completion, assessment)
The controller will publish the serialized CV status insight (in yaml or json) through a ConfigMap (this is a provisionary measure until we can get the necessary API and client-go changes merged) under a key that identifies the kube resource ("cv-version")
The controller only includes the necessary types code from o/api PR together with the necessary generated code (like deepcopy). These local types will need to be replaced with the types eventually merged into o/api and vendored to o/cluster-version-operator

Testing notes

This card only brings a skeleton of the desired functionality to the DevPreviewNoUpgrade feature set. Its purpose is mainly to enable further development by putting the necessary bits in place so that we can start developing more functionality. There's not much point in automating testing of any of the functionality in this card, but it should be useful to start getting familiar with how the new controller is deployed and what are its concepts.

For seeing the new controller in action:

1. Launch a cluster that includes both the code and manifests. As of Nov 11, #1107 is not yet merged so you need to use launch 4.18,openshift/cluster-version-operator#1107 aws,no-spot
2. Enable the DevPreviewNoUpgrade feature set. CVO will restart and will deploy all functionality gated by this feature set, including the USC. It can take a bit of time, ~10-15m should be enough though.
3. Eventually, you should be able to see the new openshift-update-status-controller Namespace created in the cluster
4. You should be able to see a update-status-controller Deployment in that namespace
5. That Deployment should have one replica running and being ready. It should not crashloop or anything like that. You can inspect its logs for obvious failures and such. At this point, its log should, near its end, say something like "the ConfigMap does not exist so doing nothing"
6. Create the ConfigMap that mimics the future API (make sure to create it in the openshift-update-status-controller namespace): oc create configmap -n openshift-update-status-controller status-api-cm-prototype
7. The controller should immediately-ish insert a usc-cv-version key into the ConfigMap. Its content is a YAML-serialized ClusterVersion status insight (see design doc). As of ~~OTA-1269~~ the content is not that important, but the (1) reference to the CV (2) versions field should be correct.
8. The status insight should have a condition of Updating type. It should be False at this time (the cluster is not updating).
9. Start upgrading the cluster (it's a cluster bot cluster with ephemeral 4.18 version so you'll need to use --to-image=pullspec and probably force it
10. While updating, you should be able to observe the controller activity in the log (it logs some diffs), but also the content of the status insight in the ConfigMap changing. The versions field should change appropriately (and startedAt too), and the Updating condition should become True.
11. Eventually the update should finish and the Updating condition should flip to False again.

Some of these will turn into automated testcases, but it does not make sense to implement that automation while we're using the ConfigMap instead of the API.

https://github.com/openshift/cluster-version-operator/pull/1091

Feature OCPSTRAT-1825: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic MCO-1208: Manage the MCS ignition-ca cert

View the Description

Spun out of https://issues.redhat.com/browse/MCO-668

This aims to capture the work required to rotate the MCS-ignition CA + cert.

Original description copied from ~~MCO-668~~:

Today in OCP there is a TLS certificate generated by the installer , which is called "root-ca" but is really "the MCS CA".

A key derived from this is injected into the pointer Ignition configuration under the "security.tls.certificateAuthorities" section, and this is how the client verifies it's talking to the expected server.

If this key expires (and by default the CA has a 10 year lifetime), newly scaled up nodes will fail in Ignition (and fail to join the cluster).

The MCO should take over management of this cert, and the corresponding user-data secret field, to implement rotation.

Reading:

- There is a section in the customer facing documentation that touches on this: https://docs.openshift.com/container-platform/4.13/security/certificate_types_descriptions/machine-config-operator-certificates.html

- There's a section in the customer facing documentation for this: https://docs.openshift.com/container-platform/4.13/security/certificate_types_descriptions/machine-config-operator-certificates.html that needs updating for clarification.

- There's a pending PR to openshift/api: https://github.com/openshift/api/pull/1484/files

- Also see old (related) bug: https://issues.redhat.com/browse/OCPBUGS-9890

- This is also separate to https://issues.redhat.com/browse/MCO-499 which describes the management of kubelet certs

Story MCO-1323: Stop writing root-ca to disk

View the Description View the linked PRs

We currently writing rootCA to disk via this template: https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/root-ca.yaml

Nothing that we know of currently uses this file and as it is templated via MC, any updates to configmap(root-ca in the kube-system namespace) used to generated this template will cause a MC roll-out. We will be updating this configmap as part of cert rotation in MCO-643, so we'd like to prevent unnecessary rotation by removing this template.

https://github.com/openshift/machine-config-operator/pull/4600

Story MCO-643: Implement a path in the controller to manage user-data secrets

View the Description View the linked PRs

The machinesets in the machine-api namespace reference a user-data secret (per pool and can be customized) which stores the initial ignition stub configuration pointing to the MCS, and the TLS cert. This today doesn't get updated after creation.

The MCO now has the ability to manage some fields of the machineset object as part of the managed bootimage work. We should extend that to also sync in the updated user-data secrets for the ignition tls cert.

The MCC should be able to parse both install-time-generated machinesets as well as user-created ones, so as to not break compatibility. One way users are using this today is to use a custom secret + machineset to do non-MCO compatible ignition fields, for example, to partition disks for different device types for nodes in the same pool. Extra care should be taken not to break this use case

https://github.com/openshift/machine-config-operator/pull/4669

Feature OCPSTRAT-1850: OCP Update Recommend command to improve update experience -Tech Preview

View the Description

Feature Overview (aka. Goal Summary)

This feature introduces a new command oc adm upgrade recommend in Tech Preview that improves how cluster administrators evaluate and select version upgrades.

Goals (aka. expected user outcomes)

Enable users (especially those with limited OpenShift expertise) to make the upgrade selection process easily.
Reduce information overload
- Provide clear, actionable recommendations for next version
- Shows relevant warnings and risks per cluster
Help customers make informed decisions about when to initiate upgrades

Requirements (aka. Acceptance Criteria):

Allows targeting specific versions with --version
Shows conditional update risks
Shows only recent relevant releases instead of all versions
- Limited to 2 per release
- highlight the recommended version
Shows upgrade blockers and risks
Shows documentation/KCS links for more details about recommendation
Shows grouping based on security/performance/features changes in versions

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Self-managed
Classic (standalone cluster)	standalone
Hosted control planes	yes
Multi node, Compact (three node), or Single node (SNO), or all	All
Connected / Restricted Network	All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	All
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Documentation Considerations

Add docs for recommend command

Interoperability Considerations

Epic OTA-1270: tech-preview 'oc adm upgrade recommend' command

View the Description

Epic Goal

Changes in ~~web-console/GUI and~~ OC CLI where we will change number of update recommendations users see.

We want to limit the number of update recommendations users see by-default. Because in our opinion a long list the older versions does not help.
- We want to provide command line line option for users to see the older versions only when they are interested.
~~Move conditional risks/known risks out of the switch/button in the GUI~~. By-default users should see recommended updates as well as update options with known issues/conditional risks
- Conditional risks should have some way to identify them as risks/known issues (example : adding asterisks beginning or end of the risk)
We will adjust the ordering of recommended updates based on freshness

No console changes were made in 4.18, but we may follow up with those changes later if the tech-preview oc adm upgrade recommend is well received.

Why

Customers still think that RH removes edges because they are not aware of the flag that hides update options with known issues/ conditional risks.
Even when they notice the output advertising --include-not-recommended, customers might assume anything "not recommended" is too complicated to be worth reading about, when sometimes the assessed risk has a straightforward mitigation, and is closer to being a release note. We want to make those messages more accessible, without requiring customers to opt in.
A long list of update recommendations does not help users/customers. We want to reduce the paradox of choices.

Done - Checklist (mandatory)

CI Testing - Tests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Other

Story OTA-1271: Create a new 'oc adm upgrade recommend' subcommand

View the Description View the linked PRs

Doesn't have to be recommend, but a new subcommand so that we can rip out the oc adm upgrade output about "how is your cluster currently doing" (Failing=True, mid-update, etc.). The new subcommand would just be focused on "I haven't decided which update I want to move to next; help me pick", including the "I am thinking about 4.y.z, but I'm not completely sure yet; anything I should be aware of for that target?".

Definition of Done:

For this initial ticket, we can just preserve all the current next-hop output, and tuck it behind a feature-gate environment variable, so we can make future pivots in follow-up tickets.

https://github.com/openshift/oc/pull/1863

Story OTA-902: Conditional update risk, Red Hat issues vs. customer issues

View the Description View the linked PRs

Conditional update UXes today are built around the assumption that when an update is conditional, it's a Red Hat issue, and some future release will fix the bug, and an update will become recommended. On this assumption, UXes like oc adm upgrade and the web-console both mention the existence of supported-but-not-recommended update targets, but don't push the associated messages in front of the cluster administrator.

But there are also update issues like exposure to Kubernetes API removals, where we will never restore the APIs, and we want the admin to take action (and maybe eventually accept the risk as part of the update). Do we want to adjust our update-risk UXes to be more open about discussing risks. For example, we could expose the message for the tip-most Recommended!=True update? Or something like that? So the cluster admin could read the message, and decide for themselves if it was a "wait for newer releases" thing or a "fix something in my current cluster state" thing. I think this would reduce current confusion about "which updates is Upgradeable=False blocking?" (~~OCPBUGS-9013~~) and similar.

https://github.com/openshift/oc/pull/1907

Story OTA-1273: oc option to show opinions about a particular 4.y.z release

View the Description View the linked PRs

Some customers will want an older release than ~~OTA-1272~~'s longest-hops. --show-outdated-version might flood them with many old releases. This card is about giving them an option, maybe --version=4.17.99 that will show them context about that specific release, without distracting them with opinions about other releases.

https://github.com/openshift/oc/pull/1897

Story OTA-1272: Re-prioritize the default update recommendations by freshness

View the Description View the linked PRs

We currently show all the recommended updates in decreasing order with --include-not-recommended to see all the updates-with-assessed-risks in decreasing order. But sometimes users want to update to the longest-hop, even if there are known risks. Or they want to read about the assessed risks, in case there's something they can do to their cluster to mitigate a currently-assessed risk before kicking off the update. This ticket is about adjusting oc's output to order roughly by release freshness. For example, for a 4.y cluster in a 4.(y+1) channel:

4.(y+1).tip
4.y.tip
4.(y+1).(tip-1)
4.y.(tip-1)
...

Because users are more likely to care about 4.(y+1).tip, even if it has assessed risks, than they are to care about 4.y.reallyOld, even if it doesn't have assessed risks.

Show some number of these by default, and then use --show-outdated-versions or similar to see all the results.

See Scott here and me in ~~OTA-902~~ both pitching something in this space.

Blocked on ~~OTA-1271~~, because that will give us a fresh, tech-preview subcommand, where we can iterate without worrying about breaking existing users, until we're happy enough to GA the new approach.

For example, on 4.12.16 in fast-4.13, oc adm upgrade will currently show between 23 and 91 recommended updates (depending on your exposure to declared update risks):

cincinnati-graph-data$ hack/show-edges.py --cincinnati https://api.openshift.com/api/upgrades_info/graph fast-4.13 | grep '^4[.]12[.]16 ->' | wc -l
23
cincinnati-graph-data$ hack/show-edges.py --cincinnati https://api.openshift.com/api/upgrades_info/graph fast-4.13 | grep '^4[.]12[.]16 ' | wc -l
91

but showing folks 4.12.16-to-4.12.17 is not worth the line it takes, because 4.12.17 is so old, and customers would be much better served by 4.12.63 or 4.12.64, which address many bugs that 4.12.17 was exposed to. With this ticket, oc adm upgrade recommend would show something like:

Recommended updates:

  VERSION                   IMAGE
  4.12.64 quay.io/openshift-release-dev/ocp-release@sha256:1263000000000000000000000000000000000000000000000000000000000000
  4.12.63 quay.io/openshift-release-dev/ocp-release@sha256:1262000000000000000000000000000000000000000000000000000000000000

Updates with known issues:

  Version: 4.13.49
  Image: quay.io/openshift-release-dev/ocp-release@sha256:1349111111111111111111111111111111111111111111111111111111111111
  Recommended: False
  Reason: ARODNSWrongBootSequence
  Message: Disconnected ARO clusters or clusters with a UDR 0.0.0.0/0 route definition that are blocking the ARO ACR and quay, are not be able to add or replace nodes after an upgrade https://access.redhat.com/solutions/7074686

There are 21 more recommended updates and 67 more updates with known issues.  Use --show-outdated-versions to see  all older updates.

Feature OCPSTRAT-247: Gateway API using Istio for Cluster Ingress - Tech Preview

View the Description

Goal:
Provide a Technical Preview of Gateway API with Istio to unify the management of cluster ingress with a common, open, expressive, and extensible API.

Description:
Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.

The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.

At its core, OpenShift's implementation of Gateway API will be based on the existing Cluster Ingress Operator and OpenShift Service Mesh (OSSM). The Ingress Operator will manage the Gateway API CRDs (gatewayclasses, gateways, httproutes), install and configure OSSM, and configure DNS records for gateways. OSSM will manage the Istio and Envoy deployments for gateways and configure them based on the associated httproutes. Although OSSM in its normal configuration does support service mesh, the Ingress Operator will configure OSSM without service mesh features enabled; for example, using Gateway API will not require the use of sidecar proxies. Istio will be configured specifically to support Gateway API for cluster ingress. See the gateway-api-with-cluster-ingress-operator enhancement proposal for more details.

Epic NE-1716: [GWAPI-TP] Test latest GWAPI release (1.0.0) with 2.6.0 OSSM integration

View the Description View the linked PRs

Epic Goal

Test GWAPI release v1.0.0-* custom resources with current integration
Why is this important?

Help find bugs in the v1.0.0 upstream release
Determine if any updates are needed in ingress-cluster-operator based on v1.0.0
Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status
Priority+ is set by engineering
Epic must be Linked to a +Parent Feature
Target version+ must be set
Assignee+ must be set
(Enhancement Proposal is Implementable
(No outstanding questions about major work breakdown
(Are all Stakeholders known? Have they all been notified about this item?
Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)

Additional information on each of the above items can be found here: Networking Definition of Planned

https://github.com/openshift/cluster-ingress-operator/pull/1163

Feature OCPSTRAT-279: Technical Debts for Node team

View the Description

This feature is the place holder for all epics related to technical debt associated with node team

Epic OCPNODE-1349: Node Upstream Features

View the Description

Promote various features upstream.

Story OCPNODE-2399: Update nodes.config object with cgroupv1 deprecation message

View the Description View the linked PRs

Add a note to the nodes.config object's status condition with a deprecation message to prevent the usage of cgroupv1 mode incase if the system is using cgroupsv1.

Slack discussion: https://redhat-internal.slack.com/archives/GK6BJJ1J5/p1719508346407769

Reference: https://docs.google.com/document/d/1z25jsBi9lZVvQdZICG3QrA6j2C-8UIF2lxpN-b4HM2c/edit#heading=h.7tru0r8qo1wf

https://github.com/openshift/machine-config-operator/pull/4582

Story OCPNODE-2400: API change - nodes.config object status

View the Description View the linked PRs

API change to include the conditions for the status field of nodes.config object.

Sample ref: https://github.com/openshift/api/blob/0689f006bcdeccf685c75438835beffa8bb14db7/example/v1alpha1/types_notstable.go#L49

https://github.com/openshift/api/pull/1943

Feature OCPSTRAT-306: Support for bring your own external OIDC based Auth provider for direct API Server access [Standalone OCP][TechPreview]

View the Description

Feature Overview (aka. Goal Summary)

The ability in OpenShift to create trust and directly consume access tokens issued by external OIDC Authentication Providers using an authentication approach similar to upstream Kubernetes.

BYO Identity will help facilitate CLI only workflows and capabilities of the Authentication Provider (such as Keycloak, Dex, Azure AD) similar to upstream Kubernetes.

Goals (aka. expected user outcomes)

Ability in OpenShift to provide a direct, pluggable Authentication workflow such that the OpenShift/K8s API server can consume access tokens issued by external OIDC identity providers. Kubernetes provides this integration as described here. Customer/Users can then configure their IDPs to support the OIDC protocols and workflows they desire such as Client credential flow.

OpenShift OAuth server is still available as default option, with the ability to tune in the external OIDC provider as a Day-2 configuration.

Requirements (aka. Acceptance Criteria):

The customer should be able to tie into RBAC functionality, similar to how it is closely aligned with OpenShift OAuth

Use Cases (Optional):

As a customer, I would like to integrate my OIDC Identity Provider directly with the OpenShift API server.
As a customer in multi-cluster cloud environment, I have both K8s and non-K8s clusters using my IDP and hence I need seamless authentication directly to the OpenShift/K8sAPI using my Identity Provider

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic AUTH-528: Direct External OIDC Provider for Standalone OCP

View the Description

Epic Goal

The ability to provide a direct authentication workflow such that OpenShift can consume bearer tokens issued by external OIDC identity providers, replacing the built-in OAuth stack by deactivating/removing its components as necessary.

Why is this important? (mandatory)

OpenShift has its own built-in OAuth server which can be used to obtain OAuth access tokens for authentication to the API. The server can be configured with an external identity provider (including support for OIDC), however it is still the built-in server that issues tokens, and thus authentication is limited to the capabilities of the oauth-server.

Scenarios (mandatory)

As a customer, I want to integrate my OIDC Identity Provider directly with OpenShift so that I can fully use its capabilities in machine-to-machine workflows.
*As a customer in a hybrid cloud environment, I want to seamlessly use my OIDC Identity Provider across all of my fleet.

Dependencies (internal and external) (mandatory)

Support in the console/console-operator (already completed)
Support in the OpenShift CLI `oc` (already completed)

Contributing Teams(and contacts) (mandatory)

Development - OCP Auth
Documentation - OCP Auth
QE - OCP Auth
PX -
Others -

Acceptance Criteria (optional)

external OIDC provider can be configured to be used directly via the kube-apiserver to issue tokens
built-in oauth stack no longer operational in the cluster; respective APIs, resources and components deactivated
changing back to the built-in oauth stack possible

Drawbacks or Risk (optional)

Enabling an external OIDC provider to an OCP cluster will result in the oauth-apiserver being removed from the system; this inherently means that the two API Services it is serving (v1.oauth.openshift.io, v1.user.openshift.io) will be gone from the cluster, and therefore any related data will be lost. It is the user's responsibility to create backups of any required data.
Configuring an external OIDC identity provider for authentication by definition means that any security updates or patches must be managed independently from the cluster itself, i.e. cluster updates will not resolve security issues relevant to the provider itself; the provider will have to be updated separately. Additionally, new functionality or features on the provider's side might need integration work in OpenShift (depending on their nature).

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Story AUTH-546: Create baseline E2E test for direct external OIDC

View the Description View the linked PRs

The test will serve as a development aid to test functionality as it gets added; the test will be extended/adapted as new features are implemented. This test will live behind the "ExternalOIDC" feature gate.

Goals of the baseline test:

deploy keycloak in the cluster, to use as an OIDC provider
configure the OIDC as a direct provider in the KAS
- update the authentication CR with the oidc provider configuration
- sync the oidc provider's CA, if necessary, to the KAS pods static resources
- patch the cluster proxy and the KAS CLI args to provide the OIDC configuration
- wait for the changes to get rolled out
run some basic keycloak sanity checks
run some baseline authentication checks via the KAS

https://github.com/openshift/cluster-authentication-operator/pull/684

Feature OCPSTRAT-472: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic NE-1100: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story NE-1811: Update router code and haproxy-config.template to replace "whitelist" with "allowlist"

View the Description View the linked PRs

Update OpenShift router to recognize a new annotation key "haproxy.router.openshift.io/ip_allowlist" in addition to the old "haproxy.router.openshift.io/ip_whitelist" annotation key. Continue to allow the old annotation key for now, but use the new one if it is present.

In a future release, we may remove the old annotation key, after allowing ample time for route owners to migrate to the new one. (We may also consider replace the annotation with a formal API field.)

https://github.com/openshift/router/pull/564

Feature OCPSTRAT-487: Pod Security Admission Integration - Restricted Enforcement

View the Description

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".

Epic AUTH-262: Pod Security Admission Integration - Restricted Enforcement

View the Description

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

Bug OCPBUGS-41878: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story AUTH-482: SCC pinning for all workloads in platform namespaces

View the Description View the linked PRs

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following tables track progress.

Progress summary

namespaces	4.19	4.18	4.17	4.16	4.15	4.14
monitored	82	82	82	82	82	82
fix needed	68	68	68	68	68	68
fixed	39	39	35	32	39	1
remaining	29	29	33	36	29	67
~ remaining non-runlevel	8	8	12	15	8	46
~ remaining runlevel (low-prio)	21	21	21	21	21	21
~ untested	2	2	2	2	82	82

Progress breakdown

#	namespace	4.19	4.18	4.17	4.16	4.15	4.14
1	oc debug node pods			#1763	#1816	#1818
2	openshift-apiserver-operator				#573	#581
3	openshift-authentication				#656	#675
4	openshift-authentication-operator				#656	#675
5	openshift-catalogd				#50	#58
6	openshift-cloud-credential-operator				#681	#736
7	openshift-cloud-network-config-controller		#2282	#2490	#2496
8	openshift-cluster-csi-drivers	#6 #118	#524 #131 #306 #265 #75		#170 #459	#484
9	openshift-cluster-node-tuning-operator				#968	#1117
10	openshift-cluster-olm-operator				#54	n/a	n/a
11	openshift-cluster-samples-operator				#535	#548
12	openshift-cluster-storage-operator		#516		#459 #196	#484 #211
13	openshift-cluster-version				#1038	#1068
14	openshift-config-operator				#410	#420
15	openshift-console			#871	#908	#924
16	openshift-console-operator			#871	#908	#924
17	openshift-controller-manager				#336	#361
18	openshift-controller-manager-operator				#336	#361
19	openshift-e2e-loki		#56579	#56579	#56579	#56579
20	openshift-image-registry				#1008	#1067
21	openshift-ingress		#1032
22	openshift-ingress-canary		#1031
23	openshift-ingress-operator		#1031
24	openshift-insights	#1033	#1041	#1049	#915	#967
25	openshift-kni-infra		#4504	#4542	#4539	#4540
26	openshift-kube-storage-version-migrator				#107	#112
27	openshift-kube-storage-version-migrator-operator				#107	#112
28	openshift-machine-api	#1308 #1317	#1311	#407	#315 #282 #1220 #73 #50 #433	#332 #326 #1288 #81 #57 #443
29	openshift-machine-config-operator		#4636	#4219	#4384	#4393
30	openshift-manila-csi-driver			#234	#235	#236
31	openshift-marketplace		#578		#561	#570
32	openshift-metallb-system		#238	#240	#241
33	openshift-monitoring	#2298 #366	#2498		#2335	#2420
34	openshift-network-console		#2545
35	openshift-network-diagnostics		#2282	#2490	#2496
36	openshift-network-node-identity		#2282	#2490	#2496
37	openshift-nutanix-infra		#4504		#4539	#4540
38	openshift-oauth-apiserver				#656	#675
39	openshift-openstack-infra		#4504		#4539	#4540
40	openshift-operator-controller				#100	#120
41	openshift-operator-lifecycle-manager				#703	#828
42	openshift-route-controller-manager				#336	#361
43	openshift-service-ca				#235	#243
44	openshift-service-ca-operator				#235	#243
45	openshift-sriov-network-operator			#995	#999	#1003
46	openshift-user-workload-monitoring				#2335	#2420
47	openshift-vsphere-infra		#4504	#4542	#4539	#4540
48	(runlevel) kube-system
49	(runlevel) openshift-cloud-controller-manager
50	(runlevel) openshift-cloud-controller-manager-operator
51	(runlevel) openshift-cluster-api
52	(runlevel) openshift-cluster-machine-approver
53	(runlevel) openshift-dns
54	(runlevel) openshift-dns-operator
55	(runlevel) openshift-etcd
56	(runlevel) openshift-etcd-operator
57	(runlevel) openshift-kube-apiserver
58	(runlevel) openshift-kube-apiserver-operator
59	(runlevel) openshift-kube-controller-manager
60	(runlevel) openshift-kube-controller-manager-operator
61	(runlevel) openshift-kube-proxy
62	(runlevel) openshift-kube-scheduler
63	(runlevel) openshift-kube-scheduler-operator
64	(runlevel) openshift-multus
65	(runlevel) openshift-network-operator
66	(runlevel) openshift-ovn-kubernetes
67	(runlevel) openshift-sdn
68	(runlevel) openshift-storage

Feature OCPSTRAT-488: OVN Observability with Sampling (tech preview)

View the Description

We should be able to correlate flows with network policies:

which policy allowed that flow?
what's the dropped flows?
provide global stats on dropped / accepted traffic

PoC doc: https://docs.google.com/document/d/14Y3YYFxuOs3o-Lkipf-d7ZZp5gpbk6-01ZT_fTraCu8/edit

There are two possible approaches in terms of implementation:

Add new "netpolicy flows" on top of existing flows
Enrich existing flows with netpolicy info.

The PoC describes the former, however it is probably most interesting to aim the latter. (95% of the PoC is valid in both cases, ie. all the "low level" parts: OvS, OVN). The latter involves more work in FLP.

Epic SDN-4487: [tech preview] ovn-k observability with OVS sampling

View the Description

Epic Goal

Implement observability for ovn-k using OVS sampling.

Why is this important?

This feature should improve packet tracing and debuggability.

Story SDN-5186: Add feature gate

View the linked PRs

https://github.com/openshift/cluster-network-operator/pull/2462

Feature OCPSTRAT-525: Enable HAProxy Dynamic Configuration Manager for OpenShift - Tech Preview

View the Description

We need to do a lot of R&D and fix some known issues (e.g., see linked BZs).

R&D targetted at 4.16 and productisation of this feature in 4.17

Epic NE-879: Enable the dynamic config manager

View the Description

Goal
To make the current implementation of the HAProxy config manager the default configuration.

Objectives

Disable pre-allocation route blueprints
- The route blueprints sub-feature should not be used to reduce the impact of the feature.
Limit dynamic server allocation
- Set the maximum number of dynamic servers to a minimal value to prevent high resource consumption.
Provide customer opt-out
- Offer customers a handler to opt out of the default config manager implementation.

Story NE-1815: Fix implementation gaps discovered during the smoke tests

View the Description View the linked PRs

https://issues.redhat.com/browse/NE-1788 describes 3 gaps in the implementation of DAC:

Idled services are waken up by the health check from the servers set by DAC (server-template).
ALPN TLS extension is not enabled for reencrypt routes.
Dynamic servers produce dummy metrics.

Additional gaps were discovered along the way:

No cookie value is set by router if DAC is enabled without blueprints. This means no sticky sessions which we have by default. The cookie is defined in the template but its value is not set at runtime or in the template. For edge routes the value seemed to be designed to be generated dynamically but the corresponding option is missing in the cookie config directive. Needs template changes and e2e tests.
verifyhost directive is not used in dynamic servers added by DAC since it requires FQDN name at config parsing time (service name).

This story aims at fixing those gaps.

Story NE-1790: Enable Dynamic Configuration Manager

View the Description View the linked PRs

The goal of this user story is to combine the code from the smoke test user story and results from the spike into an implementation PR.

Since multiple gaps were discovered a feature gate will be needed to ensure stability of OCP before the feature can be enabled by default.

Feature OCPSTRAT-539: Enhance recovery procedure for full control plane failure

View the Description

Overview

Initiative: Improve etcd disaster recovery experience (part3)

With ~~OCPBU-252~~ and OCPBU-254 we create the foundations for an enhanced experience of a recovery procedure in the case of full control plane loss. This requires researching total control-plane failure scenarios of clusters deployed using the various deployment methodologies.

Scope of this feature:

Spike to research if restoring full control plane with identical properties as the original control plane allow re-importing workers and document workload behavior
Document procedure to restore from full control plane failure using compact cluster to restore control plane and the re-attachment of workers
Enhanced e2e testing for validation of the updated manual procedure under this feature

Epic ETCD-644: Disaster Recovery Automation

View the Description

Epic Goal*

Improve the disaster recovery experience by providing automation for the steps to recover from an etcd quorum loss scenario.

Determining the exact format of the automation (bash script, ansible playbook, CLI) is a part of this epic but ideally it would be something the admin can initiate on the recovery host that then walks through the disaster recovery steps provided the necessary inputs (e.g backup and staticpod files, ssh access to the recovery and non-recovery hosts etc).

Why is this important? (mandatory)

There are a large number of manual steps in the currently documented disaster recovery workflow which customers and support staff have voiced concerns as being too cumbersome and error prone.
https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html

Providing more automation would improve that experience and also let the etcd team better support and test the disaster recovery workflow.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

(TBD based on the delivery vehicle for the automation):

As a cluster admin in a DR scenario I can trigger the quorum recovery procedure (e.g via CLI cmd on a recovery host) to reestablish quorum and recover a stable control-plane with API availability.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development - etcd team
Documentation - etcd docs
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Story ETCD-656: Automate datadir move after quorum-restore

View the Description View the linked PRs

After running the quorum restore script we want to bring the other members back into the cluster automatically.

Currently the init container in

https://github.com/openshift/etcd/blob/openshift-4.17/openshift-tools/pkg/discover-etcd-initial-cluster/initial-cluster.go

is guarding that case by checking whether the member is part of the cluster already and has an empty datadir.

We need to adjust this check by testing whether the cluster id of the currently configured member and the current datadir refer to the same cluster.

When we detect a mismatch, we can assume the cluster was recovered by quorum restore and we can attempt to move the folder to automatically make the member join the cluster again.

Story ETCD-657: E2E Test quorum-restore

View the Description View the linked PRs

We need to add an e2e test to our disaster recovery suite in order to exercise that the quorum can be restored automatically.

While we're at it, we can also disable the experimental rev bumping introduced with:

https://github.com/openshift/origin/pull/28073

https://github.com/openshift/origin/pull/29169

Story ETCD-670: provide a shutdown script

View the Description View the linked PRs

Several steps are covering the shutdown of the etcd static pod. We can provide a script to execute, which you can simply run through ssh:

> ssh core@node disable-etcd.sh

That script should move the static pod into a different folder, wait for the containers to shutdown.

https://github.com/openshift/cluster-etcd-operator/pull/1336

Story ETCD-653: Update restore-pod to use locally stored revision

View the Description View the linked PRs

Currently we have the bump guarded by an env variable:

https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/cluster-restore.sh#L151-L155

and a hardcode with 1bn revision numbers in:

https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/restore-pod.yaml#L64-L68

With this story we should remove the feature flag and enable the bumping by default. The bump amount should come from the file created in ~~ETCD-652~~ plus some slack percentage. If the file doesn't exist we assume the default value of a billion again.

https://github.com/openshift/cluster-etcd-operator/pull/1330

Story ETCD-698: CEO: rev bump in --force-new-cluster

View the Description View the linked PRs

with the downstream carry merged in ~~ETCD-696~~, we need to implement the flag in CEO.

https://github.com/openshift/cluster-etcd-operator/pull/1359

Story ETCD-654: add quorum-restore script

View the Description View the linked PRs

based on --force-new-cluster we need to add a quorum restore script that will only do that, without any inputs.

https://github.com/openshift/cluster-etcd-operator/pull/1313

Story ETCD-652: Store latest etcd leader revision in datadir

View the Description View the linked PRs

To enable resource version bumps on restore, we would need to know how far into the future (in terms of revisions) we need to bump.

We can get this information by requesting endpoint status on each member and using the maximum of all RaftIndex fields as the result. Alternatively by finding the current leader and getting its endpoint status directly.

Even though this is not an expensive operation, this should be polled in a sensible interval, e.g. once every 30s.

The result should be written as a textfile in the hostPath /var/lib/etcd that is already mounted on all relevant pods. An additional etcd sidecar container should be the most sensible choice to run this code.

https://github.com/openshift/cluster-etcd-operator/pull/1326

Story ETCD-688: restore pods require readyZ

View the Description View the linked PRs

Currently the readiness probe (of the guard pod) will constantly fail because the restore pod containers do not have the readyZ sidecar container.

Example error message:

> Oct 16 13:42:52 ci-ln-s2hivzb-72292-6r8kj-master-2 kubenswrapper[2624]: I1016 13:42:52.512331 2624 prober.go:107] "Probe failed" probeType="Readiness" pod="openshift-etcd/etcd-guard-ci-ln-s2hivzb-72292-6r8kj-master-2" podUID="2baa50c6-b5cd-463e-9b35-165570e94b76" containerName="guard" probeResult="failure" output="Get \"https://10.0.0.4:9980/readyz\": dial tcp 10.0.0.4:9980: connect: connection refused"

AC:

The guard pod does not complain about the missing readyZ anymore

https://github.com/openshift/cluster-etcd-operator/pull/1367

Feature OCPSTRAT-554: Improving error handling, propagation, collection, and disambiguation for users

View the Description

To be broken into one feature epic and a spike:

feature: error type disambiguation and error propagation into operator status
*spike: general improvement on making errors more actionable for the end user*

The MCO today has multiple layers of errors. There are generally speaking 4 locations where an error message can appear, from highest to lowest:

The MCO operator status
The MCPool status
The MCController/Daemon pod logs
The journal logs on the node

The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:

The real error is hard to find
The error message is often generic and ambiguous
The solution/workaround is not clear at all

Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:

An incomplete update happened, and something rebooted the node
The node upgrade was successful until rpm-ostree, which failed and atomically rolled back
The user modified something manually
Another operator modified something manually
Some other service/network manager overwrote something MCO writes

Etc. etc.

Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:

De-ambigufying different error cases with the same message
Adding more error catching, including journal logs and rpm-ostree errors
Propagating full error messages further up the stack, up to the operator status in a clear manner
Adding actionable fix/information messages alongside the error message

With a side objective of observability, including reporting all the way to the operator status items such as:

Reporting the status of all pools
Pointing out current status of update/upgrade per pool
What the update/upgrade is blocking on
How to unblock the upgrade

Approaches can include:

Better error messaging starting with common error cases
De-ambigufying config mismatch
Capturing rpm-ostree logs from previous boot, in case of osimageurl mismatch errors
Capturing full daemon error message back to pool/operator status
Adding a new field to the MCO operator spec, that attempts to suggest fixes or where to look next, when an error occurs
Adding better alerting messages for MCO errors

Epic MCO-111: Actionable Error Messaging

View the Description

The real error is hard to find
The error message is often generic and ambiguous
The solution/workaround is not clear at all

Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:

An incomplete update happened, and something rebooted the node
The node upgrade was successful until rpm-ostree, which failed and atomically rolled back
The user modified something manually
Another operator modified something manually
Some other service/network manager overwrote something MCO writes

Etc. etc.

De-ambigufying different error cases with the same message
Adding more error catching, including journal logs and rpm-ostree errors
Propagating full error messages further up the stack, up to the operator status in a clear manner
Adding actionable fix/information messages alongside the error message

With a side objective of observability, including reporting all the way to the operator status items such as:

Reporting the status of all pools
Pointing out current status of update/upgrade per pool
What the update/upgrade is blocking on
How to unblock the upgrade

Approaches can include:

Better error messaging starting with common error cases
De-ambigufying config mismatch
Capturing rpm-ostree logs from previous boot, in case of osimageurl mismatch errors
Capturing full daemon error message back to pool/operator status
Adding a new field to the MCO operator spec, that attempts to suggest fixes or where to look next, when an error occurs
Adding better alerting messages for MCO errors

Options

Story MCO-88: Add runbook for MCCDrainError alert

View the Description View the linked PRs

Description:

MCC sends drain alert when node drain doesn't succeed within drain timeout period (1 hour today). This is to make sure that admin takes appropriate action if required by looking at MCC pod logs. Alert contains the information on where to look for the logs.

Example alert looks like:

Drain failed on Node <node_name>, updates may be blocked. For more details: oc logs -f -n openshift-machine-config-operator machine-config-controller-xxxxx -c machine-config-controller

It is possible that admin may not be able to interpret exact action to be taken after looking at MCC pod logs. Adding runbook (https://github.com/openshift/runbooks) can help admin in better troubleshooting and taking appropriate action.

Acceptance Criteria:

Runbook doc is created for MCCDrainError alert
Created runbook link is accessible to cluster admin with MCCDrainError alert

https://github.com/openshift/machine-config-operator/pull/4666

Feature OCPSTRAT-680: Integrate Cluster API in standalone OCP-Phase 2

View the Description

Feature Overview (aka. Goal Summary)

Phase 2 Goal:

Complete the design of the Cluster API (CAPI) architecture and build the core operator logic
attach and detach of load balancers for internal and external load balancers for control plane machines on AWS, Azure, GCP and other relevant platforms
manage the lifecycle of Cluster API components within OpenShift standalone clusters
E2E tests

for Phase-1, incorporating the assets from different repositories to simplify asset management.

Background, and strategic fit

Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.

Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
CAPI has much better community interaction than MAPI.
Other projects are considering using CAPI and it would be cleaner to have one solution
Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

Epic OCPCLOUD-2634: (Infrastructure) Cluster generation for Cluster API platforms

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

To add support for generating Cluster and Infrastructure Cluster resources on Cluster API based clusters

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPCLOUD-2206: Generate AzureCluster resource for CAPI clusters

View the Description View the linked PRs

Background

To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.

To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.

This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.

Steps

Create a separate go module in cluster-api-provider-azure/openshift
Create a controller in the above module Go to manage the AzureCluster resource for non-capi bootstrapped clusters
Ensure the AzureCluster controller is only enabled for Azure platform clusters
Create an "externally-managed" AzureCluster resource and manage the status to ensure Machine's can correctly be created
Populate any required spec/status fields in the AzureCluster spec using the controller
(Refer to openstack implementation)

Stakeholders

Cluster Infra

Definition of Done

AzureCluster resource is correctly created and populated on Azure clusters

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-capi-operator/pull/189

Story OCPCLOUD-2207: Generate VSphereCluster resource for CAPI clusters

View the Description View the linked PRs

Background

To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.

This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.

Steps

Create a separate go module in cluster-api-provider-vsphere/openshift
Create a controller in the above module Go to manage the VSphereCluster resource for non-capi bootstrapped clusters
Ensure the VSphereCluster controller is only enabled for VSphere platform clusters
Create an "externally-managed" VSphereCluster resource and manage the status to ensure Machine's can correctly be created
Populate any required spec/status fields in the VSphereCluster spec using the controller
(Refer to openstack implementation)

Stakeholders

Cluster Infra

Definition of Done

VSphereCluster resource is correctly created and populated on VSphere clusters

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-capi-operator/pull/185

Story OCPCLOUD-2511: Should not be able to remove Infrastructure Cluster resources

View the Description View the linked PRs

Background

We expect every openshift cluster that relies on Cluster API to have an infrastructure cluster and a cluster object.

These resources should exist for the lifetime of the cluster and should not be able to be removed.

We must ensure that infracluster objects from supported platforms cannot be deleted once created.

Changes to go into the cluster-capi-operator.

Steps

Build validating admission that prevents InfraCluster objects from being deleted
Either use a webhook, or ValidatingAdmissionPolicy to achieve this
Apply only to the infracluster object in the openshift-cluster-api namespace

Stakeholders

Cluster infra

Definition of Done

When installed into a cluster, the cluster's infracluster object cannot be removed using `oc delete`

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

Story OCPCLOUD-2205: Generate GCPCluster resource for CAPI clusters

View the Description View the linked PRs

Background

To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.

This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.

Steps

Create a separate go module in cluster-api-provider-gcp/openshift
Create a controller in the above module Go to manage the GCPCluster resource for non-capi bootstrapped clusters
Ensure the GCPCluster controller is only enabled for AWS platform clusters
Create an "externally-managed" GCPCluster resource and manage the status to ensure Machine's can correctly be created
Populate any required spec/status fields in the GCPCluster spec using the controller
(Refer to openstack implementation)

Stakeholders

Cluster Infra

Definition of Done

GCPCluster resource is correctly created and populated on GCP clusters

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-capi-operator/pull/187

Feature OCPSTRAT-683: Migrate MAPI to Cluster API for AWS -Phase 1

View the Description

Feature Overview (aka. Goal Summary)

Implement Migration core for MAPI to CAPI for AWS

This feature covers the design and implementation of converting from using the Machine API (MAPI) to Cluster API (CAPI) for AWS
This Design investigates possible solutions for AWS
Once AWS shim/sync layer is implemented use the architecture for other clouds in phase-2 & phase 3

Acceptance Criteria

When customers use CAPI, There must be no negative effect to switching over to using CAPI . Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.

Epic OCPCLOUD-2706: MAPI/CAPI Feature Parity

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

To bring MAPI and CAPI to feature parity and unblock conversions between MAPI and CAPI resources

Why is this important?

Blocks migration to Cluster API

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPCLOUD-2707: [AWS] Investigate DeviceIndex support in MAPA/CAPA

View the Description View the linked PRs

Background

MAPA has support for users to configure the Network DeviceIndex.

According to aws, the primary network interface must use the value 0.

It appears that CAPA already forces this (it only supports creating one primary network interface) or assigns these values automatically if you are supplying your own network interfaces.

Therefore, it is likely that we do not need to support this value (MAPA only supports a single network interface), but we must be certain.

Steps

Test what happens if the DeviceIndex is a non-zero value in MAPA
If it works, we need to come up with a way to convince CAPA to support a custom device index
If it does not work, then in theory no customer could be using this, and dropping support should be fine. Document this in the conversion library.

Stakeholders

Cluster Infra

Definition of Done

We have made a decision about the device index field and how to handle it. Be that always erroring perpetually, or finding a way to get this support into CAPA.

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

https://github.com/openshift/cluster-capi-operator/pull/213

Epic OCPCLOUD-2120: Implement Migration core for MAPI to CAPI

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Create the core/common tooling needed to enable the migration designed in ~~OCPCLOUD-1578~~
To allow providers to individually migrate from MAPI to CAPI
Implementation plan in https://docs.google.com/document/d/1IZPmcJujKPdoBZKt66i3eGcJWb1RDIgk1ywt13h6T-w/edit

Why is this important?

We need to build out the core so that development of the migration for individual providers can then happen in parallel

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OCPCLOUD-2643: Bootstrap sync controllers for Machine and MachineSet resources

View the Description View the linked PRs

Background

We want to build out a sync controller for both Machine and MachineSet resources.

This card is about bootstrapping the basics of the controllers, with more implementation to follow once we have the base structure.

For this card, we are expecting to create 2 controllers, one for Machines, one for MachineSets.

The MachineSet controller should watch MachineSets from both MachineAPI and ClusterAPI in the respective namespaces that we care about. It should also be able to watch the referenced infrastructure templates from the CAPI MachineSets.

For the Machine controller, it should watch both types of Machines in MachineAPI and ClusterAPI in their respective namespaces. It should also be able to watch for InfrastructureMachines for the CAPI Machines in the openshift-cluster-api namespace.

If changes to any of the above resources occur, the controllers should trigger a reconcile which will fetch both the Machine API and Cluster API versions of the resources, and then split the reconcile depending on which version is authoritative.

Deletion logic will be handled by a separate card, but will need a fork point in the main reconcile that accounts for if either of the resources have been deleted, once they have been fetched from the cache.

Note, if a MachineSet exists only in CAPI, the controller can exit and ignore the reconcile request.

If a Machine only exists in CAPI, but is owned by another object (MachineSet for now) that is then mirrored into MAPI, the Machine needs to be reconciled so that we can produce the MAPI mirror of the Machine.

Steps

Bootstrap base controllers
Fetch resources in the controllers as per description above
Set up reconcileMAPItoCAPI and reconcileCAPItoMAPI functions as templates for future work.
Set up watches based on the decsription above - note this will need some dynamic watching since the infrastructure refs may refer to any resource
Add envtest based testsuite setup for controllers

Stakeholders

Cluster Infra

Definition of Done

We have the basis of the sync controllers implemented so that we can start implementing that actual business logic.

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

Story OCPCLOUD-2749: Implement specific testing to cover failures in AWS MAPI/CAPI conversion

View the Description View the linked PRs

Background

We have now merged a design for the MAPI to CAPI library, but, have not been extensively testing it up to now.

There are a large number of fields that currently cannot be converted, and we should ensure each of these is tested.

Steps

Update the testutils providerSpec generation in actuator pkg to be able to configure more fields (those that need to be cleared or configured to specific values)
Identify a "base" that will pass the conversion and build a test structure that allows this base to be mutated to create specific test cases
Add a test case for each of the expected failures to verify the output error message when misconfiguration occurs
Cover AWS MAPI to CAPI, Machine MAPI to CAPI, MachineSet MAPI to CAPI
And then reverse the above by doing the same in the CAPI to MAPI version
This could be broken down into several tasks and implemented as separate PRs

Stakeholders

Cluster Infra

Definition of Done

We have both positive and extensive negative testing for the MAPI to CAPI conversions in the capi operator repo

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

Story OCPCLOUD-2750: Implement fuzz testing for MAPI to CAPI conversion suite

View the Description View the linked PRs

Background

Fuzz testing should be used to create round trip testing and pick up issues in conversion.

Fuzz tests auto generate data to put into fields and we can ensure that combinations of fields are converted appropriately and also pick up when new fields are introduced into the APIs by fuzz testing and ensuring that fields are correctly round tripped.

We would like to set up a pattern for fuzz testing that can be used across various providers as we implement new provider conversions.

Steps

Set up fuzz testing for AWS MAPI to CAPI conversion
Factor the fuzz tests into utils that can easily be re-used
Add TODOs where current conversions create fuzz exceptions
Do the same for CAPI to MAPI

Stakeholders

Cluster Infra

Definition of Done

Fuzz testing is introduced to catch future breakages and help identify issues in round trip conversions.

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

Story OCPCLOUD-2565: Add Paused condition to Machine and MachineSet resources

View the Description View the linked PRs

Background

When the Machine and MachineSet MAPI resource are non-authoritative, the Machine and MachineSet controllers should observe this condition and should exit, pausing the reconciliation.

When they pause, they should acknowledge this pause by adding a paused condition to the status and ensuring it is set to true.

Behaviours

Should not reconcile when .status.authoritativeAPI is not MachineAPI
Except when it is empty (prior to defaulting migration webhook)

Steps

Ensure MAO has new API fields vendored
Add checks in Machine/MachineSet for authoritative API in status not Machine API
When not machine API, set paused condition == true, otherwise paused == false (same as CAPI)

- Condition should be giving reasons for both false and true
This feature must be gated on the ClusterAPIMigration feature gate

Stakeholders

Cluster Infra

Definition of Done

When the status of Machine indicates that the Machine API is not authoritative, the Paused condition should be set and no action should be taken.

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

Spike OCPCLOUD-2568: Design providerSpec to infra template conversion library

View the Description View the linked PRs

The core of the Machine API to Cluster API conversion will rely on a bi-directional conversion library that can convert providerSpecs from Machine API into InfraTemplates in Cluster API, and back again.

We should aim to have a platform agnostic interface such that the core logic of the migration mechanism need not care about platforms specific detail.

The library should also be able to return errors when conversion is not possible, which may occur when:

A feature in MAPI is not implemented in CAPI
A feature in CAPI is not implemented in MAPI
A value in MAPI, that now exists on the infrastructure cluster, is not compatible with the existing infrastructure cluster

These errors should resemble the API validation errors from webhooks, for familiarity, using utils such as `field.NewPath` and the InvalidValue error types.

We expect this logic to be used in the core sync controllers, responsible for converting Machine API resources to Cluster API resources and vice versa.

DoD:

Flesh out a design for a code interface for the conversion of providerSpecs to InfraTemplates
Create an implementation of the interface that handles conversion for AWS providerSpecs
Implement testing for the conversion from MAPI to CAPI and vice versa, exercising fields that are known to convert well, as well as including error cases.
Commit the code into a package within the cluster-capi-operator repository, ready for use by the conversion controller core code

https://github.com/openshift/cluster-capi-operator/pull/173

Story OCPCLOUD-2646: Implement CAPI to MAPI MachineSet conversion

View the Description View the linked PRs

Background

To be able to continue to operate MachineSets, we need a backwards conversion once the migration has occurred. We do not expect users to remove the MAPI MachineSets immediately, and the logic will be required for when we remove the MAPI controllers.

This covers the case where the CAPI MachineSet is authoritative or only a CAPI MachineSet exists.

Behaviours

If the MachineSet only exists in CAPI, do nothing
If the MachineSet is mirrored in MAPI
- Convert the InfraTemplate to a providerSpec, and update the MAPI resource
- Mirror labels from the CAPI MachineSet to the MAPI MachineSet
- Ensure spec and status fields (replicas, taints etc) are mirrored between the MachineSets
On Failure
- Set the Synchronized condition to False and apply an appropriate message
On Success
- Set the Synchronized condition to True and update the synchronizedGeneration

Steps

Stakeholders

<Who is interested in this/where did they request this>

Definition of Done

<Add items that need to be completed for this card>

Docs

<Add docs requirements for this card>

Testing

<Explain testing that will be added>

Feature OCPSTRAT-709: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic API-1789: CI implementation: Create TLS artifacts registry

View the Description View the linked PRs

Epic Goal

This is the epic tracking the work to collect a list of TLS artifacts (certificates, keys and CA bundles).

This list will contain a set of required an optional metadata. Required metadata examples are ownership (name of Jira component) and ability to auto-regenerate certificate after it has expired while offline. In most cases metadata can be set via annotations to secret/configmap containing the TLS artifact.

Components not meeting the required metadata will fail CI - i.e. when a pull request makes a component create a new secret, the secret is expected to have all necessary metadata present to pass CI.

This PR will enforce it WIP API-1789: make TLS registry tests required

Epic API-1689: Create TLS artifacts registry

View the Description View the linked PRs

In order to keep track of existing certs/CA bundles and ensure that they adhere to requirements we need to have a TLS artifact registry setup.

The registry would:

have a test which automatically collects existing certs/CA bundles from secrets/configmaps/files on disk
have a test which collects necessary metedata from them (from cert contents or annotations)
ensure that new certs match expected metadata and have necessary annotations on when a new cert is added

Ref: API-1622

https://github.com/openshift/origin/pull/29172

Feature OCPSTRAT-787: Configure AWS User Tags on Day 2 (Hosted Control Planes only)

View the Description

Feature Overview (aka. Goal Summary)

To improve automation, governance and security, AWS customers extensively use AWS Tags to track resources. Customers wants the ability to change user tags on day 2 without having to recreate a new cluster to have 1 or more tags added/modified.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Complete during New status.

Cluster administrator can add one or more tags to an existing cluster.
Cluster administrator can remove one or more tags from an existing cluster.
Cluster administrator can add one or more tags just to machine-pool / node-pool in the ROSA with HCP cluster.
All ROSA client interfaces (ROSA CLI, API, UI) can utilise the day2 tagging feature on ROSA with HCP clusters
All OSD client interfaces (API, UI, CLI) can utilize the day2 tagging feature on ROSA with HCP clusters
This feature does not affect the Red Hat owned day1 tags built into OCP/ROSA (there are 10 reserved spaces for tags, of the 50 available, leaving 40 spaces for customer provided tags)

Requirements (aka. Acceptance Criteria):

Following capabilities are available for AWS on standalone and HCP clusters.
OCP automatically tags the cloud resources with the Cluster's External ID.
Tags added by default on Day 1 are not affected.
All existing active AWS resources in the OCP clusters have the tagging changes propagated.
All new AWS resources created by OCP reflect the changes to tagging.
Hive to support additional list of key=value strings on MachinePools
- These are AWS user-defined / custom tags, not to be confused with node labels
- ROSA CLI can accept a list of key=value strings with additional tag values
  - it currently can do this during cluster-install
- The default tag(s) is/are still applied
- NOTE: AWS limit of 50 tags per object (2 used automatically by OCP; with a third to be added soon; 10 reserved for Red Hat overall, as at least 2-3 are used by Managed Services) - customer's can only specify 40 tags max!
  - Source: https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html#tag-conventions
- Must be able to modify tags after creation
Support for OpenShift 4.15 onwards.

Out-of-scope

This feature will only apply to ROSA with Hosted Control Planes, and ROSA Classic / standalone is excluded.

Why is this important?

Customers want to use custom tagging for
- access controls
- chargeback/showback
- cloud IAM conditional permissions

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Epic CFE-1122: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CFE-1129: [Image registry operator] Support reconciliation of tags on day2 updates

View the Description View the linked PRs

Support reconciliation of tags on day2 updates from Infrastructure.status field.

Acceptance criteria
1. Successful updates on tag information updates.
2. Conflict error handling.
3. Unit testcases.
4. e2e testcases.

https://github.com/openshift/cluster-image-registry-operator/pull/1121

Story CFE-1130: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-credential-operator/pull/759

Story CFE-1134: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-ingress-operator/pull/1148

Feature OCPSTRAT-883: Agent Minimal ISO Support

View the Description

Feature Overview

Allow the user to create the Agent ISO image as a minimal ISO (sans rootfs).

This is supported for the external platform, added for OCI in 4.14. This adds support for the rest of the platforms supported by the agent-based installer.

Requirements

All platforms supported by agent can install using a minimal ISO:

Bare metal
none
vSphere
Nutanix
External

Use Cases

User in a connected environment generates a minimal ISO; rootfs is automatically downloaded from mirror.openshift.com.
User in a disconnected environment generates a minimal ISO and rootfs, then uploads the rootfs to the bootArtifactsBaseURL they specified in agent-config.yaml.
By default users continue to generate a fully self-contained ISO (except on the external platform, where minimal is required).

Epic AGENT-707: Minimal ISO support for any platform

View the Description

Epic Goal

Allow the user to create the Agent ISO image as a minimal ISO (sans rootfs). We already support/require this for the external platform; we should make it possible on any platform.

Why is this important?

Some BMCs do not support images as large as the 1GB CoreOS ISO when using virtualmedia. By generating a minimal ISO, we unlock use of the agent installer on these servers

Scenarios

User in a connected environment generates a minimal ISO; rootfs is automatically downloaded from mirror.openshift.com.
User in a disconnected environment generates a minimal ISO and rootfs, then uploads the rootfs to the bootArtifactsBaseURL they specified in agent-config.yaml.
By default users continue to generate a fully self-contained ISO (except on the external platform).

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

~~AGENT-656~~ implemented minimal ISO support in the agent-based installer, including support for minimal ISOs in disconnected environments, for the external platform only. There is no UI to select this, as the external platform is assumed to always require a minimal ISO.

Open questions::

How the user should request a minimal ISO. Perhaps openshift-install agent create minimal-image?

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story AGENT-972: Add new configuration to create minimal ISO for any platform

View the Description View the linked PRs

Currently the agent-based installer creates a full ISO for all platforms, except for OCI (External) platforms in which a minimal ISO is created by default. Work is being done to support the minimal ISO for all platforms. In this case either a new command must be used to create the minimal, instead of full ISO, or a flag added to the "agent create image"
command.

UPDATE 9/30: From feedback from Zane (https://github.com/openshift/installer/pull/9056#discussion_r1777838533) the plan has changed to use a new field in agent-config.yaml to define that a minimal ISO should be generated instead of either a new command, or flag to existing command.

https://github.com/openshift/installer/pull/9056

Story AGENT-973: Remove restrictions on minimal ISO support for External platform only

View the Description View the linked PRs

Currently minimal ISO support is only provided for the External platform (see https://issues.redhat.com//browse/AGENT-702). As part of the attached Epic, all platforms will now support minimal ISO. The checks that limit minimal ISO to External platform only should be removed.

https://github.com/openshift/installer/pull/9056

Story AGENT-974: Add integration test for new command to create minimal ISO

View the Description View the linked PRs

With the addition of a new field in agent-config.yaml to create a minimal ISO that can be used on all platforms, an integration test should be added to test this support.

The integration test can check that the ISO created is below the size expected for a full ISO and also that the any ignition files are properly set for minimal ISO support.

https://github.com/openshift/installer/pull/9139

Task AGENT-999: Update internal documentation for minimal ISO support

View the Description View the linked PRs

Currently the internal documentation describes creating a minimal ISO only for an External platform. With the change to support minimal ISO on all platforms, the documentation should be uldated.

https://github.com/openshift/installer/pull/9192

Feature OCPSTRAT-940: Deprecation of iptables in OpenShift [Phase 2]

View the Description

Feature Overview (aka. Goal Summary)

Migrate every occurrence of iptables in OpenShift to use nftables, instead.

Goals (aka. expected user outcomes)

Implement a full migration from iptables to nftables within a series of "normal" upgrades of OpenShift with the goal of not causing any more network disruption than would normally be required for an OpenShift upgrade. (Different components may migrate from iptables to nftables in different releases; no coordination is needed between unrelated components.)

Requirements (aka. Acceptance Criteria):

Discover what components are using iptables (directly or indirectly, e.g. via ipfailover) and reduce the “unknown unknowns”.
Port components away from iptables.

Use Cases (Optional):

Questions to Answer (Optional):

Do we need a better “warning: you are using iptables” warning for customers? (eg, per-container rather than per-node, which always fires because OCP itself is using iptables). This could help provide improved visibility of the issue to other components that aren't sure if they need to take action and migrate to nftables, as well.

Out of Scope

Non-OVN primary CNI plug-in solutions

Background

RHEL's iptables (including the ipset and iptables-nft packages) have been deprecated in RHEL 9 and will be removed in the next major release.
See also: https://access.redhat.com/solutions/6739041

Customer Considerations

What happens to clusters that don't migrate all iptables use to nftables?
- In RHEL 9.x it will generate a single log message during node startup on every OpenShift node. There are Insights rules that will trigger on all OpenShift nodes.
- In RHEL 10 iptables will just no longer work at all. Neither the command-line tools nor the kernel modules will be present.

Documentation Considerations

Interoperability Considerations

Epic MCO-1248: replace iptables code in gcp/azure-routes with nftables

View the Description View the linked PRs

iptables is going away in RHEL 10; we need to replace all remaining usage of iptables in OCP with nftables before then.

The gcp-routes and azure-routes scripts in MCO use iptables rules and need to be ported to use nftables.

https://github.com/openshift/machine-config-operator/pull/4494

Feature OCPSTRAT-979: Integrate Azure Workload Identities and Managed Service Identity (MSI) for Operators (control plane/data plane) - Part I

View the Description

Goal Summary

This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities.

Epic HOSTEDCP-1287: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-42004: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic STOR-2107: Azure Service Principal Support with Mounted Credentials

View the Description

Epic Goal

The Cluster Storage Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.

Why is this important?

This is needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

Cluster Storage Operator is able to authenticate with Azure in ARO HCP using Service Principal with a backing certificate.
Updated documentation
ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Previous Work (Optional):

~~STOR-1697~~

Open questions:

Which degree of coverage should run on AKS e2e vs on existing e2es

Done Checklist

CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-2033: Enable getting AzureCreds via cert for Cloud Storage Operator

View the Description View the linked PRs

General

The Cloud Storage Operator needs to pass the Secret Provider Class to azure-disk and azure-file csi controllers so they can authenticate with client certificate.

Why is this important?

This is also needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

Refactored code passes the Secret Provider Class to azure-disk and azure-file csi controllers so they can authenticate with client certificate
Updated documentation
ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Done Checklist

https://github.com/openshift/cluster-storage-operator/pull/517

Epic SDN-5372: Azure Service Principal Support with Mounted Credentials

View the Description

Epic Goal

The Cluster Network Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.

Why is this important?

This is needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

Cluster Network Operator is able to authenticate with Azure in ARO HCP using Service Principal with a backing certificate.
Updated documentation
ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Previous Work (Optional):

~~SDN-4450~~

Open questions:

Which degree of coverage should run on AKS e2e vs on existing e2es

Done Checklist

Task HOSTEDCP-2070: AKS e2e certificate testing for Hypershift

View the linked PRs

Story HOSTEDCP-2035: Enable getting AzureCreds via cert using generic NewDefaultAzureCredential for Cloud Network Operator

View the Description View the linked PRs

General

The Cloud Ingress Operator would get Azure credentials using Azure SDK's generic NewDefaultAzureCredential function.

Why is this important?

Different OpenShift components implement different patterns of setting up environment variables to get Azure credentials for different Azure authentication methods.
Refactoring the pattern to use `NewDefaultAzureCredential` will enable OpenShift components to have the same pattern in setting up Azure credentials
This is also needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

Refactored code that uses `NewDefaultAzureCredential`
Updated documentation
ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Done Checklist

Epic STOR-1748: Support Azure Managed Service Identity (MSI) Authentication

View the Description

Epic Goal*

Support Managed Service Identity (MSI) authentication in Azure.

Why is this important? (mandatory)

This is a requirement to run storage controllers that require cloud access on Azure with hosted control plane topology.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be "Release Pending"

Bug OCPBUGS-42325: Azure-file and azure-disk-csi-controllers need to use their own managed identities

View the Description View the linked PRs

Description of problem:

    We discovered that the azure-disk and azure-file-csi-controllers are reusing CCM managed identity. Each of these three components should have their own managed identity and not reuse another's managed identity.

Version-Release number of selected component (if applicable):

How reproducible:

    Every time

Steps to Reproduce:

    1. Create an AKS mgmt cluster
    2. Create a HCP with MI
    3. Observe azure-disk and azure-file controllers are reusing azure CCM MI

Actual results:

    the azure-disk and azure-file-csi-controllers are reusing CCM managed identity

Expected results:

    the azure-disk and azure-file-csi-controllers should each have their own managed identity

Additional info:

Epic NE-1840: Azure Service Principal Support with Mounted Credentials

View the Description

Epic Goal

The Cluster Ingress Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.

Why is this important?

This is needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

Cluster Ingress Operator is able to authenticate with Azure in ARO HCP using Service Principal with a backing certificate.
Updated documentation
ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Previous Work (Optional):

~~NE-1504~~

Open questions:

Which degree of coverage should run on AKS e2e vs on existing e2es

Done Checklist

Bug OCPBUGS-44964: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story HOSTEDCP-2031: Enable getting AzureCreds via cert using generic NewDefaultAzureCredential for Cloud Ingress Operator

View the Description View the linked PRs

General

The Cloud Ingress Operator would get Azure credentials using Azure SDK's generic NewDefaultAzureCredential function.

Why is this important?

Different OpenShift components implement different patterns of setting up environment variables to get Azure credentials for different Azure authentication methods.
Refactoring the pattern to use `NewDefaultAzureCredential` will enable OpenShift components to have the same pattern in setting up Azure credentials
This is also needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

Refactored code that uses `NewDefaultAzureCredential`
Updated documentation
ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Done Checklist

https://github.com/openshift/cluster-ingress-operator/pull/1151

Epic IR-493: Azure Service Principal Support with Mounted Credentials

View the Description

Epic Goal

The image registry can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.

Why is this important?

This is needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

Image registry is able to authenticate with Azure in ARO HCP using Service Principal with a backing certificate.
Updated documentation
ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Previous Work (Optional):

~~IR-460~~

Open questions:

Which degree of coverage should run on AKS e2e vs on existing e2es

Done Checklist

Story HOSTEDCP-2019: Enable getting AzureCreds via cert using generic NewDefaultAzureCredential for Image Registry

View the Description View the linked PRs

General

The image registry would get Azure credentials using Azure SDK's generic NewDefaultAzureCredential function.

Why is this important?

Different OpenShift components implement different patterns of setting up environment variables to get Azure credentials for different Azure authentication methods.
Refactoring the pattern to use `NewDefaultAzureCredential` will enable OpenShift components to have the same pattern in setting up Azure credentials
This is also needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

Refactored code that uses `NewDefaultAzureCredential`
Updated documentation
ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Done Checklist

https://github.com/openshift/cluster-image-registry-operator/pull/1131

Epic IR-460: Support Azure Managed Service Identity (MSI) Authentication

View the Description

Epic Goal

Support Managed Service Identity (MSI) authentication in Azure.

Why is this important?

MSI authentication is required for any component that will run on the control plane side in ARO hosted control planes.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story IR-467: Enable Azure MSI authentication

View the Description View the linked PRs

Enable Azure MSI authentication

https://github.com/openshift/cluster-image-registry-operator/pull/1082

Epic HOSTEDCP-432: Lifecycle Hosted Clusters in HyperShift via Managed Identities

View the Description

Problem

Today Azure installation requires manually created service principal which involves relations, permission granting, credential setting, credential storage, credentials rotation, credentials clean up, and service principal deletion. This is not only mundane and time-consuming but also less secure and risks access to resources by adversaries due to lack of credential rotation.

Goal

Employ Azure managed credentials which drastically reduce the steps required to just managed identity creation, permission granting, and resource deletion.

Ideally, this should be a HyperShift-native functionality. I.e., HyperShift should use managed identities for the control plane, the kubelet, and any add-on that needs access to Azure resources.

Story HOSTEDCP-1542: Implement Workload Identities in HyperShift

View the Description View the linked PRs

Operators running management side needing to access azure customer account will use MSI.
Operands running in the guest cluster should rely on workload identity.
This ticket is to solve the latter.

We need to implement workload identity support in our components that run on the spoke cluster.

Address any TODOs in the code related to this ticket.

https://github.com/openshift/hypershift/pull/5128

Bug OCPBUGS-42434: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story HOSTEDCP-1994: Explore enable getting AzureCreds via cert using generic NewDefaultAzureCredential

View the Description View the linked PRs

https://redhat-external.slack.com/archives/C075PHEFZKQ/p1727710473581569
https://docs.google.com/document/d/1xFJSXi71bl-fpAJBr2MM1iFdUqeQnlcneAjlH8ogQxQ/edit#heading=h.8e4x3inip35u

If we decided to drop the msi init and adapter and expose the certs in management cluster directly via Azure Key Vault Secret Store CSI Driver Pods volume. This would remove complexity and avoid the need for highly permissive pods with net access.

Action items:

Each OpenShift component should authenticate via azidentity.NewDefaultAzureCredential
and let it choose based on the exposed env variables. To account for cert rotation we might either leverage reloading in-process, or use the common OpenShift fsnotify +os.Exit().
All the above could be implemented in a "library" like fashion that components ideally use as their only auth path. Otherwise they can for now keep their switch case based on the usecase for a gradual transition e.g. https://github.com/openshift/cluster-ingress-operator/compare/master...enxebre:cluster-ingress-operator:dev?expand=1

func azureCreds(options *azidentity.DefaultAzureCredentialOptions) (*azidentity.DefaultAzureCredential, error) {
if certPath := os.Getenv("AZURE_CLIENT_CERTIFICATE_PATH"); certPath != "" {
// Set up a watch on our config file; if it changes, we should exit -
// (we don't have the ability to dynamically reload config changes).
if err := watchForChanges(certPath, stopCh); err != nil

{ return nil, err }

}

return azidentity.NewDefaultAzureCredential(options)
}

For mocking besides production and getting CI passing:
- Management cluster
  - Add the `--enable-addons azure-keyvault-secrets-provider` flag to the AZ CLI command that creates the AKS management cluster. This enables/installs the CSI secrets driver on the worker nodes.
  - Create an Azure Key Vault on the AKS cluster where the SP certs will be stored.
  - Create a managed identity that serve as the "user/thing" that reads the certs out of the Azure Key Vault for the CSI secret driver.
- HyperShift Side
  - HyperShift CLI should provision a service principal to represent the MSI per HCP component and assigned the right roles/perms on them.
    - The Secret Store CSI driver/kube component should treat the cert as an MSI backing certificate. Regardless if you're using NewDefaultAzureCredential or NewClientCertificateCredential, it should work
  - The backing certificate for the SP should be stored in azure key vault in the same resource group as the AKS management cluster.
  - The HyperShift API will store:
    - The client ID and cert name of each SP for each HCP component
    - The Azure Key Vault name
    - The Azure Key Vault tenant ID
    - The client ID of the managed identity created to read the certs out of the Azure Key Vault
  - When the HCP deploys an OpenShift component that needs to authenticate with Azure:
    - HCP will supply the client ID for each SP
    - HCP will add the volume mount for the CSI driver with the SP's cert name and the client ID of the managed identity that will read the secret from the Azure Key Vault

Engineering Notes:

Proof of Concept with Ingress as the example OpenShift component - https://github.com/openshift/hypershift/pull/4841/commits/35ac5fd3310b9199309e9e8a47ee661771ec71cf

AZ CLI command to create the key vault

# Create Management Azure Key Vault
az keyvault create \
--name ${PREFIX} \
--resource-group ${AKS_RG} \
--location ${LOCATION} \
--enable-rbac-authorization

AZ CLI command to create the managed identity for the key vault

## Create the managed identity for the Management Azure Key Vault
az identity create --name "${AZURE_KEY_VAULT_AUTHORIZED_USER}" --resource-group "${AKS_RG}"
AZURE_KEY_VAULT_AUTHORIZED_USER_ID=$(az identity show --name "${AZURE_KEY_VAULT_AUTHORIZED_USER}" --resource-group "${AKS_RG}" --query principalId --output tsv)
az role assignment create \
--assignee-object-id "${AZURE_KEY_VAULT_AUTHORIZED_USER_ID}" \
--role "Key Vault Secrets User" \
--scope /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/"${AKS_RG}" \
--assignee-principal-type ServicePrincipal

AZ CLI command that creates a Service Principal with a backing cert stored in the Azure Key Vault

az ad sp create-for-rbac --name ingress --role "Contributor" --scopes /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${MANAGED_RG_NAME} --create-cert --cert ${CERTIFICATE_NAME} --keyvault ${KEY_VAULT_NAME}

Feature OCPSTRAT-981: Implement Lifecycle & Image Management for NodePools

View the Description

Feature Overview (Goal Summary)

This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.

Epic HOSTEDCP-1514: Support Azure diagnostics field on HyperShift NodePools

View the Description

Goal

Support configuring azure diagnostics for boot diagnostics on nodepools

Why is this important?

When a node fails to join the cluster, serial console logs are useful in troubleshooting, especially for managed services.

Scenarios

Customer scales / creates nodepool
1. nodes created
2. one or more nodes fail to join the cluster
3. cannot ssh to nodes because ssh daemon did not come online
4. Can use diagnosics + managed storage account to fetch serial console logs to troubleshoot

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Capz already supports this, so dependency should be on hypershift team implementing this: https://github.com/openshift/cluster-api-provider-azure/blob/master/api/v1beta1/azuremachine_types.go#L117

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-1609: CI implementation: Support Azure diagnostics field on HyperShift NodePools

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Capability 1
Capability 2
Capability 3

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/4529

Epic HOSTEDCP-1742: Exhaustive review of Azure APIs for HC and NodePools before GA

View the Description

Goal

Before GAing Azure let's make sure we do a final API review

Why is this important?

Scenarios

Acceptance Criteria

Dev - Has a valid enhancement if necessary
CI - MUST be running successfully with tests automated
QE - covered in Polarion test plan and tests implemented
Release Technical Enablement - Must have TE slides
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story HOSTEDCP-1826: Exhaustive review of Azure APIs for HC and NodePools before GA

View the Description View the linked PRs

Before GAing Azure API needs to go through review.
https://github.com/openshift/hypershift/blob/main/api/hypershift/v1beta1/nodepool_types.go#L430
https://github.com/openshift/hypershift/blob/main/api/hypershift/v1beta1/hostedcluster_types.go#L877

Feature OCPSTRAT-992: [Tech Preview]: Allow customer managed DNS solutions for AWS: Implementation

View the Description

Goal:

As an administrator, I would like to use my own managed DNS solution instead of only specific openshift-install supported DNS services (such as AWS Route53, Google Cloud DNS, etc...) for my OpenShift deployment.

Problem:

While cloud-based DNS services provide convenient hostname management, there's a number of regulatory (ITAR) and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.

Why is this important:

Provides customers with the flexibility to leverage their own custom managed ingress DNS solutions already in use within their organizations.
Required for regions like AWS GovCloud in which many customers may not be able to use the Route53 service (only for commercial customers) for both internal or ingress DNS.
OpenShift managed internal DNS solution ensures cluster operation and nothing breaks during updates.

Dependencies (internal and external):

DNS work for KNI
https://docs.google.com/document/d/1VsukDGafynKJoQV8Au-dvtmCfTjPd3X9Dn7zltPs8Cc/edit

This is a prerequisite for the internal clusters epic: https://docs.google.com/document/d/1gxtIW6OlasVQtQLTyOl6f9H9CMuxiDNM5hQFNd3xubE/edit#

Prioritized epics + deliverables (in scope / not in scope):

Ability to bootstrap cluster without an OpenShift managed internal DNS service running yet
Scalable, cluster (internal) DNS solution that’s not dependent on the operation of the control plane (in case it goes down)
Ability to automatically propagate DNS record updates to all nodes running the DNS service within the cluster
Option for connecting cluster to customers ingress DNS solution already in place within their organization

Estimate (XS, S, M, L, XL, XXL):

Previous Work:

Open questions:

Link to Epic: https://docs.google.com/document/d/1OBrfC4x81PHhpPrC5SEjixzg4eBnnxCZDr-5h3yF2QI/edit?usp=sharing

Epic CORS-3291: Installer, OpenShift API and Ingress Operator changes for AWS LB IPs

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

For a customer provided DNS to be utilized with AWS platforms, the Installer should be updated not configure the default cloud provided DNS (Route 53). Installer is also responsible for updating the Infrastructure config resource and bootstrap ignition with the LB IPs for API and API-Int. These would be used to stand-up in-cluster CoreDNS pods on the bootstrap and control plane nodes.

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CORS-3754: Keep contents of resolv.conf on Bootstrap node updated

View the Description View the linked PRs

On the bootstrap node, keep NetworkManager generated resolv.conf updated with the nameserver pointing to the localhost.

User Story:

As a (user persona), I want to be able to:

Capability 1
Capability 2
Capability 3

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/9207

Story CORS-3696: Add AWS User Provisioned DNS markers to installer

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Add the User Provisioned DNS data to the installer
Vendor the api changes

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/9069

Story CORS-3695: Update api to include aws user provisioned DNS information

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

clusterHostedDNS enabled/disabled values should be added
similar to GCP, cloudLoadBalancerConfig should be added to the AWS config so that the infra CR can be populated with the correct information for LB addresses

so that I can achieve

Outcome 1
Outcome 2
Outcome 3

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

(optional) https://github/com/link.to.enhancement/
(optional) https://issues.redhat.com/link.to.spike
Engineering detail 1
Engineering detail 2

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/api/pull/2046

Feature RHDP-1063: ODC Engineering Improvements, automation and Tech debts 4.18

View the Description

Feature Overview

This feature is to track automation in ODC, related packages, upgrades and some tech debts

Goals

Improve automation for Pipelines dynamic plugins
Improve automation for OpenShift Developer console
Engineering tech debt for ODC

Requirements

This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

(Optional) Use Cases

This Section:

Main success scenarios - high-level user stories
Alternate flow/scenarios - high-level user stories
...

Questions to answer...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

Customer Considerations

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic ODC-7662: Automation enhancement for 4.18

View the Description

Problem:

This epic covers the scope of automation-related stories in ODC

Goal:

Why is it important?

Automation enhancements for ODC

Use cases:

<case>

Acceptance criteria:

Automation enhancements to improve test execution time and efficiency
Improving test execution to get more tests run on CI

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Bug OCPBUGS-42418: Knative tests fails when operator is present without serving,eventing or kafka instance created

View the Description View the linked PRs

Description of problem:

If knative operator is installed without creation of any of its instances tests will fail

Version-Release number of selected component (if applicable):

How reproducible:

Every time

Steps to Reproduce:

    1. Install knative operator without creation of any one or all three instances
    2. Run knative e2e tests
    3.

Actual results:

Tests will fail saying: Error from server particular instance not found

Expected results:

Mechanism should be present to create missing instance

Additional info:

https://github.com/openshift/console/pull/14325

Story ODC-7654: Remove duplicate Operator checks from Shipwright package

View the Description View the linked PRs

Description

As a user,

Acceptance Criteria

<criteria>

Additional Details:

https://github.com/openshift/console/pull/14170

Bug OCPBUGS-39573: Enable topology e2e tests in CI

View the Description View the linked PRs

Description of problem:

Enabling the topology tests in CI

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14239

Story ODC-7683: Default Test User setup for packages utilising operation installation through CLI

View the Description View the linked PRs

Description

After the addition of CLI method of operator installation, test doesn't necessarily require admin privileges. Currently test add an overhead of creating admin session and page navigations which are not required.

Acceptance Criteria

Test utilising CLI method for operator installation should run in limited privileges.

Additional Details:

https://github.com/openshift/console/pull/14250

Story ODC-7684: Increase e2e CI tests for shipwright package

View the Description View the linked PRs

Description

Current Shipwright e2e test running on CI are not enough and requires additional tests.

Acceptance Criteria

<criteria>

Additional Details:

https://github.com/openshift/console/pull/14459

Story ODC-7691: Operator availability check through CLI

View the Description View the linked PRs

Description

The main goal is to incorporate Operator CLI installation method with Operator Availability checks to enable quick issue identification.

Acceptance Criteria

<criteria>

Additional Details:

https://github.com/openshift/console/pull/14297

Bug OCPBUGS-41924: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14252

Story ODC-7681: Fix multiple flaking knative e2e test

View the Description View the linked PRs

Description of problem:

KN-05-TC05, KN-02-TC12, ~~SF-01~~-TC06 are flaking on CI due to variable resource creation time and some other unknown factor which need to be identified.

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

<steps>

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Workaround:

Additional info:

https://github.com/openshift/console/pull/14180

Feature SECFLOWOTL-176: Onboarding UX for Shipwright Builds in Console

View the Description

Feature Overview

Improve onboarding experience for using Shipwright Builds in OpenShift Console

Goals

Enable users to create and use Shipwright Builds in OpenShift Console while requiring minimal expertise about Shipwright

Requirements

Requirements	Notes	IS MVP
Enable creating Shipwright Builds using a form		Yes
Allow use of Shipwright Builds for image builds during import flows		Yes
Enable access to build strategies through navigation		Yes

Use Cases

Use Shipwright Builds for image builds in import flows
Enable form-based creation of Shipwright Builds and without YAML expertise
Provide access to Shipwright resources through navigation

Out of scope

TBD

Dependencies

TBD

Background, and strategic fit

Shipwright Builds UX in Console should provide a simple onboarding path for users in order to transition them from BuildConfigs to Shipwright Builds.

Assumptions

TBD

Customer Considerations

TBD

Documentation/QE Considerations

TBD

Impact

TBD

Related Architecture/Technical Documents

TBD

Definition of Ready

The objectives of the feature are clearly defined and aligned with the business strategy.
All feature requirements have been clearly defined by Product Owners.
The feature has been broken down into epics.
The feature has been stack ranked.
Definition of the business outcome is in the Outcome Jira (which must have a parent Jira).

Epic ODC-7595: Form-based shipwright build creation

View the Description

Problem:

Creating Shipwright Builds through YAML is complex and requires Shipwright expertise which makes it difficult for novice user to user Shipwright

Goal:

Provide a form for creating Shipwright Builds

Why is it important?

To simply adoption of Shipwright and ease onboarding

Use cases:

Create build

Acceptance criteria:

User can create Shipwright Builds through a form (instead of YAML editor)
The Shipwright build asks user for the following input
- User can provide Git repository url
- User can choose to see the advanced options for Git url and provide additional details
  - Branch/tag/ref
  - Context dir
  - Source secret
- User is able to create a source secret without navigating away from the form
- User can select a build strategy from strategies that are available in the cluster (cluster-wide or in the namespace)
- User can provide param values related to the selected build strategy
- User can provide environment variables (text, from configmap, from secret)
- User can provide output image url to an image registry and push secret
- User is able to create a push secret without navigating away from the form
- User add volumes to the build

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Story ODC-7631: Create form for Shipwright build

View the Description View the linked PRs

Description

As a user, I want to create a Shipwright build using the form,

Acceptance Criteria

Create form yaml switcher create page
Users should provide Shipwright build name
Users should provide Git repository url
Users should choose to see the advanced options for Git url and provide additional details
- Branch/tag/ref
- Context dir
- Source secret
Users should create a source secret without navigating away from the form
Users should select a build strategy from strategies that are available in the cluster (cluster-wide or in the namespace)
Users should provide param values related to the selected build strategy
Users should provide environment variables
Users should provide output image URL to an image registry and push secret
Users should create a push secret without navigating away from the form
Users should add volumes to the build
Add e2e tests

Additional Details:

https://github.com/openshift/console/pull/14106

Feature SRVKE-1571: Event Discovery

View the Description

Event discovery allows for dynamic and interactive user experiences and event catalogs provide users with a structured way to discover available events within the system. Users can explore different event types, their descriptions, and associated metadata, making it easier to understand the capabilities and functionalities offered by the system.

By providing visibility into the available events and their characteristics, event catalogs help users understand how the system behaves and what events they can expect to occur as well as streamline the process of subscribing to and consuming events within the system.

Epic ODC-7657: Knative Eventing Catalog in ODC

View the Description

Problem:

Goal:

Why is it important?

Event catalogs provide users with a structured way to discover available events within the system. Users can explore different event types, their descriptions, and associated metadata, making it easier to understand the capabilities and functionalities offered by the system.

Use cases:

<case>

Acceptance criteria:

Create a catalog for Events
Add a add card on the Add page which takes the user to Events catalog page
Show Event details and list attributes and values on the side panel with Subscribe button of the Events
Subscribe form for the Event which will create a Trigger and redirects to Trigger details page

Dependencies (External/Internal):

Design Artifacts:

Exploration:

EventType doc: https://knative.dev/docs/eventing/features/eventtype-auto-creation/#produce-events-to-the-broker

Note:

Story ODC-7663: Add Catalog for Knative Events

View the Description View the linked PRs

Description

As a user, I want to see the catalogs for the Knative Events.

Acceptance Criteria

Should create a Knative Events catalog
Should add Events add card on the Add page to access the Events catalog
Should show attributes and values along with descriptions on the side panel of the Knative Event
Should add the Subscribe button on the side panel which redirects the user to the Subscribe form.

Additional Details:

https://github.com/openshift/console/pull/14212

Story ODC-7664: Create Subscribe form

View the Description View the linked PRs

Description

As a user, I want to subscribe to the Knative service using a form

Acceptance Criteria

Form should have a Name field
Form should have a subscriber dropdown that lists Knative services present in the namespace
Form should have an attribute and value field to add filters.
Form should create a Trigger resource

Additional Details:

https://github.com/openshift/console/pull/14304

Feature TELCOSTRAT-102: [MetalLB] technical debt and ability to operate (at various scales) and troubleshoot

View the Description

Feature Overview

Placeholder for small Epics not justifying an standalone Feature, in the context of technical debt and ability to operate and troubleshoot. This Feature is not needed expect during planning phases when we plan Features, until we enter the Epic planning feature.

NO MORE ADDITION OF ANY EPIC post 4.18 planning - Meaning NOW. One Feature per Epic from now on!

Epic CNF-13767: MetalLB: BGP Dynamic AS Number

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Support `neighbor X remote-as [internal|external|auto]
internal: iBGP peering
External: eBGP peering
Auto: means the peering can be iBGP or eBGP. It will be automatically detected and adjusted from the OPEN message. This value option was recently introduced in master FRR (> 9.1 version). It may not be possible to support this downstream until RHEL ships with a higher version than FRR 9.1 !!!

Why is this important?

Simplicity: Reduces the need to specify ASNs explicitly, especially useful in large configurations or dynamic environments.
Readability: Makes the configuration easier to understand by clearly indicating the type of relationship (internal vs. external).
Flexibility: Helps in dynamic configurations where the ASN of peers may not be fixed or is subject to change.

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

FRR > 9.1 version for `auto` value.

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CNF-14199: frr-k8s implementation

View the linked PRs

https://github.com/openshift/cluster-network-operator/pull/2525

Feature TELCOSTRAT-223: multi network policy for ipvlan

View the Description

Targeted support is equivalent as SR-IOV kernel and MACVLAN, see https://issues.redhat.com/browse/CNF-1470 and https://issues.redhat.com/browse/CNF-5528

Epic CNF-13401: Support Multi Network Policy for IPVLAN CNI (GA)

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Add support in Multi Network Policy for IPVLAN CNI

Why is this important?

Scenarios

Test parity with MACVLAN and SR-IOV VF (~~CNF-1470~~, ~~CNF-5528~~) including IPv6
Multi Network Policies enforced on Pods with IPVLAN with `linkInContainer=false` (default)
Multi Network Policies enforced on Pods with IPVLAN with `linkInContainer=true`

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CNF-14257: [multinetwork-policy] Add `ipvlan` to the list of managed network type

View the Description View the linked PRs

Make the multinetwork-policy daemon manage networks of type `bond`.

This can be achieved by updating the argument `--network-plugins` in the cluster-network-operator:

https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/multus-networkpolicy/multus-networkpolicy.yaml#L39

https://github.com/openshift/cluster-network-operator/pull/2518

Feature TELCOSTRAT-259: Multus multi-network policy for bond CNI (over SR-IOV)

Epic CNF-13177: Multus multi-network policy for bond CNI (over SR-IOV)

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Support for pod-level bonding with SR-IOV interfaces on Multus multi-network policy

Why is this important?

Multi-network policies do not support filtering traffic for pod-level bonded SR-IOV interfaces attached to Pods. It works fine on single interfaces (generates the appropriate iptables rules per the MultiNetworkPolicy) such as MACVLAN and SR-IOV VFs.

Scenarios

As a Pod user, I want secondary Pod interfaces to be fault-tolerant (Pod-level bonding with SR-IOV VFs) and with restricted access (multi-network policies) so that only a pre-approved set of source IPs are permitted.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CNF-14251: [multinetwork-policy] Add `bond` to the list of managed network type

View the Description View the linked PRs

Make the multinetwork-policy daemon manage networks of type `bond`.

This can be achieved by updating the argument `--network-plugins` in the cluster-network-operator:

https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/multus-networkpolicy/multus-networkpolicy.yaml#L39

https://github.com/openshift/cluster-network-operator/pull/2518

Feature TELCOSTRAT-99: (Planning Rank 13) ARM-based RAN solution (RAN DU) based on SuperMicro HW (NVidia GraceHopper CPU, BF-3)

View the Description

BU Priority Overview

To date our work within the telecommunications radio access network space has focused primarily on x86-based solutions. Industry trends around sustainability, and more specific discussions with partners and customers, indicate a desire to progress towards ARM-based solutions with a view to production deployments in roughly a 2025 timeframe. This would mean being able to support one or more RAN partners DU applications on ARM-based servers.

Goals

Introduce ARM CPUs for RAN DU scenario (SNO deployment) with a feature parity to Intel Ice Lake/SPR-EE/SPR-SP w/o QAT for DU with:
- STD-kernel (RT-Kernel is not supported by RHEL)
- SR-IOV and DPDK over SR-IOV
- PTP (OC, BC). Partner asked for LLS-C3, according to Fujitsu - ptp4l and phy2sys to work with NVIDIA Aerial SDK

Characterize ARM-based RAN DU solution performance and power metrics (unless performance parameters are specified by partners, we should propose them, see Open questions)
Productize ARM-based RAN DU solution by 2024 (partner’s expectation).

State of the Business

Depending on source 75-85% of service provider network power consumption is attributable to the RAN sites, with data centers making up the remainder. This means that in the face of increased downward pressure on both TCO and carbon footprint (the former for company performance reasons, the later for regulatory reasons) it is an attractive place to make substantial improvements using economies of scale.

There are currently three main obvious thrusts to how to go about this:

Introducing tools that improve overall observability w.r.t. power utilization of the network.
Improvement of existing RAN architectures via smarter orchestration of workloads, fine tuning hardware utilization on a per site basis in response to network usage, etc.
Introducing alternative architectures which have been designed from the ground up with lower power utilization as a goal.

This BU priority focuses on the third of these approaches.

BoM

Details per partner in TELCOSTRAT-99 and TELCOSTRAT-210 ARM requirements for DU
SuperMicro ARS-111GL-NHR
NVidia GraceHopper CPU
2 units of BlueField-3 DPU (in NIC mode)
Single NUMA
GPU - 1 NVidia GraceHopper GPU (GH200)

Out of scope

Open questions:

What are the latency KPIs? Do we need a RT-kernel to meet them?
What page size is expected?
What are the performance/throughout requirements?

Reference Documents:

Softbank AI-RAN PoC Architecture
Fujitsu vRAN RHOCP Integration
Red Hat Ecosystem Catalog
- NVIDIA Grace Hopper - Red Hat Ecosystem Catalog (RHEL 9.2 +)
- NVIDIA BlueField-3 200GbE (2x200) - Red Hat Ecosystem Catalog (RHEL 9.4 +)

Planning call notes from Apr 15

Epic CNF-12525: Add Multi Arch support to NTO/Tuned (AMD/ARM)

View the Description

Epic Goal

Both the Node Tuning Operator and TuneD assume the Intel x86 architecture is used when a Performance Profile is applied. For example, they both configure Intel x86 specific kernel parameters (e.g. intel_pstate).

In order to support Telco RAN DU deployments on the ARM architecture, we will need a way to apply a performance profile to configure the server for low latency applications. This will include tuning common to both Intel/ARM and tuning specific to one of the architectures.

The purpose of this Epic:

Design an NTO/TuneD solution that will support Intel, ARM and AMD specific tunings. Investigate whether the best approach will be to have a common performance profile that can apply to all architectures or separate performance profiles for each architecture.
Implement the NTO/TuneD changes to enable multi-architecture support. Depending on the scope of the changes, additional epics may be required.

Why is this important?

In order to support Telco RAN DU deployments on the ARM architecture, we need a way to apply a performance profile to configure the server for low latency applications.

Scenarios

SNO configured with Telco 5G RAN DU reference configuration

Acceptance Criteria

Design for ARM support in NTO/TuneD has been reviewed and approved by appropriate stakeholders.
NTO/TuneD changes implemented to enable multi-architecture support. All the ARM specific tunings will not yet be known, but the framework to support these tunings needs to be there.

Dependencies (internal and external)

Obtaining an ARM server to do the investigation/testing.

Previous Work (Optional):

Some initial prototyping on an HPE ARM server has been done and some initial tuned issues have been documented: https://docs.google.com/presentation/d/1dBQpdVXe3kIjlLjj1orKShktEr1zIqtXoBcIo6oykrs/edit#slide=id.g2ac442e1556_0_69

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug OCPBUGS-43660: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story CNF-13193: Update NTO Huge Pages validation

View the Description View the linked PRs

The validator for the Huge Pages sizes in NTO needs to be updated to account for more valid options.

Currently it only allows the values "1G" and "2M" but we want to be able to use "512M" on ARM. We may also want to support other values (https://docs.kernel.org/6.3/arm64/hugetlbpage.html) and we probably also want to validate that the size selected is at least valid for the architecture being used.

The validation is performed here: https://github.com/openshift/cluster-node-tuning-operator/blob/release-4.16/pkg/apis/performanceprofile/v2/performanceprofile_validation.go#L56

Original slack thread: https://redhat-internal.slack.com/archives/CQNBUEVM2/p1717011791766049

https://github.com/openshift/cluster-node-tuning-operator/pull/1086

Story CNF-13014: Upstream NTO enhancements

View the Description View the linked PRs

This story will serve to collect minor upstream enhancements to NTO that do not directly belong to an objective story in the greater epic

Feature XCMSTRAT-1039: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic OCM-11640: [OCM]: Epic for XCMSTRAT-1039 - ROSA-HCP support Windows LI - Preview

View the Description

Overview

An elevator pitch (value statement) that describes the parts of a Feature in a clear, concise way that will be addressed by his Epic

Acceptance Criteria
The list of requirements to be met to consider this Epic feature-complete

Done Criteria

All Acceptance Criteria are met
All existing/affected SOPs have been updated.
New SOPs have been written.
Internal training has been developed and delivered.
The feature has full, automated test suites passing in all pipelines.
If the feature requires QE involvement, QE has signed off.
The feature exposes metrics necessary to monitor.
The feature has had a security review / Contract impact assessment.
Service documentation is fully updated and complete.
Product Manager signed off.

References
Links to Gdocs, GitHub, and any other relevant information about this epic.

Story OCM-12313: [ROSA HCP] Bare metal instance may be killed too early due to default HyperShift MHC

View the Description View the linked PRs

When setting Autorepair to enabled for a NodePool in OCM, NP controller from HyperShift apply a default CAPI MHC that is defined https://github.com/openshift/hypershift/blob/4954df9582cd647243b42f87f4b10d4302e2b270/hypershift-operator/controllers/nodepool/capi.go#L673 and that has a NodeStartupTimeout (from creation to joining the cluster of 20 minutes).

Bare metal instances are known to be slower to boot (see ~~OSD-13791~~) and so in classic we have defined 2 MHC for workers node:

the default one which roughly maps to the one in HyperShift https://github.com/openshift/managed-cluster-config/blob/master/deploy/osd-machine-api/011-machine-api.srep-worker-healthcheck.MachineHealthCheck.yaml
a custom one for metal instance types only that has a nodeStartupTimeout set to 40 (and other parameters) also increased https://github.com/openshift/managed-cluster-config/blob/master/deploy/osd-machine-api/012-machine-api.srep-metal-worker-healthcheck.MachineHealthCheck.yaml

We should analyse together with the HyperShift team what is the best way forward to cover this use case.

Initial ideas to explore:

NodePool APIs expose already some health checks parameters. Can we add an annotation to override the default timeout? This will be explored as first thing.
not using AutoRepair logic at all but deploying externally the MHC based on instance type (ACM policy?). This is a larger effort and an architecture change, which ideally should not be needed/wanted

The behavior has been observed within ~~XCMSTRAT-1039~~, but it is already present with bare metal instances and so can provoke poor UX (machine cycled until we are lucky to get a faster boot time).

https://github.com/openshift/hypershift/pull/5049

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Role	Contact
PM	Peter Lauterbach
Documentation Owner	TBD
Delivery Owner	(See assignee)
Quality Engineer	(See QA contact)

Who	What	Reference
DEV	Upstream roadmap issue	<link to GitHub Issue>
DEV	Upstream code and tests merged	<link to meaningful PR or GitHub Issue>
DEV	Upstream documentation merged	<link to meaningful PR or GitHub Issue>
DEV	gap doc updated	<name sheet and cell>
DEV	Upgrade consideration	<link to upgrade-related test or design doc>
DEV	CEE/PX summary presentation	label epic with cee-training and add a <link to your support-facing preso>
QE	Test plans in Polarion	<link or reference to Polarion>
QE	Automated tests merged	<link or reference to automated tests>
DOC	Downstream documentation merged	<link to meaningful PR>

Bug OCPBUGS-29109: [IBMCloud] dump more details for ResourceGroup delete failures

View the Description View the linked PRs

Description of problem:

    When IBM Cloud Infrastructure bugs/outages prevent proper cleanup of resources, it can prevent the deletion of the Resource Group during cluster destroy. The errors returned because of this is not always helpful and can be confusing.

Version-Release number of selected component (if applicable):

    4.16 (and earlier)

How reproducible:

    80% when IBM Cloud Infrastructure experiences issues

Steps to Reproduce:

    1. When there is a know issue with IBM Cloud Infrastructure (COS, Block Storage, etc.), create an IPI cluster on IBM Cloud
    2. Destroy the cluster

Actual results:

    WARNING Failed to delete resource group us-east-block-test-2-d5ssx: Resource groups with active or pending reclamation instances can't be deleted. Use the CLI commands "ibmcloud resource service-instances --type all" and "ibmcloud resource reclamations" to check for remaining instances, then delete the instances and try again.

Expected results:

    More descriptive details on the blocking resource service-instances (not always storage reclamation related). Potentially something helpful to provide to IBM Cloud Support for assistance.

Additional info:

    IBM Cloud is working on a PR to help enhance the debug details when these kind of errors occur.
At this time, an ongoing issue, https://issues.redhat.com/browse/OCPBUGS-28870, is causing these failures, where this additional debug information can help identify and guide IBM Cloud Support to resolve. But this information does not resolve that bug (which is an Infrastructure bug).

https://github.com/openshift/installer/pull/8030

Bug OCPBUGS-42528: Node ISO Missing in the created filename

View the Description View the linked PRs

Description of problem:

The created Node ISO is missing the architecture (<arch>) in its filename, which breaks consistency with other generated ISOs such as the Agent ISO.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

100%

Actual results:

Currently, the Node ISO is being created with the filename node.iso.

Expected results:

Node ISO should be created as node.<arch>.iso to maintain consistency.

https://github.com/openshift/oc/pull/1887

Bug OCPBUGS-41875: network-status annotations should only have one default:true entry [4.18]

View the Description View the linked PRs

Description of problem:

The network-status annotation includes multiple default:true entries for OVN's UDN

Version-Release number of selected component (if applicable):

    4.17+

How reproducible:

    Always

Steps to Reproduce:

    1. Use UDN
    2. View network-status annotation, see multiple default:true entries

Actual results:

multiple default:true entries

Expected results:

single default:true entries

https://github.com/openshift/multus-cni/pull/249

Bug OCPBUGS-41969: Service name field should not use id "toggle-host".

View the Description View the linked PRs

Description of problem:

On route create page, the Hostname has id "host", and Service name field has id "toggle-host", which should be "toggle-service".

Version-Release number of selected component (if applicable):

 4.17.0-0.nightly-2024-09-13-193731

How reproducible:

Always

Steps to Reproduce:

    1.Check hostname and service name elements for route creation page,
    2.
    3.

Actual results:

1. Service name field has id "toggle-host".
screenshot: https://drive.google.com/file/d/1qkUhhzUPsfFw_o2Gj8XXr9QCISH3g1rK/view?usp=drive_link

Expected results:

1. The id should be "toggle-service".

Additional info:

https://github.com/openshift/networking-console-plugin/pull/124

Bug OCPBUGS-39083: unable to switch project successfully on network policies list page

View the Description View the linked PRs

Description of problem:

 user is unable to switch to other projects successfully on network policies list page

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-08-27-051932

How reproducible:

Always

Steps to Reproduce:

    1. cluster-admin or normal user visit network policies list page via Networking -> NetworkPolicies
    2. open project dropdown and choose different project
    3.

Actual results:

2. user is unable to switch to other project successfully

Expected results:

2. user should be able to switch project any time project is changed

Additional info:

https://github.com/openshift/networking-console-plugin/pull/54

Bug OCPBUGS-41173: ART requests updates to 4.18 image ose-cluster-kube-apiserver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1730

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1730

Bug OCPBUGS-15200: "Cluster operator X is updating versions" is not a reason for Failing=True condition

View the Description View the linked PRs

Description of problem:

During the build02 update from 4.14.0-ec.1 to ec.2 I have noticed the following:


$ b02 get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")'
{
  "lastTransitionTime": "2023-06-20T13:40:12Z",
  "message": "Multiple errors are preventing progress:\n* Cluster operator authentication is updating versions\n* Could not update customresourcedefinition \"alertingrules.monitoring.openshift.io\" (512 of 993): the object is invalid, possibly due to local cluster configuration",
  "reason": "MultipleErrors",
  "status": "True",
  "type": "Failing"
}

There is a valid error (the Could not update customresourcedefinition... one) but the whole thing is cluttered by the "Cluster operator authentication is updating versions" message, which is imo not a legit reason for Failing=True condition and should not be there. Before I captured this one I saw the message with three operators instead of just one.

Version-Release number of selected component (if applicable):

4.14.0-ec.2

How reproducible:

No idea

https://github.com/openshift/cluster-version-operator/pull/1050

Bug OCPBUGS-38479: openshift-installer shall fail when the arch in release payload mismatch with the VM

View the Description View the linked PRs

Description of problem:

When using an installer with amd64 payload, configuring the VM to use aarch64 is possible through the installer-config.yaml:

additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: ci.devcluster.openshift.com
compute:
- architecture: arm64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: arm64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3

However, the installation will fail with ambiguous error messages:

ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.build11.ci.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 13.59.207.137:6443: connect: connection refused

The actual error hides in the bootstrap VM's System Log:

Red Hat Enterprise Linux CoreOS 417.94.202407010929-0 4.17

SSH host key: SHA256:Ng1GpBIlNHcCik8VJZ3pm9k+bMoq+WdjEcMebmWzI4Y (ECDSA)

SSH host key: SHA256:Mo5RgzEmZc+b3rL0IPAJKUmO9mTmiwjBuoslgNcAa2U (ED25519)

SSH host key: SHA256:ckQ3mPUmJGMMIgK/TplMv12zobr7NKrTpmj+6DKh63k (RSA)

ens5: 10.29.3.15 fe80::1947:eff6:7e1b:baac

Ignition: ran on 2024/08/14 12:34:24 UTC (this boot)

Ignition: user-provided config was applied

[0;33mIgnition: warning at $.kernelArguments: Unused key kernelArguments[0m



[1;31mRelease image arch amd64 does not match host arch arm64[0m

ip-10-29-3-15 login: [   89.141099] Warning: Unmaintained driver is detected: nft_compat

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Use amd64 installer to install a cluster with aarch64 nodes

Steps to Reproduce:

    1. download amd64 installer
    2. generate the install-config.yaml
    3. edit install-config.yaml to use aarch64 nodes
    4. invoke the installer

Actual results:

installation timed out after ~30mins

Expected results:

installation failed immediately with proper error message indicating the installation is not possible

Additional info:

https://redhat-internal.slack.com/archives/C68TNFWA2/p1723640243828379

https://github.com/openshift/installer/pull/8842

Bug OCPBUGS-42311: “Edit Route” should support Form edit

View the Description View the linked PRs

Description of problem:

"Edit Route" from action list doesn't support Form edit.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-21-014704
    4.17.0-rc.5

How reproducible:

Always

Steps to Reproduce:

    1.Go to one route detail page, click "Edit Route" from action dropdown list.
    2.
    3.

Actual results:

1. It opens YAML tab directly.

Expected results:

1. Should support both Form and YAML edit.

Additional info:

https://github.com/openshift/networking-console-plugin/pull/119

Bug OCPBUGS-43041: fix slice init length

View the Description View the linked PRs

Description of problem:

    A slice of something like

idPointers := make([]*string, len(ids))

should be corrected to 

idPointers := make([]*string, 0, len(ids))

When the initial size is not provided to the make for slice creating, the slice is made to length (last argument) and filled with the default value. For instance _ := make([]int, 5) creates an array {0, 0, 0, 0, 0}. If this appended to rather than accessing and setting the information by index, then there are extra values. 

1. If we append to the array then we leave behind the default values (this could change the behavior of the function that the array is passed to). This could also pose as a malloc issue.
2. If we dont fill the array completely (ie. create a size of 5 and only fill 4 elements), then the same issue as above could come in to play.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9072

Bug OCPBUGS-43970: CAPO bump to v0.11.1

View the Description View the linked PRs

Placeholder for bumping CAPO in the installer.

https://github.com/openshift/installer/pull/9151

Bug TRT-1823: Aggregated payloads failing due to not enough data for tests with dynamic names

View the Description View the linked PRs

Test with dynamic namespaces in the name break aggregation (and everything else):

: [sig-architecture] platform pods in ns/openshift-must-gather-8tbzj that restart more than 2 is considered a flake for now

It's only finding 1 of that test and failing aggregation.

https://github.com/openshift/origin/pull/29107

Bug OCPBUGS-38515: container_network* metrics fail to report

View the Description View the linked PRs

Description of problem:

    container_network* metrics disappeared from pods

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-08-13-031847

How reproducible:

    always

Steps to Reproduce:

    1.create a pod
    2.check container_network* metrics from the pod
$oc get --raw /api/v1/nodes/jimabug02-95wr2-worker-westus-b2cpv/proxy/metrics/cadvisor  | grep container_network_transmit | grep $pod_name

Actual results:

2 It failed to report container_network* metrics

Expected results:

2 It should report container_network* metrics

Additional info:

This may be a regression issue, we hit it in 4.14 https://issues.redhat.com/browse/OCPBUGS-13741

https://github.com/openshift/kubernetes/pull/2074

Bug OCPBUGS-38651: i18n misses for some provisioner on Create storageclass page

View the Description View the linked PRs

Description of problem:

i18n misses for some provisioner on Create storageclass page

Navigation to Storage -> StorageClasses -> Create StorageClass page 

For Provisioner -> kubernetes.io/glusterfs
Missed: Gluster REST/Heketi URL  Issue:

For Provisioner -> kubernetes.io/quobyte
Missed: User

For Provisioner -> kubernetes.io/vsphere-volume
Missed: Disk format

For Provisioner -> kubernetes.io/portworx-volume
Missed: Filesystem, Select Filesystem, 

For Provisioner -> kubernetes.io/scaleio
Missed: Reference to a configured Secret object

Missed: Select Provisioner  for placeholder text

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-08-19-002129

How reproducible:

    Always

Steps to Reproduce:

    1. Add  ?pseudolocalization=true&lng=en at the end of URL
    2. Navigation to Storage -> StorageClasses -> Create StorageClass page,click the provisioner dropdown list, choose the provisioner
    3. Check whether the text is in i18n mode

Actual results:

    the text is not in i18n mode

Expected results:

    the text should in i18n mode

Additional info:

https://github.com/openshift/console/pull/14257

Bug OCPBUGS-39431: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8945

Bug OCPBUGS-42215: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/openshift-apiserver/pull/449

Bug OCPBUGS-37780: User should not be allowed to set different architectures for the worker and edge compute machine pools

View the Description View the linked PRs

As of now, it is possible to set different architectures for the compute machine pools when both the 'worker' and 'edge' machine pools are defined in the install-config.

Example:

compute:
- name: worker
  architecture: arm64
...
- name: edge
  architecture: amd64
  platform:
    aws:
      zones: ${edge_zones_str}

See https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L631

https://github.com/openshift/installer/pull/8788

Bug OCPBUGS-42851: Refresh interval button on "Observe -> Metrics" page failed to load for admin/developer console

View the Description View the linked PRs

Description of problem:

login admin console, go to "Observe -> Metrics" page, there is one additional and useless button to the left of "Actions" button. see picture: https://drive.google.com/file/d/11CxilYmIzRyrcaISHje4QYhMsx9It3TU/view?usp=drive_link,

according to 4.17, the button is for Refresh interval, but it failed to load

NOTE: same issue for the developer console

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-07-200953

How reproducible:

always

Steps to Reproduce:

1. login admin/developer console, go to "Observe -> Metrics" page

Actual results:

Refresh interval button on "Observe -> Metrics" page failed to load

Expected results:

no error

Additional info:

https://github.com/openshift/monitoring-plugin/pull/221

Bug OCPBUGS-42237: Samples Operator Sync Breaks Build Suite Tests

View the Description View the linked PRs

Description of problem:

The samples operator sync for OCP 4.18 includes an update to the ruby imagestream. This removes EOLed versions of Ruby and upgrades the images to be ubi9 based

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Always

Steps to Reproduce:

    1. Run build suite tests
    2.
    3.

Actual results:

Tests fail trying to pull image. Example: Error pulling image "image-registry.openshift-image-registry.svc:5000/openshift/ruby:3.0-ubi8": initializing source docker://image-registry.openshift-image-registry.svc:5000/openshift/ruby:3.0-ubi8: reading manifest 3.0-ubi8 in image-registry.openshift-image-registry.svc:5000/openshift/ruby: manifest unknown

Expected results:

Builds can pull image, and the tests succeed.

Additional info:

As part of the continued deprecation of the Samples Operator, these tests should create their own Ruby imagestream that is kept current.

https://github.com/openshift/origin/pull/29134

Bug OCPBUGS-39373: Perl example failure

View the Description View the linked PRs

Description of problem:

The example fails in the CI of the Samples Operator because it references a base image (perl:5.30-el7) that is no longer available in the OpenShift library.

This needs to be fixed to unblock the release of the Samples Operator for OCP 4.17.

There are essentially 2 ways to fix this:

1. Fix the Perl test template to reference a Perl image available in the OpenShift library.
2. Remove the test (which might be OK because the template seems to actually only be used in the tests).

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

The test breaks here: https://github.com/openshift/origin/blob/master/test/extended/image_ecosystem/s2i_perl.go#L78

and the line in the test template that specifies the outdated Perl image is here: https://github.com/openshift/origin/blob/master/test/extended/testdata/image_ecosystem/perl-hotdeploy/perl.json#L50

https://github.com/openshift/origin/pull/29082

Bug OCPBUGS-35509: In OCB/OCL it takes too much time to apply the new image once it has been built

View the Description View the linked PRs

Description of problem:

When we enable OCB in the worker pool and a new image is build, once the builder pod has finished building the image it takes about 10-20 minutes to start applying this new image in the first node.

Version-Release number of selected component (if applicable):

The issue was found while pre-merge verifying https://github.com/openshift/machine-config-operator/pull/4395

How reproducible:

Always

Steps to Reproduce:

1. Enable techpreview 
2. Create this MOSC

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: worker
spec:
  buildOutputs:
    currentImagePullSecret:
      name: $(oc get -n openshift-machine-config-operator sa default -ojsonpath='{.secrets[0].name}')
  machineConfigPool:
    name: worker
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
    renderedImagePushSecret:
      name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
    renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
    containerFile:
    - containerfileArch: noarch
      content: |-
        # Pull the centos base image and enable the EPEL repository.
        FROM quay.io/centos/centos:stream9 AS centos
        RUN dnf install -y epel-release

        # Pull an image containing the yq utility.
        FROM docker.io/mikefarah/yq:latest AS yq

        # Build the final OS image for this MachineConfigPool.
        FROM configs AS final

        # Copy the EPEL configs into the final image.
        COPY --from=yq /usr/bin/yq /usr/bin/yq
        COPY --from=centos /etc/yum.repos.d /etc/yum.repos.d
        COPY --from=centos /etc/pki/rpm-gpg/RPM-GPG-KEY-* /etc/pki/rpm-gpg/

        # Install cowsay and ripgrep from the EPEL repository into the final image,
        # along with a custom cow file.
        RUN sed -i 's/\$stream/9-stream/g' /etc/yum.repos.d/centos*.repo && \
            rpm-ostree install cowsay ripgrep
EOF

Actual results:

The machine-os-builder pod will be created, then the build pod will be created too, the image will be built and then it will take about 10-20 minutes to start applying the new build in the first node.

Expected results:

After MCO finishes building the image it should not take 10/20 minutes to start applying the image in the first node.

Additional info:

https://github.com/openshift/machine-config-operator/pull/4426

Bug OCPBUGS-42662: ART requests updates to 4.18 image ose-vmware-vsphere-csi-driver-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/129

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This ticket was created by ART pipline run sync-ci-images

https://github.com/openshift/vmware-vsphere-csi-driver/pull/129

Bug OCPBUGS-44576: Address circular references in @console/internal

View the Description View the linked PRs

Description of problem:

    Circular dependencies in OCP Console prevent migration of Webpack 5

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enable the CHECK_CYCLES env var while building
    2. Observe errors
    3.

Actual results:

    There are errors

Expected results:

    No errors

Additional info:

These are the cycles I can observe in public:

 webpack compilation dbe21e029f8714842299

41 total cycles, 26 min-length cycles (A -> B -> A)

Cycle count per directory:
  public (41)

Index files occurring within cycles:
  public/components/secrets/create-secret/index.tsx (9)
  public/components/utils/index.tsx (4)
  public/module/k8s/index.ts (2)
  public/components/graphs/index.tsx (1)

frontend/public/tokener.html
  public/tokener.html
  public/tokener.html

frontend/public/index.html
  public/index.html
  public/index.html

frontend/public/redux.ts
  public/redux.ts
  public/reducers/features.ts
  public/actions/features.ts
  public/redux.ts

frontend/public/co-fetch.ts
  public/co-fetch.ts
  public/module/auth.js
  public/co-fetch.ts

frontend/public/actions/features.ts
  public/actions/features.ts
  public/redux.ts
  public/reducers/features.ts
  public/actions/features.ts

frontend/public/components/masthead.jsx
  public/components/masthead.jsx
  public/components/masthead-toolbar.jsx
  public/components/about-modal.tsx
  public/components/masthead.jsx

frontend/public/components/utils/index.tsx
  public/components/utils/index.tsx
  public/components/utils/kebab.tsx
  public/components/utils/index.tsx

frontend/public/module/k8s/index.ts
  public/module/k8s/index.ts
  public/module/k8s/k8s.ts
  public/module/k8s/index.ts

frontend/public/reducers/features.ts
  public/reducers/features.ts
  public/actions/features.ts
  public/redux.ts
  public/reducers/features.ts

frontend/public/module/auth.js
  public/module/auth.js
  public/co-fetch.ts
  public/module/auth.js

frontend/public/components/cluster-settings/cluster-settings.tsx
  public/components/cluster-settings/cluster-settings.tsx
  public/components/cluster-settings/cluster-operator.tsx
  public/components/cluster-settings/cluster-settings.tsx

frontend/public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx

frontend/public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/utils.ts
  public/components/secret.jsx
  public/components/secrets/create-secret/index.tsx

frontend/public/components/masthead-toolbar.jsx
  public/components/masthead-toolbar.jsx
  public/components/about-modal.tsx
  public/components/masthead.jsx
  public/components/masthead-toolbar.jsx

frontend/public/actions/features.gql
  public/actions/features.gql
  public/actions/features.gql

frontend/public/components/utils/kebab.tsx
  public/components/utils/kebab.tsx
  public/components/utils/index.tsx
  public/components/utils/kebab.tsx

frontend/public/module/k8s/k8s.ts
  public/module/k8s/k8s.ts
  public/module/k8s/index.ts
  public/module/k8s/k8s.ts

frontend/public/module/k8s/swagger.ts
  public/module/k8s/swagger.ts
  public/module/k8s/index.ts
  public/module/k8s/swagger.ts

frontend/public/graphql/client.gql
  public/graphql/client.gql
  public/graphql/client.gql

frontend/public/components/cluster-settings/cluster-operator.tsx
  public/components/cluster-settings/cluster-operator.tsx
  public/components/cluster-settings/cluster-settings.tsx
  public/components/cluster-settings/cluster-operator.tsx

frontend/public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx
  public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx

frontend/public/components/monitoring/receiver-forms/webhook-receiver-form.tsx
  public/components/monitoring/receiver-forms/webhook-receiver-form.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/webhook-receiver-form.tsx

frontend/public/components/monitoring/receiver-forms/email-receiver-form.tsx
  public/components/monitoring/receiver-forms/email-receiver-form.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/email-receiver-form.tsx

frontend/public/components/monitoring/receiver-forms/slack-receiver-form.tsx
  public/components/monitoring/receiver-forms/slack-receiver-form.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/slack-receiver-form.tsx

frontend/public/components/secrets/create-secret/utils.ts
  public/components/secrets/create-secret/utils.ts
  public/components/secret.jsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/utils.ts

frontend/public/components/secrets/create-secret/CreateConfigSubform.tsx
  public/components/secrets/create-secret/CreateConfigSubform.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/CreateConfigSubform.tsx

frontend/public/components/secrets/create-secret/UploadConfigSubform.tsx
  public/components/secrets/create-secret/UploadConfigSubform.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/UploadConfigSubform.tsx

frontend/public/components/secrets/create-secret/WebHookSecretForm.tsx
  public/components/secrets/create-secret/WebHookSecretForm.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/WebHookSecretForm.tsx

frontend/public/components/secrets/create-secret/SSHAuthSubform.tsx
  public/components/secrets/create-secret/SSHAuthSubform.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/SSHAuthSubform.tsx

frontend/public/components/secrets/create-secret/GenericSecretForm.tsx
  public/components/secrets/create-secret/GenericSecretForm.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/GenericSecretForm.tsx

frontend/public/components/secrets/create-secret/KeyValueEntryForm.tsx
  public/components/secrets/create-secret/KeyValueEntryForm.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/KeyValueEntryForm.tsx

frontend/public/components/secrets/create-secret/CreateSecret.tsx
  public/components/secrets/create-secret/CreateSecret.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/CreateSecret.tsx

frontend/public/components/secrets/create-secret/SecretSubForm.tsx
  public/components/secrets/create-secret/SecretSubForm.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/SecretSubForm.tsx

frontend/public/components/about-modal.tsx
  public/components/about-modal.tsx
  public/components/masthead.jsx
  public/components/masthead-toolbar.jsx
  public/components/about-modal.tsx

frontend/public/components/graphs/index.tsx
  public/components/graphs/index.tsx
  public/components/graphs/status.jsx
  public/components/graphs/index.tsx

frontend/public/components/modals/error-modal.tsx
  public/components/modals/error-modal.tsx
  public/components/utils/index.tsx
  public/components/utils/webhooks.tsx
  public/components/modals/error-modal.tsx

frontend/public/components/image-stream.tsx
  public/components/image-stream.tsx
  public/components/image-stream-timeline.tsx
  public/components/image-stream.tsx

frontend/public/components/graphs/status.jsx
  public/components/graphs/status.jsx
  public/components/graphs/index.tsx
  public/components/graphs/status.jsx

frontend/public/components/build-pipeline.tsx
  public/components/build-pipeline.tsx
  public/components/utils/index.tsx
  public/components/utils/build-strategy.tsx
  public/components/build.tsx
  public/components/build-pipeline.tsx

frontend/public/components/build-logs.jsx
  public/components/build-logs.jsx
  public/components/utils/index.tsx
  public/components/utils/build-strategy.tsx
  public/components/build.tsx
  public/components/build-logs.jsx

frontend/public/components/image-stream-timeline.tsx
  public/components/image-stream-timeline.tsx
  public/components/image-stream.tsx
  public/components/image-stream-timeline.tsx

https://github.com/openshift/console/pull/14500

Bug OCPBUGS-38289: [CEE.neXT]noProxy URL not available in Prometheus k8s CR after configuring remote-write

View the Description View the linked PRs

Description of problem:

The cluster-wide proxy is getting injected for remote-write config automatically but not the noProxy URLs in Prometheus k8s CR which is available in openshift-monitoring project which is expected. However, if the remote-write endpoint is in noProxy region, then metrics are not transferred.

Version-Release number of selected component (if applicable):

RHOCP 4.16.4

How reproducible:

100%

Steps to Reproduce:

1. Configure proxy custom resource in RHOCP 4.16.4 cluster
2. Create cluster-monitoring-config configmap in openshift-monitoring project
3. Inject remote-write config (without specifically configuring proxy for remote-write)
4. After saving the modification in  cluster-monitoring-config configmap, check the remoteWrite config in Prometheus k8s CR. Now it contains the proxyUrl but NOT the noProxy URL(referenced from cluster proxy). Example snippet:
==============
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
[...]
  name: k8s
  namespace: openshift-monitoring
spec:
[...]
  remoteWrite:
  - proxyUrl: http://proxy.abc.com:8080     <<<<<====== Injected Automatically but there is no noProxy URL.
    url: http://test-remotewrite.test.svc.cluster.local:9090

Actual results:

The proxy URL from proxy CR is getting injected in Prometheus k8s CR automatically when configuring remoteWrite but it doesn't have noProxy inherited from cluster proxy resource.

Expected results:

The noProxy URL should get injected in Prometheus k8s CR as well.

Additional info:

https://github.com/openshift/cluster-monitoring-operator/pull/2441

Bug OCPBUGS-41121: ART requests updates to 4.18 image ose-cluster-kube-controller-manager-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/819

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/819

Bug OCPBUGS-41588: 'Are you sure' pop-up windows does not closes automatically after triggering the 'Remove all' action

View the Description View the linked PRs

Description of problem:

    'Are you sure' pop-up windows on 'Create NetworkPolicy' -> Policy type section -> both for Ingress and Egress does not closes automatically after user triggering the 'Remove all' action

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-09-120947
    4.18.0-0.nightly-2024-09-09-212926

How reproducible:

    Always

Steps to Reproduce:

    1. Naviage to Networking -> NetworkPolicies page, click 'create NetworkPolicies' button, and change to Form view
    2. On Policy type -> Ingress/Egress section, click 'Add Ingress rule' buttong
    3. Click 'Remove all', and trigger 'remove all' action on the pops-up windows

Actual results:

The ingress/egress data has been removed, but the pops up windows are not closed automatically

Expected results:

Compare with the same behavior on OCP4.16, after the 'Remove all' action is triggered and executed successfully, the windows will be closed automatically

Additional info:

https://github.com/openshift/networking-console-plugin/pull/77

Bug OCPBUGS-36479: Unexpected featuregate "ExternalRouteCertificate" added in openshift/api

View the Description View the linked PRs

Description of problem:

    As part of https://issues.redhat.com/browse/CFE-811, we added a featuregate "RouteExternalCertificate" to release the feature as TP, and all the code implementations were behind this gate.

However, it seems https://github.com/openshift/api/pull/1731 inadvertently duplicated "ExternalRouteCertificate" as "RouteExternalCertificate".

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    100%

Steps to Reproduce:

   $ oc get featuregates.config.openshift.io cluster -oyaml 
<......>
spec:
  featureSet: TechPreviewNoUpgrade
status:
  featureGates:
    enabled:
    - name: ExternalRouteCertificate
    - name: RouteExternalCertificate
<......>

Actual results:

    Both RouteExternalCertificate and ExternalRouteCertificate were added in the API

Expected results:

We should have only one featuregate "RouteExternalCertificate" and the same should be displayed in https://docs.openshift.com/container-platform/4.16/nodes/clusters/nodes-cluster-enabling-features.html

Additional info:

 Git commits

https://github.com/openshift/api/commit/11f491c2c64c3f47cea6c12cc58611301bac10b3

https://github.com/openshift/api/commit/ff31f9c1a0e4553cb63c3e530e46a3e8d2e30930

Slack thread: https://redhat-internal.slack.com/archives/C06EK9ZH3Q8/p1719867937186219

https://github.com/openshift/api/pull/1959

Bug OCPBUGS-43513: Adding a node with `oc adm node-image` fails on platform:None cluster

View the Description View the linked PRs

Description of problem:

Adding a node with `oc adm node-image` fails:

oc adm node-image monitor --ip-addresses 192.168.250.77
time=2024-10-10T11:31:19Z level=info msg=Monitoring IPs: [192.168.250.77]
time=2024-10-10T11:31:19Z level=info msg=Cannot resolve IP address 192.168.250.77 to a hostname. Skipping checks for pending CSRs.
time=2024-10-10T11:31:19Z level=info msg=Node 192.168.250.77: Assisted Service API is available
time=2024-10-10T11:31:19Z level=info msg=Node 192.168.250.77: Cluster is adding hosts
time=2024-10-10T11:31:19Z level=warning msg=Node 192.168.250.77: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking

Version-Release number of selected component (if applicable):

    4.17.0

How reproducible: Always

Extra information:

The cluster is deployed using Platform:None and userManagedNetworking on an OpenStack cluster which is used as a test bed for the real hardware Agent Based Installer.

Bootstrap of the cluster itself is successfull, but adding nodes as day 2 is not working.

During the cluster bootstrap, we see the following log message:

{\"id\":\"valid-platform-network-settings\",\"status\":\"success\",\"message\":\"Platform OpenStack Compute is allowed\"}

So after looking at https://github.com/openshift/assisted-service/blob/master/internal/host/validator.go#L569

we suppose that the error is related to `userManagedNetworking`
being set to true when bootstraping and false when adding a node.

A second related issue, is why the platform is seen as openstack, as neither the cluster-config-v1 configmap containing install-config or the infrastructure/cluster object mentions OpenStack.

Not sure if this is relevant but an external CNI plugin is used here, we have networkType: Calico in the install config.

Bug OCPBUGS-43853: Incomplete removal of monitoring -> metrics from console

View the Description View the linked PRs

Description of problem:

In ~~CONSOLE-4187~~, the metrics page was removed from the console, but some related packages (i.e., the codemirror ones) remained, even though they are now unnecessary

Version-Release number of selected component (if applicable):

4.18.0

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14442

Bug OCPBUGS-36044: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-ingress-operator/pull/1120

Bug OCPBUGS-38118: Nutanix: failed to install OCP cluster with DHCP network (regression with the 4.16 installer)

View the Description View the linked PRs

Description of problem:

IHAC who is facing issue while deploying nutanix IPI cluster 4.16.x with dhcp.ENV DETAILS: Nutanix Versions: AOS: 6.5.4 NCC: 4.6.6.3 PC: pc.2023.4.0.2 LCM: 3.0.0.1During the installation process after the bootstrap nodes and control-planes are created, the IP addresses on the nodes shown in the Nutanix Dashboard conflict, even when Infinite DHCP leases are set. The installation will work successfully only when using the Nutanix IPAM. Also 4.14 and 4.15 releases install successfully. IPS of master0 and master2 are conflicting, Please chk attachment. Sos-report of master0 and master1 : https://drive.google.com/drive/folders/140ATq1zbRfqd1Vbew-L_7N4-C5ijMao3?usp=sharing The issue was reported via the slack thread:https://redhat-internal.slack.com/archives/C02A3BM5DGS/p1721837567181699

Version-Release number of selected component (if applicable):

How reproducible:

Use the OCP 4.16.z installer to create an OCP cluster with Nutanix using DHCP network. The installation will fail. Always reproducible.

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    The installation will fail.

Expected results:

    The installation succeeds to create a Nutanix OCP cluster with the DHCP network.

Additional info:

https://github.com/openshift/installer/pull/8806

Bug OCPBUGS-38359: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/28997

Bug OCPBUGS-43991: extract-machine-os.sh: /bin/copy-iso: line 34: [: ip=dhcp: binary operator expected

View the Description View the linked PRs

When provisioning a cluster using IPI with FIPS enabled,

if using virtual media then then IPA fails to boot with FIPS, there is an error in machine-os-images

Oct 29 15:57:19 localhost.localdomain extract-machine-os.sh[3757]: Adding kernel argument ip=dhcp
Oct 29 15:57:19 localhost.localdomain extract-machine-os.sh[3757]: Adding kernel argument fips=1
Oct 29 15:57:19 localhost.localdomain extract-machine-os.sh[3757]: /bin/copy-iso: line 34: [: ip=dhcp: binary operator expected

https://github.com/openshift/machine-os-images/pull/43

Bug OCPBUGS-44171: [AWS]Installer should have pre-check for user tags

View the Description View the linked PRs

Description of problem:

    [AWS]Installer should have pre-check for user tags

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    always

Steps to Reproduce:

Setting user tags as below in install-config: 

    userTags:
      usage-user: cloud-team-rebase-bot[bot]

The user tags will be applied to many resources, including roles, but [] does not allowed to tag to roles

https://drive.google.com/file/d/148y-cYrfzNQzDwWlUrgMYAGsZAY6gbW4/view?usp=sharing

Actual results:

Installation failed as failed to create IAM roles, ref job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-api-provider-aws/529/pull-ci-openshift-cluster-api-provider-aws-master-regression-clusterinfra-aws-ipi-proxy-techpreview/1852197133122277376

Expected results:

Installer should have pre-check for this scenario and exit with error message if user tags contain unsupported chars

Additional info:

    discussion on slack: https://redhat-internal.slack.com/archives/CF8SMALS1/p1730443557188649

https://github.com/openshift/installer/pull/9171

Bug OCPBUGS-44658: Move `Events` action above `Event Sources` on Add page

View the Description View the linked PRs

Description of problem:

  Move Events option above Event Source and rename it to Event Types. And Keep the Eventing option together on add page.

https://github.com/openshift/console/pull/14506

Bug OCPBUGS-41895: Log when cluster validations fail

View the Description View the linked PRs

Validation failures in assisted-service are reported to the user in the output of openshift-install agent wait-for bootstrap-complete. However, when reporting issues to support or escalating to engineering, we quite often have only the agent-gather archive to go on.

Most validation failures in assisted-service are host validations. These can be reconstructed with some difficulty from the assisted-service log, and are readily available in that log starting with 4.17 since we enabled debugging in ~~AGENT-944~~.

However, there are also cluster validation failures and these are not well logged.

https://github.com/openshift/installer/pull/9004

Bug OCPBUGS-38719: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8978

Bug OCPBUGS-34595: Clicking Size control in PVC form throws a warning error

View the Description View the linked PRs

Description of problem: Clicking Size control in PVC form throws a warning error. See the below and attached:

`react-dom.development.js:67 Warning: A component is changing an uncontrolled input to be controlled.`

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Goto PVC form and open the Browser Dev console
    2. Click on the Size control to set a value
    The warning error `Warning: A component is changing an uncontrolled input to be controlled. This is likely caused by the value changing from undefined to a defined value, which should not happen. Decide between using a controlled or uncontrolled input element for the lifetime of the component.` is logged out in the console tab.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14326

Bug OCPBUGS-38396: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc-mirror/pull/909

Bug OCPBUGS-41578: view Route details will wrongly put route name as selected project

View the Description View the linked PRs

Description of problem:

clicking on any route to view its detail will wrongly take route name as selected project name

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-09-212926

How reproducible:

Always

Steps to Reproduce:

    1. goes to Routes list page
    2. click on any route name
    3.

Actual results:

2. the route name will be taken as selected project name so the page will always be loading because the project doesn't exist

Expected results:

2. route detail page should be returned

Additional info:

https://github.com/openshift/networking-console-plugin/pull/74

Story CONSOLE-4178: i18n upload/download routine task - sprint 258

View the Description View the linked PRs

The story is to track i18n upload/download routine tasks which are perform every sprint.

A.C.

- Upload strings to Memosource at the start of the sprint and reach out to localization team

- Download translated strings from Memsource when it is ready

- Review the translated strings and open a pull request

- Open a followup story for next sprint

https://github.com/openshift/console/pull/14118

Bug OCPBUGS-38070: [CAPI Azure] some resource created unexpected or missed when installing cluster with publish:Mixed

View the Description View the linked PRs

Description of problem:

Create cluster with publish:Mixed by using CAPZ,
1. publish: Mixed + apiserver: Internal
install-config:
=================
publish: Mixed
operatorPublishingStrategy:
  apiserver: Internal
  ingress: External

In this case, api dns should not be created in public dns zone, but it was created.
==================
$ az network dns record-set cname show --name api.jima07api --resource-group os4-common --zone-name qe.azure.devcluster.openshift.com
{
  "TTL": 300,
  "etag": "6b13d901-07d1-4cd8-92de-8f3accd92a19",
  "fqdn": "api.jima07api.qe.azure.devcluster.openshift.com.",
  "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com/CNAME/api.jima07api",
  "metadata": {},
  "name": "api.jima07api",
  "provisioningState": "Succeeded",
  "resourceGroup": "os4-common",
  "targetResource": {},
  "type": "Microsoft.Network/dnszones/CNAME"
}

2. publish: Mixed + ingress: Internal
install-config:
=============
publish: Mixed
operatorPublishingStrategy:
  apiserver: External
  ingress: Internal

In this case, load balance rule on port 6443 should be created in external load balancer, but it could not be found.
================
$ az network lb rule list --lb-name jima07ingress-krf5b -g jima07ingress-krf5b-rg
[]

Version-Release number of selected component (if applicable):

    4.17 nightly build

How reproducible:

    Always

Steps to Reproduce:

    1. Specify publish: Mixed + mixed External/Internal for api/ingress 
    2. Create cluster
    3. check public dns records and load balancer rules in internal/external load balancer to be created expected

Actual results:

    see description, some resources are unexpected to be created or missed.

Expected results:

    public dns records and load balancer rules in internal/external load balancer to be created expected based on setting in install-config

Additional info:

Bug OCPBUGS-38085: Multipart upload issues with Cloudflare R2 using S3 api

View the Description View the linked PRs

Description of problem:

Multipart upload issues with Cloudflare R2 using S3 api. Some S3 compatible object storage systems like R2 require that all multipart chunks are the same size. This was mostly true before, except the final chunk was larger than the requested chunk size which causes uploads to fail.

Version-Release number of selected component (if applicable):

How reproducible:

    Problem shows itself on OpenShift CI clusters intermittently.

Steps to Reproduce:

This behavior has been causing 504 Gateway Timeout issues in the image registry instances in OpenShift CI clusters.
It is connected to uploading big images (i.e 35GB), but we do not currently have the exact steps that reproduce it.

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

    https://github.com/distribution/distribution/issues/3873 
    https://github.com/distribution/distribution/issues/3873#issuecomment-2258926705
    https://developers.cloudflare.com/r2/api/workers/workers-api-reference/#r2multipartupload-definition (look for "uniform in size")

https://github.com/openshift/image-registry/pull/408

Bug OCPBUGS-39402: UPI playbooks when master schedulable fails

View the Description View the linked PRs

There is a typo here: https://github.com/openshift/installer/blob/release-4.18/upi/openstack/security-groups.yaml#L370

It should be os_subnet6_range.

That task is only run if os_master_schedulable is defined and greater to 0 in the inventory.yaml

https://github.com/openshift/installer/pull/8942

Bug OCPBUGS-41213: ART requests updates to 4.18 image ose-egress-http-proxy-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/images/pull/195

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/images/pull/195

Bug OCPBUGS-38753: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2314

Bug OCPBUGS-41274: ART requests updates to 4.18 image ose-aws-cluster-api-controllers-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/525

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-aws-cluster-api-controllers-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

https://github.com/openshift/cluster-api-provider-aws/pull/525

Bug OCPBUGS-44068: PowerVS: Fix missingSecurityGroupRules

View the Description View the linked PRs

Description of problem:

When the user provides an existing VPC, the IBM CAPI will not add ports 443, 5000, and 6443 to the VPC's security group. It is safe to always check for these ports since we only add them if they are missing.

https://github.com/openshift/installer/pull/9163

Story API-1863: Bump openshift-apiserver to kube 1.31.x

View the Description View the linked PRs

Update kubernetes-apiserver and openshift-apiserver to use k8s 1.31.x which is currently in use for OCP 4.18.

https://github.com/openshift/openshift-apiserver/pull/458

Bug OCPBUGS-41215: ART requests updates to 4.18 image ose-azure-cloud-controller-manager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/126

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-azure/pull/126

Bug MGMT-19006: [Staging] - UI allows LVMS and ODF to be selected and then throws an error

View the Description View the linked PRs

Description of the problem:

B[Staging]BE 2.35.0, UI 2.34.2 - [Staging] - UI allows LVMS and ODF to be selected and then throws an error

How reproducible:

100%

Steps to reproduce:

Actual results:

Expected results:

https://github.com/openshift/assisted-service/pull/6800

Bug OCPBUGS-37448: selected project is not automatically persisted when creating namespaced network policy

View the Description View the linked PRs

Description of problem:

when normal user tries to create namespace scoped network policy, selected project in project selection dropdown was not taken

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-07-17-183402

How reproducible:

Always

Steps to Reproduce:

1. normal user with a project view networkpolicy page
/k8s/ns/yapei1-1/networkpolicies/~new/form
2. Hit on 'affected pods' in Pod selector section OR keep everything with default value and click on 'Create'

Actual results:

2. User will see following error when click on 'affected pods'
Can't preview pods
r: pods is forbidden: User "yapei1" cannot list resource "pods" in API group "" at the cluster scope  

User will see following error when click on 'Create' button
An error occurrednetworkpolicies.networking.k8s.io is forbidden: User "yapei1" cannot create resource "networkpolicies" in API group "networking.k8s.io" at the cluster scope

Expected results:

2. switching to 'YAML view' we can see that the selected project name was not auto populated in YAML

Additional info:

https://github.com/openshift/console/pull/14165

Bug OCPBUGS-31367: Silenced alert seen on openshift console overview page

View the Description View the linked PRs

Description of problem:

Alert that have been silenced are still seen on Console overview page,

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

Steps to Reproduce:

    1.for a cluster installed on version 4.15
    2. Silence a alert that is firing by going to Console --> Observe --> Alerting --> Alerts
    3. Check if the alert is added to silenced alert Console --> Observe --> Alerting --> Silences
    4. Go back to Console (Overview page) silenced alert is still seen there

Actual results:

    Silenced alert can be seen on ocp overview page

Expected results:

    Silenced alert should not be seen on overview page

Additional info:

https://github.com/openshift/console/pull/14464

Bug OCPBUGS-35237: Storage -> PersistentVolumeClaims -> Details : Diagram mouse hover text "xx.yy GiB Available" is in English.

View the Description View the linked PRs

Description of problem:

Navigation:
           Storage -> PersistentVolumeClaims -> Details -> Mouse hover on 'PersistentVolumeClaim details' diagram
Issue:
           "Available" translated in-side diagram but not in mouse hover text

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-01-063526

How reproducible:

Always

Steps to Reproduce:

1. Log into web console and set language to non en_US
2. Navigate to Storage -> PersistentVolumeClaims
3. Click on PersistentVolumeClaim from list
4. In Details tab, mouse hover on 'PersistentVolumeClaim details' diagram
5. Text "xx.yy GiB Available" in English.
6. Same "Available" translated in-side diagram but not in mouse hover text

Actual results:

"Available" translated in-side diagram but not in mouse hover text

Expected results:

"Available" in mouse hover text should be in set language

Additional info:

screenshot reference attached

https://github.com/openshift/console/pull/14330

Bug OCPBUGS-39147: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4632

Bug OCPBUGS-38270: Bump OVS to 3.4 in ovn-k containerr for OCP 4.17

View the Description View the linked PRs

Description of problem:

FDP released a new OVS 3.4 version, that will be used on the host.

We want to maintain the same version in the container.

This is mostly needed for OVN observability feature.

https://github.com/openshift/ovn-kubernetes/pull/2273

Bug OCPBUGS-38620: Some driver containers missing terminationMessagePolicy

View the Description View the linked PRs

Our e2e jobs fail with:

pods/aws-efs-csi-driver-controller-66f7d8bcf5-zf8vr initContainers[init-aws-credentials-file] must have terminationMessagePolicy="FallbackToLogsOnError"
pods/aws-efs-csi-driver-node-7qj9p containers[csi-driver] must have terminationMessagePolicy="FallbackToLogsOnError"
pods/aws-efs-csi-driver-operator-fcc56998b-2d5x6 containers[aws-efs-csi-driver-operator] must have terminationMessagePolicy="FallbackToLogsOnError"

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/55652/rehearse-55652-periodic-ci-openshift-csi-operator-release-4.19-periodic-e2e-aws-efs-csi/1824483696548253696

The jobs should succeed.

Bug OCPBUGS-41613: [CI-Watcher] Various tests are failing due to missing content in element

View the Description View the linked PRs

Description of problem:

Various tests in Console's master branch CI are failing due to missing content of <li.pf-v5-c-menu__list-item> element.

Check https://search.dptools.openshift.org/?search=within+the+element%3A+%3Cli.pf-v5-c-menu__list-item%3E+but+never+did&maxAge=168h&context=1&type=all&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

    The actual issue is with the created project via CLI, which is not being available in the NS dropdown

Story MGMT-18575: "identifier: mac-address" in Siteconfig is causing some issue during Spoke Cluster installation using Gitops/ZTP Approach

View the Description View the linked PRs

When one of our partner was trying to deploy a 4.16 Spoke cluster with ZTP/Gitops Approach, they get the following error message in their assisted-service pod:

error msg="failed to get corresponding infraEnv" func="github.com/openshift/assisted-service/internal/controller/controllers.(*PreprovisioningImageReconciler).AddIronicAgentToInfraEnv" file="/remote-source/assisted-service/app/internal/controller/controllers/preprovisioningimage_controller.go:409" error="record not found" go-id=497 preprovisioning_image=storage-1.fi-911.tre.nsn-rdnet.net preprovisioning_image_namespace=fi-911 request_id=cc62d8f6-d31f-4f74-af50-3237df186dc2

After some discussion in Assisted-Installer forum(https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1723196754444999), Nick Carboni and Alona Paz suggested that "identifier: mac-address" is not supported. Partner has currently ACM 2.11.0 and MCE 2.6.0 versions. However, their older cluster had ACM 2.10 and MCE 2.4.5 and this parameter was working. Nick and Alona suggested to remove "identifier: mac-address" from siteconfig and then installation started to progress. Based on suggestion from Nick, I opened this bug ticket to understand why it started not work now. Partner asked for an official documentation on why this parameter is no more working anymore or if this parameter is not supported any more.

https://github.com/openshift/assisted-service/pull/6715

Bug OCPBUGS-38282: ART requests updates to 4.18 image ironic-static-ip-manager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/44

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ironic-static-ip-manager/pull/44

Bug OCPBUGS-42574: Project is "Undefined" on "VolumeSnapshot" create page

View the Description View the linked PRs

Description of problem:

On "VolumeSnapshot" list page, when project dropdown is "All Projects", click "Create VolumeSnapshot", the project "Undefined" is shown on project field.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-27-213503
4.18.0-0.nightly-2024-09-28-162600

How reproducible:

Always

Steps to Reproduce:

    1.Go to "VolumeSnapshot" list page, set "All Projects" in project dropdown list.
    2.Click "Create VolumeSnapshot", check project field on the creation page.
    3.

Actual results:

2. The project is "Undefined"

Expected results:

2. The project should be "default".

Additional info:

https://github.com/openshift/console/pull/14357

Bug OCPBUGS-38441: CI TestAWSEIPAllocationsForNLB and TestAWSLBSubnets DNS flakes

View the Description View the linked PRs

Description of problem:

Both TestAWSEIPAllocationsForNLB and TestAWSLBSubnets are flaking on verifyExternalIngressController waiting for DNS to resolve.

Example error:

lb_eip_test.go:119: loadbalancer domain apps.eiptest.ci-op-d2nddmn0-43abb.origin-ci-int-aws.dev.rhcloud.com was unable to resolve:

Version-Release number of selected component (if applicable):

4.17.0

How reproducible:

50%

Steps to Reproduce:

    1. Run TestAWSEIPAllocationsForNLB or TestAWSLBSubnets in CI

Actual results:

    Flakes

Expected results:

    Shouldn't flake

Additional info:

CI Search: FAIL: TestAll/parallel/TestAWSEIPAllocationsForNLB

CI Search: FAIL: TestAll/parallel/TestUnmanagedAWSEIPAllocations

CI Search: FAIL: TestAll/parallel/TestAWSLBSubnets

https://github.com/openshift/cluster-ingress-operator/pull/1127

Bug OCPBUGS-39005: Failure to pull NTO image preventing startup of ocp-tuned-one-shot.service

View the Description View the linked PRs

Hello Team,

After the hard reboot of all nodes due to a power outage, failure of image pull of NTO preventing "ocp-tuned-one-shot.service" startup result in dependency failure for kubelet and crio services,

------------

journalctl_--no-pager

Aug 26 17:07:46 ocp05 systemd[1]: Reached target The firstboot OS update has completed.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3577]: NM resolv-prepender: Starting download of baremetal runtime cfg image
Aug 26 17:07:46 ocp05 systemd[1]: Starting Writes IP address configuration so that kubelet and crio services select a valid node IP...
Aug 26 17:07:46 ocp05 systemd[1]: Starting TuneD service from NTO image...
Aug 26 17:07:46 ocp05 nm-dispatcher[3687]: NM resolv-prepender triggered by lo up.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3644]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ lo == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + exit 0
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + exit 0
Aug 26 17:07:46 ocp05 bash[3655]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 podman[3661]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26...
Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Main process exited, code=exited, status=125/n/a
Aug 26 17:07:46 ocp05 nm-dispatcher[3793]: NM resolv-prepender triggered by brtrunk up.
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Failed with result 'exit-code'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ brtrunk == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + exit 0
Aug 26 17:07:46 ocp05 systemd[1]: Failed to start TuneD service from NTO image.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Dependencies necessary to run kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Kubernetes Kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet.service: Job kubelet.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Container Runtime Interface for OCI (CRI-O).
Aug 26 17:07:46 ocp05 systemd[1]: crio.service: Job crio.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet-dependencies.target: Job kubelet-dependencies.target/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + exit 0

-----------

$ oc get proxy config cluster -oyaml
status:
httpProxy: http://proxy_ip:8080
httpsProxy: http://proxy_ip:8080

$ cat /etc/mco/proxy.env
HTTP_PROXY=http://proxy_ip:8080
HTTPS_PROXY=http://proxy_ip:8080

-----------

-----------
× ocp-tuned-one-shot.service - TuneD service from NTO image
Loaded: loaded (/etc/systemd/system/ocp-tuned-one-shot.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Mon 2024-08-26 17:07:46 UTC; 2h 30min ago
Main PID: 3661 (code=exited, status=125)

Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
-----------

Customer has proxy configured in their environment. However, nodes can not start after hard reboot of all nodes as it looks that NTO ignoring cluster wide proxy settings. To resolve NTO image pull issue, customer has to include proxy variable in /etc/systemd/system.conf manually.

https://github.com/openshift/cluster-node-tuning-operator/pull/1144

Bug OCPBUGS-37819: [4.18] opm creates FBCs which are incompatible with IIB catalogs

View the Description View the linked PRs

Description of problem:

    When we added new bundle metadata encoding as `olm.csv.metadata` in https://github.com/operator-framework/operator-registry/pull/1094 (downstreamed for 4.15+) we created situations where
- konflux onboarded operators, encouraged to use upstream:latest to generate FBC from templates; and
- IIB-generated catalog images which used earlier opm versions to serve content

could generate the new format but not be able to serve it. 

One only has to `opm render` an SQLite catalog image, or expand a catalog template.

Version-Release number of selected component (if applicable):

How reproducible:

every time

Steps to Reproduce:

    1. opm render an SQLite catalog image
    2.
    3.

Actual results:

    uses `olm.csv.metadata` in the output

Expected results:

    only using `olm.bundle.object` in the output

Additional info:

https://github.com/openshift/operator-framework-olm/pull/848

Bug OCPBUGS-38425: OLM Catalog ImageStreams not getting updated on minor release upgrade

View the Description View the linked PRs

Description of problem:

    When a HostedCluster is upgraded to a new minor version, its OLM catalog imagestreams are not updated to use the tag corresponding to the new minor version.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Always

Steps to Reproduce:

    1. Create a HostedCluster (4.15.z)
    2. Upgrade the HostedCluster to a new minor version (4.16.z)

Actual results:

    OLM catalog imagestreams remain at the previous version (4.15)

Expected results:

    OLM catalog imagestreams are updated to new minor version (4.16)

Additional info:

https://github.com/openshift/hypershift/pull/4707

Bug OCPBUGS-41157: ART requests updates to 4.18 image golang-github-prometheus-alertmanager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/95

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/prometheus-alertmanager/pull/95

Task MGMT-18417: Update openshift/origin-oauth-proxy image

View the Description View the linked PRs

Last app-sre security scan in production shows issues with the openshift/origin-oauth-proxy image.

https://grafana.stage.devshift.net/d/eds0cjpeszz0ge/acs-cvss?orgId=1

/cc Alona Kaplan

https://github.com/openshift/assisted-service/pull/6970

Bug OCPBUGS-17199: CEO prevents member deletion during revision rollout

View the Description View the linked PRs

this is case 2 from ~~OCPBUGS-14673~~

Description of problem:

MHC for control plane cannot work right for some cases

2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded.

This is a regression bug, because I tested this on 4.12 around September 2022, case 2 and case 3 work right.
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-54326

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-05-112833
4.13.0-0.nightly-2023-06-06-194351
4.12.0-0.nightly-2023-06-07-005319

How reproducible:

Always

Steps to Reproduce:

1.Create MHC for control plane

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: control-plane-health
  namespace: openshift-machine-api
spec:
  maxUnhealthy: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-type: master
  unhealthyConditions:
  - status: "False"
    timeout: 300s
    type: Ready
  - status: "Unknown"
    timeout: 300s
    type: Ready


liuhuali@Lius-MacBook-Pro huali-test % oc create -f mhc-master3.yaml 
machinehealthcheck.machine.openshift.io/control-plane-health created
liuhuali@Lius-MacBook-Pro huali-test % oc get mhc
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
control-plane-health              1              3                  3
machine-api-termination-handler   100%           0                  0 

Case 2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded.
liuhuali@Lius-MacBook-Pro huali-test % oc debug node/huliu-az7c-svq9q-master-1 
Starting pod/huliu-az7c-svq9q-master-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# systemctl stop kubelet


Removing debug pod ...
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                   STATUS   ROLES                  AGE   VERSION
huliu-az7c-svq9q-master-1              Ready    control-plane,master   95m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-2              Ready    control-plane,master   95m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0        Ready    control-plane,master   19m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf   Ready    worker                 34m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l   Ready    worker                 47m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn   Ready    worker                 83m   v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                   PHASE     TYPE              REGION   ZONE   AGE
huliu-az7c-svq9q-master-1              Running   Standard_D8s_v3   westus          97m
huliu-az7c-svq9q-master-2              Running   Standard_D8s_v3   westus          97m
huliu-az7c-svq9q-master-c96k8-0        Running   Standard_D8s_v3   westus          23m
huliu-az7c-svq9q-worker-westus-5r8jf   Running   Standard_D4s_v3   westus          39m
huliu-az7c-svq9q-worker-westus-k747l   Running   Standard_D4s_v3   westus          53m
huliu-az7c-svq9q-worker-westus-r2vdn   Running   Standard_D4s_v3   westus          91m
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                   STATUS     ROLES                  AGE     VERSION
huliu-az7c-svq9q-master-1              NotReady   control-plane,master   107m    v1.26.5+7a891f0
huliu-az7c-svq9q-master-2              Ready      control-plane,master   107m    v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0        Ready      control-plane,master   32m     v1.26.5+7a891f0
huliu-az7c-svq9q-master-jdhgg-1        Ready      control-plane,master   2m10s   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf   Ready      worker                 46m     v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l   Ready      worker                 59m     v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn   Ready      worker                 95m     v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                   PHASE      TYPE              REGION   ZONE   AGE
huliu-az7c-svq9q-master-1              Deleting   Standard_D8s_v3   westus          110m
huliu-az7c-svq9q-master-2              Running    Standard_D8s_v3   westus          110m
huliu-az7c-svq9q-master-c96k8-0        Running    Standard_D8s_v3   westus          36m
huliu-az7c-svq9q-master-jdhgg-1        Running    Standard_D8s_v3   westus          5m55s
huliu-az7c-svq9q-worker-westus-5r8jf   Running    Standard_D4s_v3   westus          52m
huliu-az7c-svq9q-worker-westus-k747l   Running    Standard_D4s_v3   westus          65m
huliu-az7c-svq9q-worker-westus-r2vdn   Running    Standard_D4s_v3   westus          103m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                   PHASE      TYPE              REGION   ZONE   AGE
huliu-az7c-svq9q-master-1              Deleting   Standard_D8s_v3   westus          3h
huliu-az7c-svq9q-master-2              Running    Standard_D8s_v3   westus          3h
huliu-az7c-svq9q-master-c96k8-0        Running    Standard_D8s_v3   westus          105m
huliu-az7c-svq9q-master-jdhgg-1        Running    Standard_D8s_v3   westus          75m
huliu-az7c-svq9q-worker-westus-5r8jf   Running    Standard_D4s_v3   westus          122m
huliu-az7c-svq9q-worker-westus-k747l   Running    Standard_D4s_v3   westus          135m
huliu-az7c-svq9q-worker-westus-r2vdn   Running    Standard_D4s_v3   westus          173m
liuhuali@Lius-MacBook-Pro huali-test % oc get node   
NAME                                   STATUS     ROLES                  AGE    VERSION
huliu-az7c-svq9q-master-1              NotReady   control-plane,master   178m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-2              Ready      control-plane,master   178m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0        Ready      control-plane,master   102m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-jdhgg-1        Ready      control-plane,master   72m    v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf   Ready      worker                 116m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l   Ready      worker                 129m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn   Ready      worker                 165m   v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.0-0.nightly-2023-06-06-194351   True        True          True       107m    APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal                                  4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
cloud-controller-manager                   4.13.0-0.nightly-2023-06-06-194351   True        False         False      176m    
cloud-credential                           4.13.0-0.nightly-2023-06-06-194351   True        False         False      3h      
cluster-autoscaler                         4.13.0-0.nightly-2023-06-06-194351   True        False         False      173m    
config-operator                            4.13.0-0.nightly-2023-06-06-194351   True        False         False      175m    
console                                    4.13.0-0.nightly-2023-06-06-194351   True        False         False      136m    
control-plane-machine-set                  4.13.0-0.nightly-2023-06-06-194351   True        False         False      71m     
csi-snapshot-controller                    4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
dns                                        4.13.0-0.nightly-2023-06-06-194351   True        True          False      173m    DNS "default" reports Progressing=True: "Have 6 available node-resolver pods, want 7."
etcd                                       4.13.0-0.nightly-2023-06-06-194351   True        True          True       173m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
image-registry                             4.13.0-0.nightly-2023-06-06-194351   True        True          False      165m    Progressing: The registry is ready...
ingress                                    4.13.0-0.nightly-2023-06-06-194351   True        False         False      165m    
insights                                   4.13.0-0.nightly-2023-06-06-194351   True        False         False      168m    
kube-apiserver                             4.13.0-0.nightly-2023-06-06-194351   True        True          True       171m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-controller-manager                    4.13.0-0.nightly-2023-06-06-194351   True        False         True       171m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-scheduler                             4.13.0-0.nightly-2023-06-06-194351   True        False         True       171m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-storage-version-migrator              4.13.0-0.nightly-2023-06-06-194351   True        False         False      106m    
machine-api                                4.13.0-0.nightly-2023-06-06-194351   True        False         False      167m    
machine-approver                           4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
machine-config                             4.13.0-0.nightly-2023-06-06-194351   False       False         True       60m     Cluster not available for [{operator 4.13.0-0.nightly-2023-06-06-194351}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)]
marketplace                                4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
monitoring                                 4.13.0-0.nightly-2023-06-06-194351   True        False         False      106m    
network                                    4.13.0-0.nightly-2023-06-06-194351   True        True          False      177m    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)...
node-tuning                                4.13.0-0.nightly-2023-06-06-194351   True        False         False      173m    
openshift-apiserver                        4.13.0-0.nightly-2023-06-06-194351   True        True          True       107m    APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver ()
openshift-controller-manager               4.13.0-0.nightly-2023-06-06-194351   True        False         False      170m    
openshift-samples                          4.13.0-0.nightly-2023-06-06-194351   True        False         False      167m    
operator-lifecycle-manager                 4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
operator-lifecycle-manager-catalog         4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2023-06-06-194351   True        False         False      168m    
service-ca                                 4.13.0-0.nightly-2023-06-06-194351   True        False         False      175m    
storage                                    4.13.0-0.nightly-2023-06-06-194351   True        True          False      174m    AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...
liuhuali@Lius-MacBook-Pro huali-test % 

-----------------------

There might be an easier way by just rolling a revision in etcd, stopping kubelet and then observing the same issue.

Actual results:

CEO's member removal controller is getting stuck on the IsBootstrapComplete check that was introduced to fix another bug: 

 https://github.com/openshift/cluster-etcd-operator/commit/c96150992a8aba3654835787be92188e947f557c#diff-d91047e39d2c1ab6b35e69359a24e83c19ad9b3e9ad4e44f9b1ac90e50f7b650R97 

 turns out IsBootstrapComplete checks whether a revision is currently rolling out (makes sense) and that one NotReady node with kubelet gone still has a revision going (rev 7, target 9).

more info: https://issues.redhat.com/browse/OCPBUGS-14673?focusedId=22726712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22726712

This causes the etcd member to not be removed. 

Which in turn blocks the vertical scale-down procedure to remove the pre-drain hook as the member is still present. Effectively you end up with a cluster of 4 CP machines, where one is stuck in Deleting state.

Expected results:

The etcd member should be removed and the machine/node should be deleted

Additional info:

Removing the revision check does fix this issue reliably, but might not be desirable:
https://github.com/openshift/cluster-etcd-operator/pull/1087

Bug OCPBUGS-42132: DeletionCandidateOfClusterAutoscaler taints not getting removed

View the Description View the linked PRs

Description of problem:

    Once min-node is reached, the remain nodes' taints shouldn't have DeletionCandidateOfClusterAutoscaler

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-arm64-2024-09-13-023103

How reproducible:

    Always

Steps to Reproduce:

    1.Create ipi cluster
    2.Create machineautoscaler and clusterautoscaler
    3.Create workload so that , scaling would happen

Actual results:

    DeletionCandidateOfClusterAutoscaler, taint are present even after min nodes are reached

Expected results:

    above taints not present on nodes once min node count is reached

Additional info:

    logs from the test - https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Runner/1037951/console

must-gather - https://drive.google.com/file/d/1zB2r-BRHjC12g17_Abc-xvtEqpJOopI5/view?usp=sharing

We did reproduce it manually and waited around 15 mins, taint was present.

https://github.com/openshift/cluster-autoscaler-operator/pull/334

Bug OCPBUGS-18007: Missing runbook for the TelemeterClientFailures alerting rule

View the Description View the linked PRs

Description of problem:

When the TelemeterClientFailures alert fires, there's no runbook link explaining the meaning of the alert and what to do about it.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Check the TelemeterClientFailures alerting rule's annotations
2.
3.

Actual results:

No runbook_url annotation.

Expected results:

runbook_url annotation is present.

Additional info:

This is a consequence of a telemeter server outage that triggered questions from customers about the alert:
https://issues.redhat.com/browse/OHSS-25947
https://issues.redhat.com/browse/OCPBUGS-17966
Also in relation to https://issues.redhat.com/browse/OCPBUGS-17797

https://github.com/openshift/cluster-monitoring-operator/pull/2506

Bug OCPBUGS-43546: ignition: invalid config version (couldn't parse)

View the Description View the linked PRs

When adding a BMH with

spec:
  online: true
  customDeploy:
    method: install_coreos

after inspection the BMO will provision the node in ironic

but the node is now being created without any userdata/ignition data,
IPA ironic_coreos_install then goes down a seldom used path to create ignition from scratch, the created ignition is invalid and the node fails to boot after it is provisioned.

Boot stalls with a ignition error "invalid config version (couldn't parse)"

https://github.com/openshift/ironic-agent-image/pull/169

Story NP-1057: Whereabouts Downstream Merge

View the Description View the linked PRs

Sync downstream with upstream

https://github.com/openshift/whereabouts-cni/pull/307

Bug OCPBUGS-39506: ART requests updates to 4.18 image operator-registry-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/853

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-olm/pull/853

Bug OCPBUGS-42671: Prometheus write_relabel_configs in remotewrite unable to drop metric in Grafana

View the Description View the linked PRs

Description of problem:

   Prometheus write_relabel_configs in remotewrite unable to drop metric in Grafana

Version-Release number of selected component (if applicable):

How reproducible:

 Customer has tried both configurations to drop MQ metric with source_label(configuration 1) or without source_label(configuration 2) but it's not working.

It seems to me that  drop configuration is not working properly and is buggy. 


Configuration 1:

```
 remoteWrite:
        - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
          write_relabel_configs:
          - source_labels: ['__name__']
            regex: 'ibmmq_qmgr_uptime'
            action: 'drop'
          basicAuth:
            username:
              name: kubepromsecret
              key: username
            password:
              name: kubepromsecret
              key: password
```

Configuration 2:
```
remoteWrite:
        - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
          write_relabel_configs:
          - regex: 'ibmmq_qmgr_uptime'
            action: 'drop'
          basicAuth:
            username:
              name: kubepromsecret
              key: username
            password:
              name: kubepromsecret
              key: password
```


Customer wants to know what's the correct remote_write configuration to drop metric in Grafana ?

Document links:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write
https://docs.openshift.com/container-platform/4.14/observability/monitoring/configuring-the-monitoring-stack.html#configuring-remote-write-storage_configuring-the-monitoring-stack
https://docs.openshift.com/container-platform/4.14/observability/monitoring/configuring-the-monitoring-stack.html#creating-user-defined-workload-monitoring-configmap_configuring-the-monitoring-stack

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    prometheus remote_write configurations NOT droppping metric in Grafana

Expected results:

prometheus  remote_write configurations should drop metric in Grafana

Additional info:

https://github.com/openshift/cluster-monitoring-operator/pull/2493

Bug OCPBUGS-36670: [CAPI Azure] Gen2 image definition missed security features enabled when configuring securitytype in install-config

View the Description View the linked PRs

Description of problem:

Using payload built with https://github.com/openshift/installer/pull/8666/ so that master instances can be provisioned from gen2 image, which is required when configuring security type in install-config.

Enable TrustedLaunch security type in install-config:
==================
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure: 
      encryptionAtHost: true
      settings:
        securityType: TrustedLaunch
        trustedLaunch:
          uefiSettings:
            secureBoot: Enabled
            virtualizedTrustedPlatformModule: Enabled

Launch capi-based installation, installer failed after waiting 15min for machines to provision...
INFO GalleryImage.ID=/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/galleries/gallery_jima08conf01_9vgq5/images/jima08conf01-9vgq5 
INFO GalleryImage.ID=/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/galleries/gallery_jima08conf01_9vgq5/images/jima08conf01-9vgq5-gen2 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-0 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-1 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-2 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-0 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-1 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-2 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master 
INFO Waiting up to 15m0s (until 6:26AM UTC) for machines [jima08conf01-9vgq5-bootstrap jima08conf01-9vgq5-master-0 jima08conf01-9vgq5-master-1 jima08conf01-9vgq5-master-2] to provision... 
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded 
INFO Shutting down local Cluster API control plane... 
INFO Stopped controller: Cluster API              
INFO Stopped controller: azure infrastructure provider 
INFO Stopped controller: azureaso infrastructure provider 
INFO Local Cluster API system has completed operations 

In openshift-install.log,
time="2024-07-08T06:25:49Z" level=debug msg="\tfailed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create or update resource jima08conf01-9vgq5-rg/jima08conf01-9vgq5-bootstrap (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/virtualMachines/jima08conf01-9vgq5-bootstrap"
time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-08T06:25:49Z" level=debug msg="\tRESPONSE 400: 400 Bad Request"
time="2024-07-08T06:25:49Z" level=debug msg="\tERROR CODE: BadRequest"
time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-08T06:25:49Z" level=debug msg="\t{"
time="2024-07-08T06:25:49Z" level=debug msg="\t  \"error\": {"
time="2024-07-08T06:25:49Z" level=debug msg="\t    \"code\": \"BadRequest\","
time="2024-07-08T06:25:49Z" level=debug msg="\t    \"message\": \"Use of TrustedLaunch setting is not supported for the provided image. Please select Trusted Launch Supported Gen2 OS Image. For more information, see https://aka.ms/TrustedLaunch-FAQ.\""
time="2024-07-08T06:25:49Z" level=debug msg="\t  }"
time="2024-07-08T06:25:49Z" level=debug msg="\t}"
time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-08T06:25:49Z" level=debug msg=" > controller=\"azuremachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AzureMachine\" AzureMachine=\"openshift-cluster-api-guests/jima08conf01-9vgq5-bootstrap\" namespace=\"openshift-cluster-api-guests\" name=\"jima08conf01-9vgq5-bootstrap\" reconcileID=\"bee8a459-c3c8-4295-ba4a-f3d560d6a68b\""

Looks like that capi-based installer missed to enable security features during creating gen2 image, which can be found in terraform code.
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L166-L169

Gen2 image definition created by terraform:
$ az sig image-definition show --gallery-image-definition jima08conf02-4mrnz-gen2 -r gallery_jima08conf02_4mrnz -g jima08conf02-4mrnz-rg --query 'features'
[
  {
    "name": "SecurityType",
    "value": "TrustedLaunch"
  }
]
It's empty when querying from gen2 image created by using CAPI.
$ az sig image-definition show --gallery-image-definition jima08conf01-9vgq5-gen2 -r gallery_jima08conf01_9vgq5 -g jima08conf01-9vgq5-rg --query 'features'
$

Version-Release number of selected component (if applicable):

4.17 payload built from cluster-bot with PR https://github.com/openshift/installer/pull/8666/

How reproducible:

Always

Steps to Reproduce:

    1. Enable security type in install-config
    2. Create cluster by using CAPI
    3.

Actual results:

    Install failed.

Expected results:

    Install succeeded.

Additional info:

   It impacts installation with security type ConfidentialVM or TrustedLaunch enabled.

Bug MGMT-19001: [Staging only][4.16 - nmstate] Unable to install cluster when static dualstack configuration

View the Description View the linked PRs

Description of the problem:
Cluster ** installation with static configuration for ipv4 and ipv6
Discovery done but without the configured ip addresses , installation aborted on bootstrap reboot.

https://redhat-internal.slack.com/archives/C02RD175109/p1727157947875779

Two issues:
#1 static configuration not applied because missing autoconf: 'false'\n"
It was working before but now its mandatory for ipv6

#2 need to update test-infra code.

How reproducible:

Steps to reproduce:

Actual results:

Expected results:

https://github.com/openshift/assisted-service/pull/6798

Bug OCPBUGS-41920: MCPs report wrong number of nodes when we move nodes from one custom MCP to another custom MCP

View the Description View the linked PRs

Description of problem:

When we move one node from one custom MCP to another custom MCP, the MCPs are reporting a wrong number of nodes.

For example, we reach this situation (worker-perf MCP is not reporting the right number of nodes)

$ oc get mcp,nodes
NAME                                                                     CONFIG                                                         UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
machineconfigpool.machineconfiguration.openshift.io/master               rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6               True      False      False      3              3                   3                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker               rendered-worker-36ee1fdc485685ac9c324769889c3348               True      False      False      1              1                   1                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker-perf          rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556          True      False      False      2              2                   2                     0                      24m
machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary   rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556   True      False      False      1              1                   1                     0                      7m52s

NAME                                             STATUS   ROLES                       AGE    VERSION
node/ip-10-0-13-228.us-east-2.compute.internal   Ready    worker,worker-perf-canary   138m   v1.30.4
node/ip-10-0-2-250.us-east-2.compute.internal    Ready    control-plane,master        145m   v1.30.4
node/ip-10-0-34-223.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-35-61.us-east-2.compute.internal    Ready    worker,worker-perf          136m   v1.30.4
node/ip-10-0-79-232.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-86-124.us-east-2.compute.internal   Ready    worker                      139m   v1.30.4



After 20 minutes or half an hour the MCPs start reporting the right number of nodes

Version-Release number of selected component (if applicable):
IPI on AWS version:

$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.17.0-0.nightly-2024-09-13-040101 True False 124m Cluster version is 4.17.0-0.nightly-2024-09-13-040101

How reproducible:
Always

Steps to Reproduce:

    1. Create a MCP
    
     oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-perf
spec:
  machineConfigSelector:
    matchExpressions:
      - {
         key: machineconfiguration.openshift.io/role,
         operator: In,
         values: [worker,worker-perf]
        }
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-perf: ""
EOF

    
    2. Add 2 nodes to the MCP
    
   $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf=
   $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[1].metadata.name}") node-role.kubernetes.io/worker-perf=

    3. Create another MCP
    oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-perf-canary
spec:
  machineConfigSelector:
    matchExpressions:
      - {
         key: machineconfiguration.openshift.io/role,
         operator: In,
         values: [worker,worker-perf,worker-perf-canary]
        }
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-perf-canary: ""
EOF

    3. Move one node from the MCP created in step 1 to the MCP created in step 3
    $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-canary=
    $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-

Actual results:

The worker-perf pool is not reporting the right number of nodes. It continues reporting 2 nodes even though one of them was moved to the worker-perf-canary MCP.
$ oc get mcp,nodes
NAME                                                                     CONFIG                                                         UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
machineconfigpool.machineconfiguration.openshift.io/master               rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6               True      False      False      3              3                   3                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker               rendered-worker-36ee1fdc485685ac9c324769889c3348               True      False      False      1              1                   1                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker-perf          rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556          True      False      False      2              2                   2                     0                      24m
machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary   rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556   True      False      False      1              1                   1                     0                      7m52s

NAME                                             STATUS   ROLES                       AGE    VERSION
node/ip-10-0-13-228.us-east-2.compute.internal   Ready    worker,worker-perf-canary   138m   v1.30.4
node/ip-10-0-2-250.us-east-2.compute.internal    Ready    control-plane,master        145m   v1.30.4
node/ip-10-0-34-223.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-35-61.us-east-2.compute.internal    Ready    worker,worker-perf          136m   v1.30.4
node/ip-10-0-79-232.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-86-124.us-east-2.compute.internal   Ready    worker                      139m   v1.30.4

Expected results:

MCPs should always report the right number of nodes

Additional info:

It is very similar to this other issue 
https://bugzilla.redhat.com/show_bug.cgi?id=2090436
That was discussed in this slack conversation
https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1653479831004619

https://github.com/openshift/machine-config-operator/pull/4593

Bug OCPBUGS-42775: Users are not getting listed propely when created with a line break.

View the Description View the linked PRs

Description of problem:

1. Creating a Normal User:

```
$ oc create user test
user.user.openshift.io/test created

$ oc get user
NAME       UID                                    FULL NAME   IDENTITIES
test       cef90f53-715e-4c10-9e26-c431d31de8c3               
```

This command worked as expected, and the user appeared correctly in both the CLI and the web console.

2. Using Special Characters:

```
$ oc create user test$*(
> test)
user.user.openshift.io/test(
test) created

$ oc get user
NAME       UID                                    FULL NAME   IDENTITIES
test       cef90f53-715e-4c10-9e26-c431d31de8c3               
test(...   50f2ad2b-1385-4b3c-b32c-b84531808864
```

In this case, the user was created successfully and displayed correctly in the web console as test( test). However, the CLI output was not as expected.

3. Handling Quoted Names:

```
$ oc create user test'
> test'

$ oc get user
NAME       UID                                    FULL NAME   IDENTITIES
test...    1fdaadf0-7522-4d38-9894-ee046a58d835
```

Similarly, creating a user with quotes produced a discrepancy: the CLI displayed test..., but the web console showed it as test test.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

100%

Steps to Reproduce:

Given in the description.

Actual results:

The user list is not getting listed properly.

Expected results:

1. User should not be created with a line break.
2. If they are being created, then they should be displayed properly.

Additional info:

https://github.com/openshift/oc/pull/1899

Bug OCPBUGS-40526: ART requests updates to 4.18 image ose-azure-file-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/270

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-39009: Pods cannot reach their service when a NetworkAttachmentDefinition is configured

View the Description View the linked PRs

Description of problem:

After upgrading from 4.12 to 4.14, the customer reports that the pods cannot reach their service when a NetworkAttachmentDefinition is set.

How reproducible:

    Create a NetworkAttachmentDefinition

Steps to Reproduce:

    1.Create a pod with a service.
    2. Curl the service from inside the pod. Works.
    3. Create a NetworkAttachmentDefinition.
    4. The same curl does not work

Actual results:

Pod does not reach service

Expected results:

Pod reaches service

Additional info:

    specifically updating the bug overview for posterity here but the specific issue is that we have pods set up with an exposed port (8080 - port doesn't matter), and a service with 1 endpoint pointing to the specific pod. We can call OTHER PODS in the same namespace via their single-endpoint call service, but we cannot call OURSELVES from inside the pod. 

The issue is with hairpinning loopback return. Is not affected by networkpolicy and appears to be an issue with (as discovered later in this jira) asymmetric routing in that return path to the container after it leaves the local net. 

This behavior is only observed when a network-attachment-definition is added to the pod and appears to be an issue with the way route rules are defined.

A workaround is available to inject the container with a route specicically, or modify the Net-attach-def to ensure a loopback route is available to the container space.

KCS for this problem with workarounds + patch fix versions (when available): https://access.redhat.com/solutions/7084866

https://github.com/openshift/ovn-kubernetes/pull/2302

Bug OCPBUGS-38463: Unable to deploy multiple Performance Profile on multi nodepool hypershift cluster

View the Description View the linked PRs

Description of problem:

Unable to deploy performance profile on multi nodepool hypershift cluster

Version-Release number of selected component (if applicable):

Server Version: 4.17.0-0.nightly-2024-07-28-191830 (management cluster)
Server Version: 4.17.0-0.nightly-2024-08-08-013133 (hosted cluster)

How reproducible:

    Always

Steps to Reproduce:

    1. In a multi nodepool hypershift cluster, attach performance profile unique to each nodepool.
    2. Check the configmap and nodepool status.

Actual results:

root@helix52:~# oc get cm -n clusters-foobar2 | grep foo
kubeletconfig-performance-foobar2            1      21h
kubeletconfig-pp2-foobar3                    1      21h
machineconfig-performance-foobar2            1      21h
machineconfig-pp2-foobar3                    1      21h
nto-mc-foobar2                               1      21h
nto-mc-foobar3                               1      21h
performance-foobar2                          1      21h
pp2-foobar3                                  1      21h
status-performance-foobar2                   1      21h
status-pp2-foobar3                           1      21h
tuned-performance-foobar2                    1      21h
tuned-pp2-foobar3                            1      21h

root@helix52:~# oc get np
NAME      CLUSTER   DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION                         UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
foobar2   foobar2   2               2               False         False        4.17.0-0.ci-2024-08-08-225819   False             True             
foobar3   foobar2   1               1               False         False        4.17.0-0.ci-2024-08-08-225819   False             True

Hypershift Pod logs -

{"level":"debug","ts":"2024-08-14T08:54:27Z","logger":"events","msg":"there cannot be more than one PerformanceProfile ConfigMap status per NodePool. found: 2 NodePool: foobar3","type":"Warning","object":{"kind":"NodePool","namespace":"clusters","name":"foobar3","uid":"c2ba814a-31fe-409d-88c2-b4e6b9a41b26","apiVersion":"hypershift.openshift.io/v1beta1","resourceVersion":"6411003"},"reason":"ReconcileError"}

Expected results:

   Performance profile should apply correctly on both node pools

Additional info:

https://github.com/openshift/hypershift/pull/4556

Bug OCPBUGS-39493: ART requests updates to 4.18 image ose-openstack-cinder-csi-driver-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/294

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-openstack/pull/294

Bug OCPBUGS-44312: Power VS: Add new regions that have added PER capabilities

View the Description View the linked PRs

Description of problem:

There are two additional zones, syd05 and us-east(dal13) that have PER capabilities but are not present in the installer. Add them.

Version-Release number of selected component (if applicable):

4.18.0

https://github.com/openshift/installer/pull/9187

Bug OCPBUGS-38037: unable to build graph image in enclave environment

View the Description View the linked PRs

Description of problem:

When running oc-mirror in mirror to disk mode in an air gapped environment with `graph: true`, and having UPDATE_URL_OVERRIDE environment variable defined, oc-mirror is still reaching out to api.openshift.com, to get the graph.tar.gz. This causes the mirroring to fail, as this URL is not reacheable from an air-gapped environment

Version-Release number of selected component (if applicable):

WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202407260908.p0.gdfed9f1.assembly.stream.el9-dfed9f1", GitCommit:"dfed9f10cd9aabfe3fe8dae0e6a8afe237c901ba", GitTreeState:"clean", BuildDate:"2024-07-26T09:52:14Z", GoVersion:"go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

Always

Steps to Reproduce:

    1.  Setup OSUS in a reacheable  network 
    2. Cut all internet connection except for the mirror registry and OSUS service
    3. Run oc-mirror in mirror to disk mode with graph:true in the imagesetconfig

Actual results:

Expected results:

Should not fail

Additional info:

https://github.com/openshift/oc-mirror/pull/906

Bug OCPBUGS-41936: [IBMCloud] CCM liveness probe in failure loop

View the Description View the linked PRs

Description of problem:

IBM Cloud CCM was reconfigured to use loopback as the bind address in 4.16. However, the liveness probe was not configured to use loopback too, so the CCM constantly fails the liveness probe and restarts continuously.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1. Create a IPI cluster on IBM Cloud
    2. Watch the IBM Cloud CCM pod and restarts, increase every 5 mins (liveness probe timeout)

Actual results:

    # oc --kubeconfig cluster-deploys/eu-de-4.17-rc2-3/auth/kubeconfig get po -n openshift-cloud-controller-manager
NAME                                            READY   STATUS             RESTARTS          AGE
ibm-cloud-controller-manager-58f7747d75-j82z8   0/1     CrashLoopBackOff   262 (39s ago)     23h
ibm-cloud-controller-manager-58f7747d75-l7mpk   0/1     CrashLoopBackOff   261 (2m30s ago)   23h



  Normal   Killing     34m (x2 over 40m)    kubelet            Container cloud-controller-manager failed liveness probe, will be restarted
  Normal   Pulled      34m (x2 over 40m)    kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5ac9fb24a0e051aba6b16a1f9b4b3f9d2dd98f33554844953dd4d1e504fb301e" already present on machine
  Normal   Created     34m (x3 over 45m)    kubelet            Created container cloud-controller-manager
  Normal   Started     34m (x3 over 45m)    kubelet            Started container cloud-controller-manager
  Warning  Unhealthy   29m (x8 over 40m)    kubelet            Liveness probe failed: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused
  Warning  ProbeError  3m4s (x22 over 40m)  kubelet            Liveness probe error: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused
body:

Expected results:

    CCM runs continuously, as it does on 4.15

# oc --kubeconfig cluster-deploys/eu-de-4.15.10-1/auth/kubeconfig get po -n openshift-cloud-controller-manager
NAME                                            READY   STATUS    RESTARTS   AGE
ibm-cloud-controller-manager-66d4779cb8-gv8d4   1/1     Running   0          63m
ibm-cloud-controller-manager-66d4779cb8-pxdrs   1/1     Running   0          63m

Additional info:

    IBM Cloud have a PR open to fix the liveness probe.
https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/360

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/360

Bug OCPBUGS-36832: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-image/pull/589

Bug OCPBUGS-41502: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-operator/pull/1300

Bug OCPBUGS-43925: BuildConfig form breaks on manually enter the Git URL

View the Description View the linked PRs

Description of problem:

BuildConfig form breaks on manually enter the Git URL after selecting the source type as Git

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Navigate to Create BuildConfig form page
    2. Select source type as Git
    3. Enter the git url by typing manually do not paste or select from the suggestion

Actual results:

Console breaks

Expected results:

   Console should not break and user should be able tocreate BuildConfig

Additional info:

https://github.com/openshift/console/pull/14449

Bug OCPBUGS-44056: MAPI operator for Azure has overly permissive actions over VNets

View the Description View the linked PRs

https://github.com/openshift/machine-api-provider-azure/tree/main/pkg/cloud/azure/services/virtualnetworks

This package is not used within MAPI, but its presence indicates that the operator needs permissions over VNets, specifically to delete VNets. This is a sensitive permission that if exercised could lead to an unrecoverable cluster, or deletion of other critical infrastructure within the same Azure subscription or resource group that's not related to the cluster itself. This package should be removed as well as the relevant permissions from the CredentialsRequest.

Bug OCPBUGS-38552: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8850

Bug OCPBUGS-41259: [4.18] Bootimage bump tracker

View the Description View the linked PRs

Tracker issue for bootimage bump in 4.18. This issue should block issues which need a bootimage bump to fix.

https://github.com/openshift/installer/pull/8974

Bug OCPBUGS-42789: GCP Forwarding Rule Deletion blocked

View the Description View the linked PRs

Description of problem:

gcp destroy fail to acknowledge the deletion of forwarding rules that have already been removed. Did you intend to change the logic here? The new version appears to be ignoring when there is an error of  http.StatusNotFound (ie, the thing is already deleted).

time="2024-10-03T23:05:47Z" level=debug msg="Listing regional forwarding rules"
time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule jstuever28743-9q9lk-api-internal"
time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule jstuever28743-9q9lk-api-internal"
time="2024-10-03T23:05:47Z" level=debug msg="Listing global forwarding rules"
time="2024-10-03T23:05:47Z" level=debug msg="Deleting global forwarding rule a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:48Z" level=debug msg="Deleting global forwarding rule jstuever28743-9q9lk-api-internal"
time="2024-10-03T23:05:48Z" level=debug msg="Deleting global forwarding rule a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:48Z" level=debug msg="Deleting global forwarding rule jstuever28743-9q9lk-api-internal"
time="2024-10-03T23:05:48Z" level=debug msg="Listing target pools"
time="2024-10-03T23:05:48Z" level=debug msg="Listing instance groups"
time="2024-10-03T23:05:49Z" level=debug msg="Listing target tcp proxies"
time="2024-10-03T23:05:49Z" level=debug msg="Listing target tcp proxies"
time="2024-10-03T23:05:49Z" level=debug msg="Listing backend services"
time="2024-10-03T23:05:49Z" level=debug msg="Listing backend services"
time="2024-10-03T23:05:49Z" level=debug msg="Deleting backend service a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:49Z" level=info msg="Deleted backend service a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:49Z" level=debug msg="Backend services: 1 global backend service pending"

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    Looping on destroy

Expected results:

Destroy successful

Additional info:

    HIVE team found this bug.

https://github.com/openshift/installer/pull/9075

Bug OCPBUGS-33311: [AWS CAPI install] Failed to create C2S/SC2S cluster via Cluster API

View the Description View the linked PRs

Description of problem:

Creating C2S/SC2S cluster using via CLuster API, got following error:

time="2024-05-06T00:57:17-04:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: infrastructure was not ready within 15m0s: timed out waiting for the condition"

Version-Release number of selected component (if applicable):

How reproducible:

4.16.0-0.nightly-2024-05-05-102537

Steps to Reproduce:

1. Install a C2S or an SC2S cluster via Cluster API

Actual results:

See description

Expected results:

Additional info:

Cluster could be created successfully on C2S/SC2S

https://github.com/openshift/installer/pull/8636

Bug OCPBUGS-42845: Clicking some dropdown list doesn't take effect on Dashboard/Metrics page

View the Description View the linked PRs

Description of problem:

On Administrator-> Observe->Dashboards page, click dropdown list for "Time Range" and "Refresh Interval", there is no response.
On Observe->Metrics page(for both Administrator and Developer), click dropdown list beside "Actions", it's original "Refresh off", there is no response.
There is error “react-dom.production.min.js:101 Uncaught TypeError: r is not a function” in F12 developer console.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-07-200953

How reproducible:

Always

Steps to Reproduce:

    1. Refer to description
    2.
    3.

Actual results:

1. Dropdown list doesn't work well. There is error “react-dom.production.min.js:101 Uncaught TypeError: r is not a function” in F12 developer console.

Expected results:

1. Dropdown list should work fine.

Additional info:

https://github.com/openshift/monitoring-plugin/pull/237

Bug OCPBUGS-38551: vsphere: install-config allows configuration of multiple NICs

View the Description View the linked PRs

Description of problem:

    If multiple NICs are configured in install-config, the installer will provision nodes properly but will fail in bootstrap due to API validation. > 4.17 will support multiple NICs, < 4.17 will not and will fail.

Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: [#1672] failed to create some manifests:
Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/8851

Bug OCPBUGS-37984: High rate of metal bm install failures on 4.17

View the Description View the linked PRs

Starting around the beginning of June, `-bm` (real baremetal) jobs started exhibiting a high failure rate. OCPBUGS-33255 was mentioned as a potential cause, but this was filed much earlier.

The start date for this is pretty clear in Sippy, chart here:

Example job run:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-bm-upgrade/1820429351796084736

More job runs

https://sippy.dptools.openshift.org/sippy-ng/jobs/4.17/runs?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Afalse%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522Platform%253Ametal%2522%257D%252C%257B%2522columnField%2522%253A%2522overall_result%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522I%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=desc&sortField=timestamp

Slack thread:

https://redhat-internal.slack.com/archives/C01CQA76KMX/p1722871253737309

Affecting these tests:

install should succeed: overall 
install should succeed: cluster creation 
install should succeed: bootstrap

https://github.com/openshift/machine-config-operator/pull/4517

Bug OCPBUGS-22442: CVO Flake: FAIL: TestRunGraph/mid-task_cancellation_with_work_in_queue_does_not_deadlock

View the Description View the linked PRs

Description of problem:

Testcase occassionally flakes:

--- FAIL: TestRunGraph (1.04s)
    --- FAIL: TestRunGraph/mid-task_cancellation_with_work_in_queue_does_not_deadlock (0.01s)
        task_graph_test.go:943: unexpected error: [context canceled context canceled]

Version-Release number of selected component (if applicable):

Reproducible with current CVO git master revision 00d0940531743e6a0e8bbba151f68c9031bf0df6

How reproducible:

Well with --race and iterations

Steps to Reproduce:

1. go test --count 30 --race ./pkg/payload/...

Actual results:

Some failures

Expected results:

no failures

Additional info:

Seeing this occassionally flake last few months, finally isolated it but I didn't feel like digging into timing test code so I'm at least filing it instead

https://github.com/openshift/cluster-version-operator/pull/1102

Bug OCPBUGS-36921: Alerting pages show "Not found" when there is not related resource.

View the Description View the linked PRs

Description of problem:

On pages under "Observe"->"Alerting", it shows "Not found" when no resources found

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-07-11-082305

How reproducible:

Steps to Reproduce:

    1.Check tabs under "Observe"->"Alerting" when there is not any related resources, eg, "Alerts", "Silence","Alerting rules".
    2.
    3.

Actual results:

1. 'Not found' is shown under each tab.

Expected results:

1. It's better to show "No <resource> found" like other resources pages. eg: "No Deployments found"

Additional info:

Bug OCPBUGS-41107: ART requests updates to 4.18 image ose-csi-snapshot-validation-webhook-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/162

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-snapshotter/pull/162

Bug OCPBUGS-42018: resolv tmp files are leaked on prem

View the Description View the linked PRs

Description of problem:

In some cases, the tmp files for resolved prepender are not removed on prem platforms.

Version-Release number of selected component (if applicable):

4.18

How reproducible:

When deploying Shift On Stack, check in /tmp and we should not see any tmp.XXX files anymore.

Actual results:

tmp files are there

Expected results:

tmp files are removed when not needed anymore

https://github.com/openshift/machine-config-operator/pull/4589

Bug OCPBUGS-36942: ART requests updates to 4.17 image ose-azure-file-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/242

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-41159: ART requests updates to 4.18 image openshift-state-metrics-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/118

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/openshift-state-metrics/pull/118

Bug OCPBUGS-41195: ART requests updates to 4.18 image ose-machine-api-provider-openstack-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-api-provider-openstack/pull/125

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/machine-api-provider-openstack/pull/125

Bug OCPBUGS-38805: CPU partitioning node test perma-failing

View the Description View the linked PRs

Component Readiness has found a potential regression in the following test:

[sig-node][apigroup:config.openshift.io] CPU Partitioning node validation should have correct cpuset and cpushare set in crio containers [Suite:openshift/conformance/parallel]

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.18
Start Time: 2024-08-14T00:00:00Z
End Time: 2024-08-21T23:59:59Z
Success Rate: 94.89%
Successes: 128
Failures: 7
Flakes: 2

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 647
Failures: 0
Flakes: 15

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=azure&Platform=azure&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=micro&Upgrade=micro&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Node%20%2F%20Kubelet&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20azure%20unknown%20ha%20micro&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-21%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-08-14%2000%3A00%3A00&testId=openshift-tests%3A9292c0072700a528a33e44338d37a514&testName=%5Bsig-node%5D%5Bapigroup%3Aconfig.openshift.io%5D%20CPU%20Partitioning%20node%20validation%20should%20have%20correct%20cpuset%20and%20cpushare%20set%20in%20crio%20containers%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D

The test is permafailing on latest payloads on multiple platforms, not just azure. It does seem to coincide with arrival of the 4.18 rhcos images.

{  fail [github.com/openshift/origin/test/extended/cpu_partitioning/crio.go:166]: error getting crio container data from node ci-op-z5sh003f-431b2-r2nm4-master-0
Unexpected error:
    <*errors.errorString | 0xc001e80190>: 
    err execing command jq: error (at <stdin>:1): Cannot index array with string "info"
    jq: error (at <stdin>:1): Cannot iterate over null (null)
    {
        s: "err execing command jq: error (at <stdin>:1): Cannot index array with string \"info\"\njq: error (at <stdin>:1): Cannot iterate over null (null)",
    }
occurred
Ginkgo exit error 1: exit with code 1}

The script involved is likely in: https://github.com/openshift/origin/blob/a365380cb3a39cfc26b9f28f04b66418c993a879/test/extended/cpu_partitioning/crio.go#L4

Nightly payloads are fully blocked as multiple blocking aggregated jobs are permafailing this test.

https://github.com/openshift/origin/pull/29028

Bug OCPBUGS-42097: user system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller in ns/openshift-infra must not produce too many applies

View the Description View the linked PRs

Example failed test:

4/1291 Tests Failed.expand_less: user system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller in ns/openshift-infra must not produce too many applies 

{had 7618 applies, check the audit log and operator log to figure out why  details in audit log}

https://github.com/openshift/openshift-controller-manager/pull/337

Bug OCPBUGS-42449: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2515

Bug OCPBUGS-43462: Broken module references in console

View the Description View the linked PRs

Description of problem:

Some references to files did not exist, e.g., `NetworkPolicyListPage` in `console-app` and `functionsComponent` in `knative-plugin`

Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14406

Bug OCPBUGS-43943: TestNodePoolReplaceUpgrade failure on openstack

View the Description View the linked PRs

Description of problem:

    TestNodePoolReplaceUpgrade e2e test on openstack is expereiencing common failures like this https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/4515/pull-ci-openshift-hypershift-main-e2e-openstack/1849445285156098048

After investigating this failure it looks like the imageRollout on openstack is completed instantly and it gives the noedpool very little time between the node becoming ready and the nodepool status version being set. 

The short amount of time causes a failure on this check https://github.com/openshift/hypershift/blob/6f6a78b7ff2932087b47609c5a16436bad5aeb1c/test/e2e/nodepool_upgrade_test.go#L166

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Flaky test

Steps to Reproduce:

    1. Run the openstack e2e
    2.
    3.

Actual results:

    TestNodePoolReplaceUpgrade fails

Expected results:

    TestNodePoolReplaceUpgrade passes

Additional info:

https://github.com/openshift/hypershift/pull/4996

Bug OCPBUGS-39730: ART requests updates to 4.18 image baremetal-machine-controller-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/221

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-baremetal/pull/221

Bug OCPBUGS-31914: Searching node with label doesn't work.

View the Description View the linked PRs

Description of problem:

On "Search" page, search resource Node and filter with label, the filter doesn't work.
Similarly, click label in "Node selector" field on one mcp detail page, it won't filter out nodes with this label.

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-04-08-024331

How reproducible:

always

Steps to Reproduce:

    1. On "Search" page, choose "Node(core/v1)" resource, filter with any label, eg "test=node","node-role.kubernetes.io/worker"
    2. On one mcp details page, click label in "Node selector" field on one mcp detail page.
    3.

Actual results:

1. Lable filter doesn't work.
2. Nodes are listed without filtered by label.

Expected results:

1. Node should be filtered by label.
2. Should only show nodes with label.

Additional info:

Screenshot: https://drive.google.com/drive/folders/1XZh4MTOzgrzZKIT6HcZ44HFAAip3ENwT?usp=drive_link

https://github.com/openshift/console/pull/14404

Story HOSTEDCP-1868: CAPI Cluster object should be paused when HostedCluster.Spec.PausedUntil is set

View the Description View the linked PRs

slack thread: https://redhat-internal.slack.com/archives/C058TF9K37Z/p1722890745089339?thread_ts=1722872764.429919&cid=C058TF9K37Z

Investigate what happens when machines are deleted when cluster is paused

https://github.com/openshift/hypershift/pull/4578

Bug OCPBUGS-37683: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8810

Bug OCPBUGS-41787: [sig-network][Feature:tap] should create a pod with a tap interface [apigroup:k8s.cni.cncf.io] [Suite:openshift/conformance/parallel] fails on a setup with Infra nodes

View the Description View the linked PRs

Description of problem:

    The test tries to schedule pods on all workers but fails to schedule on infra nodes

 Warning  FailedScheduling  86s                default-scheduler  0/9 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 6 node(s) didn
't match pod anti-affinity rules. preemption: 0/9 nodes are available: 3 Preemption is not helpful for scheduling, 6 No preemption victims found for incoming pod.         

$ oc get nodes
NAME                          STATUS   ROLES                  AGE   VERSION
ostest-b6fns-infra-0-m4v7t    Ready    infra,worker           19h   v1.30.4
ostest-b6fns-infra-0-pllsf    Ready    infra,worker           19h   v1.30.4
ostest-b6fns-infra-0-vnbp8    Ready    infra,worker           19h   v1.30.4
ostest-b6fns-master-0         Ready    control-plane,master   19h   v1.30.4
ostest-b6fns-master-2         Ready    control-plane,master   19h   v1.30.4
ostest-b6fns-master-lmlxf-1   Ready    control-plane,master   17h   v1.30.4
ostest-b6fns-worker-0-h527q   Ready    worker                 19h   v1.30.4
ostest-b6fns-worker-0-kpvdx   Ready    worker                 19h   v1.30.4
ostest-b6fns-worker-0-xfcjf   Ready    worker                 19h   v1.30.4

Infra nodes should be removed from the worker nodes in the test

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-09-09-173813

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/29178

Bug OCPBUGS-42732: Image registry operator becomes degraded when setting management state to Removed when networkAccess is set to Internal

View the Description View the linked PRs

Description of problem:

    The operator cannot succeed removing resources when networkAccess is set to Removed.
    It looks like the authorization error changes from bloberror.AuthorizationPermissionMismatch to bloberror.AuthorizationFailure after the storage account becomes private (networkAccess: Internal).
    This is either caused by weird behavior in the azure sdk, or in the azure api itself.
    The easiest way to solve it is to also handle bloberror.AuthorizationFailure here: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/azure/azure.go?plain=1#L1145

    The error condition is the following:

status:
  conditions:
  - lastTransitionTime: "2024-09-27T09:04:20Z"
    message: "Unable to delete storage container: DELETE https://imageregistrywxj927q6bpj.blob.core.windows.net/wxj-927d-jv8fc-image-registry-rwccleepmieiyukdxbhasjyvklsshhee\n--------------------------------------------------------------------------------\nRESPONSE
      403: 403 This request is not authorized to perform this operation.\nERROR CODE:
      AuthorizationFailure\n--------------------------------------------------------------------------------\n\uFEFF<?xml
      version=\"1.0\" encoding=\"utf-8\"?><Error><Code>AuthorizationFailure</Code><Message>This
      request is not authorized to perform this operation.\nRequestId:ababfe86-301e-0005-73bd-10d7af000000\nTime:2024-09-27T09:10:46.1231255Z</Message></Error>\n--------------------------------------------------------------------------------\n"
    reason: AzureError
    status: Unknown
    type: StorageExists
  - lastTransitionTime: "2024-09-27T09:02:26Z"
    message: The registry is removed
    reason: Removed
    status: "True"
    type: Available

Version-Release number of selected component (if applicable):

    4.18, 4.17, 4.16 (needs confirmation), 4.15 (needs confirmation)

How reproducible:

    Always

Steps to Reproduce:

    1. Get an Azure cluster
    2. In the operator config, set networkAccess to Internal
    3. Wait until the operator reconciles the change (watch networkAccess in status with `oc get configs.imageregistry/cluster -oyaml |yq '.status.storage'`)
    4. In the operator config, set management state to removed: `oc patch configs.imageregistry/cluster -p '{"spec":{"managementState":"Removed"}}' --type=merge`
    5. Watch the cluster operator conditions for the error

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-image-registry-operator/pull/1129

Bug OCPBUGS-43417: 4.17: [VSphereCSIDriverOperator] [UPI Upgrade] VMwareVSphereControllerDegraded: runtime error: invalid memory address or nil pointer dereference

View the Description View the linked PRs

Description of problem:

4.17: [VSphereCSIDriverOperator] [Upgrade] VMwareVSphereControllerDegraded: runtime error: invalid memory address or nil pointer dereference 

UPI installed vsphere cluster upgrade failed caused by CSO degrade
Upgrade path: 4.8 -> 4.17

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-10-12-174022

How reproducible:

 Always

Steps to Reproduce:

    1. Install the OCP cluster on vSphere by UPI with version 4.8.
    2. Upgrade the cluster to 4.17 nightly.

Actual results:

    In Step 2: The upgrade failed from path 4.16 to 4.17.

Expected results:

    In Step 2: The upgrade should be successful.

Additional info:

$ omc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-10-12-102620   True        True          1h8m    Unable to apply 4.17.0-0.nightly-2024-10-12-174022: wait has exceeded 40 minutes for these operators: storage
$ omc get co storage
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.17.0-0.nightly-2024-10-12-174022   True        True          True       15h  
$  omc get co storage -oyaml   
...
status:
  conditions:
  - lastTransitionTime: "2024-10-13T17:22:06Z"
    message: |-
      VSphereCSIDriverOperatorCRDegraded: VMwareVSphereControllerDegraded: panic caught:
      VSphereCSIDriverOperatorCRDegraded: VMwareVSphereControllerDegraded: runtime error: invalid memory address or nil pointer dereference
    reason: VSphereCSIDriverOperatorCR_VMwareVSphereController_SyncError
    status: "True"
    type: Degraded
...

$ omc logs vmware-vsphere-csi-driver-operator-5c7db457-nffp4|tail -n 50
2024-10-13T19:00:02.531545739Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker(0xc000ab7e00?, {0x3900f30?, 0xc0000b9ae0?})
2024-10-13T19:00:02.531545739Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:183 +0x4d
2024-10-13T19:00:02.531545739Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run.func2()
2024-10-13T19:00:02.531545739Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:117 +0x65
2024-10-13T19:00:02.531545739Z created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run in goroutine 500
2024-10-13T19:00:02.531545739Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:112 +0x2c9
2024-10-13T19:00:02.534308382Z I1013 19:00:02.532858       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"vmware-vsphere-csi-driver-operator", UID:"e44ce388-4878-4400-afae-744530b62281", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'Vmware-Vsphere-Csi-Driver-OperatorPanic' Panic observed: runtime error: invalid memory address or nil pointer dereference
2024-10-13T19:00:03.532125885Z E1013 19:00:03.532044       1 config_yaml.go:208] Unmarshal failed: yaml: unmarshal errors:
2024-10-13T19:00:03.532125885Z   line 1: cannot unmarshal !!seq into config.CommonConfigYAML
2024-10-13T19:00:03.532498631Z I1013 19:00:03.532460       1 config.go:272] ReadConfig INI succeeded. INI-based cloud-config is deprecated and will be removed in 2.0. Please use YAML based cloud-config.
2024-10-13T19:00:03.532708025Z I1013 19:00:03.532571       1 config.go:283] Config initialized
2024-10-13T19:00:03.533270439Z E1013 19:00:03.533160       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2024-10-13T19:00:03.533270439Z goroutine 701 [running]:
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2cf3100, 0x54fd210})
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:75 +0x85
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0014c54e8, 0x1, 0xc000e7e1c0?})
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:49 +0x6b
2024-10-13T19:00:03.533270439Z panic({0x2cf3100?, 0x54fd210?})
2024-10-13T19:00:03.533270439Z     runtime/panic.go:770 +0x132
2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).createVCenterConnection(0xc0008b2788, {0xc0022cf600?, 0xc0014c57c0?}, 0xc0006a3448)
2024-10-13T19:00:03.533270439Z     github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:491 +0x94
2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).loginToVCenter(0xc0008b2788, {0x3900f30, 0xc0000b9ae0}, 0x3377a7c?)
2024-10-13T19:00:03.533270439Z     github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:446 +0x5e
2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).sync(0xc0008b2788, {0x3900f30, 0xc0000b9ae0}, {0x38ee700, 0xc0011d08d0})
2024-10-13T19:00:03.533270439Z     github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:240 +0x6fc
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc000b3ecf0, {0x3900f30, 0xc0000b9ae0}, {0x38ee700?, 0xc0011d08d0?})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:201 +0x43
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).processNextWorkItem(0xc000b3ecf0, {0x3900f30, 0xc0000b9ae0})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:260 +0x1ae
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker.func1({0x3900f30, 0xc0000b9ae0})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:192 +0x89
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:259 +0x1f
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002bb1e80?)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:226 +0x33
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0014c5f10, {0x38cf7e0, 0xc00142b470}, 0x1, 0xc0013ae960)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:227 +0xaf
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00115bf10, 0x3b9aca00, 0x0, 0x1, 0xc0013ae960)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:204 +0x7f
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x3900f30, 0xc0000b9ae0}, 0xc00115bf70, 0x3b9aca00, 0x0, 0x1)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:259 +0x93
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:170
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker(0xc000ab7e00?, {0x3900f30?, 0xc0000b9ae0?})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:183 +0x4d
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run.func2()
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:117 +0x65
2024-10-13T19:00:03.533270439Z created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run in goroutine 500
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:112 +0x2c9

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/260

Bug OCPBUGS-28812: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes/pull/2031

Bug OCPBUGS-35297: 50-100% increase in watch count for all operators

View the Description View the linked PRs

Starting about 5/24 or 5/25, we see a massive increase in the number of watch establishments from all clients to the kube-apiserver during non-upgrade jobs. While this could theoretically be every single client merged a bug on the same day, the more likely explanation is that the kube update is exposed or produced some kind of a bug.

This is a clear regression and it is only present on 4.17, not 4.16. It is present across all platforms, though I've selected AWS for links and screenshots.

4.17 graph - shows the change

4.16 graph - shows no change

slack thread if there are questions

courtesy screen shot

https://github.com/openshift/kubernetes/pull/2044

Bug OCPBUGS-44413: Missing CRD after CAPI provider rebase

View the Description View the linked PRs

Our CI job is currently down with an error in CAPO (our CAPI provider): https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-openstack-conformance/1855132754782457856/artifacts/e2e-openstack-conformance/dump/artifacts/namespaces/clusters-0183964f0514bc3aee5c/core/pods/logs/capi-provider-5988b8b87c-q5zwq-manager.log

We are missing a CRD and we probably need to add it

https://github.com/openshift/hypershift/pull/5095

Bug OCPBUGS-38217: the "classicLoadBalancer" is still in ingresscontroller status after changing LB type from CLB to NLB

View the Description View the linked PRs

Description of problem:

    After changing LB type from CLB to NLB, the "status.endpointPublishingStrategy.loadBalancer.providerParameters.aws.classicLoadBalancer" is still there, but if create new NLB ingresscontroller the "classicLoadBalancer" will not appear.

// after changing default ingresscontroller to NLB
$ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
classicLoadBalancer:                   <<<< 
  connectionIdleTimeout: 0s            <<<<
networkLoadBalancer: {}
type: NLB

// create new ingresscontroller with NLB
$ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
networkLoadBalancer: {}
type: NLB

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-08-08-013133

How reproducible:

    100%

Steps to Reproduce:

    1. changing default ingresscontroller to NLB
$ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"providerParameters":{"type":"AWS","aws":{"type":"NLB"}},"scope":"External"}}}}'

    2. create new ingresscontroller with NLB
kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: nlb
  namespace: openshift-ingress-operator
spec:
  domain: nlb.<base-domain>
  replicas: 1
  endpointPublishingStrategy:
    loadBalancer:
      providerParameters:
        aws:
          type: NLB
        type: AWS
      scope: External
    type: LoadBalancerService

    3. check both ingresscontrollers status

Actual results:

// after changing default ingresscontroller to NLB 
$ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
classicLoadBalancer:
  connectionIdleTimeout: 0s
networkLoadBalancer: {}
type: NLB
 
// new ingresscontroller with NLB
$ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
networkLoadBalancer: {}
type: NLB

Expected results:

    If type=NLB, then "classicLoadBalancer" should not appear in the status. and the status part should keep consistent whatever changing ingresscontroller to NLB or creating new one with NLB.

Additional info:

https://github.com/openshift/cluster-ingress-operator/pull/1126

Bug OCPBUGS-43530: Searching resources with shortname doesn't work on Search page

View the Description View the linked PRs

Description of problem:

   Compare with the same behavior on OCP 4.17. The function of 'shortname seach' on OCP 4.18 is not working

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-10-16-094159

How reproducible:

    Always

Steps to Reproduce:

    1. Create a CRD resource with code https://github.com/medik8s/fence-agents-remediation/blob/main/config/crd/bases/fence-agents-remediation.medik8s.io_fenceagentsremediationtemplates.yaml
    2. Navigate to Home -> Search page
    3. Use Shortname 'FAR' to search the created resource 'FenceAgentsRemediationTemplates'
    4. Search the resource with shortname 'AM' for example

Actual results:

    3. No result was found will return
    4. The first result list on dropdown list is 'Config (sample.operator.openshit)', which is incorrect

Expected results:

    3. the Resource 'FenceAgentsRemediationTemplates' should list on the dropdown list 
    4. The first result list on dropdown list should be 'Alertmanager'

Additional info:

https://github.com/openshift/console/pull/14420

Bug OCPBUGS-38271: ART requests updates to 4.18 image ose-frr-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/frr/pull/64

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/frr/pull/64

Story TRT-1867: Payloads failing on accessing beta flow schemas API

View the Description View the linked PRs

{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:134]: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta3.flowcontrol.apiserver.k8s.io 6 times

All jobs failed on https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-upgrade-4.18-minor-release-openshift-release-analysis-aggregator/1846018782808510464

https://github.com/openshift/kubernetes/pull/2112

Bug OCPBUGS-39578: ART requests updates to 4.18 image ose-machine-api-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1283

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/machine-api-operator/pull/1283

Task MON-4065: Update prometheus to v2.55.0

View the linked PRs

Bug OCPBUGS-41087: ART requests updates to 4.18 image ose-azure-cluster-api-controllers-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-azure/pull/319

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component name: ose-azure-cluster-api-controllers-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

https://github.com/openshift/cluster-api-provider-azure/pull/319

Bug OCPBUGS-43048: [AWS] Node SG - Inbound rule access from 0.0.0.0/0 for node port range 30000-32767

View the Description View the linked PRs

Description of problem:

When deploying 4.16, customer identified an inbound rule security risk for the "node" security group allowing access from 0.0.0.0/0 to node port range 30000-32767.
This issue did not exist in versions prior to 4.16 and suspect this may be a regression.  It seems to be related to the use of CAPI which could have changed the behavior.  
Trying to understand why this was allowed.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Steps to Reproduce:

    1. Install 4.16 cluster

*** On 4.12 installations, this is not the case ***

Actual results:

The installer configures an inbound rule for the node security group allowing access from 0.0.0.0/0 for port range 30000-32767.

Expected results:

The installer should *NOT* create an inbound security rule allowing access to node port range 30000-32767 from any CIDR range (0.0.0.0/0)

Additional info:

#forum-ocp-cloud slack discussion:
https://redhat-internal.slack.com/archives/CBZHF4DHC/p1728484197441409

Relevant Code :

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/v2.4.0/pkg/cloud/services/securitygroup/securitygroups.go#L551

https://github.com/openshift/installer/pull/9127

Bug OCPBUGS-44504: HCP CLI does not respect attach default network false

View the Description View the linked PRs

Description of problem:

Despite passing in '--attach-default-network false', the nodepool still has attachDefaultNetwork: true


hcp create cluster kubevirt --name ocp-lab-int-6 --base-domain paas.com --cores 6 --memory 64Gi --additional-network "name:default/ppcore-547" --attach-default-network false --cluster-cidr 100.64.0.0/20 --service-cidr 100.64.16.0/20 --network-type OVNKubernetes --node-pool-replicas 3 --ssh-key ~/deploy --pull-secret pull-secret.txt --release-image quay.io/openshift-release-dev/ocp-release:4.16.18-x86_64

  platform:
    kubevirt:
      additionalNetworks:
        - name: default/ppcore-547
      attachDefaultNetwork: true

Version-Release number of selected component (if applicable):

Client Version: openshift/hypershift: b9e977da802d07591cd9fb8ad91ba24116f4a3a8. Latest supported OCP: 4.17.0
Server Version: b9e977da802d07591cd9fb8ad91ba24116f4a3a8
Server Supports OCP Versions: 4.17, 4.16, 4.15, 4.14

How reproducible:

Steps to Reproduce:

    1. hcp install as per the above
    2.
    3.

Actual results:

The default network is attached

Expected results:

No default network

Additional info:

https://github.com/openshift/hypershift/pull/5115

Bug OCPBUGS-29444: TestFirstBootHasSSHKeys fails on FIPS-enabled clusters

View the Description View the linked PRs

Description of problem:

When running on a FIPS-enabled cluster, the e2e test TestFirstBootHasSSHKeys times out.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Open a PR to the MCO repository.
2. Run the e2e-aws-ovn-fips-op job by commenting /test e2e-aws-ovn-fips-op (this job does not run automatically).
3. Eventually, the test will fail.

Actual results:

=== RUN TestFirstBootHasSSHKeys1065mcd_test.go:1019: did not get new node
--- FAIL: TestFirstBootHasSSHKeys (1201.83s)

Expected results:

=== RUN   TestFirstBootHasSSHKeys
    mcd_test.go:929: Got ssh key file data: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        
--- PASS: TestFirstBootHasSSHKeys (334.86s)

Additional info:

It looks like we're hitting a 20-minute timeout during the test. By comparison, the passing case seems to execute in approximately 5.5 minutes.

I have two preliminary hypothesis' for this:
1. This operation takes longer in FIPS-enabled clusters for some reason.
2. It is possible that this is occurring due to a difference in which cloud these tests run. Our normal e2e-gcp-op tests run in GCP whereas this test suite runs in AWS. The underlying operations performed by the Machine API may just take longer in AWS than they do in GCP. If that is the case, this bug can be resolved as-is.

Failing job link: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4172/pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-fips-op/1757476347388628992

Must-Gather link: https://drive.google.com/file/d/12GhTIP9bgcoNje0Jvyhr-c-akV3XnGn2/view?usp=sharing

https://github.com/openshift/machine-config-operator/pull/4415

Bug OCPBUGS-41935: Unsanitized input into IgnitionServer from HTTP header

View the Description View the linked PRs

Error from SNYK code:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/56687/rehearse-56687-pull-ci-openshift-hypershift-main-security/1834567227643269120

✗ [High] Cross-site Scripting (XSS) 
  Path: ignition-server/cmd/start.go, line 250 
  Info: Unsanitized input from an HTTP header flows into Write, where it is used to render an HTML page returned to the user. This may result in a Reflected Cross-Site Scripting attack (XSS).

https://github.com/openshift/hypershift/pull/4723

Bug OCPBUGS-42621: Can.t enable FIPS on IPA

View the Description View the linked PRs

Enabling FIP results in an error during machine-os-images /bin/copy-iso

    /bin/copy-iso: line 29: [: missing `]'

https://github.com/openshift/machine-os-images/pull/42

Bug OCPBUGS-24400: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4503

Story HOSTEDCP-1981: Add CNO functionality to use new envar for iptables cli image

View the Description View the linked PRs

We need the ability to define a different image for the iptables CLI image because its located in the dataplane

https://github.com/openshift/cluster-network-operator/pull/2507

Bug OCPBUGS-41765: The namespace value store in Ingress details page is incorrect

View the Description View the linked PRs

Description of problem:

    namespace value in Ingres details page is incorrect

Version-Release number of selected component (if applicable):

  4.18.0-0.nightly-2024-09-10-234322

How reproducible:

    Always

Steps to Reproduce:

    1. Create a sample ingress into default namespace
    2. Navigate to Networking -> Ingresses -> Ingresses details page
       /k8s/ns/default/ingresses/<ingress sample name>
    3. Check the Namespace value

Actual results:

    it shown the Ingress name which is incorrect

Expected results:

    it should update to the name store in metadata.namespace

Additional info:

https://github.com/openshift/networking-console-plugin/pull/87

Bug OCPBUGS-32773: Values entered into the Instantiate Template form are automatically cleared

View the Description View the linked PRs

Description of problem:

In the OpenShift WebConsole, when using the Instantiate Template screen, the values entered into the form are automatically cleared.

This issue occurs for users with developer roles who do not have administrator privileges, but does not occur for users with the cluster-admin cluster role. 


Additionally, using the developer tools of the web browser, I observed the following console logs when the values were cleared:


https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/prometheus/api/v1/rules 403 (Forbidden)
https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/alertmanager/api/v2/silences 403 (Forbidden)


It appears that a script attempting to fetch information periodically from PrometheusRule and Alertmanager's silences encounters a 403 error due to insufficient permissions, which causes the script to halt and the values in the form to be reset and cleared.


This bug prevents users from successfully creating instances from templates in the WebConsole.

Version-Release number of selected component (if applicable):

4.15 4.14

How reproducible:

YES

Steps to Reproduce:

1. Log in with a non-administrator account.
2. Select a template from the developer catalog and click on Instantiate Template.
3. Enter values into the initially empty form.
4. Wait for several seconds, and the entered values will disappear.

Actual results:

Entered values are disappeard

Expected results:

Entered values are appeard

Additional info:

I could not find the appropriate component to report this issue. I reluctantly chose Dev Console, but please adjust it to the correct component.

https://github.com/openshift/console/pull/14026

Task MON-3981: Update downstream prometheus to v2.54.0

View the linked PRs

Bug OCPBUGS-34418: Router pods use hostnetwork SCC even when not using host network

View the Description View the linked PRs

Description of problem

Router pods use the "hostnetwork" SCC even when they do not use the host network.

Version-Release number of selected component (if applicable)

All versions of OpenShift from 4.11 through 4.17.

How reproducible

100%.

Steps to Reproduce

1. Install a new cluster with OpenShift 4.11 or later on a cloud platform.

Actual results

The router-default pods do not use the host network, yet they use the "hostnetwork" SCC:

% oc -n openshift-ingress get pods -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default -o go-template --template='{{range .items}}{{.metadata.name}} {{with .metadata.annotations}}{{index . "openshift.io/scc"}}{{end}} {{.spec.hostNetwork}}{{"\n"}}{{end}}'
router-default-5ffd4ff7cd-mhhv6 hostnetwork <no value>
router-default-5ffd4ff7cd-wmqnj hostnetwork <no value>
%

Expected results

The router-default pods should use the "restricted" SCC.

Additional info

We missed this change from the OCP 4.11 release notes:

The restricted SCC is no longer available to users of new clusters, unless the access is explicitly granted. In clusters originally installed in OpenShift Container Platform 4.10 or earlier, all authenticated users can use the restricted SCC when upgrading to OpenShift Container Platform 4.11 and later.

Artifacts from CI jobs confirm that router pods used "restricted" for new 4.10 clusters and for 4.10→4.11 upgraded clusters, and "hostnetwork" for new 4.11 clusters:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1790552355406614528/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"restricted"
"restricted"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1790422949342220288/artifacts/e2e-aws-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"restricted"
"restricted"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1793013806733987840/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"restricted"
"restricted"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial/1793013781534609408/artifacts/e2e-aws-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade/1793670820518694912/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-serial/1793670819998601216/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1793062832263139328/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
%

https://github.com/openshift/cluster-ingress-operator/pull/1064

Bug OCPBUGS-42388: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-operator/pull/1291

Bug OCPBUGS-41184: GCP Validate Disk and Instance Type

View the Description View the linked PRs

Description of problem:

    The disk and instance types for gcp machines should be validated further. The current implementation provides validation for each individually, but the disk types and instance types should be checked against each other for valid combinations.

The attached spreadsheet displays the combinations of valid disk and instance types.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-41280: ART requests updates to 4.18 image ose-openstack-cloud-controller-manager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/296

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-openstack/pull/296

Bug OCPBUGS-42000: use TaskRuns `results.tekton.dev/record` annotation to get the logs

View the Description View the linked PRs

Description of problem:

1. We are making 2 API calls to get the logs for the PipelineRuns. instead, we can make use of `results.tekton.dev/record` annotation and replace the `records` in the value of the annotation with `logs` to get the logs of the PipelineRuns.

2. Tekton results will return back only v1 version of PipelineRun and TaskRun from Pipelines 1.16, so data type has to be v1 version for 1.16 version and for lower version it is v1beta1

https://github.com/openshift/console/pull/14303

Bug OCPBUGS-37032: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/4548

Bug OCPBUGS-43837: Bump documentationBaseURL to 4.18

View the Description View the linked PRs

Description of problem:

documentationBaseURL still points to 4.17

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-23-112324

How reproducible:

Always

Steps to Reproduce:

1. check documentationBaseURL on a 4.18 cluster
$ oc get cm console-config -n openshift-console -o yaml | grep documentationon
      documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.17/
2.
3.

Actual results:

documentationBaseURL still links to 4.17

Expected results:

documentationBaseURL should link to 4.18

Additional info:

https://github.com/openshift/console-operator/pull/937

Bug MGMT-17103: Reclaim does not work for Z hosts

View the Description View the linked PRs

Description of the problem:

Unbinding s390x (Z) hosts no longer reboots them into discovery. Instead the reclaim agent runs on the node and continuously reboots them.

How reproducible:

Steps to reproduce:

1. Boot Z hosts with discovery image and install them to a cluster (original issue did so with hypershift)

2. Unbind the hosts from the cluster (original issue scaled down nodepool) and watch as the hosts constantly reboot (not into discovery)

Actual results:

Hosts are not reclaimed, unbound, and ready to be used again. Instead they are stuck and constantly reboot.

Expected results:

Hosts are unbound and ready to be used.

Additional information

Contents of RHCOS boot config files

#  cat ostree-1-rhcos.conf 
title Red Hat Enterprise Linux CoreOS 415.92.202311241643-0 (Plow) (ostree:1)
version 1
options ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c22af6ecd95/0 root=UUID=36ac8acd-bf01-40e4-8043-3682716e3b91 rw rootflags=prjquota boot=UUID=879d4744-c4b2-4cd3-a4a3-ca601d7dadd7
linux /ostree/rhcos-5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c22af6ecd95/vmlinuz-5.14.0-284.41.1.el9_2.s390x
initrd /ostree/rhcos-5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c22af6ecd95/initramfs-5.14.0-284.41.1.el9_2.s390x.img
aboot /ostree/deploy/rhcos/deploy/01b96f07863b8bf16cb4e9a187fefe5bcc1b443a825a503355a1f658a2e856d7.0/usr/lib/ostree-boot/aboot.img
abootcfg /ostree/deploy/rhcos/deploy/01b96f07863b8bf16cb4e9a187fefe5bcc1b443a825a503355a1f658a2e856d7.0/usr/lib/ostree-boot/aboot.cfg7:51

$ cat ostree-2-rhcos.conf 
title Red Hat Enterprise Linux CoreOS 415.92.202312250243-0 (Plow) (ostree:0)
version 2
options ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/1023d42feb111a96705089345808aa014c74b171248026fd0be18949980bc322/0 root=UUID=36ac8acd-bf01-40e4-8043-3682716e3b91 rw rootflags=prjquota boot=UUID=879d4744-c4b2-4cd3-a4a3-ca601d7dadd7 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" psi=1
linux /ostree/rhcos-1023d42feb111a96705089345808aa014c74b171248026fd0be18949980bc322/vmlinuz-5.14.0-284.45.1.el9_2.s390x
initrd /ostree/rhcos-1023d42feb111a96705089345808aa014c74b171248026fd0be18949980bc322/initramfs-5.14.0-284.45.1.el9_2.s390x.img
aboot /ostree/deploy/rhcos/deploy/90229475c67473a16f77b3679a5b7a3d90d268d70adf24668f14cf00c06d83e5.1/usr/lib/ostree-boot/aboot.img
abootcfg /ostree/deploy/rhcos/deploy/90229475c67473a16f77b3679a5b7a3d90d268d70adf24668f14cf00c06d83e5.1/usr/lib/ostree-boot/aboot.cfg

Interesting journal log

Feb 15 16:51:07 localhost kernel: Kernel command line: ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842>
Feb 15 16:51:07 localhost kernel: Unknown kernel command line parameters "ostree=/ostree/boot.1/rhcos/5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c>

See attached images for reclaim files

https://github.com/openshift/assisted-installer-agent/pull/776

Bug OCPBUGS-41118: ART requests updates to 4.18 image azure-kms-encryption-provider-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/azure-kubernetes-kms/pull/8

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/azure-kubernetes-kms/pull/8

Story HOSTEDCP-1903: Review/Cleanup installed CAPI providers CRDs

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

install Hypershift with the minimum set of required CAPI/CAPx CRDs

so that I can achieve

CRDs not utilized by Hypershift shouldn't be installed

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

Today, `hypershift install` command installs ALL CAPI providers CRDs, which includes for example `ROSACluster` & `ROSAMachinePool` which are not needed by Hypershift.
We need to review and remove any CRD that is not required.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/hypershift/pull/4576

Bug OCPBUGS-43883: CAPO bump to v0.11.0

View the Description View the linked PRs

This is a placeholder for Hypershift PR(s) related to bumping CAPO to v0.11.0.

https://github.com/openshift/hypershift/pull/4990

Bug MGMT-18839: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/assisted-installer-agent/pull/778

Bug OCPBUGS-39396: HCP CLI not honoring setting multi-arch flag to false

View the Description View the linked PRs

Description of problem:

    When using an amd64 release image and setting the multi-arch flag to false, HCP CLI cannot create a HostedCluster. The following error happens:
/tmp/hcp create cluster aws --role-arn arn:aws:iam::460538899914:role/cc1c0f586e92c42a7d50 --sts-creds /tmp/secret/sts-creds.json --name cc1c0f586e92c42a7d50 --infra-id cc1c0f586e92c42a7d50 --node-pool-replicas 3 --base-domain origin-ci-int-aws.dev.rhcloud.com --region us-east-1 --pull-secret /etc/ci-pull-credentials/.dockerconfigjson --namespace local-cluster --release-image registry.build01.ci.openshift.org/ci-op-0bi6jr1l/release@sha256:11351a958a409b8e34321edfc459f389058d978e87063bebac764823e0ae3183
2024-08-29T06:23:25Z	ERROR	Failed to create cluster	{"error": "release image is not a multi-arch image"}
github.com/openshift/hypershift/product-cli/cmd/cluster/aws.NewCreateCommand.func1
	/remote-source/app/product-cli/cmd/cluster/aws/create.go:35
github.com/spf13/cobra.(*Command).execute
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1115
github.com/spf13/cobra.(*Command).Execute
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1039
github.com/spf13/cobra.(*Command).ExecuteContext
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1032
main.main
	/remote-source/app/product-cli/main.go:59
runtime.main
	/usr/lib/golang/src/runtime/proc.go:271
Error: release image is not a multi-arch image
release image is not a multi-arch image

Version-Release number of selected component (if applicable):

How reproducible:

    Every time

Steps to Reproduce:

    1. Try to create a HC with an amd64 release image and multi-arch flag set to false

Actual results:

   HC does not create and this error is displayed:
Error: release image is not a multi-arch image release image is not a multi-arch image

Expected results:

    HC should create without errors

Additional info:

  This bug seems to have occurred as a result of HOSTEDCP-1778 and this line:  https://github.com/openshift/hypershift/blob/e2f75a7247ab803634a1cc7f7beaf99f8a97194c/cmd/cluster/aws/create.go#L520

https://github.com/openshift/hypershift/pull/4660

Bug OCPBUGS-38288: Unexpected stat errors can prevent creation of keepalived sentinel iptables file

View the Description View the linked PRs

Description of problem:

The control loop that manages /var/run/keepalived/iptables-rule-exists looks at the error returned by os.Stat and decides that the file exists as long as os.IsNotExist returns false. In other words, if the error is some non-nil error other than NotExist, the sentinel file would not be created.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/baremetal-runtimecfg/pull/326

Bug OCPBUGS-44388: Create command fails with "image can't be pulled" error

View the Description View the linked PRs

Description of problem:

    The "oc adm node-image create" command sometimes throw a "image can't be pulled" error the first time the command is executed against a cluster.

Example:

+(./agent/07_agent_add_node.sh:138): case "${AGENT_E2E_TEST_BOOT_MODE}" in
+(./agent/07_agent_add_node.sh:42): oc adm node-image create --dir ocp/ostest/add-node/ --registry-config /opt/dev-scripts/pull_secret.json --loglevel=2
I1108 05:09:07.504614   85927 create.go:406] Starting command in pod node-joiner-4r4hq
I1108 05:09:07.517491   85927 create.go:826] Waiting for pod
**snip**
I1108 05:09:39.512594   85927 create.go:826] Waiting for pod
I1108 05:09:39.512634   85927 create.go:322] Printing pod logs
Error from server (BadRequest): container "node-joiner" in pod "node-joiner-4r4hq" is waiting to start: image can't be pulled

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    sometimes

Steps to Reproduce:

    1. Install a new cluster
    2. Run "oc adm node-image create" to create an image
    3.

Actual results:

    Error from server (BadRequest): container "node-joiner" in pod "node-joiner-4r4hq" is waiting to start: image can't be pulled

Expected results:

    No errors

Additional info:

    The error occurs the first the the command is executed. If one retry running the command again, it succeeds.

https://github.com/openshift/oc/pull/1909

Task MON-3992: Bump metrics-server to 0.7.2

View the Description View the linked PRs

Bump metrics-server to 0.7.2

Task MON-3989: Update downstream prometheus to v2.54.1

View the linked PRs

Bug OCPBUGS-34847: Nodes couldn't recover when missing worker role in the custom mcp

View the Description View the linked PRs

Description of problem:
Nodes couldn't recover when missing worker role in the custom mcp, all of the configuration missed in the node, the kubelet and crio services couldn't start.

Version-Release number of selected component (if applicable):
OCP 4.14

How reproducible:
Steps to Reproduce:

1. Create a custom MCP without worker role
$ cat mc.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker-t
generation: 3
name: 80-user-kernal
spec: {}

$ cat mcp.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker-t
spec:
configuration:
machineConfigSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker-t
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker-t: ""

$ oc create -f mc.yaml
$ oc create -f mcp.yaml

2. Add label worker-t to worker03

$ oc get no
NAME STATUS ROLES AGE VERSION
master01.ocp4.danliu.com Ready master 454d v1.27.13+e709aa5
master02.ocp4.danliu.com Ready master 453d v1.27.13+e709aa5
master03.ocp4.danliu.com Ready master 453d v1.27.13+e709aa5
worker01.ocp4.danliu.com Ready worker 453d v1.27.13+e709aa5
worker02.ocp4.danliu.com Ready worker 51d v1.27.13+e709aa5
worker03.ocp4.danliu.com Ready worker,worker-t 69d v1.27.13+e709aa5

$ oc label nodes worker03.ocp4.danliu.com node-role.kubernetes.io/worker-t=
node/worker03.ocp4.danliu.com labeled

Actual results:
worker03 run into NotReady status, kubelet and crio couldn't startup.

Expected results:
Prevent to sync up the mc when missing worker role

Additional info:
In the previous version (4.13 & 4.12), the task stuck with below error:

Marking Unreconcilable due to: can't reconcile config rendered-worker-8f464eb07d2e2d2fbdb84ab2204fea65 with rendered-worker-t-5b6179e2fb4fedb853c900504edad9ce: ignition passwd user section contains unsupported changes: user core may not be deleted

https://github.com/openshift/machine-config-operator/pull/4557

Bug OCPBUGS-41136: HPA/oc scale and DeploymenConfig is not working

View the Description View the linked PRs

Description of problem:

Customer is unable to scale deploymentConfig in RHOCP 4.14.21 cluster.

If they scale a DeploymentConfig they get the error: "New size: 4; reason: cpu resource utilization (percentage of request) above target; error: Internal error occurred: converting (apps.DeploymentConfig) to (v1beta1.Scale): unknown conversion"

Version-Release number of selected component (if applicable):

4.14.21

How reproducible:

N/A

Steps to Reproduce:

    1. deploy apps using DC
    2. configure an admission webhook matching the dc/scale subresource
    3. create HPA
    4. observe pods unable to scale. Also manual scaling fails

Actual results:

Pods are not getting scaled

Expected results:

Pods should be scaled using HPA

Additional info:

https://github.com/openshift/openshift-apiserver/pull/456

Bug OCPBUGS-34974: [IBMCloud] Require additional service endpoint overrides

View the Description View the linked PRs

Description of problem:

Additional IBM Cloud Services require the ability to override their service endpoints within the Installer. The list of available services provided in openshift/api must be expanded to account for this.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

100%

Steps to Reproduce:

    1. Create an install-config for IBM Cloud
    2. Define serviceEndpoints, including one for "resourceCatalog"
    3. Attempt to run IPI

Actual results:

Expected results:

Successful IPI installation, using additional IBM Cloud Service endpoint overrides.

Additional info:

IBM Cloud is working on multiple patches to incorporate these additional services. The full list is still a work in progress, but currently includes:
- Resource (Global) Catalog endpoint
- COS Config endpoint

Changes are required in the follow components currently. May open separate Jira's (if required) to track their progress.
- openshift/api
- openshift-installer
- openshift/cluster-image-registry-operator

https://github.com/openshift/installer/pull/9156

Task MON-3994: Update downstream prometheus-operator to v0.76.x

View the linked PRs

Bug OCPBUGS-38632: MCPs with RHEL nodes are degraded when a userCA bundle is added to the cluster

View the Description View the linked PRs

Description of problem:

When we add a userCA bundle to a cluster that has MCPs with yum based rhel nodes, the MCP with rhel nodes are degraded.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-08-18-131731   True        False         101m    Cluster version is 4.17.0-0.nightly-2024-08-18-131731

How reproducible:

Always

In the CI we found this issue running test case "[sig-mco] MCO security Author:sregidor-NonHyperShiftHOST-High-67660-MCS generates ignition configs with certs [Disruptive] [Serial]" on prow job periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-gcp-ipi-workers-rhel8-fips-f28-destructive

Steps to Reproduce:

    1. Create a certificate 
    
   	$ openssl genrsa -out privateKey.pem 4096
    	$ openssl req -new -x509 -nodes -days 3600 -key privateKey.pem -out ca-bundle.crt -subj "/OU=MCO qe/CN=example.com"
    
    2. Add the certificate to the cluster
    
   	# Create the configmap with the certificate
	$ oc create cm cm-test-cert -n openshift-config --from-file=ca-bundle.crt
	configmap/cm-test-cert created

	#Configure the proxy with the new test certificate
	$ oc patch proxy/cluster --type merge -p '{"spec": {"trustedCA": {"name": "cm-test-cert"}}}'
	proxy.config.openshift.io/cluster patched
    
    3. Check the MCP status and the MCD logs

Actual results:

    
    The MCP is degraded
    $ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-3251b00997d5f49171e70f7cf9b64776   True      False      False      3              3                   3                     0                      130m
worker   rendered-worker-05e7664fa4758a39f13a2b57708807f7   False     True       True       3              0                   0                     1                      130m

    We can see this message in the MCP
      - lastTransitionTime: "2024-08-19T11:00:34Z"
    message: 'Node ci-op-jr7hwqkk-48b44-6mcjk-rhel-1 is reporting: "could not apply
      update: restarting coreos-update-ca-trust.service service failed. Error: error
      running systemctl restart coreos-update-ca-trust.service: Failed to restart
      coreos-update-ca-trust.service: Unit coreos-update-ca-trust.service not found.\n:
      exit status 5"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded

In the MCD logs we can see:

I0819 11:38:55.089991    7239 update.go:2665] Removing SIGTERM protection
E0819 11:38:55.090067    7239 writer.go:226] Marking Degraded due to: could not apply update: restarting coreos-update-ca-trust.service service failed. Error: error running systemctl restart coreos-update-ca-trust.service: Failed to restart coreos-update-ca-trust.service: Unit coreos-update-ca-trust.service not found.

Expected results:

	No degradation should happen. The certificate should be added without problems.

Additional info:

https://github.com/openshift/machine-config-operator/pull/4552

Bug OCPBUGS-39084: errors when clicking on affected pods

View the Description View the linked PRs

Description of problem:

when cluster-admin user or normal user tries to create the first networkpolicy resource for one project, click on `affected pods` before submitting the creation form will result in error

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-08-27-051932

How reproducible:

Always

Steps to Reproduce:

    1. Open Networking -> NetworkPolicies, normal user or cluster-admin user tries to create the first networkpolicy resource into one project
    2. on Form view, click on `affected pods` button before hit on 'Create' button
    3.

Actual results:

2. For cluster-admin user, we will see error
Cannot set properties of undefined (setting 'tabIndex')   

For normal user, we will see
undefined has no properties

Expected results:

no errors

Additional info:

Bug OCPBUGS-38794: [OCP 4.15] "error getting ignition payload: failed to download binaries"

View the Description View the linked PRs

Description of problem:

HCP cluster is being updated but the nodepool is stuck updating:
~~~
NAME                   CLUSTER   DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
nodepool-dev-cluster   dev       2               2               False         False        4.15.22   True              True
~~~

Version-Release number of selected component (if applicable):

Hosting OCP cluster 4.15
HCP 4.15.23

How reproducible:

N/A

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Nodepool stuck in upgrade

Expected results:

Upgrade success

Additional info:

I have found this error repeating continually in the ignition-server pods:
~~~
{"level":"error","ts":"2024-08-20T09:02:19Z","msg":"Reconciler error","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"token-nodepool-dev-cluster-3146da34","namespace":"dev-dev"},"namespace":"dev-dev","name":"token-nodepool-dev-cluster-3146da34","reconcileID":"ec1f0a7f-1657-4245-99ef-c984977ff0f8","error":"error getting ignition payload: failed to download binaries: failed to extract image file: failed to extract image file: file not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

{"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"discovered machine-config-operator image","image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede"}
{"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"created working directory","dir":"/payloads/get-payload4089452863"}

{"level":"info","ts":"2024-08-20T09:02:28Z","logger":"get-payload","msg":"extracted image-references","time":"8s"}

{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"extracted templates","time":"10s"}
{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"image-cache","msg":"retrieved cached file","imageRef":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede","file":"usr/lib/os-release"}
{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"read os-release","mcoRHELMajorVersion":"8","cpoRHELMajorVersion":"9"}
{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"copying file","src":"usr/bin/machine-config-operator.rhel9","dest":"/payloads/get-payload4089452863/bin/machine-config-operator"}
~~~

https://github.com/openshift/hypershift/pull/4656

Bug OCPBUGS-39581: ART requests updates to 4.18 image ose-machine-api-provider-azure-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/118

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component name: ose-machine-api-provider-azure-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

https://github.com/openshift/machine-api-provider-azure/pull/118

Bug OCPBUGS-42816: MCO's MachineConfigNode objects should not produce too many invalid apply requests

View the Description View the linked PRs

Description of problem:

    Since about 4 days ago, the techpreview jobs have been failing on MCO namespace: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.18/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial%22%7D%5D%7D

Example run: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial/1843057579794632704

The daemons appear to be applying MCN's too early in the process, which causes it to degrade for a few loops: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial/1842877807659585536/artifacts/e2e-aws-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-79f7s_machine-config-daemon.log

This is semi-blocking techpreview jobs and should be fixed high priority. This shouldn't be blocking release as MCN is not GA and likely won't be in 4.18.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-config-operator/pull/4697

Bug OCPBUGS-23173: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8941

Bug OCPBUGS-42236: Multus daemonset requires graceful termination

View the Description View the linked PRs

Description of problem:

    This PR introduces graceful shutdown functionality to the Multus daemon by adding a /readyz endpoint alongside the existing /healthz. The /readyz endpoint starts returning 500 once a SIGTERM is received, indicating the daemon is in shutdown mode. During this time, CNI requests can still be processed for a short window. The daemonset configs have been updated to increase terminationGracePeriodSeconds from 10 to 30 seconds, ensuring we have a bit more time for these clean shutdowns.This addresses a race condition during pod transitions where the readiness check might return true, but a subsequent CNI request could fail if the daemon shuts down too quickly. By introducing the /readyz endpoint and delaying the shutdown, we can handle ongoing CNI requests more gracefully, reducing the risk of disruptions during critical transitions.

Version-Release number of selected component (if applicable):

How reproducible:

    Difficult to reproduce, might require CI signal

https://github.com/openshift/multus-cni/pull/252

Task MON-4060: Bump prometheus-operator to 0.77.2

View the Description View the linked PRs

Bump prometheus-operator to 0.77.2

Bug OCPBUGS-41637: Disable Extension Catalog navigation item

View the Description View the linked PRs

Description of problem:

    Console and OLM engineering and BU have decided to remove the Extension Catalog navigation item until the feature has matured more.

https://github.com/openshift/console/pull/14274

Bug OCPBUGS-39589: Rebase cluster-openshift-apiserver-operator to 1.30

View the Description View the linked PRs

Description of problem:

    cluster-openshift-apiserver-operator is still in 1.29 and should be updated to 1.30 to reduce conflicts and other issues

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-openshift-apiserver-operator/pull/585

Task OPRUN-3536: Update Default Catalog Images for 4.18

View the linked PRs

https://github.com/operator-framework/operator-marketplace/pull/575

Task MGMT-19200: Provide a way to add Node Labels to deployed clusters

View the Description View the linked PRs

As a part of deploying SNO clusters in the field based on the IBI install process we need a way to apply NODE labels to the resulting cluster. As an example, once the cluster has had an IBI config applied to it, it should have a node label of "edge.io/isedgedevice: true" ... the label is only an example, and the user should have the ability to add one or more labels to the resulting node.

See: https://redhat-internal.slack.com/archives/C05JHD9QYTC/p1730298666011899 for additional context.

https://github.com/openshift/installer/pull/9161

Bug OCPBUGS-30452: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oauth-server/pull/156

Bug OCPBUGS-43084: While accessing the node terminal from UI observed 'Warning alert:Admission Webhook Warning`

View the Description View the linked PRs

Description of problem:

While accessing the node terminal of the cluster from web-console the below warning message observed.
~~~
Admission Webhook WarningPod master-0.americancluster222.lab.psi.pnq2.redhat.com-debug violates policy 299 - "metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]"
~~~



Note: This is not impacting the cluster. However creating confusion among customers due to the warning message.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

    Everytime.

Steps to Reproduce:

    1. Install cluster of version 4.16.11 
    2. Upgrade the cluster from web-console to the next-minor version 4.16.13
    3. Try to access the node terminal from UI

Actual results:

    Showing warning while accessing the node terminal.

Expected results:

    Does not show any warning.

Additional info:

https://github.com/openshift/console/pull/14432

Bug OCPBUGS-38650: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/29018

Bug OCPBUGS-41120: ART requests updates to 4.18 image ose-hypershift-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/hypershift/pull/4672

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/hypershift/pull/4672

Bug OCPBUGS-41185: ART requests updates to 4.18 image ose-cluster-update-keys-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-update-keys/pull/62

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-update-keys/pull/62

Bug OCPBUGS-44362: Remove the v1alpha1 schema for ConsolePlugin CRD

View the Description View the linked PRs

Description of problem:

v1alpha1 schema is still present in the v1 ConsolePlugin CRD and should be removed manually since the generator is re-adding it automatically.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/api/pull/2095

Bug OCPBUGS-27267: Update 4.16 ose-azure-disk-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/126

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-36661: OLM catalogsource pods do not recover from node failure when registryPoll is none

View the Description View the linked PRs

Description of problem:

The pod of catalogsource without registryPoll wasn't recreated during the node failure

    jiazha-mac:~ jiazha$ oc get pods 
NAME                                    READY   STATUS        RESTARTS       AGE
certified-operators-rcs64               1/1     Running       0              123m
community-operators-8mxh6               1/1     Running       0              123m
marketplace-operator-769fbb9898-czsfn   1/1     Running       4 (117m ago)   136m
qe-app-registry-5jxlx                   1/1     Running       0              106m
redhat-marketplace-4bgv9                1/1     Running       0              123m
redhat-operators-ww5tb                  1/1     Running       0              123m
test-2xvt8                              1/1     Terminating   0              12m

jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide 
NAME         READY   STATUS    RESTARTS   AGE    IP            NODE                                          NOMINATED NODE   READINESS GATES
test-2xvt8   1/1     Running   0          7m6s   10.129.2.26   qe-daily-417-0708-cv2p6-worker-westus-gcrrc   <none>           <none>

jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME                                          STATUS     ROLES    AGE    VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc   NotReady   worker   116m   v1.30.2+421e90e

Version-Release number of selected component (if applicable):

     Cluster version is 4.17.0-0.nightly-2024-07-07-131215

How reproducible:

    always

Steps to Reproduce:

    1. create a catalogsource without the registryPoll configure.

jiazha-mac:~ jiazha$ cat cs-32183.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test
  namespace: openshift-marketplace
spec:
  displayName: Test Operators
  image: registry.redhat.io/redhat/redhat-operator-index:v4.16
  publisher: OpenShift QE
  sourceType: grpc

jiazha-mac:~ jiazha$ oc create -f cs-32183.yaml 
catalogsource.operators.coreos.com/test created

jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide 
NAME         READY   STATUS    RESTARTS   AGE     IP            NODE                                          NOMINATED NODE   READINESS GATES
test-2xvt8   1/1     Running   0          3m18s   10.129.2.26   qe-daily-417-0708-cv2p6-worker-westus-gcrrc   <none>           <none>


     2. Stop the node 
jiazha-mac:~ jiazha$ oc debug node/qe-daily-417-0708-cv2p6-worker-westus-gcrrc 
Temporary namespace openshift-debug-q4d5k is created for debugging node...
Starting pod/qe-daily-417-0708-cv2p6-worker-westus-gcrrc-debug-v665f ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.5
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# systemctl stop kubelet; sleep 600; systemctl start kubelet


Removing debug pod ...
Temporary namespace openshift-debug-q4d5k was removed.

jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME                                          STATUS     ROLES    AGE    VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc   NotReady   worker   115m   v1.30.2+421e90e


    3. check it this catalogsource's pod recreated.

Actual results:

No new pod was generated.

    jiazha-mac:~ jiazha$ oc get pods 
NAME                                    READY   STATUS        RESTARTS       AGE
certified-operators-rcs64               1/1     Running       0              123m
community-operators-8mxh6               1/1     Running       0              123m
marketplace-operator-769fbb9898-czsfn   1/1     Running       4 (117m ago)   136m
qe-app-registry-5jxlx                   1/1     Running       0              106m
redhat-marketplace-4bgv9                1/1     Running       0              123m
redhat-operators-ww5tb                  1/1     Running       0              123m
test-2xvt8                              1/1     Terminating   0              12m

once node recovery, a new pod was generated.


jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME                                          STATUS   ROLES    AGE    VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc   Ready    worker   127m   v1.30.2+421e90e

jiazha-mac:~ jiazha$ oc get pods 
NAME                                    READY   STATUS    RESTARTS       AGE
certified-operators-rcs64               1/1     Running   0              127m
community-operators-8mxh6               1/1     Running   0              127m
marketplace-operator-769fbb9898-czsfn   1/1     Running   4 (121m ago)   140m
qe-app-registry-5jxlx                   1/1     Running   0              109m
redhat-marketplace-4bgv9                1/1     Running   0              127m
redhat-operators-ww5tb                  1/1     Running   0              127m
test-wqxvg                              1/1     Running   0              27s

Expected results:

During the node failure, a new catalog source pod should be generated.

Additional info:

Hi Team,

After some more investigating the source code of operator-lifecycle-manager, we figure out the reason.

The commit [1] try to fix this issue by adding "force deleting dead pod" process into ensurePod() function.
The ensurePod() is called by EnsureRegistryServer() [2].
However, the syncRegistryServer() will return immediately without calling EnsureRegistryServer() if there is no registryPoll in catalog [3].

There is no registryPoll defined in catalogsource that were generated when we build catalog image following Doc [4].

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operator-index
  namespace: openshift-marketplace
spec:
  image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5
  sourceType: grpc

So the catalog pod created by the catalogsource cannot recovered.

And we verified that the catalog pod can be recreated on other node if we add the configuration of registryPoll to catalogsource as the following (The lines with <==).

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operator-index
  namespace: openshift-marketplace
spec:
  image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5
  sourceType: grpc
  updateStrategy:   <==
    registryPoll:   <==
      interval: 10m <==

The registryPoll is NOT MUST for catalogsource.
So the commit [1] trying to fix the issue in EnsureRegistryServer() is not properly.

[1] https://github.com/operator-framework/operator-lifecycle-manager/pull/3201/files
[2] https://github.com/joelanford/operator-lifecycle-manager/blob/82f499723e52e85f28653af0610b6e7feff096cf/pkg/controller/registry/reconciler/grpc.go#L290
[3] https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/catalog/operator.go#L1009
[4] https://docs.openshift.com/container-platform/4.16/operators/admin/olm-managing-custom-catalogs.html

https://github.com/openshift/operator-framework-olm/pull/848

Bug OCPBUGS-33255: Delay in provisioning master node

View the Description View the linked PRs

Observed in

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-serial-ovn-ipv6/1786198211774386176

there was a delay provisioning one of the master nodes, we should figure out why this is happening and if it can be prevented

from the ironic logs, there was a 5 minute delay during cleaning, on the other 2 masters this too a few seconds

01:20:53 1f90131a...moved to provision state "verifying" from state "enroll"
01:20:59 1f90131a...moved to provision state "manageable" from state "verifying"
01:21:04 1f90131a...moved to provision state "inspecting" from state "manageable"
01:21:35 1f90131a...moved to provision state "inspect wait" from state "inspecting"
01:26:26 1f90131a...moved to provision state "inspecting" from state "inspect wait" 
01:26:26 1f90131a...moved to provision state "manageable" from state "inspecting"
01:26:30 1f90131a...moved to provision state "cleaning" from state "manageable"
01:27:17 1f90131a...moved to provision state "clean wait" from state "cleaning"
>>> whats this 5 minute gap about ?? <<<
01:32:07 1f90131a...moved to provision state "cleaning" from state "clean wait" 
01:32:08 1f90131a...moved to provision state "clean wait" from state "cleaning"
01:32:12 1f90131a...moved to provision state "cleaning" from state "clean wait"
01:32:13 1f90131a...moved to provision state "available" from state "cleaning"
01:32:23 1f90131a...moved to provision state "deploying" from state "available"
01:32:28 1f90131a...moved to provision state "wait call-back" from state "deploying"
01:32:58 1f90131a...moved to provision state "deploying" from state "wait call-back"
01:33:14 1f90131a...moved to provision state "active" from state "deploying"

https://github.com/openshift/ironic-image/pull/577

Bug OCPBUGS-35417: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2400

Bug OCPBUGS-39226: Supporting Bridge Type Linux Interfaces for Primary Networking

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible: Always

Repro Steps:

Add: "bridge=br0:enpf0,enpf2 ip=br0:dhcp" to dracut cmdline. Make sure either enpf0/enpf2 is the primary network of the cluster subnet.

The linux bridge can be configured to add a virtual switch between one or many ports. This can be done by a simple machine config that adds:
"bridge=br0:enpf0,enpf2 ip=br0:dhcp"
to the the kernel command line options which will be processed by dracut.

The use case of adding such a virtual bridge for simple IEEE802.1 switching is to support PCIe devices that act as co-processors in a baremetal server. For example:
-------- ---------------------

Host	PCIe	Co-processor
eth0	<------->	enpf0 <~~br0~~> enpf2	<---> network

-------- ---------------------
This co-processor could be a "DPU" network interface card. Thus the co-processor can be part of the same underlay network as the cluster and pods can be scheduled on the Host and the Co-processor. This allows for pods to be offloaded to the co-processor for scaling workloads.

Actual results:

ovs-configuration service fails.

Expected results:

ovs-configuration service passes with the bridge interface added to the ovs bridge.

https://github.com/openshift/machine-config-operator/pull/4545

Bug OCPBUGS-41111: unrecognized baselineCapabilitySet "v4.17" with `oc adm release extract --install-config --included`

View the Description View the linked PRs

Description of problem:

v4.17 baselineCapabilitySet is not recognized.
  
# ./oc adm release extract --install-config v4.17-basecap.yaml --included --credentials-requests --from quay.io/openshift-release-dev/ocp-release:4.17.0-rc.1-x86_64 --to /tmp/test

error: unrecognized baselineCapabilitySet "v4.17"

# cat v4.17-basecap.yaml
---
apiVersion: v1
platform:
  gcp:
    foo: bar
capabilities:
  baselineCapabilitySet: v4.17

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-04-132247

How reproducible:

    always

Steps to Reproduce:

    1. Run `oc adm release extract --install-config --included` against an install-config file including baselineCapabilitySet: v4.17. 
    2.
    3.

Actual results:

    `oc adm release extract` throw unrecognized error

Expected results:

    `oc adm release extract` should extract correct manifests

Additional info:

    If specifying baselineCapabilitySet: v4.16, it works well.

Bug OCPBUGS-41586: oauth-apiserver experiencing more disruption (4.18)

View the Description View the linked PRs

TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.

The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.

The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:

source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]

Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.

The operator degraded is probably the strongest symptom to persue as it appears in most of the above.

If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.

https://github.com/openshift/cluster-authentication-operator/pull/692

Bug OCPBUGS-42783: HCP unable to pull images from registries only accessible from worker nodes

View the Description View the linked PRs

Context
Some ROSA HCP users host their own container registries (e.g., self-hosted Quay servers) that are only accessible from inside of their VPCs. This is often achieved through the use of private DNS zones that resolve non-public domains like quay.mycompany.intranet to non-public IP addresses. The private registries at those addresses then present self-signed SSL certificates to the client that can be validated against the HCP's additional CA trust bundle.

Problem Description
A user of a ROSA HCP cluster with a configuration like the one described above is encountering errors when attempting to import a container image from their private registry into their HCP's internal registry via oc import-image. Originally, these errors showed up in openshift-apiserver logs as DNS resolution errors, i.e., ~~OCPBUGS-36944~~. After the user upgraded their cluster to 4.14.37 (which fixes ~~OCPBUGS-36944~~), openshift-apiserver was able to properly resolve the domain name but complains of HTTP 502 Bad Gateway errors. We suspect these 502 Bad Gateway errors are coming from the Konnectivity-agent while it proxies traffic between the control and data planes.

We've confirmed that the private registry is accessible from the HCP data plane (worker nodes) and that the certificate presented by the registry can be validated against the cluster's additional trust bundle. IOW, curl-ing the private registry from a worker node returns a HTTP 200 OK, but doing the same from a control plane node returns a HTTP 502. Notably, this cluster is not configured with a cluster-wide proxy, nor does the user's VPC feature a transparent proxy.

Version-Release number of selected component
OCP v4.14.37

How reproducible
Can be reliably reproduced, although the network config (see Context above) is quite specific

Steps to Reproduce

Run the following command from the HCP data plane

oc import-image imagegroup/imagename:v1.2.3 --from=quay.mycompany.intranet/imagegroup/imagename:v1.2.3 --confirm

Observe the command output, the resulting ImageStream object, and openshift-apiserver logs

Actual Results

error: tag v1.2.3 failed: Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway
imagestream.image.openshift.io/imagename imported with errors

Name:            imagename
Namespace:        mynamespace
Created:        Less than a second ago
Labels:            <none>
Annotations:        openshift.io/image.dockerRepositoryCheck=2024-10-01T12:46:02Z
Image Repository:    default-route-openshift-image-registry.apps.rosa.clustername.abcd.p1.openshiftapps.com/mynamespace/imagename
Image Lookup:        local=false
Unique Images:        0
Tags:            1

v1.2.3
  tagged from quay.mycompany.intranet/imagegroup/imagename:v1.2.3

  ! error: Import failed (InternalError): Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway
      Less than a second ago

error: imported completed with errors

Expected Results
Desired container image is imported from private external image registry into cluster's internal image registry without error

https://github.com/openshift/hypershift/pull/4858

Bug OCPBUGS-38281: No error handling in iptables healthcheck

View the Description View the linked PRs

Description of problem:

We ignore errors from the existence check in https://github.com/openshift/baremetal-runtimecfg/blob/723290ec4b31bc4e032ff62198ae3dd0d0e36313/pkg/monitor/iptables.go#L116 and that can make it more difficult to debug errors in the healthchecks. In particular, this made it more difficult to debug an issue with permissions on the monitor container because there were no log messages to let us know the check had failed.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/baremetal-runtimecfg/pull/327

Bug OCPBUGS-34365: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14103

Bug OCPBUGS-44022: PowerVS: Change CAPI verbosity level

View the Description View the linked PRs

Description of problem:

We should decrease the verbosity level for the IBM CAPI module.  This will affect the output of the file .openshift_install.log

https://github.com/openshift/installer/pull/9155

Bug OCPBUGS-39541: ART requests updates to 4.18 image ose-aws-pod-identity-webhook-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/196

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/aws-pod-identity-webhook/pull/196

Bug OCPBUGS-29497: SingleReplica HCPs can not upgrade on cluster with nodes in a single zone

View the Description View the linked PRs

While updating an HC with controllerAvailabilityPolicy of SingleReplica, the HCP doesn't fully rollout with 3 pod stuck in Pending

multus-admission-controller-5b5c95684b-v5qgd          0/2     Pending   0               4m36s
network-node-identity-7b54d84df4-dxx27                0/3     Pending   0               4m12s
ovnkube-control-plane-647ffb5f4d-hk6fg                0/3     Pending   0               4m21s

This is because these deployment all have requiredDuringSchedulingIgnoredDuringExecution zone anti-affinity and maxUnavailable: 25% (i.e. 1)

Thus the old pod blocks scheduling of the new pod.

https://github.com/openshift/cluster-network-operator/pull/2480

Bug OCPBUGS-36295: Denoise: fsnotify: can't remove non-existent watch: /etc/kubernetes/kubelet-ca.crt

View the Description View the linked PRs

Description of problem

Kubelet logs contain entries like:

Jun 13 10:05:14.141073 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:14.141043    1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"

I'm not sure if that's a problem or not, but it is distracting noise for folks trying to understand Kubelet behavior, and we should either fix the problem, or denoise the red-herring.

Version-Release number of selected component

Seen in 4.13.44, 4.14.31, and 4.17.0-0.nightly-2024-06-25-162526 (details in Additional info).
Not seen in 4.12.60, so presumably a 4.12 to 4.13 change.

How reproducible

Every time.

Steps to Reproduce

1. Run a cluster.
2. Check node/kubelet logs for one control-plane node.

Actual results

Lots of can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt messages.

Expected results

No can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt messages.

Additional info

Checking recent builds in assorted 4.y streams.

4.12.60

4.12.60 > aws-sdn-serial > Artifacts > ... > gather-extra artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1803708035177123840/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name'
ip-10-0-156-214.us-west-1.compute.internal
ip-10-0-158-171.us-west-1.compute.internal
ip-10-0-203-59.us-west-1.compute.internal
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1803708035177123840/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes/ip-10-0-156-214.us-west-1.compute.internal/journal | zgrep kubelet-ca.crt | tail -n3
Jun 20 08:47:07.734060 ip-10-0-156-214 ignition[1087]: INFO     : files: createFilesystemsFiles: createFiles: op(11): [finished] writing file "/sysroot/etc/kubernetes/kubelet-ca.crt"
Jun 20 08:49:29.274949 ip-10-0-156-214 kubenswrapper[1384]: I0620 08:49:29.274923    1384 dynamic_cafile_content.go:119] "Loaded a new CA Bundle and Verifier" name="client-ca-bundle::/etc/kubernetes/kubelet-ca.crt"
Jun 20 08:49:29.275084 ip-10-0-156-214 kubenswrapper[1384]: I0620 08:49:29.275067    1384 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/etc/kubernetes/kubelet-ca.crt"

is clean.

4.13.44

4.13.44 > aws-sdn-serial > Artifacts > ... > gather-extra artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-serial/1801188570212339712/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes.json |  jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name'
ip-10-0-133-167.us-west-1.compute.internal
ip-10-0-170-3.us-west-1.compute.internal
ip-10-0-203-13.us-west-1.compute.internal
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-serial/1801188570212339712/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes/ip-10-0-133-167.us-west-1.compute.internal/journal | zgrep kubelet-ca.crt | tail -n3
Jun 13 10:05:00.464260 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:00.464190    1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"
Jun 13 10:05:13.320867 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:13.320824    1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"
Jun 13 10:05:14.141073 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:14.141043    1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"

is exposed.

4.14.31

4.14.31 > aws-ovn-serial > Artifacts > ... > gather-extra artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1803746771264868352/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name'
ip-10-0-17-181.us-west-2.compute.internal
ip-10-0-66-68.us-west-2.compute.internal
ip-10-0-97-83.us-west-2.compute.internal
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1803746771264868352/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes/ip-10-0-17-181.us-west-2.compute.internal/journal | zgrep kubelet-ca.crt | tail -n3
Jun 20 11:42:31.931470 ip-10-0-17-181 kubenswrapper[2226]: I0620 11:42:31.931404    2226 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"
Jun 20 11:42:31.980499 ip-10-0-17-181 kubenswrapper[2226]: I0620 11:42:31.980448    2226 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"
Jun 20 11:42:32.757888 ip-10-0-17-181 kubenswrapper[2226]: I0620 11:42:32.757846    2226 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"

4.17.0-0.nightly-2024-06-25-162526

4.17.0-0.nightly-2024-06-25-162526 > aws-ovn-serial > Artifacts > ... > gather-extra artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1805639599624556544/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name'
ip-10-0-125-200.ec2.internal
ip-10-0-47-81.ec2.internal
ip-10-0-8-158.ec2.internal
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1805639599624556544/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes/ip-10-0-8-158.ec2.internal/journal | zgrep kubelet-ca.crt | tail -n3
Jun 25 19:56:13.452559 ip-10-0-8-158 kubenswrapper[2243]: I0625 19:56:13.452512    2243 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="fsnotify: can't remove non-existent watch: /etc/kubernetes/kubelet-ca.crt"
Jun 25 19:56:13.512277 ip-10-0-8-158 kubenswrapper[2243]: I0625 19:56:13.512213    2243 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="fsnotify: can't remove non-existent watch: /etc/kubernetes/kubelet-ca.crt"
Jun 25 19:56:14.403001 ip-10-0-8-158 kubenswrapper[2243]: I0625 19:56:14.402953    2243 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="fsnotify: can't remove non-existent watch: /etc/kubernetes/kubelet-ca.crt"

https://github.com/openshift/kubernetes/pull/2064

Bug OCPBUGS-38287: bump gophercloud to latest v2

View the Description View the linked PRs

gophercloud is outdated, we need to update it to get the latest dependencies and avoid CVEs.

https://github.com/openshift/cluster-image-registry-operator/pull/1086

Bug OCPBUGS-38636: Improve layout and findability of Hide Lightspeed preference

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

When navigating from Lightspeed's "Don't show again" link, it can be hard to know which element is relevant.  We should look at utilizing Spotlight to highlight the relevant user preference.

Also, there is an undesirable gap before the Lightspeed user preference caused by an empty div from data-test="console.telemetryAnalytics".

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14171

Story CORS-3561: Add CAPG support for more disk types

View the Description View the linked PRs

User Story:

As a (user persona), I want to be able to:

Run a capg install and select a pd-balanced disk type
Run a capg install and select a hyperdisk-balanced disk type

so that I can achieve

Installations with capg where no regressions are available.
Support N4 and Metal Machine Types in GCP

Acceptance Criteria:

Description of criteria:

Upstream documentation
Point 1
Point 2
Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

https://github.com/openshift/cluster-api-provider-gcp/blob/02432df87df9a731c8d630311854fbd515602e91/api/v1beta1/gcpmachine_types.go#L31-L41

This requires/does not require a design proposal.
This requires/does not require a feature gate.

https://github.com/openshift/installer/pull/8880

Bug OCPBUGS-41365: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-43409: 'contorl plane' option is missing on 'filter node type' list

View the Description View the linked PRs

Description of problem:

    Based on feature https://issues.redhat.com/browse/CONSOLE-3243 - Rename "master" to "control plane node" in node pages
    The name of 'master' on ‘Filter by Node type’ on Cluster Utilization section on Overview page should be updated to 'control plane'
    But the changes have been covered by PR https://github.com/openshift/console/pull/14121 which bring this issue

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-10-15-032107

How reproducible:

    Always

Steps to Reproduce:

    1. Make sure your node role has 'control plan' 
       eg: 
$ oc get nodes -o wide
NAME                                         STATUS   ROLES                  AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
qe-uidaily-1016-dsclx-master-0               Ready    control-plane,master   3h     v1.31.1   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 418.94.202410111739-0   5.14.0-427.40.1.el9_4.x86_64   cri-o://1.31.1-4.rhaos4.18.gitd8950b8.el9
qe-uidaily-1016-dsclx-master-1               Ready    control-plane,master   3h     v1.31.1   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 418.94.202410111739-0   5.14.0-427.40.1.el9_4.x86_64   cri-o://1.31.1-4.rhaos4.18.gitd8950b8.el9


     2. Navigate to Overview page, check the option on the 'Filter by Node type' dropdown list on Cluster utilization section
    3.

Actual results:

    control plane option is missing

Expected results:

    the master option should update to 'contorl plane'

Additional info:

https://github.com/openshift/console/pull/14412

Bug OCPBUGS-41682: OLM e2e smoke test using unavailable operator in 4.18

View the Description View the linked PRs

Description of problem:

    The cert-manager operator from redhat-operators is not yet available in the 4.18 catalog. We'll need to use a different candidate in order to update our default catalog images to 4.18 without creating test failures.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/29084

Bug OCPBUGS-41693: Should show "404 not found" for non-existing resources under Networking menu

View the Description View the linked PRs

Description of problem:

For resources under Networking menu, eg, service, route, ingress, networkpolicy, when access a non-existing resource, the page should show "404 not found" instead of keeping loading the page.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-10-133647
4.17.0-0.nightly-2024-09-09-120947

How reproducible:

Always

Steps to Reproduce:

    1.Access a non-existing resource under Networking menu, eg "testconsole" service with url "/k8s/ns/openshift-console/services/testconsole".
    2.
    3.

Actual results:

1. The page will always be loading.
screenshot: https://drive.google.com/file/d/1HpH2BfVUACivI0KghXhsKt3FYgYFOhxx/view?usp=drive_link

Expected results:

1. Should show "404 not found"

Additional info:

https://github.com/openshift/networking-console-plugin/pull/88

Task HOSTEDCP-1965: SNYK Duty - 2024-09-13 (jparrill)

View the Description View the linked PRs

Perform the SnykDuty

As Toni mentioned in the 1:1 conversation, In this Snyk duty session I will add the openshift-ci-security job to all our presubmit jobs

https://github.com/openshift/hypershift/pull/4722

Bug OCPBUGS-39451: ART requests updates to 4.18 image ose-kubevirt-csi-driver-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/46

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubevirt-csi-driver/pull/46

Bug OCPBUGS-42196: Azure private account setup with vnet discovery fails to find vnet by tag

View the Description View the linked PRs

Description of problem:

    When setting .spec.storage.azure.networkAccess.type: Internal (without providing vnet and subnet names), the image registry will attempt to discover the vnet by tag. 

Previous to the installer switching to cluster-api, the vnet tagging happened here: https://github.com/openshift/installer/blob/10951c555dec2f156fad77ef43b9fb0824520015/pkg/asset/cluster/azure/azure.go#L79-L92.

After the switch to cluster-api, this code no longer seems to be in use, so the tags are no longer there.

From inspection of a failed job, the new tags in use seem to be in the form of `sigs.k8s.io_cluster-api-provider-azure_cluster_$infraID` instead of the previous `kubernetes.io_cluster.$infraID`.

Image registry operator code responsible for this: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/azure/azure.go?plain=1#L678-L682

More details in slack discussion with installer team: https://redhat-internal.slack.com/archives/C68TNFWA2/p1726732108990319

Version-Release number of selected component (if applicable):

    4.17, 4.18

How reproducible:

    Always

Steps to Reproduce:

    1. Get an Azure 4.17 or 4.18 cluster
    2. oc edit configs.imageregistry/cluster
    3. set .spec.storage.azure.networkAccess.type to Internal

Actual results:

    The operator cannot find the vnet (look for "not found" in operator logs)

Expected results:

    The operator should be able to find the vnet by tag and configure the storage account as private

Additional info:

If we make the switch to look for vnet tagged with `sigs.k8s.io_cluster-api-provider-azure_cluster_$infraID`, one thing that needs to be tested is BYO vnet/subnet clusters. What I have currently observed in CI is that the cluster has the new tag key with `owned` value, but for BYO networks the value *should* be `shared`, but I have not tested it.
---

Although this bug is a regression, I'm not going to mark it as such because this affects a fairly new feature (introduced on 4.15), and there's a very easy workaround (manually setting the vnet and subnet names when configuring network access to internal).

https://github.com/openshift/cluster-image-registry-operator/pull/1120

Bug OCPBUGS-38236: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14123

Bug OCPBUGS-43110: [CI Watcher] 'Kubernetes resource CRUD operations Secret displays detail view for newly created resource instance' flakes

View the Description View the linked PRs

Description of problem:

See https://search.dptools.openshift.org/?search=Kubernetes+resource+CRUD+operations+Secret+displays+detail+view+for+newly+created+resource+instance&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

https://github.com/openshift/console/pull/14394

Bug OCPBUGS-35824: Update the information when set with UPDATE_URL_OVERRIDE

View the Description View the linked PRs

Description of problem:

When use UPDATE_URL_OVERRIDE env, the information is confused: 

./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1 

2024/06/19 12:22:38  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/06/19 12:22:38  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/06/19 12:22:38  [INFO]   : ⚙️  setting up the environment for you...
2024/06/19 12:22:38  [INFO]   : 🔀 workflow mode: mirrorToDisk 
I0619 12:22:38.832303   66173 client.go:44] Usage of the UPDATE_URL_OVERRIDE environment variable is unsupported
2024/06/19 12:22:38  [INFO]   : 🕵️  going to discover the necessary images...

Version-Release number of selected component (if applicable):

./oc-mirror.latest  version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202406131541.p0.g157eb08.assembly.stream.el9-157eb08", GitCommit:"157eb085db0ca66fb689220119ab47a6dd9e1233", GitTreeState:"clean", BuildDate:"2024-06-13T17:25:46Z", GoVersion:"go1.22.1 (Red Hat 1.22.1-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1) Set registry on the ocp cluster;
2) do mirror2disk + disk2mirror with following isc:
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  additionalImages:
   - name: quay.io/openshifttest/bench-army-knife@sha256:078db36d45ce0ece589e58e8de97ac1188695ac155bc668345558a8dd77059f6
  platform:
    channels:
    - name: stable-4.15
      type: ocp
      minVersion: '4.15.10'
      maxVersion: '4.15.11'
    graph: true
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
      packages:
       - name: elasticsearch-operator

 3) set  ~/.config/containers/registries.conf
[[registry]]
  location = "quay.io"
  insecure = false
  blocked = false
  mirror-by-digest-only = false
  prefix = ""
  [[registry.mirror]]
    location = "my-route-testzy.apps.yinzhou-619.qe.devcluster.openshift.com"
    insecure = false

4) use the isc from step 2 and mirror2disk with different dir:
`./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1`

Actual results:

./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1 
2024/06/19 12:22:38  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/06/19 12:22:38  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/06/19 12:22:38  [INFO]   : ⚙️  setting up the environment for you...
2024/06/19 12:22:38  [INFO]   : 🔀 workflow mode: mirrorToDisk 
I0619 12:22:38.832303   66173 client.go:44] Usage of the UPDATE_URL_OVERRIDE environment variable is unsupported
2024/06/19 12:22:38  [INFO]   : 🕵️  going to discover the necessary images...
2024/06/19 12:22:38  [INFO]   : 🔍 collecting release images...

Expected results:

Give clear information to clarify the UPDATE_URL_OVERRIDE environment variable


slack discuss is here : https://redhat-internal.slack.com/archives/C050P27C71S/p1718800641718869?thread_ts=1718175617.310629&cid=C050P27C71S

https://github.com/openshift/oc-mirror/pull/908

Bug OCPBUGS-38132: OIDC IDP validation check should not be fatal to CPO reconcilation

View the Description View the linked PRs

The CPO reconciliation aborts when the OIDC/LDAP IDP validation check fails and this result in failure to reconcile for any components that are reconciled after that point in the code.

This failure should not be fatal to the CPO reconcile and should likely be reported as a condition on the HC.

xref

Customer incident
https://issues.redhat.com/browse/OCPBUGS-38071

RFE for bypassing the check
https://issues.redhat.com/browse/RFE-5638

PR to proxy the IDP check through the data plane network
https://github.com/openshift/hypershift/pull/4273

Feature Request RFE-6242: Ability to specify a pre-created loadbalancer IP on OpenStack

View the Description View the linked PRs

This is a feature request. Sorry, I couldn't find anywhere else to file it. Our team can also potentially implement this feature, so really we're looking for design input before possibly submitting a PR.

User story:

Manually create the API endpoint IP
Add DNS for the API endpoint
Create the cluster
Discover the created Ingress endpoint
Add DNS for the Ingress endpoint

I would like to simplify this workflow to:

Manually create the API and Ingress endpoint IPs
Add DNS for the API and Ingress endpoints
Create the cluster

Implementation suggestion:

Our specific target is OpenStack. We could add `OpenStackLoadBalancerParameters` to `ProviderLoadBalancerParameters`, but the parameter we would be adding is 'loadBalancerIP`. This isn't OpenStack-specific. For example, it would be equally applicable to users of either OpenStack's built-in Octavia loadbalancer, or MetalLB, both of which may reasonably be deployed on OpenStack.

I suggest adding an optional LoadBalancerIP to LoadBalancerStrategy here: https://github.com/openshift/cluster-ingress-operator/blob/8252ac492c04d161fbcf60ef82af2989c99f4a9d/vendor/github.com/openshift/api/operator/v1/types_ingress.go#L395-L440

This would be used to pre-populate spec.loadBalancerIP when creating the Service for the default router.

Bug OCPBUGS-39247: Fix doc links

View the Description View the linked PRs

Doc links on list page seems wrong, some are linking to https://docs.openshift.com/dedicated/, they should have similar links like
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.17/html/building_applications/deployments

Learn more about Service link is wrong https://console-openshift-console.apps.qe-uidaily-0830.qe.azure.devcluster.openshift.com/#
Learn more about Route link is wrong https://docs.openshift.com/dedicated/networking/routes/route-configuration.html
Learn more about Ingress https://docs.openshift.com/container-platform/4.15/networking/configuring_ingress_cluster_traffic/configuring-ingress-cluster-traffic-ingress-controller.html
Learn more about NetworkPolicies https://docs.openshift.com/dedicated/networking/network_policy/creating-network-policy.html

https://github.com/openshift/networking-console-plugin/pull/57

Bug OCPBUGS-42778: kuadrant-console-plugin not included in known plugin names

View the Description View the linked PRs

The list of known plugin names for telemetry does not include kuadrant-console-plugin, which is a Red Hat maintained plugin.

https://github.com/openshift/console/pull/14382

Bug OCPBUGS-42433: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4802

Bug OCPBUGS-36327: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8678

Bug OCPBUGS-44447: Cannot set properties of undefined (setting 'data') error

View the Description View the linked PRs

Description of problem:

In upstream and downstream automation testing, we see occasional failures coming from monitoring-plugin

For example:

Check JUnit report for https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_console/14468/pull-ci-openshift-console-master-e2e-gcp-console/1856100921105190912


Check JUnit report for https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_console/14475/pull-ci-openshift-console-master-e2e-gcp-console/1856095554396753920 


Check screenshot when visiting /monitoring/alerts 
https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-amd64-nightly-gcp-upi-f7-ui/1855343143403130880/artifacts/gcp-upi-f7-ui/cucushift-e2e/artifacts/ui1/embedded_files/2024-11-09T22:21:41+00:00-screenshot.png

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-11-144244

How reproducible:

more reproducible in automation testing

Steps to Reproduce:

Actual results:

runtime errors

Expected results:

no errors

Additional info:

https://github.com/openshift/monitoring-plugin/pull/260

Task MON-3991: Set noProxy in remote-write config via Prometheus CR

View the Description View the linked PRs

This is to track the ""permanent solution for https://issues.redhat.com/browse/OCPBUGS-38289 for >= 4.18 as the filed can be set via the Prometheus CR now.

https://github.com/openshift/cluster-monitoring-operator/pull/2453

Bug OCPBUGS-14118: There is no need to supply "User workload notifications" option on "User Preference" page for normal user.

View the Description View the linked PRs

Description of problem:

Login on admin console with normal user, there is "User workload notifications" option in "Notifications" menu on "User Preferences" page. It's not necessary, since normal user have no permission to get alerts.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-05-23-103225

How reproducible:

Always

Steps to Reproduce:

1.Login on admin console with normal user, go to "User Preferences" page.
2.Click "Notifications" menu, check/uncheck "Hide user workload notifications" for "User workload notifications"
3.

Actual results:

2. User could set the option.

Expected results:

3. It's better not show option for "User workload notifications". Since normal user could not get alerts, and there is no Notification Drawer on masthead.

Additional info:

Screenshorts: https://drive.google.com/drive/folders/15_qGw1IkbK1_rIKNiageNlYUYKTrsdKp?usp=share_link

https://github.com/openshift/console/pull/13871

Bug OCPBUGS-38596: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-43893: Pinned images not working

View the Description View the linked PRs

Description of problem:

The pinned images functionality is not working

Version-Release number of selected component (if applicable):

IPI on AWS version:
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.nightly-2024-10-28-052434   True        False         6h46m   Cluster version is 4.18.0-0.nightly-2024-10-28-052434

How reproducible:

Always

Steps to Reproduce:

    1. Enable techpreview
    2.  Create a pinnedimagesets resource

$ oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: tc-73623-worker-pinned-images
spec:
  pinnedImages:
  - name: "quay.io/openshifttest/busybox@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019"
  - name: quay.io/openshifttest/alpine@sha256:be92b18a369e989a6e86ac840b7f23ce0052467de551b064796d67280dfa06d5
EOF

Actual results:

The images are not pinned and the pool is degraded

We can see these logs in the MCDs

 I1028 14:26:32.514096    2341 pinned_image_set.go:304] Reconciling pinned image set: tc-73623-worker-pinned-images: generation: 1
E1028 14:26:32.514183    2341 pinned_image_set.go:240] failed to get image status for "quay.io/openshifttest/busybox@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019": rpc error: code = Unavailable desc = name resolver error: produced zero addresses

And we can see the machineconfignodes resources reporting pinnedimagesets degradation:

  - lastTransitionTime: "2024-10-28T14:27:58Z"
    message: 'failed to get image status for "quay.io/openshifttest/busybox@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019":
      rpc error: code = Unavailable desc = name resolver error: produced zero addresses'
    reason: PrefetchFailed
    status: "True"
    type: PinnedImageSetsDegraded

Expected results:

The images should be pinned without errors.

Additional info:


Slack conversation: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1730125766377509

This is Sam's guess (thank you [~sbatschelet] for your quick help, I really appreciate it):
My guess is that it is related to https://github.com/openshift/machine-config-operator/pull/4629
Specifically the changes to pkg/daemon/cri/cri.go where we swapped out DialContext for NewClient. Per docs.
One subtle difference between NewClient and Dial and DialContext is that the former uses "dns" as the default name resolver, while the latter use "passthrough" for backward compatibility. This distinction should not matter to most users, but could matter to legacy users that specify a custom dialer and expect it to receive the target string directly.

https://github.com/openshift/machine-config-operator/pull/4668

Bug OCPBUGS-38558: Fix typo in info message

View the Description View the linked PRs

Description of problem:

Remove the extra . from below INFO message when running add-nodes workdflow

INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z

Version-Release number of selected component (if applicable):

How reproducible:

    100%

Steps to Reproduce:

    1. Run  oc adm node-image create command to create a node iso
    2. See the INFO message at the end
    3.

Actual results:

 INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z

Expected results:

    INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso. The ISO is valid up to 2024-08-15T16:48:00Z

Additional info:

https://github.com/openshift/installer/pull/8854

Bug OCPBUGS-40412: ART requests updates to 4.18 image ose-installer-altinfra-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/installer/pull/8957

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/installer/pull/8957

Bug OCPBUGS-30704: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/openshift-controller-manager/pull/327

Bug OCPBUGS-41617: openshift-apiserver experiencing more disruption (4.18)

View the Description View the linked PRs

TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.

The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.

source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.

The operator degraded is probably the strongest symptom to persue as it appears in most of the above.

If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.

https://github.com/openshift/cluster-openshift-apiserver-operator/pull/588

Bug OCPBUGS-42419: kubelet-service: path is missing from the restorecon command

View the Description View the linked PRs

Description of problem:

Before the kubelet systemd service runs kubelet binary it calls the restorecon command:
https://github.com/openshift/machine-config-operator/blob/master/templates/worker/01-worker-kubelet/on-prem/units/kubelet.service.yaml#L13 

But the restorecon command expects a path to be given.
providing a path is mandatory.
see man page: https://linux.die.net/man/8/restorecon

At the moment the command does nothing and the error
is swallowed due to the dash (-) in the beginning
of the command.

This results with files that are labeled with wrong SELinux labels.
for example:
After https://github.com/containers/container-selinux/pull/329 got merged /var/lib/kubelet/pod-resources/* expected to be running with kubelet_var_lib_t label but it's not. it's running with the old label - container_var_lib_t

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Always

Steps to Reproduce:

    1. Check the SELinux labels of files under the system with ls -Z command.

Actual results:

    files are labeled with a wrong SELinux labels

Expected results:

file's SELinux labels are suppose the match their configuration as it captured in the container-selinux package.

Additional info:

https://github.com/openshift/machine-config-operator/pull/4606

Bug OCPBUGS-13181: metric for ingresswithoutclassname does not decrease when classless ingresses cease to exist

View the Description View the linked PRs

Description of problem:

We have an OKD 4.12 cluster which has persistent and 
increasing ingresswithoutclassname alerts with no ingresses normally 
present in the cluster. I believe the ingresswithoutclassname being 
counted is created as part of the ACME validation process managed by the
 cert-manager operator with it's openshift route addon which are torn down once the ACME validation is complete.

Version-Release number of selected component (if applicable):

 4.12.0-0.okd-2023-04-16-041331

How reproducible:

seems very consistent. went away during an update but came back shortly after and continues to increase.

Steps to Reproduce:

1. create ingress w/o classname
2. see counter increase
3. delete classless ingress
4. counter does not decrease.

Additional info:

https://github.com/openshift/cluster-ingress-operator/issues/912

https://github.com/openshift/route-controller-manager/pull/49

Bug OCPBUGS-36936: ART requests updates to 4.17 image ose-smb-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/241

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-39591: ART requests updates to 4.18 image ose-aws-ebs-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/269

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-43371: Stop locking features on platform=none

View the Description View the linked PRs

The reality is that a lot of bare-metal clusters end up using platform=none. For example, SNO's only have this platform value, so SNO users can never use provisioning network (and thus any hardware that does not support virtual media). UPI and UPI-like clusters are by definition something that operators configure for themselves, so locking them out of features makes even less sense.

With OpenStack based on OCP nowadays, I expect to see a sharp increase in complaints about this topic.

https://github.com/openshift/cluster-baremetal-operator/pull/450

Story CONSOLE-4179: Show deprecated operators in OperatorHub e2e tests

View the Description View the linked PRs

Add e2e ests to Show deprecated operators in OperatorHub work.

Open question:

What kind of tests would be most appropriate for this situation, considering the dependencies required for end-to-end (e2e) tests?

Dependencies:

Create a CatalogSource resource
Install a test operator i.e. Kiali Community Operator

AC:

Add integration tests for both pre and post installation steps [2]
- Here we should use Kiali Community Operator, which we should install though the CLI, so the test wont take to much of time.

https://github.com/openshift/console/pull/14219

Bug OCPBUGS-43378: prometheus pods can crash in rare scenarios

View the Description View the linked PRs

In https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-shiftstack-ci-release-4.18-e2e-openstack-ovn-etcd-scaling/1834144693181485056 I noticed the following panic:

 Undiagnosed panic detected in pod expand_less 	0s
{  pods/openshift-monitoring_prometheus-k8s-1_prometheus_previous.log.gz:ts=2024-09-12T09:30:09.273Z caller=klog.go:124 level=error component=k8s_client_runtime func=Errorf msg="Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3180480), concrete:(*abi.Type)(0x34a31c0), asserted:(*abi.Type)(0x3a0ac40), missingMethod:\"\"} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node)\ngoroutine 13218 [running]:\nk8s.io/apimachinery/pkg/util/runtime.logPanic({0x32f1080, 0xc05be06840})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x90\nk8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc010ef6000?})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b\npanic({0x32f1080?, 0xc05be06840?})\n\t/usr/lib/golang/src/runtime/panic.go:770 +0x132\ngithub.com/prometheus/prometheus/discovery/kubernetes.NewEndpoints.func11({0x34a31c0?, 0xc05bf3a580?})\n\t/go/src/github.com/prometheus/prometheus/discovery/kubernetes/endpoints.go:170 +0x4e\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/controller.go:253\nk8s.io/client-go/tools/cache.(*processorListener).run.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:977 +0x9f\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00fc92f70, {0x456ed60, 0xc031a6ba10}, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc011678f70, 0x3b9aca00, 0x0, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f\nk8s.io/apimachinery/pkg/util/wait.Until(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161\nk8s.io/client-go/tools/cache.(*processorListener).run(0xc04c607440)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52\ncreated by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 12933\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73\n"}

This issue seems relatively common on openstack, these runs seem to very frequently be this failure.

Linked test name: Undiagnosed panic detected in pod

https://github.com/openshift/prometheus/pull/229

Bug OCPBUGS-39246: Alerts with a non-standard severity label should be filtered out from Telemetry

View the Description View the linked PRs

Description of problem:

    Alerts with non-standard severity labels are sent to Telemeter.

Version-Release number of selected component (if applicable):

    All supported versions

How reproducible:

    Always

Steps to Reproduce:

    1. Create an always firing alerting rule with severity=foo.
    2. Make sure that telemetry is enabled for the cluster.
    3.

Actual results:

    The alert can be seen on the telemeter server side.

Expected results:

    The alert is dropped by the telemeter allow-list.

Additional info:

Red Hat operators should use standard severities: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide
Looking at the current data, it looks like ~2% of the alerts reported to Telemter have an invalid severity.

https://github.com/openshift/cluster-monitoring-operator/pull/2466

Bug OCPBUGS-42120: 4.18: After upgrading OCP and LSO to 4.14, openshift-logging elasticsearch pods will not bind to PVC and remain in Pending state

View the Description View the linked PRs

Description of problem:

    After upgrading OCP and LSO to version 4.14, elasticsearch pods in the openshift-logging deployment are unable to schedule to their respective nodes and remain Pending, even though the LSO managed PVs are bound to the PVCs. A test pod using a newly created test PV managed by the LSO is able to schedule correctly however.

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    Consistently

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    Pods consuming previously existing LSO managed PVs are unable to schedule and remain in a Pending state after upgrading OCP and LSO to 4.14.

Expected results:

    That pods would be able to consume LSO managed PVs and schedule correctly to nodes.

Additional info:

https://github.com/openshift/kubernetes/pull/2095

Bug OCPBUGS-38467: Shared Ingress Controller fails to create router pod when pullsecret is missing

View the Description View the linked PRs

Description of problem:

    When HO is installed without a pullsecret the shared ingress controller fails to create the router pod because the pullsecret is missing

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    100%

Steps to Reproduce:

    1.Install HO without pullsecret
    2.Watch HO report error   "error":"failed to get pull secret &Secret{ObjectMeta:{
      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [][]},Data:map[string[]byte{},Type:,StringData:map[string]string{},Immutabl:nil,}: Secret \"pull-secret\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.
    3. Observe that no Router pod is created in the hypershift sharedingress namespace

Actual results:

    router pod doesnt get created in hyeprshift sharedingress namespace

Expected results:

    router pod gets created in hyeprshift sharedingress namespace

Additional info:

https://github.com/openshift/hypershift/pull/4557

Bug OCPBUGS-38633: The description and name for GCP Pool ID is not consist

View the Description View the linked PRs

Description of problem:

    The description and name for GCP Pool ID is not consist
Issue is related to bug https://issues.redhat.com/browse/OCPBUGS-38557

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-08-19-002129

How reproducible:

    Always

Steps to Reproduce:

    1. Prepare a WI/FI enabled GCP cluster
    2. Go to the web Terminal operator installtion page
    3. Check the description and name for GCP Pool ID

Actual results:

    The description and name for GCP Pool ID is not consist

Expected results:

    The description and name for GCP Pool ID should consist

Additional info:

    Screenshot: https://drive.google.com/file/d/1PwiH3xk39pGzCgcHPzIHlv3ABzXYqz1O/view?usp=drive_link

https://github.com/openshift/console/pull/14201

Bug OCPBUGS-43768: Validation status logs contain wrong hostname

View the Description View the linked PRs

When the openshift-install agent wait-for bootstrap-complete command logs the status of the host validations, it logs the same hostname for all validations, regardless of which host they apply to. This makes it impossible for the user to determine which host needs remediation when a validation fails.

https://github.com/openshift/installer/pull/9137

Bug OCPBUGS-41116: ART requests updates to 4.18 image csi-livenessprobe-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/68

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-livenessprobe/pull/68

Bug OCPBUGS-42324: MachineConfigs should not have Restart=on-failure for oneshot systemd units

View the Description View the linked PRs

Description of problem:

This is a spinoff of https://issues.redhat.com/browse/OCPBUGS-38012. For additional context please see that bug.

The TLDR is that Restart=on-failure for oneshot units were only supported in systemd v244 and onwards, meaning any bootimage for 4.12 and previous doesn't support this on firstboot, and upgraded clusters would no longer be able to scale nodes if it references any such service.

Right now this is only https://github.com/openshift/machine-config-operator/blob/master/templates/common/openstack/units/afterburn-hostname.service.yaml#L16-L24 which isn't covered by https://issues.redhat.com/browse/OCPBUGS-38012

Version-Release number of selected component (if applicable):

4.16 right now

How reproducible:

Uncertain, but https://issues.redhat.com/browse/OCPBUGS-38012 is 100%

Steps to Reproduce:

    1.install old openstack cluster
    2.upgrade to 4.16
    3.attempt to scale node

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-config-operator/pull/4608

Bug OCPBUGS-43033: nil pointer error in e2e DNS tests

View the Description View the linked PRs

Description of problem:

    panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1a774eb]goroutine 11358 [running]:
testing.tRunner.func1.2({0x1d3d600, 0x3428a50})
    /usr/lib/golang/src/testing/testing.go:1631 +0x24a
testing.tRunner.func1()
    /usr/lib/golang/src/testing/testing.go:1634 +0x377
panic({0x1d3d600?, 0x3428a50?})
    /usr/lib/golang/src/runtime/panic.go:770 +0x132
github.com/openshift/cluster-ingress-operator/test/e2e.updateDNSConfig(...)
    /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/dns_ingressdegrade_test.go:89
github.com/openshift/cluster-ingress-operator/test/e2e.TestIngressStatus(0xc000511380)
    /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/dns_ingressdegrade_test.go:53 +0x34b
testing.tRunner(0xc000511380, 0x218c9f8)
    /usr/lib/golang/src/testing/testing.go:1689 +0xfb
created by testing.(*T).Run in goroutine 11200
    /usr/lib/golang/src/testing/testing.go:1742 +0x390
FAIL    github.com/openshift/cluster-ingress-operator/test/e2e    1612.553s
FAIL
make: *** [Makefile:56: test-e2e] Error 1

Version-Release number of selected component (if applicable):

    master

How reproducible:

    run the cluster-ingress-operator e2e tests against the OpenStack platform.

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    the nil pointer error

Expected results:

    no error

Additional info:

https://github.com/openshift/cluster-ingress-operator/pull/1153

Bug OCPBUGS-38466: Rendezvous node is failed to add the cluster due to some pending CSR's.

View the Description View the linked PRs

Description of problem:

- One node [ rendezvous]   is failed to add the cluster and there are some pending CSR's.

- omc get csr 
NAME                                                            AGE   SIGNERNAME                                    REQUESTOR                                                                   REQUESTEDDURATION   CONDITION
csr-44qjs                                                       21m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-9n9hc                                                       5m    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-9xw24                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-brm6f                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-dz75g                                                       36m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-l8c7v                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-mv7w5                                                       52m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-v6pgd                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending

In order to complete the installation, cu needs to approve the those CSR's manually.

Steps to Reproduce:

   agent-based installation.

Actual results:

    CSR's are in pending state.

Expected results:

    CSR's should approved automatically

Additional info:

Logs : https://drive.google.com/drive/folders/1UCgC6oMx28k-_WXy8w1iN_t9h9rtmnfo?usp=sharing

https://github.com/openshift/assisted-installer/pull/896

Bug OCPBUGS-43006: Error building assisted-installer-agent/Dockerfile.ocp

View the Description View the linked PRs

A string comparison is being done with "-eq", it should be using "="

[derekh@u07 assisted-installer-agent]$ sudo podman build -f Dockerfile.ocp 
STEP 1/3: FROM registry.ci.openshift.org/ocp/builder:rhel-9-golang-1.21-openshift-4.16 AS builder
STEP 2/3: RUN if [ "$(arch)" -eq "x86_64" ]; then dnf install -y biosdevname dmidecode; fi
/bin/sh: line 1: [: x86_64: integer expression expected
--> cb5707d9d703
STEP 3/3: RUN if [ "$(arch)" -eq "aarch64" ]; then dnf install -y dmidecode; fi
/bin/sh: line 1: [: x86_64: integer expression expected
COMMIT
--> 0b12a705f47e
0b12a705f47e015f43d7815743f2ad71da764b1358decc151454ec8802a827fc

https://github.com/openshift/assisted-installer-agent/pull/799

Bug OCPBUGS-38265: ART requests updates to 4.18 image ironic-rhcos-downloader-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/99

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ironic-rhcos-downloader/pull/99

Bug OCPBUGS-41210: ART requests updates to 4.18 image ose-ibmcloud-cluster-api-controllers-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-ibmcloud/pull/85

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/85

Task MON-3967: Update downstream thanos to v0.36.1

View the linked PRs

Bug MGMT-19029: [Staging] - BE blocks installing CNV on ARM

View the Description View the linked PRs

Description of the problem:

When discovering ARM and try to install CNV, I get the following

CNV requirements: CPU does not have virtualization support.

From inventory, CPU flags are:

cpu":{
      "architecture":"aarch64",
      "count":16,
      "flags":[
         "fp",
         "asimd",
         "evtstrm",
         "aes",
         "pmull",
         "sha1",
         "sha2",
         "crc32",
         "atomics",
         "fphp",
         "asimdhp",
         "cpuid",
         "asimdrdm",
         "lrcpc",
         "dcpop",
         "asimddp",
         "ssbs"
      ],
      "model_name":"Neoverse-N1"

How reproducible:
100%

Steps to reproduce:

Actual results:

Expected results:

https://github.com/openshift/assisted-service/pull/6845

Bug OCPBUGS-41501: Networking pages are always loading when user has no project

View the Description View the linked PRs

Description of problem:

normal user without any projects visiting Networking pages, it is always loading

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-08-130531

How reproducible:

Always

Steps to Reproduce:

    1. user without any project visit Services, Routes, Ingresses, NetworkPolicies page
    2.
    3.

Actual results:

these list page are always loading

Expected results:

show getting started guide and dim resources list

Additional info:

https://github.com/openshift/networking-console-plugin/pull/72

Bug CNV-47530: [MultiNetworkPolicies] The placeholder "Select one or more NetworkAttachmentDefinitions" is highlighted while selecting nad

View the Description View the linked PRs

Description of problem:

The placeholder "Select one or more NetworkAttachmentDefinitions" is highlighted while selecting nad

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/networking-console-plugin/pull/68

Bug OCPBUGS-43328: TechPreview: Remove DRA State file on kubelet Restart

View the Description View the linked PRs

Description of problem:

We need to remove the dra_manager_state on kubelet restart to prevent mismatch errors on restart with TechPreview or DevPreview clusters.

failed to run Kubelet: failed to create claimInfo cache: error calling GetOrCreate() on checkpoint state: failed to get checkpoint dra_manager_state: checkpoint is corrupted"

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-config-operator/pull/4639

Bug OCPBUGS-38111: Directly mutating links in useMemo may not result in re-render

View the Description View the linked PRs

Description of problem:

See https://github.com/openshift/console/pull/14030/files/0eba7f7db6c35bbf7bca5e0b8eebd578e47b15cc#r1707020700

https://github.com/openshift/console/pull/14116

Bug OCPBUGS-32233: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/insights-operator/pull/988

Vulnerability OCPBUGS-43959: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-image/pull/602

Bug OCPBUGS-38102: Bump cluster-dns-operator to client-go v0.30 and controller-runtime v0.18

View the Description View the linked PRs

Description of problem

The cluster-dns-operator repository vendors controller-runtime v0.17.3, which uses Kubernetes 1.29 packages. The cluster-dns-operator repository also vendors k8s.io/* v0.29.2 packages. However, OpenShift 4.17 is based on Kubernetes 1.30.

Version-Release number of selected component (if applicable)

4.17.

How reproducible

Always.

Steps to Reproduce

Check https://github.com/openshift/cluster-dns-operator/blob/release-4.17/go.mod.

Actual results

The sigs.k8s.io/controller-runtime package is at v0.17.3, and the k8s.io/* packages are at v0.29.2.

Expected results

The sigs.k8s.io/controller-runtime package is at v0.18.0 or newer, and the k8s.io/* packages are at v0.30.0 or newer.

Additional info

The controller-runtime v0.18 release includes some breaking changes; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.18.0.

https://github.com/openshift/cluster-dns-operator/pull/420

Bug OCPBUGS-41590: No pagination on the NetworkPolicies table list

View the Description View the linked PRs

Description of problem:

    No pagination on the NetworkPolicies table list

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-09-09-212926
    4.17.0-0.nightly-2024-09-09-120947

How reproducible:

    Always

Steps to Reproduce:

    1. Naviagate to Networking -> NetworkPolicies page, create multiple resources, at least more than 20
    2. Check the NetworkPolicies table list
    3.

Actual results:

    No pagination on the table

Expected results:

    Add pagination, and also it could control by the 'pagination_nav-control' related button/function

Additional info:

https://github.com/openshift/networking-console-plugin/pull/76

Bug OCPBUGS-41858: i18n upload/download routine task -sprint 259

View the Description View the linked PRs

Converted the story to track i18n upload/download routine tasks to a bug so that it could be backported to 4.17, as this latest translations batch contains missing translations, including ES language for the 4.17 release.

Original story: https://issues.redhat.com/browse/CONSOLE-4238

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14269

Bug OCPBUGS-43707: oc scale nodepool returns 404

View the Description View the linked PRs

Description of problem:

Running oc scale on a nodepool fails with 404 not found

Version-Release number of selected component (if applicable):

Latest hypershift operator

How reproducible:

100%

Steps to Reproduce:

Deploy latest hypershift operator and create a hosted cluster
Scale the nodepool with oc scale nodepool

Actual results:

Scaling fails

[2024-10-20 22:13:17] + oc scale nodepool/assisted-test-cluster -n assisted-spoke-cluster --replicas=1
[2024-10-20 22:13:17] Error from server (NotFound): nodepools.hypershift.openshift.io "assisted-test-cluster" not found

Expected results:

Scaling succeeds

Additional info:

Discovered in our CI tests beginning October 17th https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-cluster-api-provider-agent-master-e2e-ai-operator-ztp-capi-periodic

Note: we had to put in a workaround: directly patching the nodepool so tests may succeed starting from Oct 22

Slack thread discussion

https://github.com/openshift/hypershift/pull/4970

Bug OCPBUGS-38775: 4.17/4.18 dev console, input "in" in the query-browser input text-area, the cursor would focus in the project drop-down list

View the Description View the linked PRs

Description of problem:

see from screen recording https://drive.google.com/file/d/1LwNdyISRmQqa8taup3nfLRqYBEXzH_YH/view?usp=sharing

dev console, "Observe -> Metrics" tab, input in in the query-browser input text-area, the cursor would focus in the project drop-down list, this issue exists in 4.17.0-0.nightly-2024-08-19-165854 and 4.18.0-0.nightly-2024-08-19-002129, no such issue with admin console

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-19-165854 and 4.18.0-0.nightly-2024-08-19-002129

How reproducible:

always

Steps to Reproduce:

1. see the description

Actual results:

cursor would focus in the project drop-down

Expected results:

cursor should not move

Additional info:

https://github.com/openshift/console/pull/14335

Bug MGMT-19004: [Staging] - User is not able to select ODF once CNV is selected as LVMS is repeatedly enabled

View the Description View the linked PRs

Description of the problem:

B[Staging]BE 2.35.0, UI 2.34.2 - User is not abl to select ODF once CNV is selected as LVMS is repeatedly enabled

How reproducible:

100%

Steps to reproduce:

1. Create new cluster

2. Select cnv

3. LVMS is enabled. disabling it ends up with it being enabled again

Actual results:

Expected results:

https://github.com/openshift/assisted-service/pull/6800

Bug OCPBUGS-36871: aws autoscaler broken with dhcp option domain-name

View the Description View the linked PRs

Description of problem:

Customer has a cluster in AWS that was born on an old OCP version (4.7) and was upgraded all the way through 4.15.
During the lifetime of the cluster they changed the DHCP option in AWS to "domain name". 
During the node provisioning during MachineSet scaling the Machine can successfully be created at the cloud provider but the Node is never added to the cluster. 
The CSR remain pending and do not get auto-approved

This issue is eventually related or similar to the bug fixed via https://issues.redhat.com/browse/OCPBUGS-29290

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

   CSR don't get auto approved. New nodes have a different domain name when CSR is approved manually.

Expected results:

    CSR should get approved automatically and domain name scheme should not change.

Additional info:

https://github.com/openshift/cluster-machine-approver/pull/237

Bug OCPBUGS-41599: Storage -> VolumeSnapshots: i18n misses

View the Description View the linked PRs

Description of problem:

    Navigation:
               Storage -> VolumeSnapshots -> kebab-menu -> Mouse hover on 'Restore as new PVC'

    Issue:
               "Volume Snapshot is not Ready" is in English.

Version-Release number of selected component (if applicable):

    4.17.0

How reproducible:

Steps to Reproduce:

    1. Log into webconsole and add "?pseudolocalization=true&lng=en" to URL
    2. Navigate to Storage -> VolumeSnapshots -> kebab-menu -> Mouse hover on 'Restore as new PVC'     
    3. "Volume Snapshot is not Ready" is in English.

Actual results:

    Content is not marked for translation

Expected results:

    Content sould be marked for translation

Additional info:

    Reference screenshot added

https://github.com/openshift/console/pull/14299

Bug OCPBUGS-42745: manifests should not use APIs that are removed in upcoming releases

View the Description View the linked PRs

flowschemas.v1beta3.flowcontrol.apiserver.k8s.io used in manifests/09_flowschema.yaml

https://github.com/openshift/cluster-authentication-operator/pull/709

Bug OCPBUGS-44580: Agent check for port 22 is incorrect with IPv6

View the Description View the linked PRs

Description of problem:

The fix to remove the ssh connection and just add an ssh port test causes a problem with ssh as its not formatted correctly. We see:

level=debug msg=Failed to connect to the Rendezvous Host on port 22: dial tcp: address fd2e:6f44:5dd8:c956::50:22: too many colons in address

https://github.com/openshift/installer/pull/9208

Bug OCPBUGS-38845: Patternfly 5 components are missing their CSS

View the Description View the linked PRs

Description of problem:

    The css of some components isn't loading properly (Banner, Jumplinks)

See screenshot: https://photos.app.goo.gl/2Z1cK5puufGBVBcu5

On the screen cast, ex-aao in namespace default is a banner, and should look like: https://photos.app.goo.gl/n4LUgrGNzQT7n1Pr8

The vertical jumplinks should look like: https://photos.app.goo.gl/8GAs71S43PnAS7wH7

You can test our plugin: https://github.com/artemiscloud/activemq-artemis-self-provisioning-plugin/pull/278

1. yarn

2. yarn start

3. navigate to http://localhost:9000/k8s/ns/default/add-broker

https://github.com/openshift/console/pull/14246

Bug OCPBUGS-38012: Node scaling failed due to misconfigurations in on-prem-resolv-prepender.service in RHOCP4

View the Description View the linked PRs

Description of problem:

Customers are unable to scale-up the OCP nodes when the initial setup is done with OCP 4.8/4.9 and then upgraded to 4.15.22/4.15.23

At first customer observed that the node scale-up failed and the /etc/resolv.conf was empty on the nodes.
As a workaround, customer copy/paste the resolv.conf content from a correct resolv.conf and then it continued with setting up the new node.

However then they observed the rendered MachineConfig assembled with the 00-worker, and suspected that something can be wrong with the on-prem-resolv-prepender.service service definition.
As a workaround, customer manually changed this service definition which helped them to scale up new nodes.

Version-Release number of selected component (if applicable):

4.15 , 4.16

How reproducible:

100%

Steps to Reproduce:

1. Install OCP vSphere IPI cluster version 4.8 or 4.9
2. Check "on-prem-resolv-prepender.service" service definition
3. Upgrade it to 4.15.22 or 4.15.23
4. Check if the node scaling is working 
5. Check "on-prem-resolv-prepender.service" service definition

Actual results:

Unable to scaleup node with default service definition. After manually making changes in the service definition , scaling is working.

Expected results:

Node sclaing should work without making any manual changes in the service definition.

Additional info:

on-prem-resolv-prepender.service content on the clusters build with 4.8 / 4.9 version and then upgraded to 4.15.22 / 4.25.23 :
~~~
[Unit]
Description=Populates resolv.conf according to on-prem IPI needs
# Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
After=crio-wipe.service
[Service]
Type=oneshot
Restart=on-failure
RestartSec=10
StartLimitIntervalSec=0
ExecStart=/usr/local/bin/resolv-prepender.sh
EnvironmentFile=/run/resolv-prepender/env
~~~

After manually correcting the service definition as below, scaling works on 4.15.22 / 4.15.23 :
~~~
[Unit]
Description=Populates resolv.conf according to on-prem IPI needs
# Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
After=crio-wipe.service
StartLimitIntervalSec=0                -----------> this
[Service]
Type=oneshot
#Restart=on-failure                    -----------> this
RestartSec=10
ExecStart=/usr/local/bin/resolv-prepender.sh
EnvironmentFile=/run/resolv-prepender/env
~~~

Below is the on-prem-resolv-prepender.service on a freshly intsalled 4.15.23 where sclaing is working fine :
~~~
[Unit]
Description=Populates resolv.conf according to on-prem IPI needs
# Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
After=crio-wipe.service
StartLimitIntervalSec=0
[Service]
Type=oneshot
Restart=on-failure
RestartSec=10
ExecStart=/usr/local/bin/resolv-prepender.sh
EnvironmentFile=/run/resolv-prepender/env
~~~

Observed this in the rendered MachineConfig which is assembled with the 00-worker

https://github.com/openshift/machine-config-operator/pull/4596

Bug OCPBUGS-36494: vSphere: If the template platform spec parameter is set do not download the ova

View the Description View the linked PRs

Description of problem:

    If the `template:` field in the vsphere platform spec is defined the installer should not be downloading the OVA

Version-Release number of selected component (if applicable):

    4.16.x 4.17.x

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/8923

Bug OCPBUGS-38326: [Pre-Merge-Testing] [UDN CRD API] Failed to create NetworkAttachmentDefinition for namespace scoped CRD in layer3

View the Description View the linked PRs

Description of problem:

Failed to create NetworkAttachmentDefinition for namespace scoped CRD in layer3

Version-Release number of selected component (if applicable):

4.17

How reproducible:

always

Steps to Reproduce:

1. apply CRD yaml file
2. check the NetworkAttachmentDefinition status

Actual results:

status with error

Expected results:

NetworkAttachmentDefinition has been created

https://github.com/openshift/cluster-network-operator/pull/2468

Bug OCPBUGS-40959: ART requests updates to 4.18 image ose-aws-ebs-csi-driver-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/275

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/aws-ebs-csi-driver/pull/277

Bug OCPBUGS-42089: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/openshift-controller-manager/pull/340

Bug OCPBUGS-43567: Azure Session for Client Certificate Credential Should Set Options to Send Certificate Chain

View the Description View the linked PRs

Description of problem:

With the newer azure-sdk-for-go replacing go-autorest, there was a change to use ClientCertificateCredential that did not include the `SendCertificateChain` option by default that used to be there.  The ARO team requires this be set otherwise the 1p integration for SNI will not work.  

Old version: https://github.com/Azure/go-autorest/blob/f7ea664c9cff3a5257b6dbc4402acadfd8be79f1/autorest/adal/token.go#L262-L264

New version: https://github.com/openshift/installer-aro/pull/37/files#diff-da950a4ddabbede621d9d3b1058bb34f8931c89179306ee88a0e4d76a4cf0b13R294

Version-Release number of selected component (if applicable):

This was introduced in the OpenShift installer PR: https://github.com/openshift/installer/pull/6003

How reproducible:

Every time we authenticate using SNI in Azure.

Steps to Reproduce:

    1.  Configure a service principal in the Microsoft tenant using SNI
    2.  Attempt to run the installer using client-certificate credentials to install a cluster with credentials mode in manual

Actual results:

Installation fails as we're unable to authenticate using SNI.

Expected results:

We're able to authenticate using SNI.

Additional info:

This should not have any affect on existing non-SNI based authentication methods using client certificate credentials.  It was previously set in autorest for golang, but is not defaulted to in the newer azure-sdk-for-go.  


Note that only first party Microsoft services will be able to leverage SNI in Microsoft tenants.  The test case for this on the installer side would be to ensure it doesn't break manual credential mode installs using a certificate pinned to a service principal.

All we would need changed is to this pass the ` SendCertificateChain: true,` option only on client certificate credentials. Ideally we could back-port this as well to all openshift versions which received the migration from AAD to Microsoft Graph changes.

https://github.com/openshift/installer/pull/9117

Bug OCPBUGS-32812: When newly built images rolled out, the update progress is not displaying correctly (went 0 --> 3)

View the Description View the linked PRs

Description of problem:

    When the image from a build is rolling out on the nodes, the update progress on the node is not displaying correctly.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

    1. Enable OCL functionality 
    2. Opt the pool in by MachineOSConfig 
    3. Wait for the image to build and roll out
    4. Track mcp update status by oc get mcp

Actual results:

The MCP start with O ready pool. While there are 1-2 pools got updated already, the count still remains 0. The count jump to 3 when all the pools are ready.

Expected results:

The update progress should be reflected in the mcp status correctly.

Additional info:

https://github.com/openshift/machine-config-operator/pull/4489

Bug OCPBUGS-39523: ART requests updates to 4.18 image ose-baremetal-runtimecfg-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/333

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/baremetal-runtimecfg/pull/333

Bug OCPBUGS-36293: Installing OpenShift on AWS is leaking one EIP when using BYO IPv4 Pool

View the Description View the linked PRs

Description of problem:

CAPA is leaking one EIP in the bootstrap life cycle when creating clustres on 4.16+ with BYO IPv4 Pool on config.

The install logs is showing the message of duplicated EIP, there is a kind of race condition when the EIP is created and tried to be associated when the instance isn't ready (Running state):

~~~
time="2024-05-08T15:49:33-03:00" level=debug msg="I0508 15:49:33.785472 2878400 recorder.go:104] 
\"Failed to associate Elastic IP for \\\"ec2-i-03de70744825f25c5\\\": InvalidInstanceID: 
The pending instance 'i-03de70744825f25c5' is not in a valid state for this operation.\\n\\tstatus code: 
400, request id: 7582391c-b35e-44b9-8455-e68663d90fed\" logger=\"events\" type=\"Warning\" 
object=[...]\"name\":\"mrb-byoip-32-kbcz9\",\"[...] reason=\"FailedAssociateEIP\""

time="2024-05-08T15:49:33-03:00" level=debug msg="E0508 15:49:33.803742 2878400 controller.go:329] \"Reconciler error\" err=<"

time="2024-05-08T15:49:33-03:00" level=debug msg="\tfailed to reconcile EIP: failed to associate Elastic IP 
\"eipalloc-08faccab2dbb28d4f\" to instance \"i-03de70744825f25c5\": 
InvalidInstanceID: The pending instance 'i-03de70744825f25c5' is not in a valid state for this operation."
~~~

The EIP is deleted when the bootstrap node is removed after a success installation, although the bug impacts any new machine with Public IP set using BYO IPv4 provisioned by CAPA. Upstream issue has been opened: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5038

Version-Release number of selected component (if applicable):

   4.16+

How reproducible:

    always

Steps to Reproduce:

    1. create install-config.yaml setting platform.aws.publicIpv4Pool=poolID
    2. create cluster
    3. check the AWS Console, EIP page filtering by your cluster, you will see the duplicated EIP, while only one is associated to the correct bootstrap instance

Actual results:

Expected results:

- installer/capa creates only one EIP for bootstrap when provisioning the cluster
- no error messages for expected behavior (ec2 association errors in pending state)

Additional info:

    CAPA issue: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5038

https://github.com/openshift/installer/pull/8676

Bug OCPBUGS-38249: CNO manifests not updated with code-gen

View the Description View the linked PRs

openshift/api was bumped in CNO without running codegen. codegen needs to be run

https://github.com/openshift/cluster-network-operator/pull/2373

Bug OCPBUGS-41241: ART requests updates to 4.18 image configmap-reload-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/configmap-reload/pull/64

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/configmap-reload/pull/64

Bug OCPBUGS-35396: incorrect access mode for provisioner: pd.csi.storage.gke.io

View the Description View the linked PRs

Description of problem:

for some provisioners, the access mode is not correct. It would be good if we can have someone from storage team to confirm about the access mode values in https://github.com/openshift/console/blob/master/frontend/public/components/storage/shared.ts#L107

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-12-101500

How reproducible:

Always

Steps to Reproduce:

1. setup a cluster in GCP, check storageclasses
$ oc get sc
NAME                     PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
ssd-csi                  pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true                   5h37m
standard-csi (default)   pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true                   5h37m

2. goes to PVC creation page, choose any storageclass in the dropdown and check `Access mode` list

Actual results:

there is only `RWO` access mode

Expected results:

pd.csi.storage.gke.io support both RWO and RWOP

supported access mode reference https://docs.openshift.com/container-platform/4.15/storage/understanding-persistent-storage.html#pv-access-modes_understanding-persistent-storage

Additional info:

https://github.com/openshift/console/pull/14166

Task MGMT-19230: Do not reset last install prep status unless state changed.

View the Description View the linked PRs

The fields for `last_installation_preparation_status` for a cluster are currently reset when the user sends a request to `install` the cluster.

In the case that multiple requests are received, this can lead to this status being illegally cleared when it should not be.

It is safer to move this to the state machine where it can be ensured that states have changed in the correct way prior to the reset of this field.

https://github.com/openshift/assisted-service/pull/6983

Bug OCPBUGS-39157: [Pre-Merge-testing] L2/L3 UDN Pod2Egress is broken in SGW mode

View the Description View the linked PRs

Description of problem:

L3 Egress traffic from pod in segmented network does not work.

Version-Release number of selected component (if applicable):

build openshift/ovn-kubernetes#2274,openshift/api#2005

oc version

Client Version: 4.15.9
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.17.0-0.ci.test-2024-08-28-123437-ci-ln-v5g4wb2-latest
Kubernetes Version: v1.30.3-dirty

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster UPI GCP with build from cluster bot

2. Create a namespace test wih NAD as below

oc -n test get network-attachment-definition l3-network-nad -oyaml

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  creationTimestamp: "2024-08-28T17:44:14Z"
  generation: 1
  name: l3-network-nad
  namespace: test
  resourceVersion: "108224"
  uid: 5db4ca26-39dd-45b7-8016-215664e21f5d
spec:
  config: |
    {
      "cniVersion": "0.3.1",
      "name": "l3-network",
      "type": "ovn-k8s-cni-overlay",
      "topology":"layer3",
      "subnets": "10.150.0.0/16",
      "mtu": 1300,
      "netAttachDefName": "test/l3-network-nad",
      "role": "primary"
    }

3. Create a pod in the segmented namespace test

oc -n test exec -it hello-pod – ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:83:00:11 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.131.0.17/23 brd 10.131.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe83:11/64 scope link 
       valid_lft forever preferred_lft forever
3: ovn-udn1@if26: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1300 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:96:03:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.150.3.3/24 brd 10.150.3.255 scope global ovn-udn1
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe96:303/64 scope link 
       valid_lft forever preferred_lft forever

oc -n test exec -it hello-pod – ip r

default via 10.150.3.1 dev ovn-udn1 
10.128.0.0/14 via 10.131.0.1 dev eth0 
10.131.0.0/23 dev eth0 proto kernel scope link src 10.131.0.17 
10.150.0.0/16 via 10.150.3.1 dev ovn-udn1 
10.150.3.0/24 dev ovn-udn1 proto kernel scope link src 10.150.3.3 
100.64.0.0/16 via 10.131.0.1 dev eth0 
100.65.0.0/16 via 10.150.3.1 dev ovn-udn1 
172.30.0.0/16 via 10.150.3.1 dev ovn-udn1

4. Try to curl the IP echo server running outside the cluster to see it fail.

oc -n test exec -it hello-pod – curl 10.0.0.2:9095 --connect-timeout 5

curl: (28) Connection timeout after 5001 ms
command terminated with exit code 28

Actual results:

curl request fails

Expected results:

curl request should pass

Additional info:

The egress from pod in regular namespace works

oc -n test1 exec -it hello-pod – curl 10.0.0.2:9095 --connect-timeout 5

10.0.128.4

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/ovn-kubernetes/pull/2286

Bug OCPBUGS-42674: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14455

Bug OCPBUGS-41608: The catalogsource file for mirror2mirror is invalid with local cache

View the Description View the linked PRs

Description of problem:

The catalogsource file for mirror2mirror is invalid with local cache

Version-Release number of selected component (if applicable):

  ./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202409091841.p0.g45b1fcd.assembly.stream.el9-45b1fcd", GitCommit:"45b1fcd9df95420d5837dfdd2775891ae3dd6adf", GitTreeState:"clean", BuildDate:"2024-09-09T20:48:47Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

  1. run the mirror2mirror command : 

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  operators:
  - catalog: quay.io/openshifttest/nginxolm-operator-index:mirrortest1


`oc-mirror -c config-head.yaml --workspace file://out-head docker://my-route-zhouy.apps.yinzhou0910.qe.azure.devcluster.openshift.com  --v2 --dest-tls-verify=false`

Actual results:

The catalogsource file is invalid and create the twice for the catalogsource file:

2024/09/10 10:47:35  [INFO]   : 📄 Generating CatalogSource file...
2024/09/10 10:47:35  [INFO]   : out11re/working-dir/cluster-resources/cs-redhat-operator-index-v4-15.yaml file created
2024/09/10 10:47:35  [INFO]   : out11re/working-dir/cluster-resources/cs-redhat-operator-index-v4-15.yaml file created
2024/09/10 10:47:35  [INFO]   : mirror time     : 1m41.028961606s
2024/09/10 10:47:35  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
[fedora@preserve-fedora-yinzhou yinzhou]$ ll out11re/working-dir/cluster-resources/
total 8
-rw-r--r--. 1 fedora fedora 242 Sep 10 10:47 cs-redhat-operator-index-v4-15.yaml
-rw-r--r--. 1 fedora fedora 289 Sep 10 10:47 idms-oc-mirror.yaml
[fedora@preserve-fedora-yinzhou yinzhou]$ cat out11re/working-dir/cluster-resources/cs-redhat-operator-index-v4-15.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: cs-redhat-operator-index-v4-15
  namespace: openshift-marketplace
spec:
  image: localhost:55000/redhat/redhat-operator-index:v4.15
  sourceType: grpc
status: {}

Expected results:

    The catalogsource file should be created with the registry route not the local cache

Additional info:

https://github.com/openshift/oc-mirror/pull/923

Bug OCPBUGS-34816: HyperShift should block direct edits of IDMS/ITMS from data plane

View the Description View the linked PRs

Description of problem:

    IDMS is set on HostedCluster and reflected in their respective CR in-cluster.  Customers can create, update, and delete these today.  In-cluster IDMS has no impact.

Version-Release number of selected component (if applicable):

    4.14+

How reproducible:

    100%

Steps to Reproduce:

    1. Create HCP
    2. Create IDMS
    3. Observe it does nothing

Actual results:

    IDMS doesn't change anything if manipulated in data plane

Expected results:

    IDMS either allows updates OR IDMS updates are blocked.

Additional info:

https://github.com/openshift/hypershift/pull/4303

Bug OCPBUGS-41278: ART requests updates to 4.18 image prometheus-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/304

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/prometheus-operator/pull/304

Bug OCPBUGS-44049: Oh no! Something went wrong

View the Description View the linked PRs

Description of problem:

When the machineconfig tab is opened on the console the below error is displayed.

Oh no! Something went wrong
Type Error
Description:
Cannot read properties of undefined (reading 'toString")

Version-Release number of selected component (if applicable):

    OCP version 4.17.3

How reproducible:

    Every time at customers end.

Steps to Reproduce:

    1. Go on console.
    2. Under compute tab go to machineconfig tab.

Actual results:

     Oh no! Something went wrong

Expected results:

     Should be able to see all the available mc.

Additional info:

https://github.com/openshift/console/pull/14468

Bug OCPBUGS-44473: Hosted cluster config operator cannot reconcile image configuration (in rosa)

View the Description View the linked PRs

Description of problem:

When Ingress configuration is specified for a HostedCluster in .spec.configuration.ingress, the configuration fails to make it into the HostedCluster because VAP {{ingress-config-validation.managed.openshift.io}} prevents it.

Version-Release number of selected component (if applicable):

4.18 Hosted ROSA

How reproducible:

Always

Steps to Reproduce:

    1. Create a hosted cluster in ROSA with 
spec:
  configuration:
     ingress:
       domain: ""
       loadBalancer:
         platform:
           aws:
             type: NLB
           type: AWS
    2. Wait for the cluster to come up
    3.

Actual results:

    Cluster never finishes applying the payload (reaches Complete) because the console operator fails to reconcile its route.

Expected results:

    Cluster finishes applying the payload and reaches Complete

Additional info:

The following error is reported in the hcco log:

{"level":"error","ts":"2024-11-12T17:33:09Z","msg":"Reconciler error","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"f4216970-af97-4093-ae72-b7dbe452b767","error":"failed to reconcile global configuration: failed to reconcile ingress config: admission webhook \"ingress-config-validation.managed.openshift.io\" denied the request: Only privileged service accounts may access","errorCauses":[{"error":"failed to reconcile global configuration: failed to reconcile ingress config: admission webhook \"ingress-config-validation.managed.openshift.io\" denied the request: Only privileged service accounts may access"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222"}

https://github.com/openshift/hypershift/pull/5111

Bug OCPBUGS-39231: Ability to sync OS time from NTP and update HW clock at the time of installation of OpenShift in ABI

View the Description View the linked PRs

Description of problem:

   Feature : https://issues.redhat.com/browse/MGMT-18411
when to assited installer v. 2.34.0 but apprently not including in any openshift version to be used in ABI installation.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Went thru a loop to very the different commits to check if this is delivered in any ocp version.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9095

Bug OCPBUGS-39285: UPI playbook failing due to missing metadata.json

View the Description View the linked PRs

Description of problem: https://github.com/openshift/installer/pull/7727 changed the order of some playbooks and we're expected to run the network.yaml playbook before the metadata.json file is created. This isn't a problem with newer version of ansible, that will happily ignore missing var_files, however this is a problem with older ansible that fail with:

[cloud-user@installer-host ~]$ ansible-playbook -i "/home/cloud-user/ostest/inventory.yaml" "/home/cloud-user/ostest/network.yaml"

PLAY [localhost] *****************************************************************************************************************************************************************************************************************************
ERROR! vars file metadata.json was not found                                                                                       
Could not find file on the Ansible Controller.                                                                                      
If you are using a module and expect the file to exist on the remote, see the remote_src option

https://github.com/openshift/installer/pull/8919

Bug CNV-49305: Creating new NAD switches perspective

View the Description View the linked PRs

Description of problem:

When "Create NetworkAttachmentDefinition" button is clicked, the app switches to "Administrator" perspective

Version-Release number of selected component (if applicable):

How reproducible:

always

Steps to Reproduce:

1. Switch to "Virtualization" perspective
2. Navigate to Network -> NetworkAttachmentDefinitions
3. Click "Create NetworkAttachmentDefinition" button

Actual results:

App switches to "Administrator" perspective

Expected results:

App stays in "Virtualization" perspective

Additional info:

https://github.com/openshift/networking-console-plugin/pull/127

Bug MGMT-18475: OCP 4.12 installation fail

View the Description View the linked PRs

Description of the problem:

FYI - OCP 4.12 has reached end of maintenance support, not it is on extended support.

Looks like OCP 4.12 installations started failing lately due to hosts not discovering. for example - https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_assisted-service/6628/pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted-4-12/1817416612257468416

How reproducible:

Seems like every CI run, haven't tested locally

Steps to reproduce:

Trigger OCP 4.12 installation in the CI

Actual results:

failure, hosts not discovering

Expected results:

Successful cluster installation

https://github.com/openshift/assisted-installer-agent/pull/772

Bug OCPBUGS-44162: PowerVS: Fix destroy persistent TG

View the Description View the linked PRs

Description of problem:

We were told that adding connections to a Transit Gateway also costs an exorbitant amount of money. So the create option tgName now means that we will not clean up the connections during destroy cluster.

https://github.com/openshift/installer/pull/9166

Bug OCPBUGS-39037: cluster-authentication-operator 4.18 ART PR

View the Description View the linked PRs

Description of problem:

    We missed the window to merge the ART 4.17 image PR in time.

Version-Release number of selected component (if applicable):

How reproducible:

    Fail to get ART PR merged in time

Steps to Reproduce:

    1. Have E2E Tests fail for a while.
    2. Go on vacation afterwards.

Actual results:

    I got asked about 4.17 OCP images.

Expected results:

    I don't get asked about 4.17 OCP images.

Additional info:

https://github.com/openshift/cluster-authentication-operator/pull/691

Bug OCPBUGS-44163: ROKS v4.16.16 HyperShift-based clusters fail to get oauth token in the OpenShift web console

View the Description View the linked PRs

Description of problem:

We identified a regression where we can no longer get oauth tokens for HyperShift v4.16 clusters via the OpenShift web console. v4.16.10 works fine, but once clusters are patched to v4.16.16 (or are created at that version) they fail to get the oauth token. 

This is due to this faulty PR: https://github.com/openshift/hypershift/pull/4496.

The oauth openshift deployment was changed and affected the IBM Cloud code path.  We need this endpoint to change back to using `socks5`.

Bug:
<           value: socks5://127.0.0.1:8090
---
>           value: http://127.0.0.1:8092
98c98
<           value: socks5://127.0.0.1:8090
---
>           value: http://127.0.0.1:80924:53
Fix:
Change http://127.0.0.1:8092 to socks5://127.0.0.1:8090

Version-Release number of selected component (if applicable):

4.16.16

How reproducible:

Every time.

Steps to Reproduce:

    1. Create ROKS v4.16.16 HyperShift-based cluster. 
    2. Navigate to the OpenShift web console.
    2. Click IAM#<username> menu in the top right.
    3. Click 'Copy login command'.
    4. Click 'Display token'.

Actual results:

Error getting token: Post "https://example.com:31335/oauth/token": http: server gave HTTP response to HTTPS client

Expected results:

The oauth token should be successfully displayed.

Additional info:

https://github.com/openshift/hypershift/pull/5057

Bug OCPBUGS-38450: Day2 add node with oc binary is not working on ARM64

View the Description View the linked PRs

Description of problem:

Day2 add node with oc binary is not working for ARM64 on baremetal CI running

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. Running compact agent installation on arm64 platform
    2. After the cluster is ready, run day2 install 
    3. Day2 install fail with error, worker-a-00 is not reachable

Actual results:

    Day2 install exit with error.

Expected results:

    Day2 install should works

Additional info:

Job link: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/54181/rehearse-54181-periodic-ci-openshift-openshift-tests-private-release-4.17-arm64-nightly-baremetal-compact-abi-ipv4-static-day2-f7/1823641309190033408

Error message from console when running day2 install:
rsync: [sender] link_stat "/assets/node.x86_64.iso" failed: No such file or directory (2) command terminated with exit code 23 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1823) [Receiver=3.2.3] rsync: [Receiver] write error: Broken pipe (32) error: exit status 23 {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-08-13T14:32:20Z"} error: failed to execute wrapped command: exit status 1

https://github.com/openshift/oc/pull/1855

Bug OCPBUGS-42637: /boot/efi and /sysroot dir and subfiles are unlabeled_t

View the Description View the linked PRs

/boot/efi and /sysroot dir and subfiles are unlabeled_t

https://github.com/openshift/installer/pull/9088

Bug OCPBUGS-30492: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-apiserver-operator/pull/583

Story OCPNODE-2387: draft update SCC to have AllowHostUsers field

View the Description View the linked PRs

Not counting using it

https://github.com/openshift/api/pull/1939

Bug OCPBUGS-38177: install-status should show status of day2 services

View the Description View the linked PRs

Description of problem:

When adding nodes, agent-register-cluster.service and start-cluster-installation.service service status should not be checked and in their place agent-import-cluster.service and agent-add-node.service should be checked.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    console message shows start installation service and agent register service has not started

Expected results:

    console message shows agent import cluster and add host services has started

Additional info:

https://github.com/openshift/installer/pull/8858

Bug OCPBUGS-41169: ART requests updates to 4.18 image ose-cluster-kube-storage-version-migrator-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/116

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/116

Story HOSTEDCP-1940: Alert when etcd recovery fails or an etcd cluster is not recoverable

View the Description View the linked PRs

User Story:

As an SRE managing hypershift clusters, I want to

be alerted when a hosted cluster's etcd needs manual intervention

so that I can achieve

only intervening when needed

https://github.com/openshift/hypershift/pull/4679

Bug OCPBUGS-30122: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-36532: Stop testing rendezvous host connectivity with ssh

View the Description View the linked PRs

When the openshift-install agent wait-for bootstrap-complete command cannot connect to either the k8s API or the assisted-service API, it tries to ssh to the rendezvous host to see if it is up.

If there is a running ssh-agent on the local host, we connect to it to make use of its private keys. This is not guaranteed to work, as the private key corresponding to the public key in the agent ISO may not be present on the box.

If there is no running ssh-agent, we use the literal public key as the path to a file that we expect to contain the private key. This is guaranteed not to work.

All of this generates a lot of error messages at DEBUG level that are confusing to users.

If we did succeed in ssh-ing to the host when it has already joined the cluster, the node would end up tainted as a result, which we want to avoid. (This is unlikely in practice though, because by the time the rendezvous host joins, the k8s API should be up so we wouldn't normally run this code at that time.)

We should stop doing all of this, and maybe just ping the rendezvous host to see if it is up.

https://github.com/openshift/installer/pull/9074

Bug OCPBUGS-38657: capv session timeout

View the Description View the linked PRs

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/2781
https://kubernetes.slack.com/archives/CKFGK3SSD/p1704729665056699
https://github.com/okd-project/okd/discussions/1993#discussioncomment-10385535

Description of problem:

INFO Waiting up to 15m0s (until 2:23PM UTC) for machines [vsphere-ipi-b8gwp-bootstrap vsphere-ipi-b8gwp-master-0 vsphere-ipi-b8gwp-master-1 vsphere-ipi-b8gwp-master-2] to provision...
E0819 14:17:33.676051    2162 session.go:265] "Failed to keep alive govmomi client, Clearing the session now" err="Post \"https://vctest.ars.de/sdk\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
E0819 14:17:33.708233    2162 session.go:295] "Failed to keep alive REST client" err="Post \"https://vctest.ars.de/rest/com/vmware/cis/session?~action=get\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
I0819 14:17:33.708279    2162 session.go:298] "REST client session expired, clearing session" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"

https://github.com/openshift/installer/pull/8871

Bug OCPBUGS-38404: Ironic issues soft power_off command during installation via ACM, preventing fakefish from working on certain configurations

View the Description View the linked PRs

Description of problem:

Even though fakefish is not a supported redfish interface, it is very useful to have it working for "special" scenarios, like NC-SI, while its support is implemented.

On OCP 4.14 and later, converged flow is enabled by default, and on this configuration Ironic sends a soft power_off command to the ironic agent running on the ramdisk. Since this power operation is not going through the redfish interface, it is not processed by fakefish, preventing it from working on some NC-SI configurations, where a full power-off would mean the BMC loses power.

Ironic already supports using out-of-band power off for the agent [1], so having an option to use it would be very helpful.

[1]- https://opendev.org/openstack/ironic/commit/824ad1676bd8032fb4a4eb8ffc7625a376a64371

Version-Release number of selected component (if applicable):

Seen with OCP 4.14.26 and 4.14.33, expected to happen on later versions

How reproducible:

Always

Steps to Reproduce:

    1. Deploy SNO node using ACM and fakefish as redfish interface
    2. Check metal3-ironic pod logs

Actual results:

We can see a soft power_off command sent to the ironic agent running on the ramdisk:

2024-08-07 15:00:45.545 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Executing agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 with params {'wait': 'false', 'agent_token': '***'} _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:197
2024-08-07 15:00:45.551 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 returned result None, error None, HTTP status code 200 _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:234

Expected results:

There is an option to prevent this soft power_off command, so all power actions happen via redfish. This would allow fakefish to capture them and behave as needed.

Additional info:

https://github.com/openshift/baremetal-operator/pull/368

Bug OCPBUGS-44381: openshift-ovn-kubernetes pod can crash

View the Description View the linked PRs

Looks relatively new in serial jobs on aws and vsphere. First occurrence I see is Wednesday at around 5am. It's not every run but it is quite common. (10-20% of the time)

Caught by test: Undiagnosed panic detected in pod

Undiagnosed panic detected in pod expand_less 	0s
{  pods/openshift-ovn-kubernetes_ovnkube-control-plane-558bfbcf78-nfbnw_ovnkube-cluster-manager_previous.log.gz:E1106 08:04:15.797587       1 panic.go:261] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<}

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial/1854031870619029504

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-upi-serial/1854427071271407616

See component readiness for more runs:

https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&LayeredProduct=none&Network=ovn&Network=ovn&NetworkAccess=default&Platform=aws&Platform=aws&Procedure=none&Scheduler=default&SecurityMode=default&Suite=serial&Suite=serial&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=Other&columnGroupBy=Architecture%2CNetwork%2CPlatform%2CTopology&component=Test%20Framework&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20aws%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Network%3Aovn&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&includeVariant=Topology%3Amicroshift&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-11-08%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-11-01%2000%3A00%3A00&testId=Symptom%20Detection%3A171acaa74f3d5ea96e3b687038d0cf13&testName=Undiagnosed%20panic%20detected%20in%20pod

https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=upi&Installer=upi&LayeredProduct=none&Network=ovn&Network=ovn&NetworkAccess=default&Platform=vsphere&Platform=vsphere&Procedure=none&Scheduler=default&SecurityMode=default&Suite=serial&Suite=serial&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=Other&columnGroupBy=Architecture%2CNetwork%2CPlatform%2CTopology&component=Test%20Framework&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20upi%20ovn%20vsphere%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Network%3Aovn&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&includeVariant=Topology%3Amicroshift&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-11-08%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-11-01%2000%3A00%3A00&testId=Symptom%20Detection%3A171acaa74f3d5ea96e3b687038d0cf13&testName=Undiagnosed%20panic%20detected%20in%20pod

Bug OCPBUGS-25147: Update 4.16 ose-aws-ebs-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/81

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-41834: Add validation to detect invalid mac address in agent-config.yaml

View the Description View the linked PRs

Description of problem:


If an invalid mac address is used in the interfaces table in agent-config.yaml, like this
 
{noformat}
   - name: eno2
       macAddress: 98-BE-94-3F-48-42
{noformat}

it results in the failing to register the Infraenv with assisted-service and constant retries

{noformat}
Aug 28 15:23:37 master0 agent-register-infraenv[4606]: time="2024-08-28T19:23:37Z" level=info msg="Registering infraenv"
Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=info msg="Reference to cluster id: 1f38e4c9-afde-4ac0-aa32-aabc75ec088a"
Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=info msg="Registering infraenv"
Aug 28 15:23:37 master0 agent-register-infraenv[4606]: time="2024-08-28T19:23:37Z" level=info msg="Added 1 nmstateconfigs"
Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=info msg="Added 1 nmstateconfigs"
Aug 28 15:23:37 master0 agent-register-infraenv[4606]: time="2024-08-28T19:23:37Z" level=fatal msg="Failed to register infraenv with assisted-service: response status code does not match any response statuses defined for this endpoint in the swagger spec (status 422): {}"
Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=fatal msg="Failed to register infraenv with assisted-service: response status code does not match any response statuses defined for this endpoint in the swagger spec (status 422): {}"
{noformat}

The error above was in 4.15. In 4.18 I can duplicate it and its only marginally better. There is slightly more info due to an assisted-service change, but same net result of retrying continually on the 
Registering infraenv"
{noformat}
Sep 11 20:57:26 master-0 agent-register-infraenv[3013]: time="2024-09-11T20:57:26Z" level=fatal msg="Failed to register infraenv with assisted-service: json: cannot unmarshal number into Go struct field Error.code of type string"
Sep 11 20:57:26 master-0 podman[2987]: time="2024-09-11T20:57:26Z" level=fatal msg="Failed to register infraenv with assisted-service: json: cannot unmarshal number into Go struct field Error.code of type string"
{noformat}

Version-Release number of selected component (if applicable):

 Occcurs both in latest 4.18 and 4.15.26

How reproducible:

Steps to Reproduce:

    1. Use an invalid mac address in the interface table like this

{noformat}
      interfaces:
        - name: eth0
          macAddress: 00:59:bd:23:23:8c
        - name: eno12399np0
          macAddress: 98-BE-94-3F-51-33
      networkConfig:
        interfaces:
          - name: eno12399np0
            type: ethernet
            state: up
            ipv4:
              enabled: false
              dhcp: false
            ipv6:
              enabled: false
              dhcp: false
          - name: eth0
            type: ethernet
            state: up
            mac-address: 00:59:bd:23:23:8c
            ipv4:
              enabled: true
              address:
                - ip: 192.168.111.80
                  prefix-length: 24
              dhcp: false

{noformat}

    2. Generate the agent ISO
    3. Install using the agent ISO, I just did an SNO installation.

Actual results:

Install fails with the errors:

{noformat}
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
{noformat}

Expected results:

The invalid mac address should be detected when creating the ISO image so it can be fixed.

Additional info:

https://github.com/openshift/installer/pull/9052

Bug OCPBUGS-39004: CRD type check test fails too often

View the Description View the linked PRs

The following test is failing:

[sig-api-machinery] ValidatingAdmissionPolicy [Privileged:ClusterAdmin] should type check a CRD [Suite:openshift/conformance/parallel] [Suite:k8s]

Additional context here:

https://sippy.dptools.openshift.org/sippy-ng/tests/4.18/analysis?test=%5Bsig-api-machinery%5D%20ValidatingAdmissionPolicy%20%5BPrivileged%3AClusterAdmin%5D%20should%20type%20check%20a%20CRD%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%20%5BSuite%3Ak8s%5D&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-api-machinery%5D%20ValidatingAdmissionPolicy%20%5BPrivileged%3AClusterAdmin%5D%20should%20type%20check%20a%20CRD%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%20%5BSuite%3Ak8s%5D%22%7D%2C%7B%22columnField%22%3A%22variants%22%2C%22not%22%3Atrue%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22never-stable%22%7D%2C%7B%22columnField%22%3A%22variants%22%2C%22not%22%3Atrue%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22aggregated%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D

This was a problem back in 4.16 when the test had Beta in the name. https://issues.redhat.com/browse/OCPBUGS-30767

But the test continues to be quite flaky and we just got unlucky and failed a payload on it.

The failure always seems to be:

{  fail [k8s.io/kubernetes/test/e2e/apimachinery/validatingadmissionpolicy.go:380]: wait for type checking: PatchOptions.meta.k8s.io "" is invalid: fieldManager: Required value: is required for apply patch
Error: exit with code 1
Ginkgo exit error 1: exit with code 1}

See: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.18-ocp-e2e-gcp-ovn-multi-x-ax/1828347537950511104

It often works on a re-try. (flakes)

Something is not quite right either with this test or the product.

Bug OCPBUGS-42490: cluster-capi-operator should not run controllers on AzureStackCloud

View the Description View the linked PRs

Description of problem:

cluster-capi-operator is running its controllers on AzureStackCloud. And it shouldn't because CAPI is not supported for AzureStackCloud.

Version-Release number of selected component (if applicable):

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-24245: seLinuxMount is missed after changing to csi-operator

View the Description View the linked PRs

https://github.com/openshift/csi-operator/blob/master/assets/overlays/aws-ebs/base/csidriver.yaml

Missed "seLinuxMount: true" which has been merged in https://github.com/bertinatto/aws-ebs-csi-driver-operator-1/blob/0a9642cff6d2a7f9aea940ce89b65fc189cba6b6/assets/csidriver.yaml#L14

Bug MGMT-19215: Machine powered off when removing BMH resource

View the Description View the linked PRs

Description of the problem:

When removing a spoke BMH resource from the hub cluster the node it being shutdown. Previously, the BMH was just removed and the node wasn't affected in anyway. It seems to be due to new behavior in the BMH finalizer that removes the paused annotation from the BMH.

How reproducible:

100%

Steps to reproduce:

1. Install a spoke cluster

2. Remove one of the spoke cluster BMHs from the hub cluster

Actual results:

Correlating node is shutdown

Expected results:

Correlating node is not shutdown

https://github.com/openshift/assisted-service/pull/6971

Bug OCPBUGS-35048: Unable to create alert silence in developer UI though "Creator" field is NOT mandatory

View the Description View the linked PRs

Description of problem:

same admin console bug ~~OCPBUGS-31931~~ on developer console, 4.15.17 cluster, kubeadmin user goes to developer console UI, click "Observe", select one project, example: openshift-monitoring, select Silences tab, click "Create silence", Creator filed is not auto filled with user name, add label name/value, and Comment to create silence.

will see error on page

An error occurred
createdBy in body is required

see picture: https://drive.google.com/file/d/1PR64hvpYCC-WOHT1ID9A4jX91LdGG62Y/view?usp=sharing

this issue exists in 4.15/4.16/4.17/4.18, no issue with 4.14

Version-Release number of selected component (if applicable):

4.15.17

How reproducible:

alwawys

Steps to Reproduce:

see the description

Actual results:

Creator filed is not auto filled with user name

Expected results:

no error

Additional info:

https://github.com/openshift/console/pull/14336

Bug OCPBUGS-39248: Fix Service 'Edit Pod selector'

View the Description View the linked PRs

This action returns empty/blank page

https://github.com/openshift/networking-console-plugin/pull/58

Bug OCPBUGS-43652: Filter dropdown doesn't collapse on second click

View the Description View the linked PRs

Description of problem:

    Filter dropdown doesn't collapse on second click

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-10-21-132049

How reproducible:

    Always

Steps to Reproduce:

    1. Navigate to Workloads -> Pod page
    2. Click the 'Filter' dropdown component
    3. Click the 'Filter' dropdown again

Actual results:

    Compare with OCP4.17, the dropdown list could be collapsed after the second click
    But current on OCP4.18, the dropdown list cannot collapse

Expected results:

    the dropdown can collapse after click

Additional info:

https://github.com/openshift/console/pull/14427

Bug OCPBUGS-38722: [aws] add validation for public-only subnets workflows

View the Description View the linked PRs

Description of problem:

    We should add validation in the Installer when public-only subnets is enabled to make sure that:

	1. Print a warning if OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY is set
	2. If this flag is only applicable for public cluster, we could consider exit earlier if publish: Internal
	3. If this flag is only applicable for byo-vpc configuration, we could
 consider exit earlier if no subnets provided in install-config.

Version-Release number of selected component (if applicable):

    all versions that support public-only subnets

How reproducible:

    always

Steps to Reproduce:

    1. Set OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY
    2. Do a cluster install without specifying a VPC.
    3.

Actual results:

    No warning about the invalid configuration.

Expected results:

Additional info:

    This is an internal-only feature, so this validations shouldn't affect the normal path used by customers.

https://github.com/openshift/installer/pull/8883

Bug OCPBUGS-41550: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/29077

Vulnerability OCPBUGS-42510: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-image/pull/592

Bug OCPBUGS-37705: Decoded "auth" contains whitespace when create image pull secret with whitespace in password.

View the Description View the linked PRs

Description of problem:

Create image pull secret with whitespace in the beginning/end of username and password, decode the auth in the '.dockerconfigjson' of the secret, it still contains whitespace in password.

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-07-29-134911

How reproducible:

Always

Steps to Reproduce:

    1.Create image pull secret with whitespace in the beginning/end of username and password, eg: '  testuser  ','  testpassword  '
    2.Check on the secret details page, reveal values of ".dockerconfigjson", decode the value of 'auth'.
    3.

Actual results:

1. Secret is created.
2. There is not whitespace in value for username and password. But the decoded result of 'auth' contains whitespace in password.
$ echo 'dGVzdHVzZXI6ICB0ZXN0cGFzc3dvcmQgIA==' | base64 -d
testuser:  testpassword

Expected results:

1. Should not contain whitespace in password after decode auth. eg:
testuser:testpassword

Additional info:

https://github.com/openshift/console/pull/14356

Bug OCPBUGS-39363: Rails example failure

View the Description View the linked PRs

Description of problem:

The rails example "rails-postgresql-example" no longer runs successfully, because it references a version of ruby that is not available in the library.

This is blocking the release of Samples Operator because we check the validity of the templates shipped with the operator.

Rails sample is no longer supported by the Samples Operator but is still shipped in an old version. I.e. we just continue shipping the same old version of the sample across releases. This old version references ruby that is no longer present in the openshift library.

There are a couple of ways of solving this problem:

1. Start supporting the Rails sample again in Samples Operator (the Rails examples seem to be maintained and made also available through helm-charts).

2. Remove the test that makes sure rails example is buildable to let the test suite pass. We don't support rails anymore in the Samples Operator so this should not be too surprising.

3. Remove rails from the Samples Operator altogether. This is probably the cleanest solution but most likely requires more work than just removing the sample from the assets of Samples Operator (removing the failing test is the most obvious thing that would break, too).

We need to decide ASAP how to proceed to unblock the release of Samples Operator for OCP 4.17.

Version-Release number of selected component (if applicable):

How reproducible:

The Samples Operator testsuite runs these tests and results in a failure like this:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-samples-operator/567/pull-ci-openshift-cluster-samples-operator-master-e2e-aws-ovn-image-ecosystem/1829111792509390848

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

The test in question fails here: https://github.com/openshift/origin/blob/master/test/extended/image_ecosystem/s2i_ruby.go#L59

The line in the test output that stands out:

 I0829 13:02:24.241018 3111 dump.go:53] At 2024-08-29 13:00:21 +0000 UTC - event for rails-postgresql-example: {buildconfig-controller } BuildConfigInstantiateFailed: error instantiating Build from BuildConfig e2e-test-s2i-ruby-q75fj/rails-postgresql-example (0): Error resolving ImageStreamTag ruby:3.0-ubi8 in namespace openshift: unable to find latest tagged image

https://github.com/openshift/origin/pull/29088

Bug OCPBUGS-41149: ART requests updates to 4.18 image ose-azure-disk-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/271

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-38521: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-39429: ART requests updates to 4.18 image ose-cluster-bootstrap-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/107

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-bootstrap/pull/107

Bug OCPBUGS-33824: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/assisted-service/pull/6742

Bug OCPBUGS-42514: Errors when the image registry is configured to use a custom Azure storage account located in a different resource group blocked the upgrade

View the Description View the linked PRs

Description of problem:

When configuring the OpenShift image registry to use a custom Azure storage account in a different resource group, following the official documentation [1], the image-registy CO degrade and upgrade from version 4.14.x to 4.15.x fails. The image registry operator reports misconfiguration errors related to Azure storage credentials, preventing the upgrade and causing instability in the control plane.

[1] Configuring registry storage in Azure user infrastructure

Version-Release number of selected component (if applicable):

   4.14.33, 4.15.33

How reproducible:

Set up ARO:

- Deploy an ARO or OpenShift cluster on Azure, version 4.14.x.

Configure Image Registry:

- Follow the official documentation [1] to configure the image registry to use a custom Azure storage account located in a different resource group.
- Ensure that the image-registry-private-configuration-user secret is created in the openshift-image-registry namespace.
- Do not modify the installer-cloud-credentials secret.

Check the image registry CO status
Initiate Upgrade:

- Attempt to upgrade the cluster to OpenShift version 4.15.x.

Steps to Reproduce:

If we have the image-registry-private-configuration-user inplace and installer-cloud-credentials with no modified

We got the error

    NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: client misconfigured, missing 'TenantID', 'ClientID', 'ClientSecret', 'FederatedTokenFile', 'Creds', 'SubscriptionID' option(s)

The oeprator will also genreate a new secret image-registry-private-configuration with the same content as image-registry-private-configuration-user

$ oc get secret  image-registry-private-configuration -o yaml
apiVersion: v1
data:
  REGISTRY_STORAGE_AZURE_ACCOUNTKEY: xxxxxxxxxxxxxxxxx
kind: Secret
metadata:
  annotations:
    imageregistry.operator.openshift.io/checksum: sha256:524fab8dd71302f1a9ade9b152b3f9576edb2b670752e1bae1cb49b4de992eee
  creationTimestamp: "2024-09-26T19:52:17Z"
  name: image-registry-private-configuration
  namespace: openshift-image-registry
  resourceVersion: "126426"
  uid: e2064353-2511-4666-bd43-29dd020573fe
type: Opaque

2. then we delete the secret image-registry-private-configuration-user

now the secret image-registry-private-configuration will still exisit with the same content, but image-registry CO got a new error

NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: failed to get keys for the storage account arojudesa: storage.AccountsClient#ListKeys: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Storage/storageAccounts/arojudesa' under resource group 'aro-ufjvmbl1' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"

3. apply the workaround to manually changeing the secret installer-cloud-credentials azure_resourcegroup key with custom storage account resourcegroup

$ oc get secret installer-cloud-credentials -o yaml
apiVersion: v1
data:
  azure_client_id: xxxxxxxxxxxxxxxxx
  azure_client_secret: xxxxxxxxxxxxxxxxx
  azure_region: xxxxxxxxxxxxxxxxx
  azure_resource_prefix: xxxxxxxxxxxxxxxxx
  azure_resourcegroup: xxxxxxxxxxxxxxxxx <<<<<-----THIS
  azure_subscription_id: xxxxxxxxxxxxxxxxx
  azure_tenant_id: xxxxxxxxxxxxxxxxx
kind: Secret
metadata:
  annotations:
    cloudcredential.openshift.io/credentials-request: openshift-cloud-credential-operator/openshift-image-registry-azure
  creationTimestamp: "2024-09-26T16:49:57Z"
  labels:
    cloudcredential.openshift.io/credentials-request: "true"
  name: installer-cloud-credentials
  namespace: openshift-image-registry
  resourceVersion: "133921"
  uid: d1268e2c-1825-49f0-aa44-d0e1cbcda383
type: Opaque

The image-registry report healthy and this help the continue the upgrade

Actual results:

    The image registry seems still use the service principal way for Azure storage account authentication

Expected results:

    We expect the REGISTRY_STORAGE_AZURE_ACCOUNTKEY should the only thing image registry operator need for storage account authentication if Customer provide

The image registry continues to function using the custom Azure storage account in the different resource group.

Additional info:

Reproducibility: The issue is consistently reproducible by following the official documentation to configure the image registry with a custom storage account in a different resource group and then attempting an upgrade.
Related Issues:
- Similar problems have been reported in previous incidents, suggesting a systemic issue with the image registry operator's handling of Azure storage credentials.
Critical Customer Impact: Customers are required to perform manual interventions after every upgrade for each cluster, which is not sustainable and leads to operational overhead.

Slack : https://redhat-internal.slack.com/archives/CCV9YF9PD/p1727379313014789

https://github.com/openshift/cluster-image-registry-operator/pull/1127

Bug OCPBUGS-36236: [IBMCloud] install only checks first set of subnets (no pagination support)

View the Description View the linked PRs

Description of problem:

    The installer for IBM Cloud currently only checks the first group of subnets (50) when searching for Subnet details by name. It should provide pagination support to search all subnets.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%, dependent on order of subnets returned by IBM Cloud API's however

Steps to Reproduce:

    1. Create 50+ IBM Cloud VPC Subnets
    2. Use Bring Your Own Network (BYON) configuration (with Subnet names for CP and/or Compute) in install-config.yaml
    3. Attempt to create manifests (openshift-install create manifests)

Actual results:

    ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-1", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-2", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-3", platform.ibmcloud.controlPlaneSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-cp-eu-de-1", "eu-de-subnet-paginate-1-cp-eu-de-2", "eu-de-subnet-paginate-1-cp-eu-de-3"}: number of zones (0) covered by controlPlaneSubnets does not match number of provided or default zones (3) for control plane in eu-de, platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-1", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-2", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-3", platform.ibmcloud.computeSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-compute-eu-de-1", "eu-de-subnet-paginate-1-compute-eu-de-2", "eu-de-subnet-paginate-1-compute-eu-de-3"}: number of zones (0) covered by computeSubnets does not match number of provided or default zones (3) for compute[0] in eu-de]

Expected results:

    Successful manifests and cluster creation

Additional info:

    IBM Cloud is working on a fix

https://github.com/openshift/installer/pull/8668

Bug OCPBUGS-36948: ART requests updates to 4.17 image ose-azure-disk-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/243

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-41071: ART requests updates to 4.18 image ose-gcp-pd-csi-driver-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/71

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/gcp-pd-csi-driver/pull/71

Bug OCPBUGS-41519: oc debug node doesn't setup the right environment for sosreport

View the Description View the linked PRs

Description of problem

Recently, sos package was added to the tools image used when invoking oc debut node/<some-node> (details in z).

However, the change just added the sos package without taking into account other required conditions required by sos report to work inside a container.

For reference, the toolbox container has to be launched as follows for sos report to work properly (the comand output tells you the template of the right podman run command):

$ podman inspect registry.redhat.io/rhel9/support-tools | jq -r '.[0].Config.Labels.run' 
podman run -it --name NAME --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=NAME -e IMAGE=IMAGE -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host IMAGE

The most crucial thing is the HOST=/host environment variable, which makes sos report find the real root of the machine in /host, but the other ones are also required.

So if we are to support sos report in the tools image, the debug node container defaults should be changed such that container runs with the same settings than in the reference podman run indicated above.

Version-Release number of selected component (if applicable)

4.16 only

How reproducible

Always

Steps to Reproduce

Start a debug node container (oc debug node/<node>) and try to gather sos report (without chroot /host + toolbox, just from debug container).

Actual results

Debug container doesn't have the right environment for sos report
Sos report runs but generates a wrong sos report with limited and meaningless information of the debug container itself.

Expected results:

oc debug node/<node> to spawn a debug pod with the right environment for sos report to run as correctly as it would do in toolbox.
Sos report to work as expected in debug pod.

Additional info

(none)

https://github.com/openshift/oc/pull/1873

Bug MGMT-19280: Unable to add day2 node with iSCSI boot volume using AI

View the Description View the linked PRs

Description of the problem:

When tring to add a node on day2 using assisted-installer the node reports the disk to not be eligible as installation disk:

Thread: https://redhat-external.slack.com/archives/C05N3PY1XPH/p1731575515647969
Possible issue: https://github.com/openshift/assisted-service/blob/master/internal/hardware/validator.go#L117-L120 => the openshift version is not filled on day2

How reproducible:

Steps to reproduce:

Actual results:

Expected results:

https://github.com/openshift/assisted-service/pull/6999

Bug OCPBUGS-34637: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-file-csi-driver/pull/74

Bug OCPBUGS-43666: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/1197

Bug COO-493: monitoring UIplugin: silence alert action link has bad format

View the Description View the linked PRs

When verifying feature ACM Alerting UI, following doc,] face issue 'silence alert action link has bad format compare to CMO's same action '

https://github.com/openshift/monitoring-plugin/pull/232

Bug OCPBUGS-29528: List of default Camel K event sources disappears when adding a custom event source

View the Description View the linked PRs

Description of problem:

Camel K provides a list of Kamelets that are able to act as an event source or sink for a Knative eventing message broker.

Usually the list of Kamelets installed with the Camel K operator are displayed in the Developer Catalog list of available event sources with the provider "Apache Software Foundation" or "Red Hat Integration".

When a user adds a custom Kamelet custom resource to the user namespace the list of default Kamelets coming from the Camel K operator is gone. The Developer Catalog event source list then only displays the custom Kamelet but not the default ones.

Version-Release number of selected component (if applicable):

How reproducible:

Apply a custom Kamelet custom resource to the user namespace and open the list of available event sources in Dev Console Developer Catalog.

Steps to Reproduce:

    1. install global Camel K operator in operator namespace (e.g. openshift-operators)
    2. list all available event sources in "default" user namespace and see all Kamelets listed as event sources/sinks
    3. add a custom Kamelet custom resource to the default namespace
    4. see the list of available event sources only listing the custom Kamelet and the default Kamelets are gone from that list

Actual results:

Default Kamelets that act as event source/sink are only displayed in the Developer Catalog when there is no custom Kamelet added to a namespace.

Expected results:

Default Kamelets coming with the Camel K operator (installed in the operator namespace) should always be part of the Developer Catalog list of available event sources/sinks. When the user adds more custom Kamelets these should be listed, too.

Additional info:

Reproduced with Camel K operator 2.2 and OCP 4.14.8

screenshots: https://drive.google.com/drive/folders/1mTpr1IrASMT76mWjnOGuexFr9-mP0y3i?usp=drive_link

https://github.com/openshift/console/pull/13994

Bug OCPBUGS-41134: ART requests updates to 4.18 image ose-gcp-cluster-api-controllers-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-gcp/pull/231

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component name: ose-gcp-cluster-api-controllers-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

https://github.com/openshift/cluster-api-provider-gcp/pull/231

Bug METAL-1123: ironic-image compile errors lost

View the Description View the linked PRs

compile errors when building an ironic image look like this

2024-08-14 09:07:21 + python3 -m compileall --invalidation-mode=timestamp /usr
2024-08-14 09:07:21 Listing '/usr'...
2024-08-14 09:07:21 Listing '/usr/bin'...
...
Listing '/usr/share/zsh/site-functions'...
Listing '/usr/src'...
Listing '/usr/src/debug'...
Listing '/usr/src/kernels'...
Error: building at STEP "RUN prepare-image.sh && rm -f /bin/prepare-image.sh && /bin/prepare-ipxe.sh && rm -f /tmp/prepare-ipxe.sh": while running runtime: exit status 1

with the actual error lost in 3000+ lines of output, we should suppress the file listings

https://github.com/openshift/ironic-image/pull/544

Bug OCPBUGS-42428: Duplication of binaryData in signature configmap yaml file and signatures direcotry have more than specified releases in the ImageSetConfig yaml

View the Description View the linked PRs

Description of problem:

I see that when one release is declared in the ImageSetConfig.yaml everything works well with respect to creating release signature configmap, but when more than one release is added to ImageSetConfig.yaml i see that binaryData content in the signature configmap is duplicated and there is more than specified releases present in the signatures directory. See below

ImageSetConfig.yaml:
=================
[fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-232.yaml 
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  platform:
    channels:
    - name: stable-4.16
      minVersion: 4.16.0
      maxVersion: 4.16.0
    - name: stable-4.15
      minVersion: 4.15.0
      maxVersion: 4.15.0

Content in Signatures directory:
=========================
[fedora@preserve-fedora-yinzhou test]$ ls -l CLID-232/working-dir/signatures/
total 12
-rw-r--r--. 1 fedora fedora 896 Sep 25 11:27 4.15.0-x86_64-sha256-0da6316466d60a3a4535d5fed3589feb0391989982fba59d47d4c729912d6363
-rw-r--r--. 1 fedora fedora 897 Sep 25 11:27 4.15.31-x86_64-sha256-c03bbdd63fa8832266a2cf0d9fbcd2867692d9ba7e09d31bc77d15dd9903e36f
-rw-r--r--. 1 fedora fedora 899 Sep 25 11:27 4.16.0-x86_64-sha256-3717338045df06e31effea46761b2c7e90f543cc4f00547af8158dd6aea868c3

Content in Signature Configmap:
==========================
apiVersion: v1
binaryData:
  sha256-0da6316466d60a3a4535d5fed3589feb0391989982fba59d47d4c729912d6363-2: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphboVxQEWxbl6SZl5qXe29lQrJRdllmQmJ+YoWSlUK2XmJqanglkp+cnZqUW6uYl5mWmpxSW6KZnpQAoopVSckWhkamZlkJJoZmxoZmJmlmJmkGicaGJqbJpimpaaYmxqYZmWmmRgbGloaWFpaWGUlpRoapliYp5ikmxuZGlpaJRiZmxmrFSro6BUUlkAsk4psSQ/NzNZITk/ryQR6LAiBaBr8xJLSotSlYCqMlNS80oySyqRHVaUmpZalJqXDNZeWJpYqZeZr59fkJpXnJGZVgKUzklNLE7VTUkt089PLoDxrUz0DE31DHQrLMzizUyUakFuyC8oyczPgwZAclEq0C1FIEODUlMUPBJLFPyBhgaDDFUIBjoqMy9dwbG0JCMfGGyVCgZ6BnqGQGM6mWRYGBg5GNhYmUChysDFKQCLgT4zAYZeps1bfryz7j15qafOW3Dqwv8q1gUhm2eahBm6BEgZRp1fNN1LJEQi0PW1qVrTmQnusy7Pq/t2qcrj83LOh7b7uhMlL7AF3j6QM/HdoTTFaZsulu3qm/FU7SCTwhUH+WsaJw2/l2/bpKDEmvI29TPTCs0pJrFt1UGds0OXeuZf/Pvo9Y8WWw/7sA0lrA0daz6Ef9RdPsGdU+SDpjCrRuai8oIbavb9Fz22FvYv/eMk/dv26L6MPqaU1R56Sz8LVJQ1XQrk3Dzl+THGVZ97BOS0znjwn/RLvsNvc/8V8w39xV/XuhvskLMXfjPp5pErMtbKPMte5krmeEefy5uWvyi9dUPesedH/ey8l894t/RM1odKsaZwtx2X8tecb/eZGsd64P/c77cOnYiX62POMY+L2Xom4bVk5DnDncrKsictr/4yDjnO5Heg0uHN6k1rkv88Ez5yy+HU009+V1l3eFUfVVhfahQS/5trr3JrtIvKFln+s9L17+9brQp10wtkeTqt5OOZrftY7Nqk1mcLejxanF7uyHvSIj+vUPDZhk4GU+MAZ4a3zCfSdeb2l4REqdRwVhoXf7u9/6qnYf79L2IOHE4RzOVbghwsXgWa3T715rLQwT7e/SuYBYqWf87c+CFw0/QTPg3vmI/G/qhaKvLf3sy7U+N2TVDe9OUqj0/wvBI/yOV0y0Mpet1ZRt+zH9tllRMkH60PSd23EAA=
  sha256-0da6316466d60a3a4535d5fed3589feb0391989982fba59d47d4c729912d6363-3: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphboVxQEWxbl6SZl5qXe29lQrJRdllmQmJ+YoWSlUK2XmJqanglkp+cnZqUW6uYl5mWmpxSW6KZnpQAoopVSckWhkamZlkJJoZmxoZmJmlmJmkGicaGJqbJpimpaaYmxqYZmWmmRgbGloaWFpaWGUlpRoapliYp5ikmxuZGlpaJRiZmxmrFSro6BUUlkAsk4psSQ/NzNZITk/ryQR6LAiBaBr8xJLSotSlYCqMlNS80oySyqRHVaUmpZalJqXDNZeWJpYqZeZr59fkJpXnJGZVgKUzklNLE7VTUkt089PLoDxrUz0DE31DHQrLMzizUyUakFuyC8oyczPgwZAclEq0C1FIEODUlMUPBJLFPyBhgaDDFUIBjoqMy9dwbG0JCMfGGyVCgZ6BnqGQGM6mWRYGBg5GNhYmUChysDFKQCLgT4zAYZeps1bfryz7j15qafOW3Dqwv8q1gUhm2eahBm6BEgZRp1fNN1LJEQi0PW1qVrTmQnusy7Pq/t2qcrj83LOh7b7uhMlL7AF3j6QM/HdoTTFaZsulu3qm/FU7SCTwhUH+WsaJw2/l2/bpKDEmvI29TPTCs0pJrFt1UGds0OXeuZf/Pvo9Y8WWw/7sA0lrA0daz6Ef9RdPsGdU+SDpjCrRuai8oIbavb9Fz22FvYv/eMk/dv26L6MPqaU1R56Sz8LVJQ1XQrk3Dzl+THGVZ97BOS0znjwn/RLvsNvc/8V8w39xV/XuhvskLMXfjPp5pErMtbKPMte5krmeEefy5uWvyi9dUPesedH/ey8l894t/RM1odKsaZwtx2X8tecb/eZGsd64P/c77cOnYiX62POMY+L2Xom4bVk5DnDncrKsictr/4yDjnO5Heg0uHN6k1rkv88Ez5yy+HU009+V1l3eFUfVVhfahQS/5trr3JrtIvKFln+s9L17+9brQp10wtkeTqt5OOZrftY7Nqk1mcLejxanF7uyHvSIj+vUPDZhk4GU+MAZ4a3zCfSdeb2l4REqdRwVhoXf7u9/6qnYf79L2IOHE4RzOVbghwsXgWa3T715rLQwT7e/SuYBYqWf87c+CFw0/QTPg3vmI/G/qhaKvLf3sy7U+N2TVDe9OUqj0/wvBI/yOV0y0Mpet1ZRt+zH9tllRMkH60PSd23EAA=
  sha256-3717338045df06e31effea46761b2c7e90f543cc4f00547af8158dd6aea868c3-1: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphbqJSUGVJf56SZl5adU9WtVKyUWZJZnJiTlKVgrVSpm5iempYFZKfnJ2apFubmJeZlpqcYluSmY6kAJKKRVnJBqZmlkZmxuaGxtbGJiYpqQZmKUaG6ampaUmmpiZmxkmGSWbp1oapJmaGCcnm6QZGJiamCemWRiaWqSkmCWmJlqYWSQbK9XqKCiVVBaArFNKLMnPzUxWSM7PK0nMzEstUgC6Ni+xpLQoVQmoKjMlNa8ks6QS2WFFqWmpRal5yWDthaWJlXqZ+fr5Bal5xRmZaSVA6ZzUxOJU3ZTUMv385AIY38pEz9BMz0C3wsIs3sxEqRbkhvyCksz8PGgAJBelAt1SBDI0KDVFwSOxRMEfaGgwyFCFYKCjMvPSFRxLSzLygcFWqWCgZ6BnCDSmk0mGhYGRg4GNlQkUqgxcnAKwGNiSIcDQLFrmt8ZarfU0234jphipJx9PrVWR6Ne1P/lzlbnN1blfXt+UWnXz4NW1Ne/eHNI+vNpyxpe0VZozL1YKlMg+VCo+uul5S4t3L+8byXsmb98vdVy61TLumM+0Ta1WuikS3NfVlvPNLJ6y4+6qX74pz9pqnXbr32lxenH6btxcpW+C21ICAxd9tOkST7Vemn7kedPrOXyPCkQ5blZK1BdaPYndXcMZK3AsI7a4SqMsrvH2pNgVRU+X3z1t/umAHWv4FbZowW8zDnZtt1ov5215R/dtsXOw4fwEi5WtClM55h0908FyYOor+7/HI0qPZ3DsP8DPIZy4YOl38fb5PPOCTP8fm8t++erKN9mbAh7+Yo90eO8urXuho6OitC3hcIjpf9HiSBMl13fOt6MEF7zsn7Zj5oI7x5Y2Hr6ys/RNxnPZgjlh/pkdr7OccxM2zLFvXTN7b7n0r3dq277/LvuYl+l+e16u18bpMbmZu2VtkmYY31h94+uCaN3I43tbJLmXTtly97Yyc23LrtxtK7PM5K4oSd0oMJ7zaN3Ssr0bEo8GFIT7m9eY/3leG/76McPKO5uDHji8zpWUnfNyv2L315RVXc+usYuwf/v81PvHlz3Vt/49PTFNILy04pjQv788culLEi1edk2amaH5zTfBvN407aP4i6NzPwi98O5nac/cHbZLzDEw4iXjpHWsuWZPzJhyNF3myQb3SlQ7AQA=
  sha256-3717338045df06e31effea46761b2c7e90f543cc4f00547af8158dd6aea868c3-5: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphbqJSUGVJf56SZl5adU9WtVKyUWZJZnJiTlKVgrVSpm5iempYFZKfnJ2apFubmJeZlpqcYluSmY6kAJKKRVnJBqZmlkZmxuaGxtbGJiYpqQZmKUaG6ampaUmmpiZmxkmGSWbp1oapJmaGCcnm6QZGJiamCemWRiaWqSkmCWmJlqYWSQbK9XqKCiVVBaArFNKLMnPzUxWSM7PK0nMzEstUgC6Ni+xpLQoVQmoKjMlNa8ks6QS2WFFqWmpRal5yWDthaWJlXqZ+fr5Bal5xRmZaSVA6ZzUxOJU3ZTUMv385AIY38pEz9BMz0C3wsIs3sxEqRbkhvyCksz8PGgAJBelAt1SBDI0KDVFwSOxRMEfaGgwyFCFYKCjMvPSFRxLSzLygcFWqWCgZ6BnCDSmk0mGhYGRg4GNlQkUqgxcnAKwGNiSIcDQLFrmt8ZarfU0234jphipJx9PrVWR6Ne1P/lzlbnN1blfXt+UWnXz4NW1Ne/eHNI+vNpyxpe0VZozL1YKlMg+VCo+uul5S4t3L+8byXsmb98vdVy61TLumM+0Ta1WuikS3NfVlvPNLJ6y4+6qX74pz9pqnXbr32lxenH6btxcpW+C21ICAxd9tOkST7Vemn7kedPrOXyPCkQ5blZK1BdaPYndXcMZK3AsI7a4SqMsrvH2pNgVRU+X3z1t/umAHWv4FbZowW8zDnZtt1ov5215R/dtsXOw4fwEi5WtClM55h0908FyYOor+7/HI0qPZ3DsP8DPIZy4YOl38fb5PPOCTP8fm8t++erKN9mbAh7+Yo90eO8urXuho6OitC3hcIjpf9HiSBMl13fOt6MEF7zsn7Zj5oI7x5Y2Hr6ys/RNxnPZgjlh/pkdr7OccxM2zLFvXTN7b7n0r3dq277/LvuYl+l+e16u18bpMbmZu2VtkmYY31h94+uCaN3I43tbJLmXTtly97Yyc23LrtxtK7PM5K4oSd0oMJ7zaN3Ssr0bEo8GFIT7m9eY/3leG/76McPKO5uDHji8zpWUnfNyv2L315RVXc+usYuwf/v81PvHlz3Vt/49PTFNILy04pjQv788culLEi1edk2amaH5zTfBvN407aP4i6NzPwi98O5nac/cHbZLzDEw4iXjpHWsuWZPzJhyNF3myQb3SlQ7AQA=
  sha256-c03bbdd63fa8832266a2cf0d9fbcd2867692d9ba7e09d31bc77d15dd9903e36f-4: owGbwMvMwMEoOU9/4l9n2UDGtYwpSWLxRQW5xZnpukWphbpZ+ZXhZuF6SZl5abcZJKuVkosySzKTE3OUrBSqlTJzE9NTwayU/OTs1CLd3MS8zLTU4hLdlMx0IAWUUirOSDQyNbNKNjBOSkpJMTNOS7SwMDYyMjNLNEpOM0ixTEtKTjGyMDM3szRKsUxKNE81sEwxNkxKNjdPMTRNSbG0NDBONTZLU6rVUVAqqSwAWaeUWJKfm5mskJyfV5KYmZdapAB0bV5iSWlRqhJQVWZKal5JZkklssOKUtNSi1LzksHaC0sTK/Uy8/XzC1LzijMy00qA0jmpicWpuimpZfr5yQUwvpWJnqGpnrGhboWFWbyZiVItyBH5BSWZ+XnQEEguSgU6pghkalBqioJHYomCP9DUYJCpCsFAV2XmpSs4lpZk5APDrVLBQM9AzxBoTCeTDAsDIwcDGysTKFgZuDgFYFHwQYP/r7TdX8MJrlqz/3tPL+rjsZNXsNwX8Vxgc++2GI5dkt4r1r1nmrfdcGVn8tVJMtzTrf7m6F+9v5m7uK54b18F3+1JS5ziwtOfTpSpZs1u4z41o2QHo3HJmQNum0OK5ywoMtB4s8Mh+YVo7FSN7Vpr8/fdkHDPmr/plNTxw5EByZreMicnzhWx1TX94bxkYf9X1gehhstDj5Vu+7G6VTv49O9yx+xah4XC4ccvGyj4y374ql1TcsZwscHEagvz1eeFey97Lkj6nX2y+MyjY3yvJMRbxEvZ/iS9W/+b4+zOGZmHdm6pymfO9104VY3JVeO2V3JvfvKi9KKmXh8xyf/lQlprjI52nomwOOSZfIpBLv7Ezf/r9wQ4Lt81dfuJlfO50uc5p5ybIMD3L6ZywY3EA1yvIkNllmkwCTgc9RDwf7hnqrpoxNeLP75tcY7ekplU3FymE1z7YMIli8Trp3c0VFTFHuibLcGn13Rvu0roraAZBpvXV7vL7mExXjJHaoJlenxeOIvZ85ksH29fe3Cp2lVCp8Kh1KjUeyZ7w8PJX/W0Ppp96TTwUPuXNi/ZxXSpxxy19trJysLbLi5In8sytTB08vRLarfc0hiVXgs7m6f0P7xyYpbzVPbZrPYHnRjfCS9ljFNamXL50KzN6T46hww81YT1W84kzvMNZd/M0B+auvfe758FLnyRM3zfrJ43n2tbF1P3Ph7tqngA
kind: ConfigMap
metadata:
  labels:
    release.openshift.io/verification-signatures: ""
  namespace: openshift-config-managed

Version-Release number of selected component (if applicable):

     [fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-298-ga5a32fa", GitCommit:"a5a32fa3", GitTreeState:"clean", BuildDate:"2024-09-25T08:22:44Z", GoVersion:"go1.22.4", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

    Always

Steps to Reproduce:

    1.  clone oc-mirror repo, cd oc-mirror, run make build
    2.  Now use the imageSetConfig.yaml present above and run mirror2disk & disk2mirror commands
    3. oc-mirror -c /tmp/clid-232.yaml file://CLID-232 --v2 ; oc-mirror -c /tmp/clid-232.yaml --from file://CLID-232 docker://localhost:5000/clid-232 --dest-tls-verify=false --v2

Actual results:

   1.  See that signature directory contains more than expected releases as shown in the description
    2. Also see that binaryData is duplicated in the signatureconfigmap.yaml

Expected results:

    1. Should only see the releases that are defined in the imageSetConfig.yaml in the signatures directory
    2. Should not see any duplication of binaryData in the signatureconfigmap.yaml file.

Additional info:

https://github.com/openshift/oc-mirror/pull/934

Bug OCPBUGS-44336: Technical debt created with controllers v2

View the Description View the linked PRs

The duplication of controllers for hostedcontrolplane v2 has caused some technical debt.

The new controllers are now out of sync with their v1.

For example:

control-plane-operator/controllers/hostedcontrolplane/v2/cloud_controller_manager/openstack/config.go is missing a feature that was merged between the v2 controller was merged, so it's out of sync.

https://github.com/openshift/hypershift/pull/5084

Bug OCPBUGS-36492: Ironic inspection fails due to utf-8 decoding issue on Disk serial

View the Description View the linked PRs

Description of problem:

Inspection is failing on hosts which special characters found in serial number of block devices:

Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: 2024-07-03 09:16:11.325 1 DEBUG ironic_python_agent.inspector [-] collected data: {'inventory'....'error': "The following errors were encountered:\n* collector logs failed: 'utf-8' codec can't decode byte 0xff in position 12: invalid start byte"} call_inspector /usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py:128

Serial found:
"serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff"

Interesting stacktrace error:
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed

Full stack trace:
~~~
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: 2024-07-03 09:16:11.628 1 DEBUG oslo_concurrency.processutils [-] CMD "lsblk -bia --json -oKNAME,MODEL,SIZE,ROTA,TYPE,UUID,PARTUUID,SERIAL" returned: 0 in 0.006s e
xecute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: --- Logging error ---
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: --- Logging error ---
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Traceback (most recent call last):
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]:   File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Traceback (most recent call last):
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     stream.write(msg + self.terminator)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Call stack:
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]:     stream.write(msg + self.terminator)
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/bin/ironic-python-agent", line 10, in <module>
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     sys.exit(run())
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/cmd/agent.py", line 50, in run
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     agent.IronicPythonAgent(CONF.api_url,
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Call stack:
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 485, in run
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     self.process_lookup_data(content)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 400, in process_lookup_data
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     hardware.cache_node(self.node)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3179, in cache_node
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     dispatch_to_managers('wait_for_disks')
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     return getattr(manager, method)(*args, **kwargs)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 997, in wait_for_disks
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     self.get_os_install_device()
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1518, in get_os_install_device
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     block_devices = self.list_block_devices_check_skip_list(
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1495, in list_block_devices_check_skip_list
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     block_devices = self.list_block_devices(
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1460, in list_block_devices
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     block_devices = list_all_block_devices()
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 526, in list_all_block_devices
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     report = il_utils.execute('lsblk', '-bia', '--json',
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 111, in execute
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     _log(result[0], result[1])
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 99, in _log
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     LOG.debug('Command stdout is: "%s"', stdout)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Message: 'Command stdout is: "%s"'
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Arguments: ('{\n   "blockdevices": [\n      {\n         "kname": "loop0",\n         "model": null,\n         "size": 67467313152,\n         "rota": false,\n         "type": "loop",\n         "uuid": "28f5ff52-7f5b-4e5a-bcf2-59813e5aef5a",\n         "partuuid": null,\n         "serial": null\n      },{\n         "kname": "loop1",\n         "model": null,\n         "size": 1027846144,\n         "rota": false,\n         "type": "loop",\n         "uuid": null,\n         "partuuid": null,\n         "serial": null\n      },{\n         "kname": "sda",\n         "model": "LITEON IT ECE-12",\n         "size": 120034123776,\n         "rota": false,\n         "type": "disk",\n         "uuid": null,\n         "partuuid": null,\n         "serial": "XXXXXXXXXXXXXXXXXX"\n      },{\n         "kname": "sdb",\n         "model": "LITEON IT ECE-12",\n         "size": 120034123776,\n         "rota": false,\n         "type": "disk",\n         "uuid": null,\n         "partuuid": null,\n         "serial": "XXXXXXXXXXXXXXXXXXXX"\n      },{\n         "kname": "sdc",\n         "model": "External",\n         "size": 0,\n         "rota": true,\n         "type": "disk",\n         "uuid": null,\n         "partuuid": null,\n         "serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff"\n      }\n   ]\n}\n',)
~~~

Version-Release number of selected component (if applicable):

OCP 4.14.28

How reproducible:

Always

Steps to Reproduce:

    1. Add a BMH with a bad utf-8 characters in serial
    2.
    3.

Actual results:

Inspection fail

Expected results:

Inspection works

Additional info:

Bug OCPBUGS-44369: Event modal persists after Add option is selected if knative service is not present

View the Description View the linked PRs

Description of problem:

Selecting Add from Event modal in topology redirects to add page but the event modal to add trigger for  a broker persistes

Version-Release number of selected component (if applicable):

How reproducible:

Everytime

Steps to Reproduce:

    1. Enable event option in config map of knative-eventing namespace
    2. Create a broker and associate an event to it
    3. In topology select add trigger for the broker
    4. Since no service is created it will ask to go to Add page to create a service so select Add from the modal

Actual results:

The modal persists

Expected results:

The modal should be closed after the user is redirected to the Add page

Additional info:

Adding video of the issue

https://drive.google.com/file/d/16hMbtBj0GeqUOLnUdCTMeYR3exY84oEn/view?usp=sharing

https://github.com/openshift/console/pull/14479

Bug OCPBUGS-41328: HostedClusterConfigOperator used wrong certificate for Kube certificate authority

View the Description View the linked PRs

Description of problem:

    Rotating the root certificates (root CA) requires multiple certificates during the rotation process to prevent downtime as the server and client certificates are updated in the control and data planes. Currently, the HostedClusterConfigOperator uses the cluster-signer-ca from the control plane to create a kublet-serving-ca on the data plane. The cluster-signer-ca contains only a single certificate that is used for signing certificates for the kube-controller-manager. 

During a rotation, the kublet-serving-ca will be updated with the new CA which triggers the metrics-server pod to restart and use the new CA. This will lead to an error in the metrics-server where it cannot scrape metrics as the kublet has yet to pickup the new certificate.

E0808 16:57:09.829746       1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.240.0.29:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="pres-cqogb7a10b7up68kvlvg-rkcpsms0805-default-00000130"

rkc@rmac ~> kubectl get pods -n openshift-monitoring
NAME                                                     READY   STATUS    RESTARTS   AGE
metrics-server-594cd99645-g8bj7                          0/1     Running   0          2d20h
metrics-server-594cd99645-jmjhj                          1/1     Running   0          46h 

The HostedClusterConfigOperator should likely be using the KubeletClientCABundle from the control plane for the kublet-serving-ca in the data plane. This CA bundle will contain both the new and old CA such that all data plane components can remain up during the rotation process.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/4676

Bug OCPBUGS-41776: Adding more tested arm instances to Tested instance types for AWS on 64-bit ARM infrastructures

View the Description View the linked PRs

Description of problem:

the section is: https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-arm-tested-machine-types_installing-aws-vpc  

all tesed arm instances for 4.14+:
c6g.*
c7g.*
m6g.*
m7g.*
r8g.*

we need to ensure all sections include "Tested instance types for AWS on 64-bit ARM infrastructures" section been updated for 4.14+

Additional info:

https://github.com/openshift/installer/pull/8996

Task MGMT-18292: Use the openshift installer to generate the configuration ISO

View the Description View the linked PRs

In 4.17 the openshift installer will have the `create config iso` functionality (see epic). IBIO should stop implementing this logic, instead it should extract the openshift installer from the release image (already part of the ICI CR) and use it to create ethe configuration ISO.

https://github.com/openshift/installer/pull/9058

Bug OCPBUGS-39538: ART requests updates to 4.18 image ose-route-controller-manager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/route-controller-manager/pull/47

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/route-controller-manager/pull/47

Bug OCPBUGS-41228: Console crashes when ssh is selected in add secret for starting a pipeline run

View the Description View the linked PRs

Description of problem:

The console crashes when the user selects SSH as the Authentication type for the git server under add secret in the start pipeline form

Version-Release number of selected component (if applicable):

How reproducible:

Everytime. Only in developer perspective and if the Pipelines dynamic plugin is enabled.

Steps to Reproduce:

    1. Create a pipeline through add flow and open start pipeline page 
    2. Under show credentials select add secret
    3. In the secret form select `Access to ` as Git server and `Authentication type` as SSH key

Actual results:

Console crashes

Expected results:

UI should work as expected

Additional info:

Attaching console log screenshot

https://drive.google.com/file/d/1bGndbq_WLQ-4XxG5ylU7VuZWZU15ywTI/view?usp=sharing

https://github.com/openshift/console/pull/14282

Bug OCPBUGS-36577: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-34189: ART requests updates to 4.17 image ose-aws-ebs-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/227

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-41785: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug TRT-1777: cluster-data.json cannot detect dualstack and is miscategorizing jobs in variant registry for 4.13+

View the Description View the linked PRs

Dualstack jobs beyond 4.13 (presumably when we added cluster-data.json) are miscategorized as NetworkStack = ipv4 because the code doesn't know how to detect dualstack: https://github.com/openshift/origin/blob/11f7ac3e64e6ee719558fc18d753d4ce1303d815/pkg/monitortestlibrary/platformidentification/types.go#L88

We have the ability to NOT override a variant calculated from jobname if cluster-data disagrees: https://github.com/openshift/sippy/blob/master/pkg/variantregistry/ocp.go#L181

We should fix origin, but we don't want to backport to five releases, so we should also update the variant registry to ignore this field in cluster data is release <= 4.18 (assuming that's where we fix this)

https://github.com/openshift/origin/pull/29035

Task AGENT-1023: Support PATCH request in curl_assisted_service

View the Description View the linked PRs

Worth supporting PATCH request in curl_assisted_service func.
E.g. for the appliance flow: https://github.com/openshift/appliance/blob/1c405b5cc722b29edcf4bb6bbe14e44d21a4c066/data/scripts/bin/update-hosts.sh.template#L29-L30

https://github.com/openshift/installer/pull/9061

Bug MGMT-19005: [nmstate] nmstate enabled on ARM

View the Description View the linked PRs

Description of the problem:

Looks like nmstate service enabled on ARM machine .

ARM machine: (Run on CI job)
nvd-srv-17.nvidia.eng.rdu2.redhat.com

https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/CI/job/assisted-saas-api-dualstack-staticip-mno-arm/

[root@worker-0-0 core]# cd /etc/nmstate/ [root@worker-0-0 nmstate]# cat cat catchsegv [root@worker-0-0 nmstate]# ls -l total 8 -rw-r--r--. 1 root root 95 Aug 1 2022 README -rw-------. 1 root root 804 Sep 24 12:36 ymlFile2.yml [root@worker-0-0 nmstate]# cat ymlFile2.yml capture: iface0: interfaces.mac-address == "52:54:00:82:6B:E0" desiredState: dns-resolver: config: server: - 192.168.200.1 interfaces: - ipv4: address: - ip: 192.168.200.53 prefix-length: 24 dhcp: false enabled: true name: "{{ capture.iface0.interfaces.0.name }}" type: ethernet state: up ipv6: address: - ip: fd2e:6f44:5dd8::39 prefix-length: 64 dhcp: false enabled: true routes: config: - destination: 0.0.0.0/0 next-hop-address: 192.168.200.1 next-hop-interface: "{{ capture.iface0.interfaces.0.name }}" table-id: 254 - destination: ::/0 next-hop-address: fd2e:6f44:5dd8::1 next-hop-interface: "{{ capture.iface0.interfaces.0.name }}" table-id: 254[root@worker-0-0 nmstate]#

[root@worker-0-0 nmstate]# systemctl status nmstate.service ● nmstate.service - Apply nmstate on-disk state Loaded: loaded (/usr/lib/systemd/system/nmstate.service; enabled; preset: enabled) Active: active (exited) since Tue 2024-09-24 12:40:05 UTC; 20min ago Docs: man:nmstate.service(8) https://www.nmstate.io Process: 3427 ExecStart=/usr/bin/nmstatectl service (code=exited, status=0/SUCCESS) Main PID: 3427 (code=exited, status=0/SUCCESS) CPU: 36ms Sep 24 12:40:03 worker-0-0 systemd[1]: Starting Apply nmstate on-disk state... Sep 24 12:40:03 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:03Z INFO nmstatectl] Nmstate version: 2.2.27 Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::ip] Static addresses fd2e:6f44:5dd8::39/64 defined when dynamic IP is enabled Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::ip] Static addresses fd2e:6f44:5dd8::39/64 defined when dynamic IP is enabled Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::query_apply::net_state] Created checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::query_apply::net_state] Rollbacked to checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z ERROR nmstatectl::service] Failed to apply state file /etc/nmstate/ymlFile2.yml: NmstateError: NotImplementedError: Autoconf without DHCP is not supported yet Sep 24 12:40:05 worker-0-0 systemd[1]: Finished Apply nmstate on-disk state. [root@worker-0-0 nmstate]# more /usr/lib/systemd/system/nmstate.service [Unit] Description=Apply nmstate on-disk state Documentation=man:nmstate.service(8) https://www.nmstate.io After=NetworkManager.service Before=network-online.target Requires=NetworkManager.service [Service] Type=oneshot ExecStart=/usr/bin/nmstatectl service RemainAfterExit=yes [Install] WantedBy=NetworkManager.service [root@worker-0-0 nmstate]#

How reproducible:

Steps to reproduce:

Actual results:

Expected results:

https://github.com/openshift/assisted-service/pull/6798

Bug OCPBUGS-39111: gather nmstate custom resources

View the Description View the linked PRs

Gather the nodenetworkconfigurationpolicy.nmstate.io/v1 and nodenetworkstate.nmstate.io/v1beta1 cluster scoped resources in the Insights data. This CRs are introduced by the NMState operator.

Bug OCPBUGS-43266: Architecture Metrics shown no data point found on console

View the Description View the linked PRs

Description of problem:

  A new Chart 'Architecture' is added on Metrics page for some resources eg: Deployments, StatefulSet, and DemonStets, and so on. It will be shown 'no data point found' on the Chart which is not correct

The Report issue/Question is: 
Q1. Should the Chart of 'Architecture' be listed on the Metrics page for those resources?
Q2. If yes, it should not shown 'No datapoints found'

Version-Release number of selected component (if applicable):

  4.18.0-0.nightly-2024-10-08-075347

How reproducible:

    Always

Steps to Reproduce:

    1. Navigate to a resource details page, such as StatefulSet details/ Deployments details page, and go to Metrices tab
       eg: k8s/ns/openshift-monitoring/statefulsets/alertmanager-main/metrics
    2. Check the new chart 'Architecture'
    3.

Actual results:

    A new chart'Architecture' is listed on Metrics page
    And the data in the chart return 'no datapoints found'

Expected results:

    The chart of 'Architecture' should not exist
    If it is added by Design, it should not return 'No datapoints found'

Additional info:

For Reference: I think the page is impact by the PR https://github.com/openshift/console/pull/13718

https://github.com/openshift/console/pull/14458

Bug OCPBUGS-43920: Unnecessary kubeclient copy in the operator

View the Description View the linked PRs

Description of problem:

    etcd-operator is using JSON-based client for core object communication. Instead it should use protobuf version

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-etcd-operator/pull/1361

Bug MGMT-18702: AgentServiceConfig stuck deleting on local-cluster-import finalizer

View the Description View the linked PRs

Description of the problem:

When attempting to delete the agentserviceconfig, it gets stuck deleting on the `agentserviceconfig.agent-install.openshift.io/local-cluster-import-deprovision` finalizer.

The following errors is reported by the infrastructure operator pod:

time="2024-09-03T12:57:17Z" level=info msg="AgentServiceConfig (LocalClusterImport) Reconcile started"
time="2024-09-03T12:57:17Z" level=error msg="could not delete local cluster ClusterDeployment due to error failed to delete ClusterDeployment  in namespace : resource name may not be empty"
time="2024-09-03T12:57:17Z" level=error msg="failed to clean up local cluster CRs" error="failed to delete ClusterDeployment  in namespace : resource name may not be empty"
time="2024-09-03T12:57:17Z" level=info msg="AgentServiceConfig (LocalClusterImport) Reconcile ended"
{"level":"error","ts":"2024-09-03T12:57:17Z","msg":"Reconciler error","controller":"agentserviceconfig","controllerGroup":"agent-install.openshift.io","controllerKind":"AgentServiceConfig","AgentServiceConfig":{"name":"agent"},"namespace":"","name":"agent","reconcileID":"470afd7d-ec86-4d45-818f-eb6ebb4caa3d","error":"failed to delete ClusterDeployment  in namespace : resource name may not be empty","errorVerbose":"resource name may not be empty\nfailed to delete ClusterDeployment  in namespace \ngithub.com/openshift/assisted-service/internal/controller/controllers.(*LocalClusterImportReconciler).deleteClusterDeployment\n\t/remote-source/assisted-service/app/internal/controller/controllers/local_cluster_import_controller.go:250\ngithub.com/openshift/assisted-service/internal/controller/controllers.(*LocalClusterImportReconciler).ensureLocalClusterCRsDeleted\n\t/remote-source/assisted-service/app/internal/controller/controllers/local_cluster_import_controller.go:333\ngithub.com/openshift/assisted-service/internal/controller/controllers.(*LocalClusterImportReconciler).Reconcile\n\t/remote-source/assisted-service/app/internal/controller/controllers/local_cluster_import_controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1695","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

How reproducible:

100%

Steps to reproduce:

1. Delete AgentServiceConfig resource

Actual results:

The AgentServiceConfig isn't removed

Expected results:

The AgentServiceConfig is removed

https://github.com/openshift/assisted-service/pull/6721

Bug OCPBUGS-39001: 4.15 hosted cluster capi-provider failed due to its ROSA feature gate

View the Description View the linked PRs

Description of problem:

Using the latest main branch hypershift client to create a 4.15 hc, the capi provider crashed with the logs:

$ oc logs capi-provider-647f454bf-sqq9c
Defaulted container "manager" out of: manager, token-minter, availability-prober (init)
invalid argument "EKS=false,ROSA=false" for "--feature-gates" flag: unrecognized feature gate: ROSA
Usage of /bin/cluster-api-provider-aws-controller-manager:
invalid argument "EKS=false,ROSA=false" for "--feature-gates" flag: unrecognized feature gate: ROSA

Version-Release number of selected component (if applicable):

4.15 HC

How reproducible:

    100%

Steps to Reproduce:

    1. Just use main latest cli to create a public aws 4.15 HC 
    2.
    3.

Actual results:

capi-provider pod crashed

Expected results:

    the 4.15 hc could be created successfully

Additional info:

probably related to

4576

slack: https://redhat-internal.slack.com/archives/G01QS0P2F6W/p1724249475037359

https://github.com/openshift/hypershift/pull/4619

Bug OCPBUGS-43747: Removed third party override for cloud-provider-vsphere

View the Description View the linked PRs

Description of problem:

Removing third party override of cloud-provider-vsphere's config package

Version-Release number of selected component (if applicable):

4.18, 4.17.z

How reproducible:

Always

Additional info:

Upstream package was overridden to fix logging confusion while we waited for upstream fix.  Fix is now ready and the third party override needs to be removed.

Bug OCPBUGS-38437: mce-2.7 and main HyperShift Operator Konflux builds need to be split post branch

View the Description View the linked PRs

Description of problem:

    After branching, main branch still publishes Konflux builds to mce-2.7

Version-Release number of selected component (if applicable):

    mce-2.7

How reproducible:

    100%

Steps to Reproduce:

    1.Post a PR to

main

    2. Check the jobs that run

Actual results:

Both mce-2.7 and main Konflux builds get triggered

Expected results:

Only main branch Konflux builds gets triggered

Additional info:

https://github.com/openshift/hypershift/pull/4549

Bug OCPBUGS-39357: webhook service is missed in openshift-console-operator namespace

View the Description View the linked PRs

Description of problem:

After installed MCE operator, tried to create MultiClusterEngine instance, it failed with error:
 "error applying object Name: mce Kind: ConsolePlugin Error: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found"
Checked in openshift-console-operator, there is not webhook service, also deployment "console-conversion-webhook" is missed.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-25-103421

How reproducible:

Always

Steps to Reproduce:

    1.Check resources in openshift-console-opeator, such as deployment and service.
    2.
    3.

Actual results:

1. There is not webhook related deployment, pod and service.

Expected results:

1. Should have webhook related resources.

Additional info:

https://github.com/openshift/console-operator/pull/927

Bug OCPBUGS-43718: Edit Deployment and Edit DeploymentConfig takes user to project workloads page

View the Description View the linked PRs

Description of problem:

Edit Deployment and Edit DeploymentConfig actions redirect user to project workloads page instead of resource details page

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-22-123921

How reproducible:

    Always

Steps to Reproduce:

    1. user tries to `Edit Deployment` and `Edit DeploymentConfig` action either in Form or YAML view, save the changes

Actual results:

1. user will be redirected to project workloads page

Expected results:

1. user should be taken to resource details page

Additional info:

https://github.com/openshift/console/pull/14434

Bug OCPBUGS-44014: Cancelling the file browser dialog after initial file was previously uploaded causes TypeError crash

View the Description View the linked PRs

Description of problem:

    Cancelling the file browser dialog after initial file was previously uploaded causes TypeError crash

Version-Release number of selected component (if applicable):

4.18.0-0.ci-2024-10-30-043000

How reproducible:

    always

Steps to Reproduce:

1. user logins to console 
2. goes to Secrets -> Create Image pull secret, on the page - Secret name: test-secret - Authentication type: Upload configuration file, here we click on browse and upload some file.
3. then when we try to browse for other file, but instead of uploading another file we cancel the file chooser dialog, the console crash with 'Cannot read properties of undefined (reading 'size')' error.

Actual results:

Console crashes with 'Cannot read properties of undefined (reading 'size')' error

Expected results:

Console should not crash.

Additional info:

https://github.com/openshift/console/pull/14461

Bug OCPBUGS-41697: Clicking label edit button on ingress details page will open annotation edit modal

View the Description View the linked PRs

Description of problem:

On one ingress details page, click "Edit" button for Labels, it opens annotation edit modal.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-10-133647
4.17.0-0.nightly-2024-09-09-120947

How reproducible:

Always

Steps to Reproduce:

    1.Go to one ingress details page, click "Edit" button for Labels.
    2.
    3.

Actual results:

1. The "Edit annotations" modal is opened.

Expected results:

1. Should open "Edit labels" modal.

Additional info:

https://github.com/openshift/networking-console-plugin/pull/85

Bug OCPBUGS-43540: Enable Shipwright e2e test in CI

View the Description View the linked PRs

Description of problem:

    Enabling the Shipwright tests in CI

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14252

Bug OCPBUGS-39392: cluster-capi-operator: vsphere secret recreation attempt fails

View the Description View the linked PRs

Description of problem:

In cluster-capi-operator, if the VsphereCluster object gets deleted, the controller attempts to recreate it and fails while trying to also recreate its corresponding vsphere credentials secret, which instead still exists.

The failure is highlighted by the following logs in the controller: `resourceVersion should not be set on objects to be created`

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Delete VsphereCluster
    2. Check the cluster-capi-operator logs
    3.

Actual results:

    VsphereCluster fails to be recreated as the reconciliation fails during ensuring the vsphere credentials secret

Expected results:

    VsphereCluster gets recreated

Additional info:

https://github.com/openshift/cluster-capi-operator/pull/199

Bug OCPBUGS-34451: ART requests updates to 4.17 image ose-aws-ebs-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/231

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Task MGMT-18521: Fix assisted-installer-agent subsystem job failure

View the Description View the linked PRs

On 1.8.2024, assisted-installer-agent job started failing subsystem test "add_multiple_servers". We need to make sure it is occurs only in tests and The fix should be backported.

https://github.com/openshift/assisted-installer-agent/pull/767

Bug OCPBUGS-39476: spelling error for word `instal`

View the Description View the linked PRs

Description of problem:

    there is a spelling error for word `instal` , it should be `install`

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-03-211053

How reproducible:

    Always

Steps to Reproduce:

    1. normal user open Lightspeed hover button, check the messages
    2.
    3.

Actual results:

Must have administrator accessContact your administrator and ask them to instal Red Hat OpenShift Lightspeed.

Expected results:

word `instal` should be `install`

Additional info:

https://github.com/openshift/console/pull/14251

Bug OCPBUGS-34647: In OCB, "enforcing=0" kernel argument is degrading the MachineConfigPool

View the Description View the linked PRs

Description of problem:

When we enable OCB functionality and we create a MC that configures an eforcing=0 kernel argumnent the MCP is degraded reporting this message

              {
                  "lastTransitionTime": "2024-05-30T09:37:06Z",
                  "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"",
                  "reason": "1 nodes are reporting degraded status on sync",
                  "status": "True",
                  "type": "NodeDegraded"
              },

Version-Release number of selected component (if applicable):

IPI on AWS

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-05-30-021120   True        False         97m     Error while reconciling 4.16.0-0.nightly-2024-05-30-021120: the cluster operator olm is not available

How reproducible:

Alwasy

Steps to Reproduce:

    1. Enable techpreview
$ oc patch featuregate cluster --type=merge -p '{"spec":{"featureSet": "TechPreviewNoUpgrade"}}'

    2. Configure a MSOC resource to enable OCB functionality in the worker pool

When we hit this problem we were using the mcoqe quay repository.
A copy of the pull-secret for baseImagePullSecret and renderedImagePushSecret and no currentImagePullSecret configured.

apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: worker
spec:
  machineConfigPool:
    name: worker
#  buildOutputs:
#    currentImagePullSecret:
#      name: ""
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: pull-copy 
    renderedImagePushSecret:
      name: pull-copy 
    renderedImagePushspec: "quay.io/mcoqe/layering:latest"

    3. Create a MC to use enforing=0 kernel argument

{
    "kind": "List",
    "apiVersion": "v1",
    "metadata": {},
    "items": [
        {
            "apiVersion": "machineconfiguration.openshift.io/v1",
            "kind": "MachineConfig",
            "metadata": {
                "labels": {
                    "machineconfiguration.openshift.io/role": "worker"
                },
                "name": "change-worker-kernel-selinux-gvr393x2"
            },
            "spec": {
                "config": {
                    "ignition": {
                        "version": "3.2.0"
                    }
                },
                "kernelArguments": [
                    "enforcing=0"
                ]
            }
        }
    ]
}

Actual results:

The worker MCP is degraded reporting this message:

oc get mcp worker -oyaml
....

              {
                  "lastTransitionTime": "2024-05-30T09:37:06Z",
                  "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"",
                  "reason": "1 nodes are reporting degraded status on sync",
                  "status": "True",
                  "type": "NodeDegraded"
              },

Expected results:

The MC should be applied without problems and selinux should be using enforcing=0

Additional info:

https://github.com/openshift/machine-config-operator/pull/4585

Bug OCPBUGS-36261: OAuthServer service with Route type does not work with a custom hostname

View the Description View the linked PRs

Description of problem:

In hostedcluster installations, when the following OAuthServer service is configure without any configured hostname parameter, the oauth route is created in the management cluster with the standard hostname  which following the pattern from ingresscontroller wilcard domain (oauth-<hosted-cluster-namespace>.<wildcard-default-ingress-controller-domain>):  

~~~
$ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml
  - service: OAuthServer
    servicePublishingStrategy:
      type: Route
~~~  

On the other hand, if any custom hostname parameter is configured, the oauth route is created in the management cluster with the following labels: 

~~~
$ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml
  - service: OAuthServer
    servicePublishingStrategy:
      route:
        hostname: oauth.<custom-domain>
      type: Route

$ oc get routes -n hcp-ns --show-labels
NAME    HOST/PORT             LABELS
oauth oauth.<custom-domain>  hypershift.openshift.io/hosted-control-plane=hcp-ns <---
~~~

The configured label makes the ingresscontroller does not admit the route as the following configuration is added by hypershift operator to the default ingresscontroller resource: 

~~~
$ oc get ingresscontroller -n openshift-ingress-default default -oyaml
    routeSelector:
      matchExpressions:
      - key: hypershift.openshift.io/hosted-control-plane <---
        operator: DoesNotExist <---
~~~

This configuration should be allowed as there are use-cases where the route should have a customized hostname. Currently the HCP platform is not allowing this configuration and the oauth route does not work.

Version-Release number of selected component (if applicable):

   4.15

How reproducible:

    Easily

Steps to Reproduce:

    1. Install HCP cluster 
    2. Configure OAuthServer with type Route 
    3. Add a custom hostname different than default wildcard ingress URL from management cluster

Actual results:

    Oauth route is not admitted

Expected results:

    Oauth route should be admitted by Ingresscontroller

Additional info:

Bug COO-519: Tracing UI not displayed in OpenShift Webconsole

View the Description View the linked PRs

Version of components:
OCP version

4.16.0-0.nightly-2024-11-05-003735

Operator bundle: quay.io/rhobs/observability-operator-bundle:0.4.3-241105092032

Description of issue:
When Tracing UI plugin instance is created. The distributed-tracing-* pod shows the following errors and the Tracing UI is not available in the OCP web console.

 % oc logs distributed-tracing-745f655d84-2jk6b
time="2024-11-05T13:08:37Z" level=info msg="enabled features: []\n" module=main
time="2024-11-05T13:08:37Z" level=error msg="cannot read base manifest file" error="open web/dist/plugin-manifest.json: no such file or directory" module=manifest
time="2024-11-05T13:08:37Z" level=info msg="listening on https://:9443" module=server
I1105 13:08:37.620932       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
10.128.0.109 - - [05/Nov/2024:13:08:54 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62
10.128.0.109 - - [05/Nov/2024:13:08:54 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62
10.128.0.109 - - [05/Nov/2024:13:09:10 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62
10.128.0.109 - - [05/Nov/2024:13:09:25 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62

Steps to reproduce the issue:

*Instal the latest operator bundle.

quay.io/rhobs/observability-operator-bundle:0.4.3-241105092032

*Set the -openshift.enabled flag in the CSV.

*Create the Tracing UI plugin instance and check the UI plugin pod logs.

https://github.com/openshift/monitoring-plugin/pull/246

Bug OCPBUGS-39438: Configure-ovs doesn't persist ethtool configuration

View the Description View the linked PRs

Description of problem: If a customer applies ethtool configuration to the interface used in br-ex, that configuration will be dropped when br-ex is created. We need to read and apply the configuration from the interface to the phys0 connection profile, as described in https://issues.redhat.com/browse/RHEL-56741?focusedId=25465040&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25465040

Version-Release number of selected component (if applicable): 4.16

How reproducible: Always

Steps to Reproduce:

1. Deploy a cluster with an NMState config that sets the ethtool.feature.esp-tx-csum-hw-offload field to "off"

Actual results: The ethtool setting is only applied to the interface profile which is disabled after configure-ovs runs

Expected results: The ethtool setting is present on the configure-ovs-created profile

Additional info:

Affected Platforms: VSphere. Probably baremetal too and possibly others.

Bug OCPBUGS-6869: Whereabouts kubeconfig known to expire

View the Description View the linked PRs

Description of problem:

The whereabouts kubeconfig is known to expire, if the cluster credentials and kubernetes secret changes, the whereabouts kubeconfig (which is stored on disk) is not updated to reflect the credential change

Version-Release number of selected component (if applicable):

>= 4.8.z (all OCP versions which ship Whereabouts)

How reproducible:

With time.

Steps to Reproduce:

1. Wait for cluster credentials to expire (which may take a year depending on cluster configuration) (currently unaware of a technique to force a credentials change to the serviceaccount secret token)

Actual results:

Kubeconfig is out of date and Whereabouts cannot properly authenticate with API server

Expected results:

Kubeconfig is updated and Whereabouts can authenticate with API server

https://github.com/openshift/whereabouts-cni/pull/315

Bug OCPBUGS-37543: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug MGMT-19283: Incorrect min cpu requirements for lvms

View the Description View the linked PRs

Description of the problem:

Trying to create cluster (Multi - operators : mtv + cnv + lvms) with minimal requirements

according to preflight response (attached below):
We should need 5 vcpu cores as minimal req:

basic: 2
additional for mtv 1
additional for cnv 2
additional for lvms 0
should be 5

however when creating the cluster it is asking for 6instead of 5

tooltip says
Require at least 6 CPU cores for worker role, found only 5.

{"ocp":{"master":{"qualitative":null,"quantitative":{"cpu_cores":4,"disk_size_gb":20,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":100,"packet_loss_percentage":0,"ram_mib":16384}},"worker":{"qualitative":null,"quantitative":{"cpu_cores":2,"disk_size_gb":20,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":1000,"packet_loss_percentage":10,"ram_mib":8192}}},"operators":[{"dependencies":[],"operator_name":"lso","requirements":{"master":{"qualitative":null,"quantitative":{}},"worker":{"qualitative":null,"quantitative":{}}}},{"dependencies":["lso"],"operator_name":"odf","requirements":{"master":{"qualitative":["Requirements apply only for master-only clusters","At least 3 hosts","At least 1 non-boot SSD or HDD disk on 3 hosts"],"quantitative":{"cpu_cores":6,"ram_mib":19456}},"worker":{"qualitative":["Requirements apply only for clusters with workers","5 GiB of additional RAM for each non-boot disk","2 additional CPUs for each non-boot disk","At least 3 workers","At least 1 non-boot SSD or HDD disk on 3 workers"],"quantitative":{"cpu_cores":8,"ram_mib":19456}}}},{"dependencies":["lso"],"operator_name":"cnv","requirements":{"master":{"qualitative":["Additional 1GiB of RAM per each supported GPU","Additional 1GiB of RAM per each supported SR-IOV NIC","CPU has virtualization flag (vmx or svm)"],"quantitative":{"cpu_cores":4,"ram_mib":150}},"worker":{"qualitative":["Additional 1GiB of RAM per each supported GPU","Additional 1GiB of RAM per each supported SR-IOV NIC","CPU has virtualization flag (vmx or svm)"],"quantitative":{"cpu_cores":2,"ram_mib":360}}}},{"dependencies":[],"operator_name":"lvm","requirements":{"master":{"qualitative":["At least 1 non-boot disk per host","100 MiB of additional RAM","1 additional CPUs for each non-boot disk"],"quantitative":{"cpu_cores":1,"ram_mib":100}},"worker":{"qualitative":null,"quantitative":{}}}},{"dependencies":[],"operator_name":"mce","requirements":{"master":{"qualitative":[],"quantitative":{"cpu_cores":4,"ram_mib":16384}},"worker":{"qualitative":[],"quantitative":{"cpu_cores":4,"ram_mib":16384}}}},{"dependencies":["cnv"],"operator_name":"mtv","requirements":{"master":{"qualitative":["1024 MiB of additional RAM","1 additional CPUs"],"quantitative":{"cpu_cores":1,"ram_mib":1024}},"worker":{"qualitative":["1024 MiB of additional RAM","1 additional CPUs"],"quantitative":{"cpu_cores":1,"ram_mib":1024}}}}]}

How reproducible:

100%

Steps to reproduce:

1. create a multi cluster

2. select mtv + lvms + cnv

3. add 5 cpu cores work node

Actual results:

unaable to continue installation process cluster asking for an extra cpu core

Expected results:
should be bale to isntall cluster 5 cpu should be enough

https://github.com/openshift/assisted-service/pull/7006

Bug OCPBUGS-33776: 4.15 AWS EFS CSI driver is not compatible with the 4.17 OCP

View the Description View the linked PRs

Description of problem:

Trying to install AWS EFS Driver 4.15 in 4.16 OCP. And driver pods get stuck with the below error:
$ oc get pods
NAME                                             READY   STATUS    RESTARTS   AGE
aws-ebs-csi-driver-controller-5f85b66c6-5gw8n    11/11   Running   0          80m
aws-ebs-csi-driver-controller-5f85b66c6-r5lzm    11/11   Running   0          80m
aws-ebs-csi-driver-node-4mcjp                    3/3     Running   0          76m
aws-ebs-csi-driver-node-82hmk                    3/3     Running   0          76m
aws-ebs-csi-driver-node-p7g8j                    3/3     Running   0          80m
aws-ebs-csi-driver-node-q9bnd                    3/3     Running   0          75m
aws-ebs-csi-driver-node-vddmg                    3/3     Running   0          80m
aws-ebs-csi-driver-node-x8cwl                    3/3     Running   0          80m
aws-ebs-csi-driver-operator-5c77fbb9fd-dc94m     1/1     Running   0          80m
aws-efs-csi-driver-controller-6c4c6f8c8c-725f4   4/4     Running   0          11m
aws-efs-csi-driver-controller-6c4c6f8c8c-nvtl7   4/4     Running   0          12m
aws-efs-csi-driver-node-2frs7                    0/3     Pending   0          6m29s
aws-efs-csi-driver-node-5cpb8                    0/3     Pending   0          6m26s
aws-efs-csi-driver-node-bchg5                    0/3     Pending   0          6m28s
aws-efs-csi-driver-node-brndb                    0/3     Pending   0          6m27s
aws-efs-csi-driver-node-qcc4m                    0/3     Pending   0          6m27s
aws-efs-csi-driver-node-wpk5d                    0/3     Pending   0          6m27s
aws-efs-csi-driver-operator-6b54c78484-gvxrt     1/1     Running   0          13m

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  6m58s                  default-scheduler  0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  3m42s (x2 over 4m24s)  default-scheduler  0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    all the time

Steps to Reproduce:

    1. Install AWS EFS CSI driver 4.15 in 4.16 OCP
    2.
    3.

Actual results:

    EFS CSI drive node pods are stuck in pending state

Expected results:

    All pod should be running.

Additional info:

    More info on the initial debug here: https://redhat-internal.slack.com/archives/CBQHQFU0N/p1715757611210639

Bug OCPBUGS-43823: [azuredisk-csi-driver] doesn't work on ASH with "NoRegisteredProviderFound"

View the Description View the linked PRs

Description of problem:

    In 4.18 Azure Stack Hub cluster, Azure-Disk CSI Driver doesn't doesn't work with following error when provisioning volume:

E1024 05:36:01.335536       1 utils.go:110] GRPC error: rpc error: code = Internal desc = PUT https://management.mtcazs.wwtatc.com/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/ci-op-wv5kxjrl-cc5c6/providers/Microsoft.Compute/disks/pvc-854653a6-6107-44ff-95e3-a6d588864420
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
ERROR CODE: NoRegisteredProviderFound
--------------------------------------------------------------------------------
{
  "error": {
    "code": "NoRegisteredProviderFound",
    "message": "No registered resource provider found for location 'mtcazs' and API version '2023-10-02' for type 'disks'. The supported api-versions are '2017-03-30, 2018-04-01, 2018-06-01, 2018-09-30, 2019-03-01, 2019-07-01, 2019-11-01'. The supported locations are 'mtcazs'."
  }
}
--------------------------------------------------------------------------------

Version-Release number of selected component (if applicable):

    OCP:4.18.0-0.nightly-2024-10-23-112324
    AzureDisk CSI Driver: v1.30.4

How reproducible:

    Always

Steps to Reproduce:

    1. Create cluster on Azure Stack Hub with prometheus pvc configurated
    2. Volume provisioning failed due to "NoRegisteredProviderFound"

Actual results:

    Volume provisioning failed

Expected results:

    Volume provisioning should succeed

Additional info:

https://github.com/openshift/azure-disk-csi-driver/pull/88

Ticket CONSOLE-4187: Dev console: Use Metrics page from monitoring-plugin

View the Description View the linked PRs

Summary

Duplicate issue of https://issues.redhat.com/browse/OU-258.

To pass the CI/CD requirements of the openshift/console each PR needs to have a issue in a OCP own Jira board.

This issue migrates the rendering of the Developer Perspective > Observe > Metrics page from the openshift/console to openshift/monitioring-plugin.

openshift/console PR#4187: Removes the Metrics Page.

openshift/monitoring-plugin PR#138: Add the Metrics Page & consolidates the code to use the same components as the Administrative > Observe > Metrics Page.

—

Testing

Both openshift/console PR#4187 & openshift/monitoring-plugin PR#138 need to be launched to see the full feature. After launching both the PRs you should see a page like the screenshot attached below.

—

Except from ~~OU-258~~ : https://issues.redhat.com/browse/OU-258 :

Background

The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

The UX of the two pages differs somewhat, so we will need to decide whether we can change the dev console to use the same UX as the admin page or whether we need to keep some differences. This is an opportunity to bring the improved PromQL editing UX from the admin console to the dev console.

Outcomes

The dev console metrics is loaded from monitoring-plugin and the code that is not shared with other components in the console is removed from the console codebase.
The dev console version of the page has the project selector dropdown, but the admin console page doesn't, so monitoring-plugin will need to be changed to support that difference.

https://github.com/openshift/console/pull/14105

Bug OCPBUGS-39404: Installer: enable virtual media TLS by default

View the Description View the linked PRs

OCPBUGS-36283 introduced the ability to switch on TLS between the BMC and the Metal3's httpd server. It is currently off by default to make the change backportable without a high risk of regressions. We need to turn it on for 4.18+ for consistency with CBO-deployed Metal3.

https://github.com/openshift/installer/pull/9170

Bug OCPBUGS-41824: Hypershift is managing kubeconfigs for DNS and Ingress operators

View the Description View the linked PRs

Description of problem:

    The kubeconfigs for the DNS Operator and the Ingress Operator are managed by Hypershift and they should only be managed by the cloud service provider. This can lead to the kubeconfig/certificate being invalid in the cases where the cloud service provider further manages the kubeconfig (for example ca-rotation).

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/4709

Bug OCPBUGS-44566: Address circular references in @console/knative-plugin

View the Description View the linked PRs

Description of problem:

    Circular dependencies in OCP Console prevent migration of Webpack 5

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enable the CHECK_CYCLES env var while building
    2. Observe errors
    3.

Actual results:

    There are errors

Expected results:

    No errors

Additional info:

https://github.com/openshift/console/pull/14492

Bug OCPBUGS-42782: Cluster monitoring reports OpenShift Pipelines plugin as unknown

View the Description View the linked PRs

Description of problem:
The OpenShift Pipelines operator automatically installs a OpenShift console plugin. The console plugin metrics reports this as unknown after the plugin was renamed from "pipeline-console-plugin" to "pipelines-console-plugin".

Version-Release number of selected component (if applicable):
4.14+

How reproducible:
Always

Steps to Reproduce:

Install the OpenShift Pipelines operator with the plugin
Navigate to Observe > Metrics
Check the metrics console_plugins_info

Actual results:
It shows an "unknown" plugin in the metrics.

Expected results:
It should shows a "pipelines" plugin in the metrics.

Additional info:
None

https://github.com/openshift/console/pull/14372

Bug OCPBUGS-42605: Allow from host network networkpolicies do not work during live migration

View the Description View the linked PRs

Description of problem:

We are in a live migration scenario.

If a project has a networkpolicy to allow from the host network (more concretely, to allow from the ingress controllers and the ingress controllers are in the host network), traffic doesn't work during the live migration between any ingress controller node (either migrated or not migrated) and an already migrated application node.

I'll expand later in the description and internal comments, but the TL;DR is that the IPs of the tun0 of not migrated source nodes and the IPs of the ovn-k8s-mp0 from migrated source nodes are not added to the address sets related to the networkpolicy ACL in the target OVN-Kubernetes node, so that traffic is not allowed.

Version-Release number of selected component (if applicable):

4.16.13

How reproducible:

Always

Steps to Reproduce:

1. Before the migration: have a project with a networkpolicy that allows from the ingress controller and the ingress controller in the host network. Everything must work properly at this point.

2. Start the migration

3. During the migration, check connectivity from the host network of either a migrated node or a non-migrated node. Both will fail (checking from the same node doesn't fail)

Actual results:

Pod on the worker node is not reachable from the host network of the ingress controller node (unless the pod is in the same node than the ingress controller), which causes the ingress controller routes to throw 503 error.

Expected results:

Pod on the worker node to be reachable from the ingress controller node, even when the ingress controller node has not migrated yet and the application node has.

Additional info:

This is not a duplicate of OCPBUGS-42578. This bug refers to the host-to-pod communication path while the other one doesn't.

This is a customer issue. More details to be included in private comments for privacy.

Workaround: Creating a networkpolicy that explicitly allows traffic from tun0 and ovn-k8s-mp0 interfaces. However, note that the workaround can be problematic for clusters with hundreds or thousands of projects. Another possible workaround is to temporarily delete all the networkpolicies of the projects. But again, this may be problematic (and a security risk).

https://github.com/openshift/cluster-network-operator/pull/2532

Bug OCPBUGS-39030: operator conditions kube-apiserver failures

View the Description View the linked PRs

operator conditions kube-apiserver

is showing as regressed in 4.17 (and 4.18) for metal and vsphere

Stephen Benjamin noted there is one line of JQ used to create the tests and has offered to try to stabilize that code some. Ultimately TRT-1764 is intended to build out a smarter framework. This bug is to see what can be done in the short term.

https://github.com/openshift/origin/pull/29039

Bug OCPBUGS-41973: [CI Watcher] shipwright operator installation failing on CI

View the Description View the linked PRs

Description of problem:

    Shipwright operator installation through CLI is failing - 

Failure:

# Shipwright build details page.Shipwright build details page Shipwright tab should be default on first open if the operator is installed (ODC-7623): SWB-01-TC01
Error: Failed to install Shipwright Operator - Pod timeout

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14293

Bug OCPBUGS-42712: Tab characters are considered unprintable in Secret details view

View the Description View the linked PRs

Description of problem:

In the Secret details view, if one of the data properties from the Secret contains a tab character, it is considered "unpritable" and the content cannot be viewed in the console. This is not correct. Tab characters can be printed and should not prevent content from being viewed. 

We have a dependency "istextorbinary" that will determine if a buffer contains binary. We should use it here.

Version-Release number of selected component (if applicable) 4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Download (this file)[https://gist.github.com/TheRealJon/eb1e2eaf80c923938072f8a997fed3cd/raw/04b7307d31a825ae686affd9da0c0914d490abd3/pull-secret-with-tabs.json]
    2. Run this command:
oc create secret generic test -n default --from-file=.dockerconfigjson=<path-to-file-from-step-1> --type=kubernetes.io/dockerconfigjson
    3. In the console, navigate to Workloads -> Secrets and make sure that the "default" project is selected from the project dropdown.
    4. Select named "test"
    5. Scroll to the bottom to view the data content of the Secret

Actual results:

    The "Save this file" option is shown, and user is unable to reveal the contents of the Secret

Expected results:

    The "Save this file" option should not be shown, the obfuscated content should be rendered, and the reveal/hide button should show and hide the content from the pull secret.

Additional info:

    There is logic in this view that prevents us from trying to render binary data by detecting "unprintable characters". The regex for this includes the Tab character, which is incorrect, since that character is printable.

https://github.com/openshift/console/pull/14364

Bug OCPBUGS-38475: [router] Rename Dockerfile

View the Description View the linked PRs

Refactor name to Dockerfile.ocp as a better, version independent alternative

https://github.com/openshift/router/pull/616

Bug OCPBUGS-42616: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2314

Bug OCPBUGS-40663: ART requests updates to 4.18 image ose-ibm-vpc-block-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/126

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/126

Bug OCPBUGS-43481: Machine-API healthz probe failure on SNO Upgrades

View the Description View the linked PRs

[sig-arch] events should not repeat pathologically for ns/openshift-machine-api

The machine-api resource seems to not be responding to the `/healthz` requests from kubelet causing an increase in probe error events. The pod does seem to be up, and preliminary look at Loki is showing that the `/healthz` endpoint does seem to be up, but looses leader between, before starting the health probe again.

Prow Link
Loki General Query

Loki Start/Stop/Query

(read from bottom up)

I1016 19:51:31.418815       1 server.go:191] "Starting webhook server" logger="controller-runtime.webhook"
I1016 19:51:31.418764       1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8082" secure=false
I1016 19:51:31.418703       1 server.go:83] "starting server" name="health probe" addr="[::]:9441"
I1016 19:51:31.418650       1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"		
2024/10/16 19:51:31 Starting the Cmd.

...

2024/10/16 19:50:44 leader election lost
I1016 19:50:44.406280       1 leaderelection.go:297] failed to renew lease openshift-machine-api/cluster-api-provider-machineset-leader: timed out waiting for the condition
error
E1016 19:50:44.406230       1 leaderelection.go:436] error retrieving resource lock openshift-machine-api/cluster-api-provider-machineset-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-api-provider-machineset-leader": context deadline exceeded
error
E1016 19:50:37.430054       1 leaderelection.go:429] Failed to update lock optimitically: rpc error: code = DeadlineExceeded desc = context deadline exceeded, falling back to slow path
error
E1016 19:50:04.423920       1 leaderelection.go:436] error retrieving resource lock openshift-machine-api/cluster-api-provider-machineset-leader: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io cluster-api-provider-machineset-leader)
error
E1016 19:49:04.422237       1 leaderelection.go:429] Failed to update lock optimitically: rpc error: code = DeadlineExceeded desc = context deadline exceeded, falling back to slow path
....

I1016 19:46:21.358989       1 server.go:83] "starting server" name="health probe" addr="[::]:9441"
I1016 19:46:21.358891       1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8082" secure=false
I1016 19:46:21.358682       1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"		
2024/10/16 19:46:21 Starting the Cmd.

Event Filter

https://github.com/openshift/machine-api-operator/pull/1299

Bug OCPBUGS-44574: Address circular references in @console/shared

View the Description View the linked PRs

Description of problem:

    Circular dependencies in OCP Console prevent migration of Webpack 5

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enable the CHECK_CYCLES env var while building
    2. Observe errors
    3.

Actual results:

    There are errors

Expected results:

    No errors

Additional info:

https://github.com/openshift/console/pull/14495

Bug OCPBUGS-42536: Reduce the flakiness of the image-ecosystem testsuite

View the Description View the linked PRs

Description of problem:

The image ecosystem testsuite sometimes fails due to timeouts in samples smoke tests in origin - the tests starting with "[sig-devex][Feature:ImageEcosystem][Slow] openshift sample application repositories".

These can be caused by either the build taking too long (for example the rails application tends to take quite a while to build) or the application actually can start quite slowly.

There is no bullet proof solution here but to try and increase the timeouts to a value that both provides enough time and doesn't stall the testsuite for too long.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Run the image-ecosystem testsuite
    2.
    3.

Actual results:

sometime the testsuite fails because of timeouts

Expected results:

no timeouts

Additional info:

https://github.com/openshift/origin/pull/29143

Bug OCPBUGS-39325: ConsolePlugin template lacks of required data

View the Description View the linked PRs

Description of problem:

ConsolePlugin example YAML lacks required data

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-08-30-231249

How reproducible:

    Always

Steps to Reproduce:

1. goes to ConsolePlugins list page 
 /k8s/cluster/customresourcedefinitions/consoleplugins.console.openshift.io/instances  
or 
/k8s/cluster/console.openshift.io~v1~ConsolePlugin
2. Click on 'Create ConsolePlugin' button

Actual results:

Example YAML is quite simple and lacking of required data, user will get various error if trying from example YAML

apiVersion: console.openshift.io/v1
kind: ConsolePlugin
metadata:
  name: example
spec: {}

Expected results:

we should add complete YAML as as example or create a default Sample

Additional info:

https://github.com/openshift/console/pull/14236

Bug OCPBUGS-43998: Fix VirtualizedTable select feature

View the Description View the linked PRs

Description of problem:

Add two new props to VirtualizedTable in order to make the header checkbox work.

allRowsSelected and canSelectAll. allRowsSelected will check the checkbox and canSelectAll will be a control to hide or show the header checkbox.

https://github.com/openshift/console/pull/14447

Bug OCPBUGS-38274: Removed vSphere CSI driver leaves lot of conditions

View the Description View the linked PRs

Description of problem:

When the vSphere CSI driver is removed (using managementState: Removed), it leaves all existing conditions in the ClusterCSIDriver. IMO it should delete all of them and keep some something like"Disabled: true" that we use for Manila CSI driver operator.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-09-031511

How reproducible: always

Steps to Reproduce:

Edit ClusterCSIDriver and set `managementState: Removed`.
See the CSI driver deployment + DaemonSet are removed.
Check ClusterCSIDriver conditions

Actual results: All Deployment + DaemonSet conditions are present

Expected results: The conditions are pruned.

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/248

Bug OCPBUGS-42850: Cluster monitoring reports OpenShift Networking plugin as unknown

View the Description View the linked PRs

Description of problem:
OpenShift automatically installs the OpenShift networking plugin, but the console plugin metrics reports this as "unknown".

Version-Release number of selected component (if applicable):
4.17+ ???

How reproducible:
Always

Steps to Reproduce:

Navigate to Observe > Metrics
Check the metric console_plugins_info

Actual results:
It shows an "unknown" plugin in the metrics.

Expected results:
It should shows a "networking" plugin in the metrics.

Additional info:
None

https://github.com/openshift/console/pull/14382

Bug OCPBUGS-38573: use pooled client for etcd single member health checks

View the Description View the linked PRs

Description of problem:

While working on the readiness probes we have discovered that the single member health check always allocates a new client. 

Since this is an expensive operation, we can make use of the pooled client (that already has a connection open) and change the endpoints for a brief period of time to the single member we want to check.

This should reduce CEO's and etcd CPU consumption.

Version-Release number of selected component (if applicable):

any supported version

How reproducible:

always, but technical detail

Steps to Reproduce:

na

Actual results:

CEO creates a new etcd client when it is checking a single member health

Expected results:

CEO should use the existing pooled client to check for single member health

Additional info:

Bug OCPBUGS-42100: Run 2 replicas of active/passive HA hypershift deployments

View the Description View the linked PRs

Description of problem:

    HyperShift currently runs 3 replicas of active/passive HA deployments such as kube-controller-manager, kube-scheduler, etc. In order to reduce the overhead of running a HyperShift control plane, we should be able to run these deployments with 2 replicas.

In a 3 zone environment with 2 replicas, we can still use a rolling update strategy, and set the maxSurge value to 1, as the new pod would schedule into the unoccupied zone.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/4738

Bug OCPBUGS-39567: ART requests updates to 4.18 image ose-vsphere-problem-detector-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/vsphere-problem-detector/pull/172

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/vsphere-problem-detector/pull/172

Bug OCPBUGS-39480: ART requests updates to 4.18 image openshift-enterprise-egress-dns-proxy-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/images/pull/193

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/images/pull/193

Bug OCPBUGS-37628: vsphere platform spec validation failure when both legacy and current fields are used

View the Description View the linked PRs

Description of problem:

   openshift install fails with "failed to lease wait: Invalid configuration for device '0'. generated yaml below:
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: XXX
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    vsphere:
      coresPerSocket: 2
      cpus: 8
      memoryMB: 40960
      osDisk:
        diskSizeGB: 150
      zones:
      - generated-failure-domain
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    vsphere:
      coresPerSocket: 2
      cpus: 4
      memoryMB: 32768
      osDisk:
        diskSizeGB: 150
      zones:
      - generated-failure-domain
  replicas: 3
metadata:
  creationTimestamp: null
  name: dc3
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  vsphere:
    apiVIP: 172.21.0.20
    apiVIPs:
    - 172.21.0.20
    cluster: SA-LAB
    datacenter: OVH-SA
    defaultDatastore: DatastoreOCP
    failureDomains:
    - name: generated-failure-domain
      region: generated-region
      server: XXX
      topology:
        computeCluster: /OVH-SA/host/SA-LAB
        datacenter: OVH-SA
        datastore: /OVH-SA/datastore/DatastoreOCP
        networks:
        - ocpdemo
        resourcePool: /OVH-SA/host/SA-LAB/Resources
      zone: generated-zone
    ingressVIP: 172.21.0.21
    ingressVIPs:
    - 172.21.0.21
    network: ocpdemo

~~~ Truncated~~~

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1.openshift-install create cluster
    2.choose Vsphere
    3.

Actual results:

    Error

Expected results:

    Cluster creation

Additional info:

https://github.com/openshift/installer/pull/8845

Bug OCPBUGS-34373: regular user should have same permission to update route tls.externalCertificate and tls.certificate

View the Description View the linked PRs

Description of problem:

    regular user can update route spec.tls.certificate/key without extra permissions, but if the user try to edit/patch spec.tls.externalCertificate, it reports error:
spec.tls.externalCertificate: Forbidden: user does not have update permission on custom-host

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-05-21-221942

How reproducible:

    100%

Steps to Reproduce:

    1. login as regular use and create namespace, pod, svc and edge route
$ oc create route edge myedge --service service-unsecure --cert tls.crt --key tls.key
$ oc get route myedge -oyaml

    2. edit the route and remove one certificate from spec.tls.certificate 
$ oc edit route myedge
$ oc get route myedge

    3. edit the route and restore the original spec.tls.certificate

    4. edit the route with spec.tls.externalCertificate

Actual results:

    1. edge route is admitted and works well
$ oc get route myedge -oyaml
<......>
spec:
  host: myedge-test3.apps.hongli-techprev.qe.azure.devcluster.openshift.com
  port:
    targetPort: http
  tls:
    certificate: |
      -----BEGIN CERTIFICATE-----
      XXXXXXXXXXXXXXXXXXXXXXXXXXX
      -----END CERTIFICATE-----
      -----BEGIN CERTIFICATE-----
      XXXXXXXXXXXXXXXXXXXXXXXX 
      -----END CERTIFICATE-----

   key: |
      -----BEGIN RSA PRIVATE KEY-----
<......>

    2. route is failed validation since "private key does not match public key"
$ oc get route myedge
NAME     HOST/PORT                  PATH   SERVICES           PORT   TERMINATION   WILDCARD
myedge   ExtendedValidationFailed          service-unsecure   http   edge          None

    3. route is admitted again after the spec.tls.certificate is restored

    4. reports error when updating spec.tls.externalCertificate 
spec.tls.externalCertificate: Forbidden: user does not have update permission on custom-host

Expected results:

    user can has same permission to update both spec.tls.certificate and spec.tls.externalCertificate

Additional info:

Bug OCPBUGS-39454: ART requests updates to 4.18 image ose-csi-external-snapshotter-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/161

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-snapshotter/pull/161

Bug OCPBUGS-38233: oc-mirror plugin should produce consistent YAML manifests

View the Description View the linked PRs

Description of problem:

    oc-mirror produces images signature config maps in JSON format, inconsistent with other manifests which are normally in YAML. That breaks some automation, especially Multicloud Operators Subscription controller which expects manifests in YAML only.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

    1. Perform release payload mirroring as documented
    2. Check 'release-signatures' directory

Actual results:

    There is a mix of YAML and JSON files with kubernetes manifests.

Expected results:

    Manifests are stored in one format, either YAML or JSON

Additional info:

https://github.com/openshift/oc-mirror/pull/924

Bug OCPBUGS-44167: Validation failure when RendezvousIP is a substring of next-hop-address

View the Description View the linked PRs

Description of problem:

An unexpected validation failure occurs when creating the agent ISO image if the RendezvousIP is a substring of the next-hop-address set for a worker node.

For example this configuration snippet in agent-config.yaml:

apiVersion: v1alpha1
kind: AgentConfig
metadata:
  name: agent-config
rendezvousIP: 7.162.6.1
hosts:
...
 - hostname: worker-0
    role: worker
    networkConfig:
     interfaces:
        - name: eth0
          type: Ethernet
          state: up
          ipv4:
            enabled: true
            address:
              - ip: 7.162.6.4
                prefix-length: 25
            dhcp: false
     routes:
        config:
          - destination: 0.0.0.0/0
            next-hop-address: 7.162.6.126
            next-hop-interface: eth0
            table-id: 254

Will result in the validation failure when creating the image:

FATAL failed to fetch Agent Installer ISO: failed to fetch dependency of "Agent Installer ISO": failed to fetch dependency of "Agent Installer Artifacts": failed to fetch dependency of "Agent Installer Ignition": failed to fetch dependency of "Agent Manifests": failed to fetch dependency of "NMState Config": failed to generate asset "Agent Hosts": invalid Hosts configuration: [Hosts[3].Host: Forbidden: Host worker-0 has role 'worker' and has the rendezvousIP assigned to it. The rendezvousIP must be assigned to a control plane host.

The problem is this check here https://github.com/openshift/installer/pull/6716/files#diff-fa305fe33630f77b65bd21cc9473b620f67cfd9ce35f7ddf24d03b26ec2ccfffR293
Its checking for the IP in the raw nmConfig. The problem is the routes stanza is also included in the nmConfig and the route is
next-hop-address: 7.162.6.126
So when rendezvousIP is 7.162.6.1 that strings.Contains() check returns true and the validation fails.

https://github.com/openshift/installer/pull/9167

Task MGMT-19191: Provide a way to add coreos installer params for installation phase

View the Description View the linked PRs

Some time users wants to create some modifications while installing ibi, like create new partitions for the disk, in order to save them and not to override them by coreos installer command we need a way to provide params to coreos installer command

https://github.com/openshift/installer/pull/9149

Bug OCPBUGS-38755: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4582

Bug OCPBUGS-39131: TestMetrics repeatedly failing

View the Description View the linked PRs

Description of problem:

The e2e test, TestMetrics, is repeatedly failing with the following failure message:

=== RUN   TestMetrics
    utils.go:135: Setting up pool metrics
    utils.go:636: Applied label "node-role.kubernetes.io/metrics" to node ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q
    utils.go:722: Created MachineConfigPool "metrics"
    utils.go:140: Target Node: ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q
    utils.go:124: No MachineConfig provided, will wait for pool "metrics" to include MachineConfig "00-worker"
    utils.go:252: Pool metrics has rendered configs [00-worker] with rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 6.039157947s)
    utils.go:286: Pool metrics has completed rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 1m14.043792995s)
    utils.go:145: 
            Error Trace:    /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:145
                                       /go/src/github.com/openshift/machine-config-operator/test/e2e/mco_test.go:149
            Error:          Expected nil, but got: &fmt.wrapError{msg:"node config change did not occur (waited 37.479869ms): nodes \"ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q\" not found", err:(*errors.StatusError)(0xc00071a8c0)}
            Test:           TestMetrics

Version-Release number of selected component (if applicable):

How reproducible:

Sporadically, but could potentially block e2e.

Steps to Reproduce:

Run the e2e-gcp-op test

Actual results:

=== RUN   TestMetrics
    utils.go:135: Setting up pool metrics
    utils.go:636: Applied label "node-role.kubernetes.io/metrics" to node ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q
    utils.go:722: Created MachineConfigPool "metrics"
    utils.go:140: Target Node: ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q
    utils.go:124: No MachineConfig provided, will wait for pool "metrics" to include MachineConfig "00-worker"
    utils.go:252: Pool metrics has rendered configs [00-worker] with rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 6.039157947s)
    utils.go:286: Pool metrics has completed rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 1m14.043792995s)
    utils.go:145: 
            Error Trace:    /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:145
                                       /go/src/github.com/openshift/machine-config-operator/test/e2e/mco_test.go:149
            Error:          Expected nil, but got: &fmt.wrapError{msg:"node config change did not occur (waited 37.479869ms): nodes \"ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q\" not found", err:(*errors.StatusError)(0xc00071a8c0)}
            Test:           TestMetrics

Expected results:

The test should pass

Additional info:

https://github.com/openshift/machine-config-operator/pull/4553

Bug OCPBUGS-41174: ART requests updates to 4.18 image ose-installer-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/installer/pull/8960

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/installer/pull/8960

Bug OCPBUGS-26924: Enable healthcheck of stale node-registration sockets

View the Description View the linked PRs

Following up from ~~OCPBUGS-16357~~, we should enable health check of stale registration sockets in our operators.

We will need - https://github.com/kubernetes-csi/node-driver-registrar/pull/322 and we will have to enable healthcheck for registration sockets - https://github.com/kubernetes-csi/node-driver-registrar#example

Bug OCPBUGS-39039: Azure Encryption at Host Should be Independently Togglable from DiskEncryptionSetID

View the Description View the linked PRs

Description of problem:

    EncryptionAtHost and DiskEncryptionSets are two features which should not be tightly coupled.  They should be able to be enabled / disabled independently.  Currently EncryptionAtHost is only enabled if DiskEncryptionSetID is a valid disk encryption set resource ID.


https://github.com/openshift/hypershift/blob/0cc82f7b102dcdf6e5d057255be1bdb1593d1203/hypershift-operator/controllers/nodepool/azure.go#L81-L88

Version-Release number of selected component (if applicable):

How reproducible:

    Every time

Steps to Reproduce:

    1.See comments

Actual results:

   EncryptionAtHost is only set if DiskEncryptionSetID is set.

Expected results:

    EncryptionAtHost and DiskEncryptionSetID should be independently settable.

Additional info:

    https://redhat-external.slack.com/archives/C075PHEFZKQ/p1724772123804009

Bug OCPBUGS-36644: Cloud Credentials operator generating millions of messages per day in GCP clusters

View the Description View the linked PRs

The customer's cloud credentials operator generates millions of the below messages per day in the GCP cluster.

And they want to reduce/stop these logs as it is consuming more disks. Also, their "cloud credentials" operator runs in manual mode.

time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds
time="2024-06-21T08:37:42Z" level=error msg="error creating GCP client" error="Secret \"gcp-credentials\" not found"
time="2024-06-21T08:37:42Z" level=error msg="error determining whether a credentials update is needed" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm error="unable to check whether credentialsRequest needs update"
time="2024-06-21T08:37:42Z" level=error msg="error syncing credentials: error determining whether a credentials update is needed" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials
time="2024-06-21T08:37:42Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials
time="2024-06-21T08:37:42Z" level=info msg="reconciling clusteroperator status"
time="2024-06-21T08:37:42Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator
time="2024-06-21T08:37:42Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator
time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds

https://github.com/openshift/cloud-credential-operator/pull/771

Bug OCPBUGS-41538: GCP Principal remains in Host project

View the Description View the linked PRs

Description of problem:

    When the user selects a shared vpc install, the created control plane service account is left over. To verify, after the destruction of the cluster check the principals in the host project for a remaining name XXX-m@some-service-account.com

Version-Release number of selected component (if applicable):

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

    No principal remaining

Additional info:

https://github.com/openshift/installer/pull/8989

Bug OCPBUGS-42660: [vSphere] network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update

View the Description View the linked PRs

There were remaining issues from the original issue. A new bug has been opened to address this. This is a clone of issue ~~OCPBUGS-32947~~. The following is the description of the original issue:
—
Description of problem:

    [vSphere] network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-04-23-032717

How reproducible:

    Always

Steps to Reproduce:

    1.Install a vSphere 4.16 cluster, we use automated template: ipi-on-vsphere/versioned-installer
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-04-23-032717   True        False         24m     Cluster version is 4.16.0-0.nightly-2024-04-23-032717     

    2.Check the controlplanemachineset, you can see network.devices, template and workspace have value.
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset     
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
cluster   3         3         3       3                       Active   51m
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml
apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
  creationTimestamp: "2024-04-25T02:52:11Z"
  finalizers:
  - controlplanemachineset.machine.openshift.io
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
  name: cluster
  namespace: openshift-machine-api
  resourceVersion: "18273"
  uid: f340d9b4-cf57-4122-b4d4-0f45f20e4d79
spec:
  replicas: 3
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
  state: Active
  strategy:
    type: RollingUpdate
  template:
    machineType: machines_v1beta1_machine_openshift_io
    machines_v1beta1_machine_openshift_io:
      failureDomains:
        platform: VSphere
        vsphere:
        - name: generated-failure-domain
      metadata:
        labels:
          machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
      spec:
        lifecycleHooks: {}
        metadata: {}
        providerSpec:
          value:
            apiVersion: machine.openshift.io/v1beta1
            credentialsSecret:
              name: vsphere-cloud-credentials
            diskGiB: 120
            kind: VSphereMachineProviderSpec
            memoryMiB: 16384
            metadata:
              creationTimestamp: null
            network:
              devices:
              - networkName: devqe-segment-221
            numCPUs: 4
            numCoresPerSocket: 4
            snapshot: ""
            template: huliu-vs425c-f5tfl-rhcos-generated-region-generated-zone
            userDataSecret:
              name: master-user-data
            workspace:
              datacenter: DEVQEdatacenter
              datastore: /DEVQEdatacenter/datastore/vsanDatastore
              folder: /DEVQEdatacenter/vm/huliu-vs425c-f5tfl
              resourcePool: /DEVQEdatacenter/host/DEVQEcluster/Resources
              server: vcenter.devqe.ibmc.devcluster.openshift.com
status:
  conditions:
  - lastTransitionTime: "2024-04-25T02:59:37Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Error
  - lastTransitionTime: "2024-04-25T03:03:45Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2024-04-25T03:03:45Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2024-04-25T03:01:04Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasUpdated
    status: "False"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3     

    3.Delete the controlplanemachineset, it will recreate a new one, but those three fields that had values before are now cleared.

liuhuali@Lius-MacBook-Pro huali-test % oc delete controlplanemachineset cluster
controlplanemachineset.machine.openshift.io "cluster" deleted
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE      AGE
cluster   3         3         3       3                       Inactive   6s
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml
apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
  creationTimestamp: "2024-04-25T03:45:51Z"
  finalizers:
  - controlplanemachineset.machine.openshift.io
  generation: 1
  name: cluster
  namespace: openshift-machine-api
  resourceVersion: "46172"
  uid: 45d966c9-ec95-42e1-b8b0-c4945ea58566
spec:
  replicas: 3
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
  state: Inactive
  strategy:
    type: RollingUpdate
  template:
    machineType: machines_v1beta1_machine_openshift_io
    machines_v1beta1_machine_openshift_io:
      failureDomains:
        platform: VSphere
        vsphere:
        - name: generated-failure-domain
      metadata:
        labels:
          machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
      spec:
        lifecycleHooks: {}
        metadata: {}
        providerSpec:
          value:
            apiVersion: machine.openshift.io/v1beta1
            credentialsSecret:
              name: vsphere-cloud-credentials
            diskGiB: 120
            kind: VSphereMachineProviderSpec
            memoryMiB: 16384
            metadata:
              creationTimestamp: null
            network:
              devices: null
            numCPUs: 4
            numCoresPerSocket: 4
            snapshot: ""
            template: ""
            userDataSecret:
              name: master-user-data
            workspace: {}
status:
  conditions:
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Error
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasUpdated
    status: "False"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3     

    4.I active the controlplanemachineset and it does not trigger an update,  I continue to add these field values back and it does not trigger an update, I continue to edit these fields to add a second network device and it still does not trigger an update. 


            network:
              devices:
              - networkName: devqe-segment-221
              - networkName: devqe-segment-222


By the way, I can create worker machines with other network device or two network devices.
huliu-vs425c-f5tfl-worker-0a-ldbkh    Running                          81m
huliu-vs425c-f5tfl-worker-0aa-r8q4d   Running                          70m

Actual results:

    network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update

Expected results:

    The fields value should not be changed when deleting the controlplanemachineset, 
    Updating these fields should trigger an update, or if these fields should not be modified, then it should not take effect when modifying the controlplanemachineset, as such an inconsistency seems confusing.

Additional info:

    Must gather:  https://drive.google.com/file/d/1mHR31m8gaNohVMSFqYovkkY__t8-E30s/view?usp=sharing

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/327

Bug OCPBUGS-39482: ART requests updates to 4.18 image ose-gcp-cloud-controller-manager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/67

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-gcp/pull/67

Bug OCPBUGS-39544: ART requests updates to 4.18 image openshift-enterprise-egress-router-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/images/pull/194

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/images/pull/194

Bug OCPBUGS-39358: [vSphereCSIDriverDisable] VSphereProblemDetector still waiting 24 hours after changing vSphere clustercsidrivers.managementState to "Managed" from "Removed"

View the Description View the linked PRs

Description of problem:

In vSphere cluster, change clustercsidrivers.managementState to "Removed" from "Managed", the check of VSphereProblemDetector become less frequent(once in 24 hours), see log: Scheduled the next check in 24h0m0. It is as expect.
Then change clustercsidrivers.managementState to "Managed" from "Removed", the VSphereProblemDetector check frequency is still 24 hours.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-01-175607

How reproducible:

Always

Steps to Reproduce:

See Description

Actual results:

The VSphereProblemDetector check frequency is once in 24 hour

Expected results:

The VSphereProblemDetector check frequency should become to 1 hour

Additional info:

https://github.com/openshift/vsphere-problem-detector/pull/173

Bug OCPBUGS-43512: New ManagedBootImages test appears to be failing too often

View the Description View the linked PRs

Component Readiness has found a potential regression in the following test:

[sig-mco][OCPFeatureGate:ManagedBootImages][Serial] Should degrade on a MachineSet with an OwnerReference [apigroup:machineconfiguration.openshift.io] [Suite:openshift/conformance/serial]

New feature went live that ensures new tests in a release have at least a 95% pass rate. This test was one that showed up immediately with a couple bad runs in the last 20 attempts. The failures look similar which indicate the test probably has a problem that could be fixed.

We suspect a timeout issue, the test takes about 25s on average with a 30s timeout.

Test has a 91.67% pass rate, but 95.00% is required.

Sample (being evaluated) Release: 4.18
Start Time: 2024-10-10T00:00:00Z
End Time: 2024-10-17T23:59:59Z
Success Rate: 91.67%
Successes: 22
Failures: 2
Flakes: 0

Insufficient pass rate

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=serial&Suite=serial&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=OCPFeatureGate%3AManagedBootImages&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=Machine%20Config%20Operator&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-10-17%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-10-10%2000%3A00%3A00&testId=openshift-tests%3A94bbe8be59569d92f1c5afdef12b26dd&testName=%5Bsig-mco%5D%5BOCPFeatureGate%3AManagedBootImages%5D%5BSerial%5D%20Should%20degrade%20on%20a%20MachineSet%20with%20an%20OwnerReference%20%5Bapigroup%3Amachineconfiguration.openshift.io%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fserial%5D

https://github.com/openshift/origin/pull/29210

Bug OCPBUGS-38114: Openshift Installer: create a cluster in AWS with public subnets only

View the Description View the linked PRs

Description of problem:

Starting from version 4.16, the installer does not support creating a cluster in AWS with the OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY=true flag enabled anymore.

Version-Release number of selected component (if applicable):

How reproducible:

The installation procedure fails systemically when using a predefined VPC

Steps to Reproduce:

    1. Follow the procedure at https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-config-yaml_installing-aws-vpc to prepare an install-config.yaml in order to install a cluster with a custom VPC
    2. Run `openshift-install create cluster ...'
    3. The procedure fails: `failed to create load balancer`

Actual results:

The installation procedure fails.

Expected results:

An OCP cluster to be provisioned in AWS, with public subnets only.

Additional info:

https://github.com/openshift/installer/pull/8807

Bug OCPBUGS-39222: on-prem-resolv-prepender.path should be disabled in UPI

View the Description View the linked PRs

The on-prem-resolv-prepender.path is enabled in UPI setup when it should only run for IPI

https://github.com/openshift/machine-config-operator/pull/4555

Vulnerability OCPBUGS-42053: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/193

Bug OCPBUGS-38701: clear all filters button is counted into resource type number

View the Description View the linked PRs

Description of problem:

clear all filters button is counted as part of resource type

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-08-19-002129

How reproducible:

Always

Steps to Reproduce:

    1. navigate to Home -> Events page, choose 3 resource types, check what's shown on page
    2. navigate to Home -> Search page, choose 3 resource types, check what's shown on page. Choose 4 resource types and check what's shown

Actual results:

1. it shows `1 more`, only clear all button will be shown if we click on `1 more` button
2. `1 more` button is only displayed when 4 resource types are selected, this is working as expected

Expected results:

1. clear all button should not be counted as part of resource number, the number 'N more' should reveal correct resource type numbers

Additional info:

https://github.com/openshift/console/pull/14173

Bug OCPBUGS-39298: cluster-capi-operator: manifests-gen: missing metadata value

View the Description View the linked PRs

Description of problem:

cluster-capi-operator's manifests-gen tool would generate CAPI providers transport configmaps with missing metadata details

Version-Release number of selected component (if applicable):

4.17, 4.18

How reproducible:

Not impacting payload, only a tooling bug

https://github.com/openshift/cluster-capi-operator/pull/197

Bug OCPBUGS-30811: CI doesn't reflect software used during tests

View the Description View the linked PRs

Description of problem:

On CI all the software for openstack and ansible related pieces are taken from pip and ansible-glalaxy instead of OS repository.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/8830

Bug OCPBUGS-38571: [gcp] Global apiserver address not removed on destroy

View the Description View the linked PRs

Description of problem:

Cluster's global address "<infra id>-apiserver" not deleted during "destroy cluster"

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-multi-2024-08-15-212448

How reproducible:

Always

Steps to Reproduce:

1. "create install-config", then optionally insert interested settings (see [1])
2. "create cluster", and make sure the cluster turns healthy finally (see [2])
3. check the cluster's addresses on GCP (see [3])
4. "destroy cluster", and make sure everything of the cluster getting deleted (see [4])

Actual results:

The global address "<infra id>-apiserver" is not deleted during "destroy cluster".

Expected results:

Everything of the cluster shoudl get deleted during "destroy cluster".

Additional info:

FYI we had a 4.16 bug once, see https://issues.redhat.com/browse/OCPBUGS-32306

https://github.com/openshift/installer/pull/8860

Bug OCPBUGS-39126: user workload monitoring is trying to scrap RH operators which have been installed in openshift-operators namespace

View the Description View the linked PRs

Description of problem:


Difficult to detect in which component I should report this bug. The description is the following.

Today we can install RH operators using one precise namespace or just all namepaces that will install the operator in "openshift-operators" namespace.

if this operator creates a serviceMonitor that should be scrapped by platform prometheus, this will have a token authentication and security configured in its definition.

But if the operator is installed in "openshift-operators" namespace, it's user workload monitoring that will try to scrappe it since this mentioned namespace has not the corresponding label to be scrapped by platform monitoring and we don't want it to have it because in this namespace we can also install community operators.

The result is that user workload monitoring will scrap this namespace and the service monitors will be skipped since they are configured with security against platform monitoring and UWM will not hande this.

A possible workaround is to do:

oc label namespace openshift-operators openshift.io/user-monitoring=false

losing functionality since some RH operators will not be monitored if installed in openshift-operators.

Version-Release number of selected component (if applicable):

 4.16

https://github.com/openshift/cluster-monitoring-operator/pull/2452

Bug OCPBUGS-41181: ART requests updates to 4.18 image ose-baremetal-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/baremetal-operator/pull/376

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/baremetal-operator/pull/376

Bug OCPBUGS-42525: ABI Installation failing for compact and HA clusters in vSphere environment

View the Description View the linked PRs

Description of problem:

The installation of compact and HA clusters is failing in the vSphere environment. During the cluster setup, two master nodes were observed to be in a "Not Ready" state, and the rendezvous host failed to join the cluster.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-25-131159

How reproducible:

100%

Actual results:

level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
level=info msg=Use the following commands to gather logs from the cluster
level=info msg=openshift-install gather bootstrap --help
level=error msg=Bootstrap failed to complete: : bootstrap process timed out: context deadline exceeded
ERROR: Bootstrap failed. Aborting execution.

Expected results:

Installation should be successful.

Additional info:

Agent Gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/54459/rehearse-54459-periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-vsphere-agent-compact-fips-f14/1839389511629410304/artifacts/vsphere-agent-compact-fips-f14/cucushift-agent-gather/artifacts/agent-gather.tar.xz

https://github.com/openshift/assisted-service/pull/6889

Bug OCPBUGS-38921: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4606

Bug OCPBUGS-38006: [capi] sometimes cluster-capi-operator pod stuck in CrashLoopBackOff on osp

View the Description View the linked PRs

Description of problem:

    sometimes cluster-capi-operator pod stuck in CrashLoopBackOff on osp

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-01-213905

How reproducible:

    Sometimes

Steps to Reproduce:

    1.Create an osp cluster with TechPreviewNoUpgrade
    2.Check cluster-capi-operator pod
    3.

Actual results:

cluster-capi-operator pod in CrashLoopBackOff status
$ oc get po                               
cluster-capi-operator-74dfcfcb9d-7gk98          0/1     CrashLoopBackOff   6 (2m54s ago)   41m

$ oc get po         
cluster-capi-operator-74dfcfcb9d-7gk98          1/1     Running   7 (7m52s ago)   46m

$ oc get po                                                               
cluster-capi-operator-74dfcfcb9d-7gk98          0/1     CrashLoopBackOff   7 (2m24s ago)   50m

E0806 03:44:00.584669       1 kind.go:66] "kind must be registered to the Scheme" err="no kind is registered for the type v1alpha7.OpenStackCluster in scheme \"github.com/openshift/cluster-capi-operator/cmd/cluster-capi-operator/main.go:86\"" logger="controller-runtime.source.EventHandler"
E0806 03:44:00.685539       1 controller.go:203] "Could not wait for Cache to sync" err="failed to wait for clusteroperator caches to sync: timed out waiting for cache to be synced for Kind *v1alpha7.OpenStackCluster" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator"
I0806 03:44:00.685610       1 internal.go:516] "Stopping and waiting for non leader election runnables"
I0806 03:44:00.685620       1 internal.go:520] "Stopping and waiting for leader election runnables"
I0806 03:44:00.685646       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685706       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster"
I0806 03:44:00.685712       1 controller.go:242] "All workers finished" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster"
I0806 03:44:00.685717       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685722       1 controller.go:242] "All workers finished" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685718       1 controller.go:242] "All workers finished" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685720       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator"
I0806 03:44:00.685823       1 recorder_in_memory.go:80] &Event{ObjectMeta:{dummy.17e906d425f7b2e1  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:CustomResourceDefinitionUpdateFailed,Message:Failed to update CustomResourceDefinition.apiextensions.k8s.io/openstackclusters.infrastructure.cluster.x-k8s.io: Put "https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/openstackclusters.infrastructure.cluster.x-k8s.io": context canceled,Source:EventSource{Component:cluster-capi-operator-capi-installer-apply-client,Host:,},FirstTimestamp:2024-08-06 03:44:00.685748961 +0000 UTC m=+302.946052179,LastTimestamp:2024-08-06 03:44:00.685748961 +0000 UTC m=+302.946052179,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0806 03:44:00.719743       1 capi_installer_controller.go:309] "CAPI Installer Controller is Degraded" logger="CapiInstallerController" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="6fa96361-4dc2-4865-b1b3-f92378c002cc"
E0806 03:44:00.719942       1 controller.go:329] "Reconciler error" err="error during reconcile: failed to set conditions for CAPI Installer controller: failed to sync status: failed to update cluster operator status: client rate limiter Wait returned an error: context canceled" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="6fa96361-4dc2-4865-b1b3-f92378c002cc"

Expected results:

    cluster-capi-operator pod is always Running

Additional info:

https://github.com/openshift/cluster-capi-operator/pull/203

Bug OCPBUGS-39299: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8930

Bug OCPBUGS-41128: ART requests updates to 4.18 image ose-network-interface-bond-cni-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/bond-cni/pull/65

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/bond-cni/pull/65

Bug OCPBUGS-38154: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4563

Bug OCPBUGS-38172: /bin/bridge crashes if frontend is still starting

View the Description View the linked PRs

Description of problem:

    When running /bin/bridge and trying to access localhost:9000 while the frontend is still starting, the bridge crashes as it cannot find frontend/public/dist/index.html

Version-Release number of selected component (if applicable):

    4.17.0

How reproducible:

    Always

Steps to Reproduce:

    1. Build the OpenShift Console backend and run /bin/bridge 
    2. Try to access localhost:9000 while it is still starting

Actual results:

    Bridge crash

Expected results:

    No crash, either return HTTP 404/500 to the browser or serve a fallback page

Additional info:

    This is just a minor dev annoyance

https://github.com/openshift/console/pull/14190

Bug OCPBUGS-41699: there is no Destination CA certificate section

View the Description View the linked PRs

Description of problem:

when user tries to create Re-encrypt type route, there is no place to upload 'Destination CA certificate'

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-09-120947

How reproducible:

    Always

Steps to Reproduce:

    1. create Secure route, TLS termination: Re-encrypt
    2. 
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-38051: Content and navigation improvements for OLS popup

View the Description View the linked PRs

Description of problem:

Information on the Lightspeed modal is not as clear as it could be for users to understand what to do next. Users should also have a very clear way to disable and those options are not obvious.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14110

Bug OCPBUGS-38533: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14379

Bug OCPBUGS-37782: snyk: google.golang.org/grpc/metadata

View the Description View the linked PRs

Description of problem:

    ci/prow/security is failing on google.golang.org/grpc/metadata

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

always

Steps to Reproduce:

    1. run ci/pro/security job on 4.15 pr
    2.
    3.

Actual results:

    Medium severity vulnerability found in google.golang.org/grpc/metadata

Expected results:

Additional info:

https://github.com/openshift/cloud-credential-operator/pull/746

Bug OCPBUGS-39600: ART requests updates to 4.18 image ose-apiserver-network-proxy-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/62

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/apiserver-network-proxy/pull/62

Bug OCPBUGS-41487: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/1154

Bug MGMT-18694: AI shouldn't require MAC mapping for interfaces with predictable names

View the Description View the linked PRs

Description of the problem:

The MAC mapping validation added in ~~MGMT-17618~~ caused a regression on ABI.

To avoid this regression, the validation should be mitigated to validate only non-predictable interface names.

We should still make sure at least one MAC address exist in the MAC map, to be able to detect the relevant host.

slack discussion.

How reproducible:

100%

Steps to reproduce:

Install on a node with two (statically configured via nmstate yaml) interfaces with a predictable name format (not eth*).
add on one of the interfaces MAC address to the MAC map.

Actual results:
error 'mac-interface mapping for interface xxxx is missing'
Expected results:

Installation succeeds and the interfaces are correctly configured.

https://github.com/openshift/assisted-service/pull/6715

Bug OCPBUGS-31777: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-38349: OpenID IDP endpoint verification fails when hostname can only be resolved by data plane

View the Description View the linked PRs

Description of problem:

When using configuring an OpenID idp that can only be accessed via the data plane, if the hostname of the provider can only be resolved by the data plane, reconciliation of the idp fails.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1. Configure an OpenID idp on a HostedCluster with a URL that points to a service in the dataplane (like https://keycloak.keycloak.svc)

Actual results:

    The oauth server fails to be reconciled

Expected results:

    The oauth server reconciles and functions properly

Additional info:

    Follow up to OCPBUGS-37753

https://github.com/openshift/hypershift/pull/4516

Bug OCPBUGS-43064: kube 1.31 rebase broke TechPreview hypershift on 4.18

View the Description View the linked PRs

kube rebase broke TechPreview hypershift on 4.18 with resource.k8s.io group going to v1alpha3

KAS fails to start with

E1010 19:05:25.175819       1 run.go:72] "command failed" err="group version resource.k8s.io/v1alpha2 that has not been registered"

KASO addressed it here
https://github.com/openshift/cluster-kube-apiserver-operator/pull/1731

https://github.com/openshift/hypershift/pull/4887

Bug OCPBUGS-38026: cns-migration exits logic enhancement

View the Description View the linked PRs

Description of problem:
There are two enhancements we could have for cns-migration：
1. we can print the error message once the target datastore is not found, currently it exits as nothing did:

sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source vsanDatastore -destination invalid -volume-file /tmp/pv.txt
KubeConfig is: /tmp/kubeconfig
I0806 07:59:34.884908     131 logger.go:28] logging successfully to vcenter
I0806 07:59:36.078911     131 logger.go:28] ----------- Migration Summary ------------
I0806 07:59:36.078944     131 logger.go:28] Migrated 0 volumes
I0806 07:59:36.078960     131 logger.go:28] Failed to migrate 0 volumes
I0806 07:59:36.078968     131 logger.go:28] Volumes not found 0

See the source datastore checing:

sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source invalid -destination Datastorenfsdevqe -volume-file /tmp/pv.txt
KubeConfig is: /tmp/kubeconfig
I0806 08:02:08.719657     138 logger.go:28] logging successfully to vcenter
E0806 08:02:08.749709     138 logger.go:10] error listing cns volumes: error finding datastore invalid in datacenter DEVQEdatacenter

2. If we the volume-file has one invalid pv name which is not found like at the beginning, it exits immediately and all the remaining pvs are skips, we can let it continue to check other pvs.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

    Always

Steps to Reproduce:

    See Description

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/245

Bug OCPBUGS-42354: normal user visit Routes Metrics tab returns empty

View the Description View the linked PRs

Description of problem:

normal user(project admin) visit Routes Metrics tab, only empty page returned

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-21-014704

How reproducible:

Always

Steps to Reproduce:

    1. normal user has a project and a route
    2. visit Networking -> Routes -> Metrics tab
    3.

Actual results:

empty page returned

Expected results:

- we may should not expose Metrics tab for normal user(compared with 4.16 behavior)
- if Metrics tab is supposed to be exposed to normal user, then we should return correct content instead of empty page

Additional info:

https://github.com/openshift/networking-console-plugin/pull/126

Bug OCPBUGS-38174: Cannot use new proxy settings in Alertmanager configuration

View the Description View the linked PRs

Description of problem:

The prometheus operator fails to reconcile when proxy settings like no_proxy are set in the Alertmanager configuration secret.

Version-Release number of selected component (if applicable):

4.15.z and later

How reproducible:

    Always when AlertmanagerConfig is enabled

Steps to Reproduce:

    1. Enable UWM with AlertmanagerConfig
    enableUserWorkload: true
    alertmanagerMain:
      enableUserAlertmanagerConfig: true
    2. Edit the "alertmanager.yaml" key in the alertmanager-main secret (see attached configuration file)
    3. Wait for a couple of minutes.

Actual results:

Monitoring ClusterOperator goes Degraded=True.

Expected results:

No error

Additional info:

The Prometheus operator logs show that it doesn't understand the proxy_from_environment field.
The newer proxy fields are supported since Alertmanager v0.26.0 which is equivalent to OCP 4.15 and above.

https://github.com/openshift/prometheus-operator/pull/295

Bug OCPBUGS-41229: ART requests updates to 4.18 image ose-cloud-credential-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/753

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-credential-operator/pull/753

Task OU-506: On board plugins into COO Konflux

View the Description View the linked PRs

Context

In order to be able to use UIPlugins when installing COO we need to on board the plugins using konflux with COO.

We might need to create a new Dockerfile in the plugin repos that is based on rhel8

Outcome

The plugins used by COO are on boarded and can be included in the COO payload

https://github.com/openshift/monitoring-plugin/pull/229

Bug OCPBUGS-30919: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-baremetal-operator/pull/441

Bug OCPBUGS-37950: oc-mirror fails with out proceeding further if a release does not contain kubevirt coreos container image

View the Description View the linked PRs

Description of problem:

I see that if a release does not contain kubevirt coreos container image and if kubeVirtContainer flag is set to true oc-mirror fails to continue.

Version-Release number of selected component (if applicable):

     [fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-280-g8a42369", GitCommit:"8a423691", GitTreeState:"clean", BuildDate:"2024-08-03T08:02:06Z", GoVersion:"go1.22.4", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

    1. use imageSetConfig.yaml as shown below
    2. Run command oc-mirror -c clid-179.yaml file://clid-179 --v2
    3.

Actual results:

    fedora@preserve-fedora-yinzhou test]$ ./oc-mirror -c /tmp/clid-99.yaml file://CLID-412 --v2

2024/08/03 09:24:38  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/08/03 09:24:38  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/08/03 09:24:38  [INFO]   : ⚙️  setting up the environment for you...
2024/08/03 09:24:38  [INFO]   : 🔀 workflow mode: mirrorToDisk 
2024/08/03 09:24:38  [INFO]   : 🕵️  going to discover the necessary images...
2024/08/03 09:24:38  [INFO]   : 🔍 collecting release images...
2024/08/03 09:24:44  [INFO]   : kubeVirtContainer set to true [ including :  ]
2024/08/03 09:24:44  [ERROR]  : unknown image : reference name is empty
2024/08/03 09:24:44  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
2024/08/03 09:24:44  [ERROR]  : unknown image : reference name is empty

Expected results:

    If kubeVirt coreos container does not exist in a relelase oc-mirror should skip and continue mirroring other operators but should not fail.

Additional info:

    [fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-99.yaml 
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  platform:
    channels:
      - name: stable-4.12
        minVersion: 4.12.61
        maxVersion: 4.12.61
    kubeVirtContainer: true
  operators:
  - catalog: oci:///test/ibm-catalog
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: devworkspace-operator
      minVersion: "0.26.0"
    - name: nfd
      maxVersion: "4.15.0-202402210006"
    - name: cluster-logging
      minVersion: 5.8.3
      maxVersion: 5.8.4
    - name: quay-bridge-operator
      channels:
      - name: stable-3.9
        minVersion: 3.9.5
    - name: quay-operator
      channels:
      - name: stable-3.9
        maxVersion: "3.9.1"
    - name: odf-operator
      channels:
      - name: stable-4.14
        minVersion: "4.14.5-rhodf"
        maxVersion: "4.14.5-rhodf"
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest
  - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27
  - name: quay.io/openshifttest/scratch@sha256:b045c6ba28db13704c5cbf51aff3935dbed9a692d508603cc80591d89ab26308

https://github.com/openshift/oc-mirror/pull/904

Bug OCPBUGS-39524: ART requests updates to 4.18 image csi-driver-manila-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/295

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-openstack/pull/295

Bug OCPBUGS-32370: Azure Disk CSI driver node metrics are not collected

View the Description View the linked PRs

Azure Disk CSI driver operator runs node DaemonSet that exposes CSI driver metrics on loopback, but there is no kube-rbac-proxy in front of it and there is no Service / ServiceMonitor for it. Therefore OCP doesn't collect these metrics.

Bug OCPBUGS-38860: Collapse/Expand Feature Added, Removal Option Removed in Version 4.16

View the Description View the linked PRs

Description of problem:

In 4.16 version now we can collapse and expand the "Getting Started resource" section under administrative perspective. 

But as in the earlier version, we can directly remove this tab [X], which is not there in the 4.16 version.

There is only an expand and collapse function available, but removing that tab is not available as it was there in previous versions.

Version-Release number of selected component (if applicable):

How reproducible:

    Every time

Steps to Reproduce:

    1. Go to Web console. Click on the "Getting started resources." 
    2. Then you can expand and collapse this tab.
    3. But there is no option to directly remove this tab.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14261

Bug OCPBUGS-36468: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/multus-networkpolicy/pull/61

Bug OCPBUGS-41123: ART requests updates to 4.18 image ose-multus-admission-controller-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/multus-admission-controller/pull/90

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/multus-admission-controller/pull/90

Bug OCPBUGS-41499: [Pre-Merge-Testing]Restarting ovn pods cause ovn pods crashed when having udn pods/service configured

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):
build openshift/ovn-kubernetes#2291

How reproducible:
Always

Steps to Reproduce:

1. Create a ns ns1

2. Create a CRD in ns1

% oc get UserDefinedNetwork -n ns1 -o yaml
apiVersion: v1
items:
- apiVersion: k8s.ovn.org/v1
  kind: UserDefinedNetwork
  metadata:
    creationTimestamp: "2024-09-09T08:34:49Z"
    finalizers:
    - k8s.ovn.org/user-defined-network-protection
    generation: 1
    name: udn-network
    namespace: ns1
    resourceVersion: "73943"
    uid: c923b0b1-05b4-4889-b076-c6a28f7353de
  spec:
    layer3:
      role: Primary
      subnets:
      - cidr: 10.200.0.0/16
        hostSubnet: 24
    topology: Layer3
  status:
    conditions:
    - lastTransitionTime: "2024-09-09T08:34:49Z"
      message: NetworkAttachmentDefinition has been created
      reason: NetworkAttachmentDefinitionReady
      status: "True"
      type: NetworkReady
kind: List
metadata:
  resourceVersion: ""

3. Create a service and pods in ns1

 % oc get svc -n ns1
NAME           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE
test-service   ClusterIP   172.30.16.88   <none>        27017/TCP   5m32s
% oc get pods -n ns1
NAME            READY   STATUS    RESTARTS   AGE
test-rc-f54tl   1/1     Running   0          5m4s
test-rc-lhnd7   1/1     Running   0          5m4s
% oc exec -n ns1 test-rc-f54tl -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if41: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:80:02:1b brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.128.2.27/23 brd 10.128.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe80:21b/64 scope link 
       valid_lft forever preferred_lft forever
3: ovn-udn1@if42: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:c8:03:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.200.3.3/24 brd 10.200.3.255 scope global ovn-udn1
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fec8:303/64 scope link 
       valid_lft forever preferred_lft forever

4. Restart ovn pods

{code:java}
% oc delete pods --all -n openshift-ovn-kubernetes
pod "ovnkube-control-plane-76fd6ddbf4-j69j8" deleted
pod "ovnkube-control-plane-76fd6ddbf4-vnr2m" deleted
pod "ovnkube-node-5pd5w" deleted
pod "ovnkube-node-5r9mg" deleted
pod "ovnkube-node-6bdtx" deleted
pod "ovnkube-node-6v5d7" deleted
pod "ovnkube-node-8pmpq" deleted
pod "ovnkube-node-cffld" deleted



Actual results:

{code:java}
 % oc get pods -n openshift-ovn-kubernetes                        
NAME                                     READY   STATUS             RESTARTS        AGE
ovnkube-control-plane-76fd6ddbf4-9cklv   2/2     Running            0               9m22s
ovnkube-control-plane-76fd6ddbf4-gkmlg   2/2     Running            0               9m22s
ovnkube-node-bztn5                       7/8     CrashLoopBackOff   5 (21s ago)     9m19s
ovnkube-node-qhjsw                       7/8     Error              5 (2m45s ago)   9m18s
ovnkube-node-t5f8p                       7/8     Error              5 (2m32s ago)   9m20s
ovnkube-node-t8kpp                       7/8     Error              5 (2m34s ago)   9m19s
ovnkube-node-whbvx                       7/8     Error              5 (2m35s ago)   9m20s
ovnkube-node-xlzlh                       7/8     CrashLoopBackOff   5 (14s ago)     9m18s

ovnkube-controller:
    Container ID:  cri-o://977dd8c17320695b1098ea54996bfad69c14dc4219a91dfd4354c818ea433cac
    Image:         registry.build05.ci.openshift.org/ci-ln-y1ypd82/stable@sha256:3110151b89e767644c01c8ce2cf3fec4f26f6d6e011262d0988c1d915d63355f
    Image ID:      registry.build05.ci.openshift.org/ci-ln-y1ypd82/stable@sha256:3110151b89e767644c01c8ce2cf3fec4f26f6d6e011262d0988c1d915d63355f
    Port:          29105/TCP
    Host Port:     29105/TCP
    Command:
      /bin/bash
      -c
      set -xe
      . /ovnkube-lib/ovnkube-lib.sh || exit 1
      start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105
      
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   :205] Sending *v1.Node event handler 7 for removal
I0909 08:45:58.537155  170668 factory.go:542] Stopping watch factory
I0909 08:45:58.537167  170668 handler.go:219] Removed *v1.Node event handler 7
I0909 08:45:58.537185  170668 handler.go:219] Removed *v1.Namespace event handler 1
I0909 08:45:58.537198  170668 handler.go:219] Removed *v1.Namespace event handler 5
I0909 08:45:58.537206  170668 handler.go:219] Removed *v1.EgressIP event handler 8
I0909 08:45:58.537207  170668 handler.go:219] Removed *v1.EgressFirewall event handler 9
I0909 08:45:58.537187  170668 handler.go:219] Removed *v1.Node event handler 10
I0909 08:45:58.537219  170668 handler.go:219] Removed *v1.Node event handler 2
I0909 08:45:58.538642  170668 network_attach_def_controller.go:126] [network-controller-manager NAD controller]: shutting down
I0909 08:45:58.538703  170668 secondary_layer3_network_controller.go:433] Stop secondary layer3 network controller of network ns1.udn-network
I0909 08:45:58.538742  170668 services_controller.go:243] Shutting down controller ovn-lb-controller for network=ns1.udn-network
I0909 08:45:58.538767  170668 obj_retry.go:432] Stop channel got triggered: will stop retrying failed objects of type *v1.Node
I0909 08:45:58.538754  170668 obj_retry.go:432] Stop channel got triggered: will stop retrying failed objects of type *v1.Pod
E0909 08:45:58.5
      Exit Code:    1
      Started:      Mon, 09 Sep 2024 16:44:57 +0800
      Finished:     Mon, 09 Sep 2024 16:45:58 +0800
    Ready:          False
    Restart Count:  5
    Requests:
      cpu:      10m
      memory:   600Mi

Expected results:
ovn pods should not crash

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/ovn-kubernetes/pull/2334

Bug OCPBUGS-43923: PowerVS: update capi ibmcloud 9b077049

View the Description View the linked PRs

Description of problem:

Deploy a 4.18 cluster on a PowerVS zone where LoadBalancers are slow to create.
We are called with InfraReady. We then create DNS records for the LBs. However, only the public LB exists. So the cluster fails to deploy.  The internal LB does eventually complete.

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Occassionally on a zone with slow LB creation.

https://github.com/openshift/installer/pull/9148

Bug OCPBUGS-41093: ART requests updates to 4.18 image ose-azure-cloud-node-manager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/125

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-azure/pull/125

Bug OCPBUGS-44350: VolumeAttachment does not reconcile on worker VM reboot

View the Description View the linked PRs

Description of problem:

When a kubevirt-csi pod runs on a worker node of a Guest cluster, the underlying PVC from the infra/host cluster is attached to the Virtual Machine that is the worker node of the Guest cluster.

That works well, but only until the VM is rebooted.

After the VM is power cycled for some reason, the volumeattachment on the Guest cluster is still there and shows as attached.

[guest cluster]# oc get volumeattachment
NAME                                                                   ATTACHER          PV                                         NODE                         ATTACHED   AGE
csi-976b6b166ef7ea378de9a350c9ef427c23e8c072dc6e76a392241d273c3effdb   csi.kubevirt.io   pvc-4e375fa9-c1ad-4fa6-a254-03d4c3b1111b   hostedcluster2-rlq9m-z2x88   true       39m

But the VM does not have the hotplugged disk anymore (its not a persistent hotplug). Its not attached at all.

It only has its rhcos disk and cloud-init after the reboot:

[host cluster]# oc get vmi -n clusters-hostedcluster2 hostedcluster2-rlq9m-z2x88 -o yaml | yq '.status.volumeStatus'
- name: cloudinitvolume
  size: 1048576
  target: vdb
- name: rhcos
  persistentVolumeClaimInfo:
    accessModes:
      - ReadWriteOnce
    capacity:
      storage: 32Gi
    claimName: hostedcluster2-rlq9m-z2x88-rhcos
    filesystemOverhead: "0"
    requests:
      storage: "34359738368"
    volumeMode: Block
  target: vda

The result is all workloads with PVCs now fail to start, as the hotplug is not triggered again. The worker node VM cannot find the disk:

26s         Warning   FailedMount                                  pod/mypod                             MountVolume.MountDevice failed for volume "pvc-4e375fa9-c1ad-4fa6-a254-03d4c3b1111b" : rpc error: code = Unknown desc = couldn't find device by serial id

So workload pods cannot start.

Version-Release number of selected component (if applicable):

    OCP 4.17.3
    CNV 4.17.0
    MCE 2.7.0

How reproducible:

    Always

Steps to Reproduce:

    1. Have a pod running with a PV from kubevirt-csi in the guest cluster
    2. Shutdown the Worker VM running the Pod and start it again

Actual results:

    Workloads fail to start after VM reboot

Expected results:

    Hotplug the disk again and let workloads start

Additional info:

https://github.com/openshift/kubevirt-csi-driver/pull/47

Bug OCPBUGS-42412: Add new tested azure instances types in installer doc

View the Description View the linked PRs

Description of problem:

When running 4.17 installer QE full function test, following am64 instances types are detected and tested passed, so append them in installer doc[1]: 
* standardBasv2Family
* StandardNGADSV620v1Family 
* standardMDSHighMemoryv3Family
* standardMIDSHighMemoryv3Family
* standardMISHighMemoryv3Family
* standardMSHighMemoryv3Family

[1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_x86_64.md

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9057

Bug OCPBUGS-38142: baremetal-operator fails to resolve BMC hostname

View the Description View the linked PRs

Description of problem:

To summarize, when we meet the following three conditions, baremetal nodes cannot boot due to a hostname resolution failure.

HubCluster is IPv4/IPv6 Dual Stack
BMC of managed baremetal hosts are IPv6 single stack
A hostname is used instead of An IP address in "spec.bmc.address" of BMH resource
The hostname is resolved only to IPv6 address, not IPv4

According to the following update, the provisioning service checks the BMC address scheme on the target and provides a matching URL for the installation media:

https://docs.redhat.com/en/documentation/openshift_container_platform/4.14/html/release_notes/ocp-4-14-release-notes#ocp-4-14-nw-ipv6-spoke-cluster-support

When we create a BMH resource, spec.bmc.address will be an URL of the BMC.
However, when we put a hostname instead of an IP address in the spec.bmc.address like the following example,

<Example BMH definition>
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
  :
spec:
  bmc:
    address: redfish://bmc.hostname.example.com:443/redfish/v1/Systems/1

we observe the following error.

$ oc logs -n openshift-machine-api metal3-baremetal-operator-6779dff98c-9djz7

{"level":"info","ts":1721660334.9622784,"logger":"provisioner.ironic","msg":"Failed to look up the IP address for BMC hostname","host":"myenv~mybmh","hostname":"redfish://bmc.hostname.example.com:443/redfish/v1/Systems/1"}

Because of name resolution failure, baremetal-operator cannot determine if the BMC is IPv4 or IPv6.
Therefore, the IP scheme is fall-back to IPv4 and ISO images are exposed via IPv4 address even if the BMC is IPv6 single stack.
In this case, the IPv6 BMC cannot access to the ISO image on IPv4, we observe error messages like the following example, and the baremetal host cannot boot from the ISO.

<Error message on iDRAC>
Unable to locate the ISO or IMG image file or folder in the network share location because the file or folder path or the user credentials entered are incorrect

The issue is caused by the following implementation.
The following line passes `p.bmcAddress` which is whole URL, that's why the name resolution fails.
I think we should pass `parsedURL.Hostname()` instead, which is the hostname part of the URL.

https://github.com/metal3-io/baremetal-operator/blob/main/pkg/provisioner/ironic/ironic.go#L657

		ips, err := net.LookupIP(p.bmcAddress)

Version-Release number of selected component (if applicable):
We observe this issue on OCP 4.14 and 4.15. But I think this issue occurs even in the latest releases.

How reproducible:

HubCluster is IPv4/IPv6 Dual Stack
BMC of managed baremetal hosts are IPv6 single stack
A hostname is used instead of An IP address in "spec.bmc.address" of BMH resource
The hostname is resolved only to IPv6 address, not IPv4

Steps to Reproduce:

Create a HubCluster with IPv4/IPv6 Dual Stack
Prepare a baremetal host and BMC with IPv6 single stack
Prepare a DNS server with an AAAA record entry which resolve the BMC hostname to an IPv6 address
Create a BMH resource and use the hostname in the URL of "spec.bmc.address"
BMC cannot boot due to IPv4/IPv6 mismatch

Actual results:
Name resolution fails and the baremetal host cannot boot

Expected results:
Name resolution works and the baremetal host can boot

Additional info:

https://github.com/openshift/baremetal-operator/pull/369

Bug OCPBUGS-38341: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-etcd-operator/pull/1314

Bug OCPBUGS-38357: Missing support for CAPI failure domain

View the Description View the linked PRs

Description of problem:

Hypershift doesn't allow to configure the Failure Domains for node pools; which could help to put machines into the desired availability zone.

https://github.com/openshift/hypershift/pull/4531

Bug OCPBUGS-41273: ART requests updates to 4.18 image ose-machine-api-provider-gcp-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/91

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component name: ose-machine-api-provider-gcp-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

https://github.com/openshift/machine-api-provider-gcp/pull/91

Bug OCPBUGS-37588: PowerVS: Add support for preexisting transit gateway

View the Description View the linked PRs

Description of problem:

Creating and destroying transit gateways (TG) during CI testing is costing an abnormal amount of money.  Since the monetary cost for creating a TG is high, provide support for a user created TG when creating an OpenShift cluster.

Version-Release number of selected component (if applicable):

all

How reproducible:

always

https://github.com/openshift/installer/pull/8774

Bug OCPBUGS-38226: e2e test "Helm release status verification: HR-01-TC04" is failing

View the Description View the linked PRs

Description of problem:

https://search.dptools.openshift.org/?search=Helm+Release&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14125

Bug OCPBUGS-39553: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/baremetal-operator/pull/375

Bug OCPBUGS-43983: conditional upgrades trigger Admission Webhook Warning

View the Description View the linked PRs

Description of problem:

For conditional updates, status.conditionalUpdates.release is also a Release type https://github.com/openshift/console/blob/master/frontend/public/module/k8s/types.ts#L812-L815 which will also trigger Admission Webhook Warning

Version-Release number of selected component (if applicable):

4.18.0-ec.2

How reproducible:

Always

Steps to Reproduce:

1.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14452

Bug OCPBUGS-41285: ART requests updates to 4.18 image ose-azure-file-csi-driver-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/80

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/azure-file-csi-driver/pull/80

Bug OCPBUGS-33308: IngressController subnet selection in AWS

View the Description View the linked PRs

Description of problem:

When creating an OCP cluster on AWS and selecting "publish: Internal," 
the ingress operator may create external LB mappings to external 
subnets.

This can occur if public subnets were specified during installation at install-config.

https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installing-aws-private.html#private-clusters-about-aws_installing-aws-private 

A configuration validation should be added to the installer.

Version-Release number of selected component (if applicable):

    4.14+ probably older versions as well.

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

    Slack thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1714986876688959

https://github.com/openshift/installer/pull/8906

Bug OCPBUGS-38792: TLS errors for openshift-tests image

View the Description View the linked PRs

Description of problem:

https://issues.redhat.com//browse/OCPBUGS-31919 partially fixed an issue consuming the test image from a custom registry.
The fix is about consuming in the test binary the pull-secret of the cluster under tests.
To complete it we have to do the same trusting custom CA as the cluster under test.

Without that, if the test image is exposed by a registry where the TLS cert is signed by a custom CA, the same tests will fail as for:

{  fail [github.com/openshift/origin/test/extended/operators/certs.go:120]: Unexpected error:
    <*errors.errorString | 0xc0023105c0>: 
    unable to determine openshift-tests image oc wrapper with cluster ps: Error running /usr/bin/oc adm release info virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7 --image-for=tests --registry-config /tmp/image-pull-secret2435751342:
    StdOut>
    error: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority
    StdErr>
    error: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority
    exit status 1
    
    {
        s: "unable to determine openshift-tests image oc wrapper with cluster ps: Error running /usr/bin/oc adm release info virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7 --image-for=tests --registry-config /tmp/image-pull-secret2435751342:\nStdOut>\nerror: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": tls: failed to verify certificate: x509: certificate signed by unknown authority\nStdErr>\nerror: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": tls: failed to verify certificate: x509: certificate signed by unknown authority\nexit status 1\n",
    }
occurred
Ginkgo exit error 1: exit with code 1}

Version-Release number of selected component (if applicable):

    release-4.16, release-4.17 and master branchs in origin.

How reproducible:

Always

Steps to Reproduce:

    1. try to run the test suite against a cluster where the OCP release (and the test image) comes from a private registry with a cert signed by a custom CA
    2.
    3.

Actual results:

    3 failing tests:
: [sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel] expand_more
: [sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel] expand_more
: [sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel] expand_more

Expected results:

    No failing tests

Additional info:

    OCPBUGS-31919 partially fixed it having the test binary downloading the pull secret from the cluster under test. But in order to have it working we have also to trust custom CAs trusted by the cluster under test

https://github.com/openshift/origin/pull/28990

Bug OCPBUGS-38990: node-joiner pod does not honour cluster wide proxy

View the Description View the linked PRs

Description of problem:

node-joiner pod does not honour cluster wide testing

Version-Release number of selected component (if applicable):

OCP 4.16.6

How reproducible:

Always

Steps to Reproduce:

    1. Configure an OpenShift cluster wide proxy according to https://docs.openshift.com/container-platform/4.16/networking/enable-cluster-wide-proxy.html and add Red Hat urls (quay.io and alii) to the proxy allow list.
    2. Add a node to a cluster using a node joiner pod, following https://github.com/openshift/installer/blob/master/docs/user/agent/add-node/add-nodes.md

Actual results:

Error retrieving the images on quay.io
time=2024-08-22T08:39:02Z level=error msg=Release Image arch could not be found: command '[oc adm release info quay.io/openshift-release-dev/ocp-release@sha256:24ea553ce2e79fab0ff9cf2917d26433cffb3da954583921926034b9d5d309bd -o=go-template={{if and .metadata.metadata (index . "metadata" "metadata" "release.openshift.io/architecture")}}{{index . "metadata" "metadata" "release.openshift.io/architecture"}}{{else}}{{.config.architecture}}{{end}} --insecure=true --registry-config=/tmp/registry-config1164077466]' exited with non-zero exit code 1:time=2024-08-22T08:39:02Z level=error msg=error: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:24ea553ce2e79fab0ff9cf2917d26433cffb3da954583921926034b9d5d309bd: Get "http://quay.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Expected results:

  node-joiner is able to downoad the images using the proxy

Additional info:
By allowing full direct internet access, without a proxy, the node joiner pod is able to download image from quay.io.

So there is a strong suspicion that the http timeout error above comes from the pod not being to use the proxy.

Restricted environementes when external internet access is only allowed through a proxy allow lists is quite common in corporate environements.

Please consider honouring the openshift proxy configuration .

https://github.com/openshift/oc/pull/1857

Bug OCPBUGS-42149: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/assisted-installer/pull/903

Bug OCPBUGS-44573: Address circular references in @console/webterminal-plugin and in @console/console-app

View the Description View the linked PRs

Description of problem:

    Circular dependencies in OCP Console prevent migration of Webpack 5

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enable the CHECK_CYCLES env var while building
    2. Observe errors
    3.

Actual results:

    There are errors

Expected results:

    No errors

Additional info:

https://github.com/openshift/console/pull/14494

Bug OCPBUGS-42563: Extra control plane VMs created during GCP install in 4.17+

View the Description View the linked PRs

Description of problem

During install of multi-AZ OSD GCP clusters into customer-provided GCP projects, extra control plane nodes are created by the installer. This may be limited to a few regions, and has show up in our testing in us-west2 and asia-east2.

When the cluster is installed, the installer provisions three control plane nodes via the cluster-api:

master-0 in AZ *a
master-1 in AZ *b
master-2 in AZ *c

However, the Machine manifest for master-0 and master-2 are written with the wrong AZs (master-0 in AZ *c and master-2 in AZ *a).

When the Machine controller in-cluster starts up and parses the manifests, it cannot find a VM for master-0 in AZ *c, or master-2 in *a, so it proceeds to try to create new VMs for those cases. master-1 is identified correctly, and unaffected.

This results in the cluster coming up with three control plane nodes, with master-0 and master-2 having no backing Machines, three control plane Machines, with only master-1 having a Node link, and the other two listed in Provisioned state, but with no Nodes, and 5 GCP VMs for these control plane nodes:

master-0 in AZ *a
master-0 in AZ *c
master-1 in AZ *b
master-2 in AZ *a
master-2 in AZ *c

This happens consistently, across multiple GCP projects, so far in us-west2 and asia-east2 ONLY.

4.16.z clusters work as expected, as do clusters upgraded from 4.16.z to 4.17.z.

Version-Release number of selected component

4.17.0-rc3 - 4.17.0-rc6 have all been identified as having this issue.

How reproducible

100%

Steps to Reproduce

I'm unsure how to replicate this in vanilla cluster install, but via OSD:

Create a multi-az cluster in one of the reported zones, with a supplied GCP project (not the core OSD shared project, ie: CCS, or "Customer Cloud Subscription").

Example:

$ ocm create cluster --provider=gcp --multi-az --ccs --secure-boot-for-shielded-vms --region asia-east2 --service-account-file ~/.config/gcloud/chcollin1-dev-acct.json --channel-group candidate --version openshift-v4.17.0-rc.3-candidate chcollin-4170rc3-gcp

Requesting a GCP install via an install-config with controlPlane.platform.gcp.zones out of order seems to reliably reproduce.

Actual results

Install will fail in OSD, but a cluster will be created with multiple extra control-plane nodes, and the API server will respond on the master-1 node.

Expected results

A standard 3 control-plane-node cluster is created.

Additional info

We're unsure what it is about the two reported Zones or the difference between the primary OSD GCP project and customer-supplied Projects that has an effect.

The only thing we've noticed is the install-config has the order backwards for compute nodes, but not for control plane nodes:

{
  "controlPlane": [
    "us-west2-a",
    "us-west2-b",
    "us-west2-c"
  ],
  "compute": [
    "us-west2-c",     <--- inverted order.  Shouldn't matter when building control-plane Machines, but maybe cross-contaminated somehow?
    "us-west2-b",
    "us-west2-a"
  ],
  "platform": {
    "defaultMachinePlatform": {  <--- nothing about zones in here, although again, the controlPlane block should override any zones configured here
      "osDisk": {
        "DiskSizeGB": 0,
        "diskType": ""
      },
      "secureBoot": "Enabled",
      "type": ""
    },
    "projectID": "anishpatel",
    "region": "us-west2"
  }
}

Since we see the divergence at the asset/manifest level, we should be able to reproduce with just an openshift-install create manifests, followed by grep -r zones: or something, without having to wait for an actuall install attempt to come up and fail.

https://github.com/openshift/installer/pull/9063

Bug OCPBUGS-38183: registryOverride doesn't take effect on azure-disk-csi-driver-controller

View the Description View the linked PRs

Description of problem:

 azure-disk-csi-driver doesnt use registryOverrides

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1.set registry override on CPO
    2.watch that azure-disk-csi-driver continues to use default registry
    3.

Actual results:

    azure-disk-csi-driver uses default registry

Expected results:

    azure-disk-csi-driver mirrored registry

Additional info:

Story ETCD-664: Add e2e test for scaling when CPMS is disabled

View the Description View the linked PRs

See this comment for background:
https://github.com/openshift/origin/blob/6b07170abad135bc7c5b22c78b2079ceecfc8b51/test/extended/etcd/vertical_scaling.go#L75-L86

The current vertical scaling test relies triggers CPMSO to create a new machine by first deleting an existing machine. In that test we can't validate that the new machine is scaled-up before the old one is removed.

Another test we could add is to first disable the CPMSO and then delete an existing machine and manually create a new machine like we did before the CPMSO.

https://docs.openshift.com/container-platform/4.16/machine_management/control_plane_machine_management/cpmso-disabling.html

That way we can validate that the scale-down does not happen before the scale-up event.

https://github.com/openshift/origin/pull/29086

Bug OCPBUGS-43039: Broker form view by default throws error in Application name if an application exists

View the Description View the linked PRs

Description of problem:

Error is thrown by the broker form view for a pre-populated application name The error reads:  formData.application.selectedKey must be a `string` type, but the final value was: `null`. If "null" is intended as an empty value be sure to mark the schema as `.nullable()`
_

Version-Release number of selected component (if applicable):

How reproducible:

 Every time

Steps to Reproduce:

    1. Install serverless operator
    2. Create any application in a namespace 
    3. Now open broker in form view

Actual results:

You have to select no application or any other application for the form view to work

Expected results:

Error should not be thrown for the appropriate value

Additional info:

Attaching a video of the error

https://drive.google.com/file/d/1WRp2ftMPlCG0ZiHZwC0QfleES3iVHObq/view?usp=sharing

https://github.com/openshift/console/pull/14399

Story CORS-3595: 4.18 Release Branching Checklist

View the Description

User Story:

This is a checklist of tasks for when we break off a new feature branch.

Acceptance Criteria:

Description of criteria:

Complete checklist items
Clone this card to the next release

Sub-task CORS-3598: Update default release image

View the Description View the linked PRs

e.g. https://github.com/openshift/installer/pull/5774/files

https://github.com/openshift/installer/pull/9049

Sub-task CORS-3601: Update k8s dependencies (api, etc)

View the Description View the linked PRs

Note: also notify the Hive team we're doing these bumps.

https://github.com/openshift/installer/pull/9067

Sub-task CORS-3599: Update CVO channel

View the Description View the linked PRs

https://github.com/openshift/installer/pull/6302/commits/35023b3804335775a04d6cbfb6665706eaef586f

https://github.com/openshift/installer/pull/8907

Bug OCPBUGS-24226: setting TLSSecurityProfile with no minTLSVersion crashes controller

View the Description View the linked PRs

Maxim Patlasov pointed this out in ~~STOR-1453~~ but still somehow we missed it. I tested this on 4.15.0-0.ci-2023-11-29-021749.

It is possible to set a custom TLSSecurityProfile without minTLSversion:

$ oc edit apiserver cluster
...
spec:
tlsSecurityProfile:
type: Custom
custom:
ciphers:
- ECDHE-ECDSA-CHACHA20-POLY1305
- ECDHE-ECDSA-AES128-GCM-SHA256

This causes the controller to crash loop:

$ oc get pods -n openshift-cluster-csi-drivers
NAME READY STATUS RESTARTS AGE
aws-ebs-csi-driver-controller-589c44468b-gjrs2 6/11 CrashLoopBackOff 10 (18s ago) 37s
...

because the `${TLS_MIN_VERSION}` placeholder is never replaced:

- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}

The observed config in the ClusterCSIDriver shows an empty string:

$ oc get clustercsidriver ebs.csi.aws.com -o json | jq .spec.observedConfig
{
"targetcsiconfig": {
"servingInfo":

{ "cipherSuites": [ "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256" ], "minTLSVersion": "" }

}
}

which means minTLSVersion is empty when we get to this line, and the string replacement is not done:

[https://github.com/openshift/library-go/blob/c7f15dcc10f5d0b89e8f4c5d50cd313ae158de20/pkg/operator/csi/csidrivercontrollerservicecontroller/helpers.go#L234]

So it seems we have a couple of options:

1) completely omit the --tls-min-version arg if minTLSVersion is empty, or
2) set --tls-min-version to the same default value we would use if TLSSecurityProfile is not present in the apiserver object

Bug OCPBUGS-34667: Azure HostedClusters failing to complete due to cluster-storage-operator

View the Description View the linked PRs

Description of problem:

Azure HostedClusters are failing in OCP 4.17 due to issues with the cluster-storage-operator.

- lastTransitionTime: "2024-05-29T19:58:39Z"
          message: 'Unable to apply 4.17.0-0.nightly-multi-2024-05-29-121923: the cluster operator storage is not available'
          observedGeneration: 2
          reason: ClusterOperatorNotAvailable
          status: "True"
          type: ClusterVersionProgressing

I0529 20:05:21.547544       1 status_controller.go:218] clusteroperator/storage diff {"status":{"conditions":[{"lastTransitionTime":"2024-05-29T20:02:00Z","message":"AzureDiskCSIDriverOperatorCRDegraded: AzureDiskDriverGuestStaticResourcesControllerDegraded: \"node_service.yaml\" (string): namespaces \"clusters-test-case4\" not found\nAzureDiskCSIDriverOperatorCRDegraded: AzureDiskDriverGuestStaticResourcesControllerDegraded: ","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverGuestStaticResourcesController_SyncError","status":"True","type":"Degraded"},{"lastTransitionTime":"2024-05-29T20:04:15Z","message":"AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying","status":"True","type":"Progressing"},{"lastTransitionTime":"2024-05-29T19:59:00Z","message":"AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying","status":"False","type":"Available"},{"lastTransitionTime":"2024-05-29T19:59:00Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"},{"lastTransitionTime":"2024-05-29T19:59:00Z","reason":"NoData","status":"Unknown","type":"EvaluationConditionsDetected"}]}} I0529 20:05:21.566215       1 event.go:364] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"azure-cloud-controller-manager", UID:"205a4307-67e4-481e-9fee-975b2c5c40fb", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/storage changed: Progressing message changed from "AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nAzureFileCSIDriverOperatorCRProgressing: AzureFileDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods" to "AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods"

On the HostedCluster itself, these errors with the csi pods not coming up are:

% k describe pod/azure-disk-csi-driver-node-5hb24 -n openshift-cluster-csi-drivers | grep fail
    Liveness:  http-get http://:healthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5
    Liveness:     http-get http://:rhealthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5
  Warning  FailedMount  2m (x28 over 42m)  kubelet            MountVolume.SetUp failed for volume "metrics-serving-cert" : secret "azure-disk-csi-driver-node-metrics-serving-cert" not found

There was an error with the CO as well:

storage                                    4.17.0-0.nightly-multi-2024-05-29-121923   False       True          True       49m     AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Every time

Steps to Reproduce:

    1. Create a HC with a 4.17 nightly

Actual results:

    Azure HC does not complete; nodes do join NodePool though

Expected results:

    Azure HC should complete

Additional info:

Bug OCPBUGS-36454: [coredns] Rename Dockerfile

View the Description View the linked PRs

Refactor name to Dockerfile.ocp as a better, version independent alternative

https://github.com/openshift/coredns/pull/128

Bug OCPBUGS-30292: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-38734: Expose EncodedExtension type via Console plugin SDK

View the Description View the linked PRs

Description of problem:

Console dynamic plugins may declare their extensions using TypeScript, e.g. Kubevirt plugin-extensions.ts module.

The EncodedExtension type should be exposed directly via Console plugin SDK, instead of plugins having to import this type from the dependent OpenShift plugin SDK packages.

https://github.com/openshift/console/pull/14167

Bug OCPBUGS-36869: IPI Baremetal - BootstrapVM interface restart impacts pulling image and causes ironic service to fail

View the Description View the linked PRs

Description of problem:

IPI Baremetal - BootstrapVM machineNetwork interface restart impacts pulling image and causes ironic service to fail

Version-Release number of selected component (if applicable):

4.16.Z but also seen this in 4.15 and 4.17

How reproducible:

50% of our jobs fail because of this.

Steps to Reproduce:

    1. Prepare an IPI baremetal deployment (we have provisioning network disabled, we are using Virtual Media)
    2. Start a deployment, wait for the bootstrapVM to start running and login via SSH
    3. Run the command: journalctl -f | grep "Dependency failed for Ironic baremetal deployment service"
    4. If the command above returns something, then print around 70 lines before and check for the NetworkManager entries in the log about the interface in the baremetal network getting restarted and an error about pulling an image because DNS is not reachable.

Actual results:

Deployments fail 50% of the time, bootstrapVM is not able to pull an image because main machineNetwork interface is getting restarted and DNS resolution fails.

Expected results:

Deployments work 100% of the time, bootstrapVM is able to pull any image because machineNetwork interface is NOT restarted while images are getting pulled.

Additional info:

We have a CI system to test OCP 4.12 through 4.17 deployments and this issue started to occurred a few weeks ago. mainly in 4.15, 4.16, and 4.17

In this log extract of a deployment with OCP 4.16.0-0.nightly-2024-07-07-171226: you can see the image pull error because is not able to resolve the registry name, but in the lines before and after you can see that the machineNetwork interface is getting restarted, causing the lack of DNS resolution.

Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Finished Build Ironic environment.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Extract Machine OS Images...
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Provisioning interface...
Mon 2024-07-08 23:15:15 UTC localhost.localdomain extract-machine-os.service[3779]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b49111aa35052140e7fdd79964c32db47074c1...
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.3899] audit: op="connection-update" uuid="bf7e41e3-f1ea-3eed-98fd-c3d021e35d11" 
name="Wired connection 1" args="ipv4.addresses" pid=3812 uid=0 result="success"
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <warn>  [1720480515.4008] keyfile: load: "/etc/NetworkManager/system-connections/nmconnection": failed to load connection: invalid connection: connection.type: property is missing
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4018] audit: op="connections-reload" pid=3817 uid=0 result="success"
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4159] agent-manager: agent[543677841603162b,:1.67/nmcli-connect/0]: agent registered
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4164] device (ens3): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4170] manager: NetworkManager state is now CONNECTED_LOCAL
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4172] device (ens3): disconnecting for new activation request.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4172] audit: op="connection-activate" uuid="bf7e41e3-f1ea-3eed-98fd-c3d021e35d11" name="Wired connection 1" pid=3821 uid=0 result="success"
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4200] device (ens3): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4214] dhcp4 (ens3): canceled DHCP transaction
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4215] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4215] dhcp4 (ens3): state changed no lease
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4216] dhcp6 (ens3): canceled DHCP transaction
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4216] dhcp6 (ens3): activation: beginning transaction (timeout in 45 seconds)
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4216] dhcp6 (ens3): state changed no lease

Mon 2024-07-08 23:15:15 UTC localhost.localdomain extract-machine-os.service[3779]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b49111aa35052140e7fdd79964c32db47074c1: (Mirrors also failed: [registry.dfwt5g.lab:4443/ocp-4.16/4.16.0-0.nightly-2024-07-07-171226@sha256:1370c041f0ecf4f6590c1
2f3e1b49111aa35052140e7fdd79964c32db47074c1: Get "https://registry.dfwt5g.lab:4443/v2/ocp-4.16/4.16.0-0.nightly-2024-07-07-171226/manifests/sha256:1370c041f0ecf4f6590c12f3e1b49111a
a35052140e7fdd79964c32db47074c1": dial tcp 192.168.5.9:4443: connect: network is unreachable]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b491
11aa35052140e7fdd79964c32db47074c1: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 192.168.32.8:53: dial udp 192.168.32.8:53: connect: network is unreachable

Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: extract-machine-os.service: Main process exited, code=exited, status=125/n/a
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2607:b500:410:7700::1 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.10.223.134 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4309] policy: set-hostname: set hostname to 'localhost.localdomain' (no hostname found)
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 207.246.65.226 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4309] device (ens3): Activation: starting connection 'Wired connection 1' (bf7e41e3-f1ea-3eed-98fd-c3d021e35d11)
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2001:470:f1c4:1::42 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4315] device (ens3): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2603:c020:0:8369::feeb:dab offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4317] manager: NetworkManager state is now CONNECTING
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2600:3c01:e000:7e6::123 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4317] device (ens3): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 192.168.32.8 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4322] device (ens3): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.89.207.99 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4326] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 135.148.100.14 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4347] dhcp4 (ens3): state changed new lease, address=192.168.32.28
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4350] policy: set 'Wired connection 1' (ens3) as default for IPv4 routing and DNS
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4385] device (ens3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Removed source 192.168.32.8
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.10.223.134 online
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 207.246.65.226 online
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.89.207.99 online
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 135.148.100.14 online
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: extract-machine-os.service: Failed with result 'exit-code'.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Failed to start Extract Machine OS Images.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Customized Machine OS Image Server.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Ironic baremetal deployment service.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: ironic.service: Job ironic.service/start failed with result 'dependency'.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Metal3 deployment service.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: metal3-baremetal-operator.service: Job metal3-baremetal-operator.service/start failed with result 'dependency'.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: image-customization.service: Job image-customization.service/start failed with result 'dependency'.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Ironic ramdisk logger...
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Update master BareMetalHosts with introspection data...
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3899]: NM local-dns-prepender triggered by ens3 dhcp4-change.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3899]: <13>Jul  8 23:15:15 root: NM local-dns-prepender triggered by ens3 dhcp4-change.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3901]: NM resolv-prepender: Checking for nameservers in /var/run/NetworkManager/resolv.conf
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3903]: nameserver 192.168.32.8
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3905]: Failed to get unit file state for systemd-resolved.service: No such file or directory
Mon 2024-07-08 23:15:15 UTC localhost.localdomain root[3911]: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3911]: <13>Jul  8 23:15:15 root: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3917]: NM local-dns-prepender: local DNS IP already is the first entry in resolv.conf
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3917]: <13>Jul  8 23:15:15 root: NM local-dns-prepender: local DNS IP already is the first entry in resolv.conf
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5372] device (ens3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain provisioning-interface.service[3821]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveMon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5375] device (ens3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5377] manager: NetworkManager state is now CONNECTED_SITE
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5379] device (ens3): Activation: successful, device activated.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5383] manager: NetworkManager state is now CONNECTED_GLOBAL
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Finished Provisioning interface.

https://github.com/openshift/installer/pull/8985

Bug OCPBUGS-41295: ART requests updates to 4.18 image openshift-enterprise-haproxy-router-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/router/pull/624

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/router/pull/624

Bug OCPBUGS-39133: Kube-aggregator reaching stale apiservice endpoints

View the Description View the linked PRs

Debugging https://issues.redhat.com/browse/OCPBUGS-36808 (the Metrics API failing some of the disruption checks) and taking https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808 as a reproducer of the issue, I think the Kube-aggregator is behind the problem.

According to the disruption checks which forward some relevant errors from the apiserver in the logs, looking at one of the new-connections check failures (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808/artifacts/e2e-aws-ovn-upgrade-2/openshift-e2e-test/artifacts/junit/backend-disruption_20240816-155051.json)

> "Aug 16 16:43:17.672 - 2s E backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/c62b7d32-856f-49de-86f5-1daed55326b2 backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: error running request: 503 Service Unavailable: error trying to reach service: dial tcp 10.128.2.31:10250: connect: connection refused"

The "error trying to reach service" part comes from: https://github.com/kubernetes/kubernetes/blob/b3c725627b15bb69fca01b70848f3427aca4c3ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L105, the apiserver failing to reach the metrics-server Pod, the problem is that the IP "10.128.2.31" corresponds to a Pod that was deleted some milliseconds before (as part of a node update/draining), as we can see in:

> 2024-08-16T16:19:43.087Z|00195|binding|INFO|openshift-monitoring_metrics-server-7b9d8c5ddb-dtsmr: Claiming 0a:58:0a:80:02:1f 10.128.2.31
...
I0816 16:43:17.650083 2240 kubelet.go:2453] "SyncLoop DELETE" source="api" pods=["openshift-monitoring/metrics-server-7b9d8c5ddb-dtsmr"]
...

The apiserver was using a stale IP to reach a Pod that no longer exists, even though a new Pod that had already replaced the other Pod (Metrics API backend runs on 2 Pods), some minutes before, was available.
According to OVN, a fresher IP 10.131.0.12 of that Pod was already in the endpoints at that time:

> I0816 16:40:24.711048 4651 lb_config.go:1018] Cluster endpoints for openshift-monitoring/metrics-server are: map[TCP/https:

{10250 [10.128.2.31 10.131.0.12] []}

]

I think, when "10.128.2.31" failed, the apiserver should have fallen back to "10.131.0.12", maybe it waits for some time/retries before doing so, or maybe it wasn't even aware of "10.131.0.12"

AFAIU, we have "--enable-aggregator-routing" set by default https://github.com/openshift/cluster-kube-apiserver-operator/blob/37df1b1f80d3be6036b9e31975ac42fcb21b6447/bindata/assets/config/defaultconfig.yaml#L101-L103 on the apiservers, so instead of forwarding to the metrics-server's service, apiserver directly reaches the Pods.

For that it keeps track of the relevant services and endpoints https://github.com/kubernetes/kubernetes/blob/ad8a5f5994c0949b5da4240006d938e533834987/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L40

bad decisions may be made if the if the services and/or endpoints cache are stale.

Looking at the metrics-server (the Metrics API backend) endpoints changes in the apiserver audit logs:

> $ grep -hr Event . | grep "endpoints/metrics-server" | jq -c 'select( .verb | match("watch|update"))' | jq -r '[.requestReceivedTimestamp,.user.username,.verb] | @tsv' | sort
2024-08-16T15:39:57.575468Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:02.005051Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:35.085330Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:35.128519Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:19:41.148148Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:19:47.797420Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.051594Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.100761Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.938927Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:21:01.699722Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:39:00.328312Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:39:XX the first Pod was rolled out
2024-08-16T16:39:07.260823Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:39:41.124449Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:43:23.701015Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:43:23, the new Pod that replaced the second one was created
2024-08-16T16:43:28.639793Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:43:47.108903Z system:serviceaccount:kube-system:endpoint-controller update

We can see that just before the new-connections checks succeeded again at around "2024-08-16T16:43:23.", an UPDATE was received/treated which may have helped the apiserver sync its endpoints cache or/and chose a healthy Pod

Also, no update was triggered when the second Pod was deleted at "16:43:17" which may explain the stale 10.128.2.31 endpoints entry on apiserver side.

To summarize, I can see two problems here (maybe one is the consequence of the other):

A Pod was deleted and an Endpoint pointing to it wasn't updated. Apparently the Endpoints controller had/has some sync issues https://github.com/kubernetes/kubernetes/issues/125638
The apiserver resolver had a endpoints cache with one stale and one fresh entry but it kept 4-5 times in a row trying to reach the stale entry OR
The endpoints was updated "At around 16:39:XX the first Pod was rolled out, see above", but the apiserver resolver cache missed that and ended up with 2 stale entries in the cache, and had to wait until "At around 16:43:23, the new Pod that replaced the second one was created, see above" to sync and replace them with 2 fresh entries.

Version-Release number of selected component (if applicable):
{code:none}

How reproducible:

Steps to Reproduce:

    1. See "Description of problem"
    2.
    3.

Actual results:

Expected results:

the kube-aggregator should detect stale Apiservice endpoints.

Additional info:

the kube-aggregator proxies requests to a stale Endpoints/Pod which makes Metrics API requests falsely fail.

Bug OCPBUGS-38490: "net/http: TLS handshake timeout" due to out of connections in haproxy in openshift-kni-infra while using ACM to Image-based Upgrade 3500+ managedclusters

View the Description View the linked PRs

Description of problem:

While running batches of 500 managedclusters upgrading via Image-Based Upgrades (IBU) via RHACM and TALM, frequently the haproxy load balancer configured by default for a bare metal cluster in the openshift-kni-infra namespace would run out of connections despite being tuned for 20,000 connections.

Version-Release number of selected component (if applicable):

Hub OCP - 4.16.3
Spoke Clusters - Originally deployed 4.14.31 then upgraded in sequence to 4.14.32 -> 4.15.20 -> 4.15.21 -> 4.16.1 -> 4.16.3
ACM - 2.11.0-DOWNSTREAM-2024-07-10-21-49-48
TALM - 4.16.0

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

While monitoring the current connections during a CGU batch of 500 SNOs to IBU to a new OCP version I would observe the oc cli returning "net/http: TLS handshake timeout" and if I monitoring the current connections via rsh into the active haproxy pod:

# oc  -n openshift-kni-infra rsh haproxy-d16-h10-000-r650 
Defaulted container "haproxy" out of: haproxy, haproxy-monitor, verify-api-int-resolvable (init)
sh-5.1$ echo "show info" | socat stdio /var/lib/haproxy/run/haproxy.sock | grep CurrConns
CurrConns: 20000
sh-5.1$ 

While capturing this value every 10 or 15 seconds I would observe a high fluctuation of the number of connections such as 
Thu Aug  8 17:51:57 UTC 2024
CurrConns: 17747
Thu Aug  8 17:52:02 UTC 2024
CurrConns: 18413
Thu Aug  8 17:52:07 UTC 2024
CurrConns: 19147
Thu Aug  8 17:52:12 UTC 2024
CurrConns: 19785
Thu Aug  8 17:52:18 UTC 2024
CurrConns: 20000
Thu Aug  8 17:52:23 UTC 2024
CurrConns: 20000
Thu Aug  8 17:52:28 UTC 2024
CurrConns: 20000
Thu Aug  8 17:52:33 UTC 2024
CurrConns: 20000

A brand new hub cluster without any spoke clusters and without ACM installed runs between 53-56 connections, after installing ACM I would see the connection count rise to 56-60 connections. In a smaller environment with only 297 managedclusters I observed between 1410-1695 connections. I do not have a measurement of how many approximate connections we need in the large environment however it clearly fluctuates and the initiation of the IBU upgrades seems to spike it to the current default limit triggering the timeout error message.

https://github.com/openshift/machine-config-operator/pull/4531

Story CONSOLE-4295: i18n upload/download routine task - sprint 261

View the Description View the linked PRs

The story is to track i18n upload/download routine tasks which are perform every sprint.

A.C.

- Upload strings to Memosource at the start of the sprint and reach out to localization team

- Download translated strings from Memsource when it is ready

- Review the translated strings and open a pull request

- Open a followup story for next sprint

https://github.com/openshift/console/pull/14435

Bug OCPBUGS-36705: Console CSS adds bullets to dynamic plugin dropdown menu

View the Description View the linked PRs

Description of problem:

CSS overrides in the OpenShift console are applied to ACM dropdown menu

Version-Release number of selected component (if applicable):

4.14, 4.15

How reproducible:

Always

Steps to Reproduce:

View ACM, Governance > Policies. Actions dropdown

Actual results:

Actions are indented and preceded by bullets

Expected results:

Dropdown menu style should not be affected

Additional info:

https://github.com/openshift/console/pull/14324

Bug OCPBUGS-8259: --to-multi-arch should be consistent in forbidding flags that are unusable during transition to hetero

View the Description View the linked PRs

Description of problem:

while applying "oc adm upgrade --to-multi-arch"
certain flags such as --to and --to-image are blocked with error message such as: 
error: --to-multi-arch may not be used with --to or --to-image
however if one applies --force, or --to-latest, no error message is generated, only:
Requested update to multi cluster architecture
and the flags are omitted silently, applying .spec:
desiredUpdate:
    architecture:    Multi
    force:    false   <- --force silently have no effect here
    image:    
    version:    4.13.0-ec.2  <- --to-latest omitted silently either

Version-Release number of selected component (if applicable):

4.13.0-ec.2 but seen elsewhere

How reproducible:

100%

Steps to Reproduce:

1. oc adm upgrade --to-multi-arch --force
2. oc adm upgrade --to-multi-arch --to-latest
3. oc adm upgrade --to-multi-arch --force --to-latest

Actual results:

omitted silently as explained above

Expected results:

either blocked with the same error as --to and --to-image
or if there is a use case, should have the desired effect and not omitted

https://github.com/openshift/oc/pull/1906

Bug OCPBUGS-39570: ART requests updates to 4.18 image ose-machine-os-images-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-os-images/pull/40

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/machine-os-images/pull/40

Bug OCPBUGS-38255: ART requests updates to 4.18 image ironic-agent-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ironic-agent-image/pull/143

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ironic-agent-image/pull/143

Bug OCPBUGS-41258: ART requests updates to 4.18 image openshift-enterprise-console-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/console-operator/pull/929

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/console-operator/pull/929

Bug OCPBUGS-41618: The data structure for services's label is flipped, with values acting as key and key as values

View the Description View the linked PRs

Description of problem:

    The label data for networking services is inverted, it should be stored as "key=value", but it's currently stored as "value=key"

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-09-120947
    4.18.0-0.nightly-2024-09-09-212926

How reproducible:

    Always

Steps to Reproduce:

    1. Navigate to Networking - Services page. and create a sample Service with lable
       eg: 
apiVersion: v1
kind: Service
metadata:
  name: exampleasd
  namespace: default
  labels:
    testkey1: testvalue1
    testkey2: testvaule2
spec:
  selector:
    app: MyApp
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376    

    2. Check the Labels on Service details page
    3. Check the Labels on Labels column on Networking -> Services page

Actual results:

    the data is shown as 'testvalue1=testkey1' and 'testvalue2=testkey2'

Expected results:

    it should be shown as 'testkey1=testvalue1' and 'testkey2=testvalue2'

Additional info:

https://github.com/openshift/networking-console-plugin/pull/78

Bug OCPBUGS-20052: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8941

Bug METAL-1164: [ironic-image] broken downstream build due to pip install isolation

View the Description View the linked PRs

https://download-01.beak-001.prod.iad2.dc.redhat.com/brewroot/work/tasks/7701/65007701/x86_64.log

we need no-build-isolation in the download command too
this has been verified with ART

Bug OCPBUGS-33863: Reason for broken UWM configmaps should be traceable back to UWM

View the Description View the linked PRs

Description of problem:

    Creating a faulty configmap for UWM results in cluster_operator_up=0 with the reason InvalidConfiguration. With https://issues.redhat.com/browse/MON-3421 we're expecting the reason to match UserWorkload.*

Version-Release number of selected component (if applicable):

    4.15.z

How reproducible:

    100%

Steps to Reproduce:

apply the following CM to a cluster with UWM enabled:

apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    hah helo! :)

Actual results:

    cluster_operator_up=0 with reason InvalidConfiguration

Expected results:

    cluster_operator_up=0 with reason matching pattern UserWorkload.*

Additional info:

https://issues.redhat.com/browse/MON-3421 streamlined reasons to allow separation between UWM and cluster monitoring. The above is a leftover that should be updated to match the same pattern.

https://github.com/openshift/cluster-monitoring-operator/pull/2436

Bug OCPBUGS-37052: LDAP communication going through HTTP(S) proxy

View the Description View the linked PRs

Description of problem:

This is a followup of https://issues.redhat.com/browse/OCPBUGS-34996, in which comments led us to better understand the issue customers are facing.

LDAP IDP traffic from the oauth pod seems to be going through the configured HTTP(S) proxy, while it should not due to it being a different protocol. This results in customers adding the ldap endpoint to their no-proxy config to circumvent the issue.

Version-Release number of selected component (if applicable):

4.15.11

How reproducible:

Steps to Reproduce:

 (From the customer)   
    1. Configure LDAP IDP
    2. Configure Proxy
    3. LDAP IDP communication from the control plane oauth pod goes through proxy instead of going to the ldap endpoint directly

Actual results:

    LDAP IDP communication from the control plane oauth pod goes through proxy

Expected results:

    LDAP IDP communication from the control plane oauth pod should go to the ldap endpoint directly using the ldap protocol, it should not go through the proxy settings

Additional info:

For more information, see linked tickets.

Bug OCPBUGS-40849: ART requests updates to 4.18 image ose-cluster-dns-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/421

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-dns-operator/pull/421

Bug OCPBUGS-38011: Need to allow blank for Project/namespace when setting SA Subject in 'Project access tab'

View the Description View the linked PRs

Description of problem:

Until OCP 4.11, the form with Name and Role in 'Dev Console -> Project -> Project Access tab' seems to have been changed to the form of Subject, Name, and Role through ~~OCPBUGS-7800~~. Here, when the Subject is ServiceAccount, the Save button is not available unless Project is selected.

This seems to be a requirement to set Project/namespace.However, in the CLI, RoleBinding objects can be created without namespace with no issues.

$ oc describe rolebinding.rbac.authorization.k8s.io/monitor
Name: monitor
Labels: <none>
Annotations: <none>
Role:
Kind: ClusterRole
Name: view
Subjects:
Kind Name Namespace
---- ---- ---------
ServiceAccount monitor

—

This is inconsistent with the dev console, causing confusion for developers and administrators and making things cumbersome for administrators.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Login to the web console for Developer.
    2. Select Project on the left.
    3. Select 'Project Access' tab.
    4. Add  access -> Select Sevice Account on the dropdown

Actual results:

   Save button is not active when no project is selected

Expected results:

    The Save button is enabled even though the Project is not selected, so that it can be created just as it is handled in the CLI.

Additional info:

https://github.com/openshift/console/pull/14142

Bug OCPBUGS-37945: Invalid configuration for device 0 error with openshift-installer for vsphere

View the Description View the linked PRs

Description of problem:

    openshift-install create cluster leads to error:
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: Invalid configuration for device '0'. 

Vsphere standard port group

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. openshift-install create cluster
     2. Choose Vsphere
    3. fill in the blanks
4. Have a standard port group

Actual results:

    error

Expected results:

    cluster creation

Additional info:

https://github.com/openshift/installer/pull/8799

Bug OCPBUGS-32053: [enterprise-4.14] Issue in file authentication/using-rbac.adoc

View the Description View the linked PRs

Description of problem:

    The single page docs are missing the "oc adm policy add-cluster-role-to* and remove-cluster-role-from-* commands.  These options exist in these docs:

https://docs.openshift.com/container-platform/4.14/authentication/using-rbac.html

but not in these docs:

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/cli_tools/index#oc-adm-policy-add-role-to-user

https://github.com/openshift/oc/pull/1845

Bug OCPBUGS-43538: onRowsRendered prop is not available in VirtualizedTable component

View the Description View the linked PRs

Description of problem:

    VirtualizedTable which is exposed to dynamic plugin is missing onRowsRendered prop which is available in VirtualTableBody of @patternfly/react-virtualized-extension package

Version-Release number of selected component (if applicable):

    4.15.z

Actual results:

    onRowsRendered prop is not available in VirtualizedTable component

Expected results:

    onRowsRendered prop should be available in VirtualizedTable component

Additional info:

https://github.com/openshift/console/pull/14421

Bug OCPBUGS-43518: Power VS: Missing security group rules are not getting created

View the Description View the linked PRs

Description of problem:

Necessary security group rules are not created when using installer created VPC.

Version-Release number of selected component (if applicable):

    4.17.2

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy a power vs cluster and have the installer create the VPC, or remove required rules from a VPC you're bringing.
    2. Control plane nodes fail to bootstrap.
    3. Fail

Actual results:

    Install fails

Expected results:

    Install succeeds

Additional info:

    Fix identified

https://github.com/openshift/installer/pull/9107

Bug OCPBUGS-38733: rendered MachineConfig in use not recreated in OpenShift 4.16

View the Description View the linked PRs

Description of problem:

In OpenShift 4.13-4.15, when a "rendered" MachineConfig in use is deleted, it's automatically recreated. In OpenShift 4.16, it's not recreated, and nodes and MCP becomes degraded due to the "rendered" not found error.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

Always

Steps to Reproduce:

1. Create a MC to deploy any file in the worker MCP

2. Get the name of the new rendered MC, like for example "rendered-worker-bf829671270609af06e077311a39363e"

3. When the first node starts updating, delete the new rendered MC

    oc delete mc rendered-worker-bf829671270609af06e077311a39363e

Actual results:

Node degraded with "rendered" not found error

Expected results:

In OCP 4.13 to 4.15, the "rendered" MC is automatically re-created, and the node continues updating to the MC content without issues. It should be the same in 4.16.

Additional info:

The same behavior in 4.12 and older than now in 4.16. In 4.13-4.15, the "rendered" is re-created and no issues with the nodes/MCPs are shown.

https://github.com/openshift/machine-config-operator/pull/4594

Bug OCPBUGS-31044: [Azure-File] volume mount failed in multiple payload images

View the Description View the linked PRs

Description of problem:

Azure-File volume mount failed, it happens on arm cluster with multi payload

$ oc describe pod
  Warning  FailedMount       6m28s (x2 over 95m)  kubelet            MountVolume.MountDevice failed for volume "pvc-102ad3bf-3480-410b-a4db-73c64daeb3e2" : rpc error: code = InvalidArgument desc = GetAccountInfo(wduan-0319b-bkp2k-rg#clusterjzrlh#pvc-102ad3bf-3480-410b-a4db-73c64daeb3e2###wduan) failed with error: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wduan-0319b-bkp2k-rg/providers/Microsoft.Storage/storageAccounts/clusterjzrlh/listKeys?api-version=2021-02-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post "https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token": dial tcp 20.190.190.193:443: i/o timeout'

The node log reports:
W0319 09:41:30.745936 1 azurefile.go:806] GetStorageAccountFromSecret(azure-storage-account-clusterjzrlh-secret, wduan) failed with error: could not get secret(azure-storage-account-clusterjzrlh-secret): secrets "azure-storage-account-clusterjzrlh-secret" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:azure-file-csi-driver-node-sa" cannot get resource "secrets" in API group "" in the namespace "wduan"

Checked the role looks good, at least the same as previous: 
$ oc get clusterrole azure-file-privileged-role -o yaml
...
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-multi-2024-03-13-031451

How reproducible:

2/2

Steps to Reproduce:

    1. Checked in CI, azure-file cases failed due to this
    2. Create one cluster with the same config and payload, create azure-file pvc and pod
    3.

Actual results:

Pod could not be running

Expected results:

Pod should be running

Additional info:

Bug OCPBUGS-37988: In Cluster settings, version text is black when in dark mode on firefox

View the Description View the linked PRs

Description of problem:

    In the Administrator view under Cluster Settings -> Update Status Pane, the text for the versions is black instead of white when Dark mode is selected on Firefox (128.0.3 Mac). Also happens if you choose System default theme and the system is set to Dark mode.

Version-Release number of selected component (if applicable):

How reproducible:

    Always

Steps to Reproduce:

    1. Open /settings/cluster using Firefox with Dark mode selected
    2.
    3.

Actual results:

    The version numbers under Update status are black

Expected results:

    The version numbers under Update status are white

Additional info:

https://github.com/openshift/console/pull/14120

Bug OCPBUGS-42191: Fix ImageEcosystem tests

View the Description View the linked PRs

Description of problem:

There are 2 problematic tests in the ImageEcosystem testsuite in: the rails sample and the s2i perl test. This issue tries to fix them both at once so that we can get a passing image ecosystem test.

Version-Release number of selected component (if applicable):

How reproducible:

always

Steps to Reproduce:

    1. Run the imageecosystem testsuite
    2. observe the {[Feature:ImageEcosystem][ruby]} and {[Feature:ImageEcosystem][perl]} test fail

Actual results:

The two tests fail

Expected results:

No test failures

Additional info:

https://github.com/openshift/origin/pull/29116

Bug MGMT-18863: Host Agent CR eventsURL reaches maximum entries and does not show latests events

View the Description View the linked PRs

Description of the problem:

After multiple re-installations over the exact same baremetal host and re-using the exact same parameters (such as Agent ID, Cluster name, domain, etc) - even if the postgres database does save latest entries, the eventsURL hits a limit so there is no direct way to check the progress.

How reproducible:

Steps to reproduce:

1. Install an SNO cluster in a Host

2. Fully wipe out all the resources in RHACM, including SNO project

3. Re-install exact same SNO in the same Host

4. Repeat steps 1-3 multiple times

Actual results:

Last ManagedCluster installed is from 09/09 and the postgres database contains its last installation logs:

installer=> SELECT * FROM events WHERE host_id LIKE 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201' ORDER BY event_time DESC;
   id   |          created_at           |          updated_at           | deleted_at | category |              cluster_id              |         event_time         |
               host_id                |             infra_env_id             |                                                                                       
                                    message                                                                                                                          
  |             name              | props |              request_id              | severity 
--------+-------------------------------+-------------------------------+------------+----------+--------------------------------------+----------------------------+
--------------------------------------+--------------------------------------+---------------------------------------------------------------------------------------
--------------+-------+--------------------------------------+----------
 213102 | 2024-09-09 10:15:54.440757+00 | 2024-09-09 10:15:54.440757+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:15:54.439+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host sno1: validation 'api-int-domain-name-resolved-correctly' that used to succeed is
 now failing                                                                                                                                                         
  | host_validation_failed        |       | b7785748-9f73-46e8-a11a-afefe2bfeb59 | warning
 213088 | 2024-09-09 10:06:16.021777+00 | 2024-09-09 10:06:16.021777+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:06:16.021+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Done                                           
                                                                                                                                                                
  | host_install_progress_updated |       | a711f06b-870f-4f5f-886a-882ed6ea4665 | info
 213087 | 2024-09-09 10:06:16.019012+00 | 2024-09-09 10:06:16.019012+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:06:16.018+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host sno1: updated status from installing-in-progress to installed (Done)             
                                                                                                                                                                     
  | host_status_updated           |       | a711f06b-870f-4f5f-886a-882ed6ea4665 | info
 213086 | 2024-09-09 10:05:16.029495+00 | 2024-09-09 10:05:16.029495+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:05:16.029+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Joined                                         
                                                                                                                                                                     
  | host_install_progress_updated |       | 2a8028c1-a0d0-4145-92cf-ea32e6b3f7e6 | info
 213085 | 2024-09-09 10:03:32.06692+00  | 2024-09-09 10:03:32.06692+00  |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:32.066+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Rebooting: Ironic will reboot the node shortly 
                                                                                                                                                                     
  | host_install_progress_updated |       | fced0438-2f03-415f-913e-62da2d43431b | info
 213084 | 2024-09-09 10:03:31.998935+00 | 2024-09-09 10:03:31.998935+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:31.998+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Uploaded logs for host sno1 cluster c5b3b1d3-0cc6-4674-8ba6-62140e9dea16              
                                                                                                                                                                     
  | host_logs_uploaded            |       | df3bc18a-d56a-4a20-84cb-d179fe3040f6 | info
 213083 | 2024-09-09 10:03:12.621342+00 | 2024-09-09 10:03:12.621342+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:12.621+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Writing image to disk: 100%                    
                                                                                                                                                                     
  | host_install_progress_updated |       | 69cad5b4-b606-406c-921e-4f7b0ababfb6 | info
 213082 | 2024-09-09 10:03:12.158359+00 | 2024-09-09 10:03:12.158359+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:12.158+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Writing image to disk: 97%

But opening the Agent eventsURL (from 09/09 installation):

apiVersion: agent-install.openshift.io/v1beta1
kind: Agent
metadata:
  annotations:
    inventory.agent-install.openshift.io/version: "0.1"
  creationTimestamp: "2024-09-09T09:55:46Z"
  finalizers:
  - agent.agent-install.openshift.io/ai-deprovision
  generation: 2
  labels:
    agent-install.openshift.io/bmh: sno1
    agent-install.openshift.io/clusterdeployment-namespace: sno1
    infraenvs.agent-install.openshift.io: sno1
    inventory.agent-install.openshift.io/cpu-architecture: x86_64
    inventory.agent-install.openshift.io/cpu-virtenabled: "true"
    inventory.agent-install.openshift.io/host-isvirtual: "true"
    inventory.agent-install.openshift.io/host-manufacturer: RedHat
    inventory.agent-install.openshift.io/host-productname: KVM
    inventory.agent-install.openshift.io/storage-hasnonrotationaldisk: "false"
  name: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201
  namespace: sno1
...
...
  debugInfo:
    eventsURL: https://assisted-service-multicluster-engine.apps.hub-sno.nokia-test.lab/api/assisted-install/v2/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiJjZDBkZGRjMy1lODc5LTRjNzItOWU5ZC0zZDk4YmI3ODEzYmIifQ.eMlGvHeR69CoEA6OhtZX0uBZFeQOSRGOhYsqd1b0W3M78cGo1a2kbIKTz1eU80GUb70cU3v3pxKmxd19kpFaQA&host_id=aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201
    state: installed
    stateInfo: Done

Clicking on the eventsURL shows latest event as one of past 25/7, which means it is still showing past installations over the host and not the last one:

  {
    "cluster_id": "4df40e8d-b28e-4cad-88d3-fa5c37a81939",
    "event_time": "2024-07-25T00:37:15.538Z",
    "host_id": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201",
    "infra_env_id": "f6564380-9d04-47e3-afe9-b348204cf521",
    "message": "Host sno1: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation)",
    "name": "host_status_updated",
    "severity": "info"
  }

Trying to replicate the behavior on the postgres database, its like if there was around 50.000 entries max and it is shown the last one of it, something like:

installer=> SELECT * FROM (SELECT * FROM events WHERE host_id LIKE 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201' LIMIT 50000) AS A ORDER BY event_time DESC LIMIT 1;
   id   |         created_at          |         updated_at          | deleted_at | category |              cluster_id              |         event_time         |    
           host_id                |             infra_env_id             |                                                           message                         
                                  |        name         | props |              request_id              | severity 
--------+-----------------------------+-----------------------------+------------+----------+--------------------------------------+----------------------------+----
----------------------------------+--------------------------------------+-------------------------------------------------------------------------------------------
----------------------------------+---------------------+-------+--------------------------------------+----------
 170052 | 2024-07-29 04:41:53.4572+00 | 2024-07-29 04:41:53.4572+00 |            | user     | 4df40e8d-b28e-4cad-88d3-fa5c37a81939 | 2024-07-29 04:41:53.457+00 | aaa
aaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | f6564380-9d04-47e3-afe9-b348204cf521 | Host sno1: updated status from known to preparing-for-installation (Host finished successf
ully to prepare for installation) | host_status_updated |       | 872c267a-499e-4b91-8bbb-fdc7ff4521aa | info

Expected results:

That the user can directly see in eventsURL latests events, in this scenario, they would be all from 09/09 installation and not from July

https://github.com/openshift/assisted-service/pull/6846

Bug OCPBUGS-37088: [AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS vt1*/g4* instance types

View the Description View the linked PRs

Description of problem:

[AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS vt1*/g4* instance types

Version-Release number of selected component (if applicable):

 4.17.0-0.nightly-2024-07-16-033047

How reproducible:

 Always

Steps to Reproduce:

1. Use instance type "vt1.3xlarge"/"g4ad.xlarge"/"g4dn.xlarge" install Openshift cluster on AWS

2. Check the csinode allocatable volumes count 
$ oc get csinode ip-10-0-53-225.ec2.internal -ojsonpath='{.spec.drivers[?(@.name=="ebs.csi.aws.com")].allocatable.count}'
26

g4ad.xlarge # 25 
g4dn.xlarge # 25
vt1.3xlarge # 26                                                              

$ oc get no/ip-10-0-53-225.ec2.internal -oyaml| grep 'instance-type'
    beta.kubernetes.io/instance-type: vt1.3xlarge
    node.kubernetes.io/instance-type: vt1.3xlarge
3. Create statefulset with pvc(which use the ebs csi storageclass), nodeAnffinity to the same node and set the replicas to the max volumesallocatable count to verify the the csinode allocatable volumes count is correct and all the pods should become Running 

# Test data
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: statefulset-vol-limit
spec:
  serviceName: "my-svc"
  replicas: 26
  selector:
    matchLabels:
      app: my-svc
  template:
    metadata:
      labels:
        app: my-svc
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - ip-10-0-53-225.ec2.internal # Make all volume attach to the same node
      containers:
      - name: openshifttest
        image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339
        volumeMounts:
        - name: data
          mountPath: /mnt/storage
      tolerations:
        - key: "node-role.kubernetes.io/master"
          effect: "NoSchedule"
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      #storageClassName: gp3-csi
      resources:
        requests:
          storage: 1Gi

Actual results:

In step 3 there's some pods stuck at "ContainerCreating" status caused by volumes stuck at attaching status and couldn't be attached to the node

Expected results:

 In step 3 all the pods with pvc should become "Running", and In step 2 the csinode allocatable volumes count should be correct

-> g4ad.xlarge allocatable count should be 24
-> g4dn.xlarge allocatable count should be 24
-> vt1.3xlarge allocatable count should be 24

Additional info:

  ...
attach or mount volumes: unmounted volumes=[data12 data6], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition
06-25 17:51:23.680      Warning  FailedAttachVolume      4m1s (x13 over 14m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-d08d4133-f589-4aa3-bbef-f988058c419a" : rpc error: code = Internal desc = Could not attach volume "vol-0aa138f453d414ec3" to node "i-09d532f5155b3c05d": attachment of disk "vol-0aa138f453d414ec3" failed, expected device to be attached but was attaching
06-25 17:51:23.681      Warning  FailedMount             3m40s (x3 over 10m)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[data6 data12], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition
...

https://github.com/openshift/aws-ebs-csi-driver/pull/274

Story CONSOLE-4274: i18n upload/download routine task -sprint 260

View the Description View the linked PRs

The story is to track i18n upload/download routine tasks which are perform every sprint.

A.C.

- Upload strings to Memosource at the start of the sprint and reach out to localization team

- Download translated strings from Memsource when it is ready

- Review the translated strings and open a pull request

- Open a followup story for next sprint

https://github.com/openshift/console/pull/14348

Bug OCPBUGS-38350: The NumberSpinnerFields in the horizontal pod autoscaler do not handle numbers above 2147483647

View the Description View the linked PRs

Description of problem:

  Numbers input into NumberSpinnerField that are above 2147483647 are not accepted as integers

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enter a number larger than 2147483647 into any NumberSpinnerField

Actual results:

    Number is not accepted as an integer

Expected results:

    There should be a separate validation error stating the number should be less than 2147483647

Additional info:

    See https://github.com/openshift/console/pull/14084

https://github.com/openshift/console/pull/14286

Bug CNV-48187: [MultiNetworkPolicies] "Create MultiNetworkPolicies" does not get MultiNetworkPolicy

View the Description View the linked PRs

Description of problem:

On NetworkPolicies page, select MultiNetworkPolicies and create the policy, the created policy is not MultiNetworkPolicy, but NetworkPolicy.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

100%

Steps to Reproduce:

1. Create a MultiNetworkPolicy
2.
3.

Actual results:

The policy is a NetworkPolicy, not MultiNetworkPolicy

Expected results:

It's MultiNetworkPolicy

Additional info:

https://github.com/openshift/networking-console-plugin/pull/71

Bug OCPBUGS-41778: kube-apiserver experiencing more disruption (4.18)

View the Description View the linked PRs

TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.

The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.

source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.

The operator degraded is probably the strongest symptom to persue as it appears in most of the above.

If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1732

Bug OCPBUGS-42086: [4.18] Bootimage bump tracker

View the Description View the linked PRs

Tracker issue for bootimage bump in 4.18. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-41259.

https://github.com/openshift/installer/pull/9027

Bug OCPBUGS-42860: [olmv1] co olm is Degraded if ClusterExtension is installed

View the Description View the linked PRs

Description of problem:

    The message of the co olm conditions of Upgradeable is not correct if one ClusterExtension(without olm.maxOpenShiftVersion) is installed.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-10-06-223232

How reproducible:

    always

Steps to Reproduce:

    1.create ClusterCatalog
apiVersion: olm.operatorframework.io/v1alpha1
kind: ClusterCatalog
metadata:
  name: catalog-1
  labels:
    example.com/support: "true"
    provider: olm-1
spec:
  priority: 1000
  source:
    type: image
    image:
      ref: quay.io/openshifttest/nginxolm-operator-index:nginxolm74108     2. create ns and sa
3. create ClusterExtension
apiVersion: olm.operatorframework.io/v1alpha1
kind: ClusterExtension
metadata:
  name: test-74108
spec:
  source:
    sourceType: Catalog
    catalog:
      packageName: nginx74108
      channels:
        - candidate-v1.1
  install:
    serviceAccount:
      name: sa-74108
    namespace: test-74108   

 4. check co olm status
status:
  conditions:
  - lastTransitionTime: "2024-10-08T11:51:01Z"
    message: 'OLMIncompatibleOperatorControllerDegraded: error with cluster extension
      test-74108: error in bundle nginx74108.v1.1.0: could not convert olm.properties:
      failed to unmarshal properties annotation: unexpected end of JSON input'
    reason: OLMIncompatibleOperatorController_SyncError
    status: "True"
    type: Degraded
  - lastTransitionTime: "2024-10-08T02:16:36Z"
    message: All is well
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2024-10-08T02:16:36Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2024-10-08T11:48:26Z"
    message: 'InstalledOLMOperatorsUpgradeable: error with cluster extension test-74108:
      error in bundle nginx74108.v1.1.0: could not convert olm.properties: failed
      to unmarshal properties annotation: unexpected end of JSON input'
    reason: InstalledOLMOperators_FailureGettingExtensionMetadata
    status: "False"
    type: Upgradeable
  - lastTransitionTime: "2024-10-08T02:09:59Z"
    reason: NoData
    status: Unknown
    type: EvaluationConditionsDetected

Actual results:

    co olm is Degraded

Expected results:

    co olm is OK

Additional info:

    Bellow annotation of CSV is not configured 
olm.properties: '[{"type": "olm.maxOpenShiftVersion", "value": "4.8"}]'

https://github.com/openshift/cluster-olm-operator/pull/72

Bug OCPBUGS-39398: Error fetching networking-console-plugin locales

View the Description View the linked PRs

Description of problem:

    When the console is loaded there are errors in the browsers console abouth failing to fetch networking-console-plugin locales.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

    The issue is also effecting console CI

https://github.com/openshift/cluster-network-operator/pull/2488

Bug OCPBUGS-42939: [4.18 EFS] controller, node pods are left behind after uninstalling driver

View the Description View the linked PRs

Description of problem:

4.18 efs controller, node pods are left behind after uninstalling driver

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-08-075347

How reproducible:

Always

Steps to Reproduce:

1. Install 4.18 EFS operator, driver on  cluster and check the efs pods are all up and Running
2. Uninstall EFs driver and check the controller, node pods gets deleted

Execution on 4.16 and 4.18 clusters

4.16 cluster

oc create -f og-sub.yaml
oc create -f driver.yaml

oc get pods | grep "efs"
aws-efs-csi-driver-controller-b8858785-72tp9     4/4     Running   0          4s
aws-efs-csi-driver-controller-b8858785-gvk4b     4/4     Running   0          6s
aws-efs-csi-driver-node-2flqr                    3/3     Running   0          9s
aws-efs-csi-driver-node-5hsfp                    3/3     Running   0          9s
aws-efs-csi-driver-node-kxnlv                    3/3     Running   0          9s
aws-efs-csi-driver-node-qdshm                    3/3     Running   0          9s
aws-efs-csi-driver-node-ss28h                    3/3     Running   0          9s
aws-efs-csi-driver-node-v9zwx                    3/3     Running   0          9s
aws-efs-csi-driver-operator-65b55bf877-4png9     1/1     Running   0          2m53s

oc get clustercsidrivers | grep "efs"
efs.csi.aws.com   2m26s

oc delete -f driver.yaml

oc get pods | grep "efs"
aws-efs-csi-driver-operator-65b55bf877-4png9     1/1     Running   0          4m40s

4.18 cluster
oc create -f og-sub.yaml
oc create -f driver.yaml

oc get pods | grep "efs" 
aws-efs-csi-driver-controller-56d68dc976-847lr   5/5     Running   0               9s
aws-efs-csi-driver-controller-56d68dc976-9vklk   5/5     Running   0               11s
aws-efs-csi-driver-node-46tsq                    3/3     Running   0               18s
aws-efs-csi-driver-node-7vpcd                    3/3     Running   0               18s
aws-efs-csi-driver-node-bm86c                    3/3     Running   0               18s
aws-efs-csi-driver-node-gz69w                    3/3     Running   0               18s
aws-efs-csi-driver-node-l986w                    3/3     Running   0               18s
aws-efs-csi-driver-node-vgwpc                    3/3     Running   0               18s
aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv     1/1     Running   0               2m55s

oc get clustercsidrivers 
efs.csi.aws.com   2m19s

oc delete -f driver.yaml

oc get pods | grep "efs"              
aws-efs-csi-driver-controller-56d68dc976-847lr   5/5     Running   0               4m58s
aws-efs-csi-driver-controller-56d68dc976-9vklk   5/5     Running   0               5m
aws-efs-csi-driver-node-46tsq                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-7vpcd                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-bm86c                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-gz69w                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-l986w                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-vgwpc                    3/3     Running   0               5m7s
aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv     1/1     Running   0               7m44s

oc get clustercsidrivers  | grep "efs" => Nothing is there

Actual results:

The EFS controller, node pods are left behind

Expected results:

After uninstalling driver the EFS controller, node pods should get deleted

Additional info:

 On 4.16 cluster this is working fine

EFS Operator logs:

oc logs aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv
E1009 07:13:41.460469       1 base_controller.go:266] "LoggingSyncer" controller failed to sync "key", err: clustercsidrivers.operator.openshift.io "efs.csi.aws.com" not found

Discussion: https://redhat-internal.slack.com/archives/C02221SB07R/p1728456279493399

Bug OCPBUGS-41034: ART requests updates to 4.18 image ose-vmware-vsphere-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/249

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/249

Bug OCPBUGS-41799: Enable cluster-olm-operator after fixing merges

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-olm-operator/pull/67

Bug CNV-48188: Filter on MultiNetworkPolicies tab impacts NetworkPolicies

View the Description View the linked PRs

Description of problem:

Go to NetworkPolicies page, make sure they have policies in each tab.
Go to MultiNetworkPolicies tab and create a filter, then move the the first tab(NetworkPolicies tab), it does not show the policies any more.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Have policies on NetworkPolicies tab and MultiNetworkPolicies tab
2. Create a filter on MultiNetworkPolicies tab
3. Go to NetworkPolicies tab

Actual results:

It shows "Not found"

Expected results:

the list of networkpolicies shows up

Additional info:

https://github.com/openshift/networking-console-plugin/pull/84

Bug OCPBUGS-41620: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14254

Bug OCPBUGS-44375: External network ID should be optional but create nil pointer errror if not given

View the Description View the linked PRs

Description of problem:

External network ID should be an optional CLI option but when not given, the Hypershift Operator crashes with a nil pointer error.

Version-Release number of selected component (if applicable):

    4.18 and 4.17

https://github.com/openshift/hypershift/pull/5091

Bug OCPBUGS-32526: Pipeline creation not working with devfiles through import from git way

View the Description View the linked PRs

Description of problem:

Creation of pipeline through import from git using devfile repo does not work

Version-Release number of selected component (if applicable):

How reproducible:

Everytime

Steps to Reproduce:

    1. Create a pipeline from import from git form using devfile repo `https://github.com/nodeshift-starters/devfile-sample.git`
    2. Check pipelines page
    3.

Actual results:

No pipeline is created instead build config is created for it

Expected results:

If the pipeline option is showing in _import from git form _for a repo, the pipeline should be generated

Additional info:

https://github.com/openshift/console/pull/14329

Bug OCPBUGS-6589: Workloads - Jobs : Completions column i18n misses

View the Description View the linked PRs

Description of problem:

Completions column values need to be marked for translation.

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

Steps to Reproduce:

1. Navigate to Workloads - Jobs
2. Values under Completions column are in English
3.

Actual results:

Content is in English

Expected results:

Content should be in target language

Additional info:

screenshot provided

https://github.com/openshift/console/pull/14253

Story MGMT-18378: Support arm64 architecture for CNV 4.14+ as dev preview

View the Description View the linked PRs

arm64 is dev preview by CNV since 4.14. The installer shouldn't block installing it.

Just make sure it is shown in the UI as dev preview.

https://github.com/openshift/assisted-service/pull/6645

Task MON-3982: Update downstream prometheus-operator to v0.76.0

View the linked PRs

Bug OCPBUGS-35895: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OKD-225: Incorrect samples operator in OKD-SCOS

View the Description View the linked PRs

In all releases tested, in particular, 4.16.0-0.okd-scos-2024-08-21-155613, Samples operator uses incorrect templates, resulting in following alert:

Samples operator is detecting problems with imagestream image imports. You can look at the "openshift-samples" ClusterOperator object for details. Most likely there are issues with the external image registry hosting the images that needs to be investigated. Or you can consider marking samples operator Removed if you do not care about having sample imagestreams available. The list of ImageStreams for which samples operator is retrying imports: fuse7-eap-openshift fuse7-eap-openshift-java11 fuse7-java-openshift fuse7-java11-openshift fuse7-karaf-openshift-jdk11 golang httpd java jboss-datagrid73-openshift jboss-eap-xp3-openjdk11-openshift jboss-eap-xp3-openjdk11-runtime-openshift jboss-eap-xp4-openjdk11-openshift jboss-eap-xp4-openjdk11-runtime-openshift jboss-eap74-openjdk11-openshift jboss-eap74-openjdk11-runtime-openshift jboss-eap74-openjdk8-openshift jboss-eap74-openjdk8-runtime-openshift jboss-webserver57-openjdk8-tomcat9-openshift-ubi8 jenkins jenkins-agent-base mariadb mysql nginx nodejs perl php postgresql13-for-sso75-openshift-rhel8 postgresql13-for-sso76-openshift-rhel8 python redis ruby sso75-openshift-rhel8 sso76-openshift-rhel8 fuse7-karaf-openshift jboss-webserver57-openjdk11-tomcat9-openshift-ubi8 postgresql

For example, the sample image for Mysql 8.0 is being pulled from registry.redhat.io/rhscl/mysql-80-rhel7:latest (and cannot be found using the dummy pull secret).

Works correctly on OKD FCOS builds.

Story API-1820: Bump openshift-apiserver to kube 1.30.x

View the Description View the linked PRs

The openshift-apiserver is using kube 1.29.2 . However the library-go is bumped to 1.30.1.

In order to have smooth vendoring of changes from library-go to openshift-apiserver, both repositories should use the same Kube version.

https://github.com/openshift/openshift-apiserver/pull/441

Bug OCPBUGS-39481: ART requests updates to 4.18 image ose-vsphere-cloud-controller-manager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/76

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-vsphere/pull/76

Bug OCPBUGS-39566: ART requests updates to 4.18 image ose-cluster-csi-snapshot-controller-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/216

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/216

Bug OCPBUGS-42705: release-image in separate mirror results in failed installation

View the Description View the linked PRs

Description of problem:

In GetMirrorFromRelease() https://github.com/openshift/installer/blob/master/pkg/asset/agent/mirror/registriesconf.go#L313-L328, the agent installer sets the mirror for the release image based on the source url.

This setting is then used in assisted-service to extract images etc. https://github.com/openshift/assisted-service/blob/master/internal/oc/release.go#L328-L340 in conjunction with the icsp file.

The problem is that GetMirrorFromRelease() returns just the first entry in registries.conf so its not really the actual mirror in the case when a source has multiple mirrors. A better way to handle this would be to net set the env variable OPENSHIFT_INSTALL_RELEASE_IMAGE_MIRROR and just let the resolving of the mirror be handled by the icsp-file. Its currently using the icsp-file but since the source has changed to the mirror it might not use these if, for example, the first mirror does not have the manifest file.

We've had an internal report of a failure when using mirroring:

Oct 01 10:06:16 master-0 agent-register-cluster[7671]: time="2024-10-01T14:06:16Z" level=fatal msg="Failed to register cluster with assisted-service: command 'oc adm release info -o template --template '{{.metadata.version}}' --insecure=true --icsp-file=/tmp/icsp-file2810072099 registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev@sha256:6078cb4ae197b5b0c526910363b8aff540343bfac62ecb1ead9e068d541da27b --registry-config=/tmp/registry-config204889789' exited with non-zero exit code 1: \nFlag --icsp-file has been deprecated, support for it will be removed in a future release. Use --idms-file instead.\nerror: image \"registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev@sha256:6078cb4ae197b5b0c526910363b8aff540343bfac62ecb1ead9e068d541da27b\" not found: manifest unknown: manifest unknown\n"

When using the mirror config:

[[registry]]
  location = "quay.io/openshift-release-dev/ocp-release"
  mirror-by-digest-only = true
  prefix = ""

  [[registry.mirror]]
    location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev"

  [[registry.mirror]]
    location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release"

[[registry]]
  location = "quay.io/openshift-release-dev/ocp-v4.0-art-dev"
  mirror-by-digest-only = true
  prefix = ""

  [[registry.mirror]]
    location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev"

  [[registry.mirror]]
    location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release"

https://github.com/openshift/assisted-service/pull/6918

Bug OCPBUGS-38994: [cluster-samples-operator] Fix library-sync.sh to handle renames correctly

View the Description View the linked PRs

Description of problem:

The library-sync.sh script may leave some files of the unsupported samples in the checkout. In particular, the files that have been renamed are not deleted even though they should have.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Run library-sync.sh

Actual results:

A couple of files under assets/operator/ocp-x86_64/fis are present.

Expected results:

The directory should not be present at all, because it is not supported.

Additional info:

https://github.com/openshift/cluster-samples-operator/pull/562

Bug OCPBUGS-39445: ART requests updates to 4.18 image ose-smb-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/268

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-42155: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/assisted-installer-agent/pull/781

Bug OCPBUGS-41292: ART requests updates to 4.18 image ose-network-tools-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/network-tools/pull/133

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/network-tools/pull/133

Bug OCPBUGS-41232: ART requests updates to 4.18 image ose-cluster-openshift-apiserver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/586

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-openshift-apiserver-operator/pull/586

Bug OCPBUGS-42819: `name === '~new'` is not a valid condition since resource names cannot include `~`

View the Description View the linked PRs

Description of problem:

https://github.com/search?q=repo%3Aopenshift%2Fconsole+name+%3D%3D%3D+%27%7Enew%27&type=code shows a number of instances in Console code where there is a check for a resource name with a value of "~new".  This check is not valid as a resource name cannot include "~".  We should remove these invalid checks.

https://github.com/openshift/console/pull/14380

Bug OCPBUGS-41527: ingress operator changed condition/Available to false during non-upgrade job

View the Description View the linked PRs

Component Readiness has found a potential regression in the following test:

[bz-Routing] clusteroperator/ingress should not change condition/Available

Probability of significant regression: 97.63%

Sample (being evaluated) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-09-09T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-28T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 67
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=upi&Installer=upi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=vsphere&Platform=vsphere&Scheduler=default&SecurityMode=default&Suite=serial&Suite=serial&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-28%2000%3A00%3A00&capability=Operator&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=Networking%20%2F%20router&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20upi%20ovn%20vsphere%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-09-09%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-09-01%2000%3A00%3A00&testId=openshift-tests%3Ab690e68fb6372a8924d84a0d6aa2f552&testName=%5Bbz-Routing%5D%20clusteroperator%2Fingress%20should%20not%20change%20condition%2FAvailable

It is worth mentioning that in two of the three failures, ingress operator went available=false at the same time image registry went available=false. This is one example.

Team can investigate, and if legit reason exists, please create an exception with origin and address it at proper time: https://github.com/openshift/origin/blob/4557bdcecc10d9fa84188c1e9a36b1d7d162c393/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L90

Since this is appearing on component readiness dashboard and the management depends on a green dashboard to make release decisions, please give the initial investigation a high priority. If an exception is needed, please contact TRT team to triage the issue.

https://github.com/openshift/cluster-ingress-operator/pull/1143

Vulnerability OCPBUGS-44391: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/258

Bug OCPBUGS-39314: Excessive Restarts on container/metal3-static-ip-set

View the Description View the linked PRs

We are aiming to find containers that are restarting more than 3 times in the progress of an e2e test. Critical pods like metal3-static-ip-set should not be restarting more than 3 times in the progress of a test.

Can your team investigate this and aim to fix for it?

For now, we will exclude our test from failing.

See https://search.dptools.openshift.org/?search=restarted+.*+times+at%3A&maxAge=168h&context=1&type=junit&name=4.18&excludeName=okd%7Csingle&maxMatches=5&maxBytes=20971520&groupBy=job

for an example of how much this container restarts in the progress of a test.

https://github.com/openshift/origin/pull/29223

Bug OCPBUGS-41175: Operator Hub does not show correct description when installing new operator

View the Description View the linked PRs

Description of problem:

See attached screenshots. Different operator versions have different descriptions but Operator hub shows still the 
same description for whatever operator version is selected.

Version-Release number of selected component (if applicable):

    OCP 4.16

How reproducible:

    Always

Steps to Reproduce:

    1.Open Operator Hub and find Sail operator
    2.Select Sail Operator
    3.Choose different versions and channels

Actual results:

    Description is always the same even though actual description for given version is different.

Expected results:

Expected behavior - when selecting different operator versions during installation the description should be updated according to selected operator.

Additional info:

    See attachments in original issue https://issues.redhat.com/browse/OPECO-3239

https://github.com/openshift/console/pull/14347

Bug OCPBUGS-42796: After upgrading the cluster to 4.15 the Prometheus Operator´s "Prometheus" tab does not show the Prometheuses

View the Description View the linked PRs

Description of problem:

After upgrading the cluster to 4.15 the Prometheus Operator´s "Prometheus" tab does not show the Prometheuses, they can still be viewed and accessed through the "All instances" tab

Version-Release number of selected component (if applicable):

OCP v4.15

Steps to Reproduce:

    1. Install prometheus operator from operator hub
    2. create prometheus instance
    3. Instance will be visible under all instances tab , not under prometheus tab

Actual results:

Prometheus instance be visible in all instance tab only

Expected results:

Prometheus instance should be visible in all instance along with prometheus tab

https://github.com/openshift/console/pull/14396

Bug OCPBUGS-22190: Azure cloud node manager has global Node update permissions

View the Description View the linked PRs

Description of problem:

The Azure cloud node manager uses a service account with a cluster role attached that provides it with cluster wide permissions to update Node objects.

This means, were the service account to become compromised, Node objects could be maliciously updated.

To limit the blast radius of a leak, we should determine if there is a way to limit the Azure Cloud Node Manager to only be able to update the node on which it resides, or, to move it's functionality centrally within the cluster.

Possible paths:
* Check upstream progress for any attempt to move the node manager role into the CCM
* See if we can re-use kubelet credentials as these are already scoped to updating only the Node on which they reside
* See if there's another admission control method we can use to limit the updates (possibly https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/)

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/363

Bug OCPBUGS-31306: Azure-Disk CSI Driver node pod CrashLoopBackOff in Azure Stack

View the Description View the linked PRs

Description of problem:

In Azure Stack, the Azure-Disk CSI Driver node pod CrashLoopBackOff:

openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-57rxv                                      1/3     CrashLoopBackOff   33 (3m55s ago)   59m     10.0.1.5       ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-m62cj   <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-8wvqm                                      1/3     CrashLoopBackOff   35 (29s ago)     67m     10.0.0.6       ci-op-q8b6n4iv-904ed-kp5mv-master-1              <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-97ww5                                      1/3     CrashLoopBackOff   33 (12s ago)     67m     10.0.0.7       ci-op-q8b6n4iv-904ed-kp5mv-master-2              <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-9hzw9                                      1/3     CrashLoopBackOff   35 (108s ago)    59m     10.0.1.4       ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-gjqmw   <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-glgzr                                      1/3     CrashLoopBackOff   34 (69s ago)     67m     10.0.0.8       ci-op-q8b6n4iv-904ed-kp5mv-master-0              <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-hktfb                                      2/3     CrashLoopBackOff   48 (63s ago)     60m     10.0.1.6       ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-kdbpf   <none>           <none>

The CSI-Driver container log:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc8 pc=0x18ff5db]
goroutine 228 [running]:
sigs.k8s.io/cloud-provider-azure/pkg/provider.(*Cloud).GetZone(0xc00021ec00, {0xc0002d57d0?, 0xc00005e3e0?})
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_zones.go:182 +0x2db
sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).NodeGetInfo(0xc000144000, {0x21ebbf0, 0xc0002d5470}, 0x273606a?)
 /go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/nodeserver.go:336 +0x13b
github.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler.func1({0x21ebbf0, 0xc0002d5470}, {0x1d71a60?, 0xc0003b0320})
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:7160 +0x72
sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x21ebbf0, 0xc0002d5470}, {0x1d71a60?, 0xc0003b0320?}, 0xc0003b0340, 0xc00050ae10)
 /go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409
github.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler({0x1ec2f40?, 0xc000144000}, {0x21ebbf0, 0xc0002d5470}, 0xc000054680, 0x20167a0)
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:7162 +0x135
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000530000, {0x21ebbf0, 0xc0002d53b0}, {0x21f5f40, 0xc00057b1e0}, 0xc00011cb40, 0xc00052c810, 0x30fa1c8, 0x0)
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1343 +0xe03
google.golang.org/grpc.(*Server).handleStream(0xc000530000, {0x21f5f40, 0xc00057b1e0}, 0xc00011cb40)
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1737 +0xc4c
google.golang.org/grpc.(*Server).serveStreams.func1.1()
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 260
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:997 +0x145

The registrar container log:
E0321 23:08:02.679727       1 main.go:103] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Unavailable desc = error reading from server: EOF, restarting registration container.

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-03-21-152650

How reproducible:

    See it in CI profile, and manual install failed earlier.

Steps to Reproduce:

    See Description

Actual results:

    Azure-Disk CSI Driver node pod CrashLoopBackOff

Expected results:

    Azure-Disk CSI Driver node pod should be running

Additional info:

    See gather-extra and must-gather: 
https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-azure-stack-ipi-proxy-fips-f2/1770921405509013504/artifacts/azure-stack-ipi-proxy-fips-f2/

Bug OCPBUGS-39498: ART requests updates to 4.18 image openshift-enterprise-registry-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/image-registry/pull/411

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/image-registry/pull/411

Bug OCPBUGS-38320: 4.18 && 4.17 Payload Disruption Failures

View the Description View the linked PRs

CI Disruption during node updates:
4.18 Minor and 4.17 micro upgrades started failing with the initial 4.17 payload 4.17.0-0.ci-2024-08-09-225819

4.18 Micro upgrade failures began with the initial payload 4.18.0-0.ci-2024-08-09-234503

CI Disruption in the -out-of-change jobs in the nightlies that start with
4.18.0-0.nightly-2024-08-10-011435 and
4.17.0-0.nightly-2024-08-09-223346

The common change in all of those scenarios appears to be:
~~OCPNODE-2357~~: templates/master/cri-o: make crun as the default container runtime #4437
~~OCPNODE-2357~~: templates/master/cri-o: make crun as the default container runtime #4518

https://github.com/openshift/machine-config-operator/pull/4523

Bug OCPBUGS-42873: Traffic to audit-webhook:8443 getting routed through Konnectivity proxy in ROSA

View the Description View the linked PRs

Description of problem:

openshift-apiserver that sends traffic through konnectivity proxy is sending traffic intended for the local audit-webhook service. The audit-webhook service should be included in the NO_PROXY env var of the openshift-apiserver container.

4.14.z,4.15.z,4.15.z,4.16.z

    How reproducible:{code:none} Always

Steps to Reproduce:

    1. Create a rosa hosted cluster
    2. Obeserve logs of the konnectivity-proxy sidecar of openshift-apiserver
    3.

Actual results:

     Logs include requests to the audit-webhook local service

Expected results:

      Logs do not include requests to audit-webhook

Additional info:

https://github.com/openshift/hypershift/pull/4864

Ticket TRT-1831: excess pathological events related to apiserver deployments

View the Description View the linked PRs

Slack thread asking apiserver team

We saw excess pathological events tests that failed aggregated jobs in aws and gcp jobs for 4.18.0-0.ci-2024-09-26-062917 (azure has them too and now failed in 4.18.0-0.nightly-2024-09-26-093014). The events are in namespace/openshift-apiserver-operator and namespace/openshift-authentication-operator – reason/DeploymentUpdated Updated Deployment.apps/apiserver -n openshift-oauth-apiserver because it changed
Examples:

https://github.com/openshift/api/pull/2045

Bug OCPBUGS-39497: ART requests updates to 4.18 image vmware-vsphere-syncer-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/127

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/vmware-vsphere-csi-driver/pull/127

Bug OCPBUGS-39629: ART requests updates to 4.18 image ose-azure-workload-identity-webhook-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/24

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/azure-workload-identity/pull/24

Task MGMT-19176: Fix assisted-service-operator OLM packaging

View the linked PRs

https://github.com/openshift/assisted-service/pull/6956

Bug MGMT-18980: Cannot delete BMH from spoke namespace

View the Description View the linked PRs

Description of the problem:

After successful deployment, trying to delete spoke resources.
BMHs are not being removed and stuck.

How reproducible:

Always

Steps to reproduce:

1. Deploy spoke node (tested in disconnected + IPV6 but CI also fails on ipv4)

2. Try to delete BMH (after deleting agents)

Actual results:

BMH is still in provisioned state and not being deleted.
From assisted logs:
-------
time="2024-09-20T21:02:23Z" level=error msg="failed to delete BMH" func=github.com/openshift/assisted-service/internal/controller/controllers.removeSpokeResources file="/remote-source/assisted-service/app/internal/controller/controllers/agent_controller.go:450" agent=6df557e8-00af-4377-ac93-096b66c8e3c6 agent_namespace=spoke-0 error="failed to remove BMH openshift-machine-api/spoke-worker-0-1 finalizers: Internal error occurred: failed calling webhook \"baremetalhost.metal3.io\": failed to call webhook: Post \"https://baremetal-operatf557e8-00af-4377-ac93-096b66c8e3c6 agent_namespace=spoke-0 error="failed to remove BMH openshift-machine-api/spoke-worker-0-1 finalizers: Internal error occurred: failed calling webhook \"baremetalhost.metal3.io\": failed to call webhook: Post \"https://baremetal-operator-webhook-service.openshift-machine-api.svc:443/validate-metal3-io-v1alpha1-baremetalhost?timeout=10s\": no endpoints available for service \"baremetal-operator-webhook-service\"" go-id=393 hostname=spoke-worker-0-1 machine=spoke-0-f9w48-worker-0-x484f machine_namespace=openshift-machine-api machine_set=spoke-0-f9w48-worker-0 node=spoke-w
--------

Expected results:
BMH shoud be deleted

must-gather: https://drive.google.com/file/d/1JOeDGTzQNgDy9ZdjlJMcRi-hksB6Iz9h/view?usp=drive_link

https://github.com/openshift/assisted-service/pull/6843

Bug MGMT-19007: [Staging] - BE allows LVMS and ODF to be enabled

View the Description View the linked PRs

Description of the problem:

B[Staging]BE 2.35.0, UI 2.34.2 - [Staging] - BE allows LVMS and ODF to be enabled

How reproducible:

100%

Steps to reproduce:

Actual results:

Expected results:

https://github.com/openshift/assisted-service/pull/6800

Bug OCPBUGS-39534: ART requests updates to 4.18 image ose-cluster-cloud-controller-manager-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/364

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/364

Bug OCPBUGS-41269: ART requests updates to 4.18 image cluster-etcd-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1332

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-etcd-operator/pull/1332

Bug OCPBUGS-43674: ABI cluster installation fails for external OCI platform

View the Description View the linked PRs

Description of problem:

The assisted service is throwing an error message stating that the Cloud Controller Manager (CCM) is not enabled, even though the CCM value is correctly set in the install-config file.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-19-045205

How reproducible:

Always

Steps to Reproduce:

    1. Prepare install-config and agent-config for external OCI platform.
      example of install-config configuration
.......
.......
platform: external
  platformName: oci
  cloudControllerManager: External
.......
.......
    2. Create agent ISO for external OCI platform     
    3. Boot up nodes using created agent ISO

Actual results:

Oct 21 16:40:47 agent-sno.private.agenttest.oraclevcn.com service[2829]: time="2024-10-21T16:40:47Z" level=info msg="Register cluster: agenttest with id 2666753a-0485-420b-b968-e8732da6898c and params {\"api_vips\":[],\"base_dns_domain\":\"abitest.oci-rhelcert.edge-sro.rhecoeng.com\",\"cluster_networks\":[{\"cidr\":\"10.128.0.0/14\",\"host_prefix\":23}],\"cpu_architecture\":\"x86_64\",\"high_availability_mode\":\"None\",\"ingress_vips\":[],\"machine_networks\":[{\"cidr\":\"10.0.0.0/20\"}],\"name\":\"agenttest\",\"network_type\":\"OVNKubernetes\",\"olm_operators\":null,\"openshift_version\":\"4.18.0-0.nightly-2024-10-19-045205\",\"platform\":{\"external\":{\"cloud_controller_manager\":\"\",\"platform_name\":\"oci\"},\"type\":\"external\"},\"pull_secret\":\"***\",\"schedulable_masters\":false,\"service_networks\":[{\"cidr\":\"172.30.0.0/16\"}],\"ssh_public_key\":\"ssh-rsa XXXXXXXXXXXX\",\"user_managed_networking\":true,\"vip_dhcp_allocation\":false}" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterClusterInternal" file="/src/internal/bminventory/inventory.go:515" cluster_id=2666753a-0485-420b-b968-e8732da6898c go-id=2110 pkg=Inventory request_id=82e83b31-1c1b-4dea-b435-f7316a1965e

Expected results:

The cluster installation should be successful.

https://github.com/openshift/installer/pull/9141

Bug OCPBUGS-41331: oc-mirror count the operator catalog image twice when doing the mirror to mirror

View the Description View the linked PRs

Description of problem:

When doing the mirror to mirror, will count the operator catalog image twice : 
 ✓   70/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 
 ✓   71/81 : (3s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15 
 ✓   72/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 
2024/09/06 04:55:05  [INFO]   : Mirroring is ongoing. No errors.
 ✓   73/81 : (0s) oci:///test/ibm-catalog 
 ✓   74/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 
 ✓   75/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 
 ✓   76/81 : (3s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 
 ✓   77/81 : (1s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 
 ✓   78/81 : (3s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 
 ✓   79/81 : (2s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 
 ✓   80/81 : (2s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15

Version-Release number of selected component (if applicable):

   oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-unknown-7b0b3bf2", GitCommit:"7b0b3bf2", GitTreeState:"clean", BuildDate:"2024-09-06T01:32:29Z", GoVersion:"go1.23.0", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always

Steps to Reproduce:

    1. do mirror2mirror with following imagesetconfig: 
cat config-136.yaml 
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  operators:
  - catalog: oci:///test/ibm-catalog
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: windows-machine-config-operator
    - name: cluster-kube-descheduler-operator
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14
    packages:
    - name: servicemeshoperator 
    - name: windows-machine-config-operator
  - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15
    packages:
    - name: nvidia-network-operator
  - catalog: registry.redhat.io/redhat/community-operator-index:v4.15
    packages:
    - name: skupper-operator
    - name: reportportal-operator
  - catalog: registry.redhat.io/redhat/redhat-marketplace-index:v4.15
    packages:
    - name: dynatrace-operator-rhmp

`oc-mirror -c config-136.yaml docker://localhost:5000/m2m06 --workspace file://m2m6 --v2  --dest-tls-verify=false`

Actual results:

will count the operator catalog images twice : 

 ✓   70/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 
 ✓   71/81 : (3s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15 
 ✓   72/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 
2024/09/06 04:55:05  [INFO]   : Mirroring is ongoing. No errors.
 ✓   73/81 : (0s) oci:///test/ibm-catalog 
 ✓   74/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 
 ✓   75/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 
 ✓   76/81 : (3s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 
 ✓   77/81 : (1s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 
 ✓   78/81 : (3s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 
 ✓   79/81 : (2s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 
 ✓   80/81 : (2s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15

Expected results:

   Should only count the operator catalog image corretly

Additional info:

Bug OCPBUGS-44041: hypershift CLI depends on az and jq commands

View the Description View the linked PRs

Description of problem:

The hypershift CLI has an implicit dependency on the az and jq commands, as it invokes them directly. 

As a result, the "hypershift-azure-create" chain will not work since it's based on the hypershift-operator image, which lacks these tools.

Expected results:

Refactor the hypershift CLI to handle these dependencies in a Go-native way, so that the CLI is self-contained.

Bug OCPBUGS-44567: Address circular references in @console/pipelines-plugin

View the Description View the linked PRs

Description of problem:

    Circular dependencies in OCP Console prevent migration of Webpack 5

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enable the CHECK_CYCLES env var while building
    2. Observe errors
    3.

Actual results:

    There are errors

Expected results:

    No errors

Additional info:

https://github.com/openshift/console/pull/14493

Bug OCPBUGS-39081: Slow network causes metal IPI bootstrap to fail

View the Description View the linked PRs

If the network to the bootstrap VM is slow, the extract-machine-os.service can time out (after 180s). If this happens, it will be restarted but services that depend on it (like ironic) will never be started even once it succeeds. systemd added support for Restart:on-failure for Type:oneshot services, but they still don't behave the same way as other types of services.

This can be simulated in dev-scripts by doing:

sudo tc qdisc add dev ostestbm root netem rate 33Mbit

https://github.com/openshift/installer/pull/8921

Bug OCPBUGS-41266: ART requests updates to 4.18 image ose-openshift-apiserver-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/445

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/openshift-apiserver/pull/445

Bug OCPBUGS-42107: HostedCluster VPC endpoint and related DNS entries not cleaned up

View the Description View the linked PRs

Description of problem:

    When deleting an AWS HostedCluster with endpoint access of type PublicAndPrivate or Private, the VPC endpoint for the HostedCluster is not always cleaned up when the HostedCluster is deleted.

Version-Release number of selected component (if applicable):

4.18.0

How reproducible:

    Most of the time

Steps to Reproduce:

    1. Create a HostedCluster on AWS with endpoint access PublicAndPrivate
    2. Wait for the HostedCluster to finish deploying
    3. Delete the HostedCluster by deleting the HostedCluster resource (oc delete hostedcluster/[name] -n clusters)

Actual results:

    The vpc endpoint and/or the DNS entries in the hypershift.local hosted zone that corresponds to the hosted cluster are not removed.

Expected results:

    The vpc endpoint and DNS entries in the hypershift.local hosted zone are deleted when the hosted cluster is cleaned up.

Additional info:

With current code, the namespace is deleted before the control plane operator finishes cleanup of the VPC endpoint and related DNS entries.

https://github.com/openshift/hypershift/pull/4740

Bug OCPBUGS-21593: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-39508: ART requests updates to 4.18 image ose-powervs-block-csi-driver-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/92

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ibm-powervs-block-csi-driver/pull/92

Bug OCPBUGS-36861: Pull image from gcp artifact registry failed

View the Description View the linked PRs

Description of problem:

Pull image from gcp artifact registry failed

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

1. Create repo for gcp artifact registry: zhsun-repo1
 
2. Login to registry
gcloud auth login
gcloud auth configure-docker us-central1-docker.pkg.dev 
    
3. Push image to registry
$ docker pull openshift/hello-openshift
$ docker tag openshift/hello-openshift:latest us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest
$ docker push us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest
4. Create pod
$ oc new-project hello-gcr
$ oc new-app --name hello-gcr --allow-missing-images \  
  --image us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest
5. Check pod status

Actual results:

Pull image failed.
must-gather: https://drive.google.com/file/d/1o9cyJB53vQtHNmL5EV_hIx9I_LzMTB0K/view?usp=sharing
kubelet log: https://drive.google.com/file/d/1tL7HGc4fEOjH5_v6howBpx2NuhjGKsTp/view?usp=sharing
$ oc get po               
NAME                          READY   STATUS             RESTARTS   AGE
hello-gcr-658f7f9869-76ssg    0/1     ImagePullBackOff   0          3h24m

$ oc describe po hello-gcr-658f7f9869-76ssg 
  Warning  Failed          14s (x2 over 15s)  kubelet  Error: ImagePullBackOff
  Normal   Pulling         2s (x2 over 16s)   kubelet  Pulling image "us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest"
  Warning  Failed          1s (x2 over 16s)   kubelet  Failed to pull image "us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest": rpc error: code = Unknown desc = Requesting bearer token: invalid status code from registry 403 (Forbidden)

Expected results:

Can pull image from artifact registry succeed

Additional info:

gcr.io works as expected. 
us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest doesn't work.
$ oc get po -n hello-gcr            
NAME                          READY   STATUS             RESTARTS   AGE
hello-gcr-658f7f9869-76ssg    0/1     ImagePullBackOff   0          156m
hello-gcr2-6d98c475ff-vjkt5   1/1     Running            0          163m
$ oc get po -n hello-gcr -o yaml | grep image                                                                                                                                       
    - image: us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest
    - image: gcr.io/openshift-qe/hello-gcr:latest

https://github.com/openshift/installer/pull/8938

Bug OCPBUGS-39362: Remove Node Memory Overcommit chart

View the Description View the linked PRs

Revert https://issues.redhat.com//browse/CNV-39065
as we don't need this chart anymore

https://github.com/openshift/console/pull/14218

Vulnerability OCPBUGS-44396: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/networking-console-plugin/pull/140

Bug CNV-47529: [MultiNetworkPolicies] The link of "Learn more about MultiNetworkPolicy" does not work

View the Description View the linked PRs

Description of problem:

In MultiNetworkPolicies page, the learn more link does not work

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Go to Networking -> NetworkPolicies -> MultiNetworkPolicies
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/networking-console-plugin/pull/67

Bug OCPBUGS-42987: Setting ESP offload off for bonds does not work reliably

View the Description View the linked PRs

It is been observed that the esp_offload kernel module might be loaded by libreswan even if bond ESP offloads have been correctly turned off.

This might be because ipsec service and configure-ovs run at the same time, so it is possible that ipsec service starts when bond offloads are not yet turned off and trick libreswan into thinking they should be used.

The potential fix would be to run ipsec service after configure-ovs.

https://github.com/openshift/machine-config-operator/pull/4633

Bug OCPBUGS-34956: Enable knative and A-04-TC01 tests

View the Description View the linked PRs

Description of problem:

Renable knative and A-04-TC01 tests that are being disabled in the pr  https://github.com/openshift/console/pull/13931

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14043

Bug OCPBUGS-38388: Add Error for Monitor test for node leases

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/28999

Bug OCPBUGS-41113: ART requests updates to 4.18 image golang-github-prometheus-prometheus-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prometheus/pull/226

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/prometheus/pull/226

Bug OCPBUGS-41938: PowerVS update capi ibmcloud c6bcd313

View the Description View the linked PRs

Description of problem:

Update the installer to use commit c6bcd313bce0fc9866e41bb9e3487d9f61c628a3 of cluster-api-provider-ibmcloud.  This includes a couple of necessary Transit Gateway fixes.

https://github.com/openshift/installer/pull/9010

Bug OCPBUGS-43892: Bump openshift/cluster-api-provider-openstack@master to v0.11.0

View the Description View the linked PRs

Placeholder to bump CAPO v0.11.0.

https://github.com/openshift/cluster-api-provider-openstack/pull/331

Bug OU-550: Use npm instead of yarn for the monitoring plugin

View the Description View the linked PRs

Context

in order to ease CI builds and konflux integrations, and have standardise with other observability plugins we need to migrate away from yarn and use npm

Outcome

The monitoring plugin uses npm instead of yarn for development and in Dockerfiles

Steps

Migrate the yarn.lock into a package-lock.json, check the resolutions that were added to resolve CVEs
Update the makefile to remove yarn calls
Remove yarn specific installations from the Dockerfiles
Update docs that have yarn references

https://github.com/openshift/monitoring-plugin/pull/200

Bug OCPBUGS-39516: ART requests updates to 4.18 image ose-cluster-api-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api/pull/222

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component name: ose-cluster-api-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

https://github.com/openshift/cluster-api/pull/222

Bug OCPBUGS-40851: ART requests updates to 4.18 image ose-nutanix-machine-controllers-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/80

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/machine-api-provider-nutanix/pull/80

Bug OCPBUGS-37740: Should not panic when specify wrong loglevel for oc-mirror

View the Description View the linked PRs

Description of problem:

Should not panic when specify wrong loglevel for oc-mirror

Version-Release number of selected component (if applicable):

oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1. Run command: `oc-mirror -c config-36410.yaml --from file://out36410 docker://quay.io/zhouying7780/36410test --v2  --loglevel -h`

Actual results:

The command panic with error: 
oc-mirror -c config-36410.yaml --from file://out36410 docker://quay.io/zhouying7780/36410test --v2  --loglevel -h
2024/07/31 05:22:41  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/07/31 05:22:41  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/07/31 05:22:41  [INFO]   : ⚙️  setting up the environment for you...
2024/07/31 05:22:41  [INFO]   : 🔀 workflow mode: diskToMirror 
2024/07/31 05:22:41  [ERROR]  : parsing config error parsing local storage configuration : invalid loglevel -h Must be one of [error, warn, info, debug]
panic: StorageDriver not registered: 
goroutine 1 [running]:github.com/distribution/distribution/v3/registry/handlers.NewApp({0x5634e98, 0x76ea4a0}, 0xc000a7c388)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:126 +0x2374github.com/distribution/distribution/v3/registry.NewRegistry({0x5634e98?, 0x76ea4a0?}, 0xc000a7c388)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/distribution/distribution/v3/registry/registry.go:141 +0x56github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).setupLocalStorage(0xc000a78488)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:571 +0x3c6github.com/openshift/oc-mirror/v2/internal/pkg/cli.NewMirrorCmd.func1(0xc00090f208, {0xc0007ae300, 0x1, 0x8})	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:201 +0x27fgithub.com/spf13/cobra.(*Command).execute(0xc00090f208, {0xc0000520a0, 0x8, 0x8})	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:987 +0xab1github.com/spf13/cobra.(*Command).ExecuteC(0xc00090f208)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1115 +0x3ffgithub.com/spf13/cobra.(*Command).Execute(0x74bc8d8?)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1039 +0x13main.main()	/go/src/github.com/openshift/oc-mirror/cmd/oc-mirror/main.go:10 +0x18

Expected results:

Exit with error , should not panic

https://github.com/openshift/oc-mirror/pull/916

Bug OCPBUGS-41179: ART requests updates to 4.18 image aws-kms-encryption-provider-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/aws-encryption-provider/pull/21

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Task OCPCLOUD-2703: Update OWNERS subcomponents for Cluster API Providers

View the Description View the linked PRs

Update OWNERS subcomponents for Cluster API Providers
context: https://redhat-internal.slack.com/archives/GE2HQ9QP4/p1725451236624749

Bug OCPBUGS-38414: NetworkSegmentation tests are peramfailing on metal

View the Description View the linked PRs

https://github.com/openshift/origin/pull/28945 are permafailing on metal

https://github.com/openshift/api/pull/1988 maybe needs to be reverted?

https://github.com/openshift/api/pull/1996

Task MON-3990: Add a readiness/reachability test for monitoring plugin in CMO

View the Description View the linked PRs

Currently, CMO only tests that the plugin Deployment is rolled out with the appropriate config https://github.com/openshift/cluster-monitoring-operator/blob/f7e92e869c43fa0455d656dcfc89045b60e5baa1/test/e2e/config_test.go#L730

The plugin Deployment does set any readinessProbe, we're missing a check to ensure the plugin is ready to serve requests.

—

With the new plugin backend, a readiness probe can/will be added, see https://github.com/openshift/cluster-monitoring-operator/pull/2412#issuecomment-2315085438, that will help ensure minimal readiness on palyload tests flavors.

The CMO test can be more demanding and ask for /plugin-manifest.json

https://github.com/openshift/cluster-monitoring-operator/pull/2444

Bug OCPBUGS-38949: OVN-Kubernetes: failed to configure the policy based routes for network "default": invalid host address:

View the Description View the linked PRs

Description of problem:

https://search.dptools.openshift.org/?search=failed+to+configure+the+policy+based+routes+for+network&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

See:

event happened 183 times, something is wrong: node/ip-10-0-52-0.ec2.internal hmsg/9cff2a8527 - reason/ErrorUpdatingResource error creating gateway for node ip-10-0-52-0.ec2.internal: failed to configure the policy based routes for network "default": invalid host address: 10.0.52.0/18 (17:55:20Z) result=reject

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/ovn-kubernetes/pull/2276

Bug OCPBUGS-39590: ART requests updates to 4.18 image ose-agent-installer-utils-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/agent-installer-utils/pull/36

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/agent-installer-utils/pull/36

Bug OCPBUGS-42534: Console-operator's route healthcheck should have longer retry

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

    When console-operator performs health check for the active console route, the retry takes 50ms, which is too short. It should be bumped at least to couple of seconds, to prevent burst of request which could lead to the same result and thus be misleading.
Also we need to add additional logging around the healthcheck for better debugging.

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console-operator/pull/933

Bug OCPBUGS-43780: GCP environment parament cannot shown on Operator subscription page

View the Description View the linked PRs

Description of problem:

    The GCP environment parament is missing on GCP STS environment

Based on feature  https://issues.redhat.com/browse/CONSOLE-4176 
 If the cluster is in GCP WIF mode and the operator claims support for it, the operator subscription page provides configuring 4 additional fields,which will be set on the Subscription's spec.config.env field

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-10-23-112324

How reproducible:

    Always

Steps to Reproduce:

    1. Prepare a GCP WIF mode enable cluster
    2. Navigate to Operator Hub page, and selected 'Auth Token GCP' on the Infrastructure features section
    3. Choose one operator and click install button (eg: Web Terminal)
    4. Check the Operator subscription page
       /operatorhub/subscribe?pkg=web-terminal&catalog=redhat-operators&catalogNamespace=openshift-marketplace&targetNamespace=undefined&channel=fast&version=1.11.0&tokenizedAuth=null

Actual results:

    The fuction for feature CONSOLE-4176 is missing

Expected results:

    1. WI/FI Warning message can shown on the subscription page
    2. User can setup POOL_ID， PROVIDER_ID，SERVICE_ACCOUNT_EMAIL on the page

Additional info:

https://github.com/openshift/console/pull/14446

Bug OCPBUGS-39599: ART requests updates to 4.18 image ose-aws-cloud-controller-manager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-aws/pull/94

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-aws/pull/94

Bug OCPBUGS-42306: Change of additionalTrustBundle doesn't create new user-token

View the Description View the linked PRs

Description of problem:

Adding of ATB to HC doesnt create new user token and doesnt reconcile to worker ndoes

Version-Release number of selected component (if applicable):

    4.18 nightly

How reproducible:

    100%

Steps to Reproduce:

    1.Create 4.18 nightly HC
    2.Add atb to HC 
    3.Notice no new user token

Actual results:

    no new user token generated so no new payload

Expected results:

    new user token generated with new payload

Additional info:

https://github.com/openshift/hypershift/pull/4775

Bug OCPBUGS-42579: OAuth, Konnectivity, Ingress, Ignition fails due to netpol in HCP deployed with NodePort via KubeVirt

View the Description View the linked PRs

Hello Team,

When we deploy the HyperShift cluster with OpenShift Virtualization by specifying NodePort strategy for services, the requests to ignition, oauth, connectivity (for oc rsh, oc logs, oc exec), virt-launcher-hypershift-node-pool pod fails as by default following netpols get created automatically and restricting the traffic on on all other ports.

$ oc get netpol
NAME                      POD-SELECTOR           AGE
kas                       app=kube-apiserver     153m
openshift-ingress         <none>                 153m
openshift-monitoring      <none>                 153m
same-namespace            <none>                 153m

I resolved

$ cat ingress-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ingress
spec:
  ingress:
  - ports:
    - port: 31032
      protocol: TCP
  podSelector:
    matchLabels:
      kubevirt.io: virt-launcher
  policyTypes:
  - Ingress


$ cat oauth-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: oauth
spec:
  ingress:
  - ports:
    - port: 6443
      protocol: TCP
  podSelector:
    matchLabels:
      app: oauth-openshift
      hypershift.openshift.io/control-plane-component: oauth-openshift
  policyTypes:
  - Ingress


$ cat ignition-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: nodeport-ignition-proxy
spec:
  ingress:
  - ports:
    - port: 8443
      protocol: TCP
  podSelector:
    matchLabels:
      app: ignition-server-proxy
  policyTypes:
  - Ingress


$ cat konn-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: konn
spec:
  ingress:
  - ports:
    - port: 8091
      protocol: TCP
  podSelector:
    matchLabels:
      app: kube-apiserver
      hypershift.openshift.io/control-plane-component: kube-apiserver
  policyTypes:
  - Ingress

The bug for ignition netpol has already been reported.

--> https://issues.redhat.com/browse/OCPBUGS-39158

--> https://issues.redhat.com/browse/OCPBUGS-39317

It would be helpful if these policies get created automatically as well or maybe we get an option in HyperShift to disable the automatic management of network policies where we can manually take care of the network policies.

https://github.com/openshift/hypershift/pull/4840

Bug OCPBUGS-43642: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/132

Bug OCPBUGS-38578: ose-aws-efs-csi-driver-operator image_reference invalid

View the Description View the linked PRs

Description of problem:

    ose-aws-efs-csi-driver-operator has an invalid reference tools that cause build failed
this issue is due to https://github.com/openshift/csi-operator/pull/252/files#r1719471717

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-44578: Some images and icons use leagcy CommonJS import style

View the Description View the linked PRs

Description of problem:

    "import * as icon from '[...].svg' " imports cause errors on webpack5/rspack (can't convert value to primitive type). They should be rewritten as "import icon from '[...].svg'"

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. 
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14496

Bug OCPBUGS-41492: Sample tab doesn't show up when ConsoleYAMLSample without snippet applied for ConfigMap

View the Description View the linked PRs

Description of problem:

 as a follow up issue of https://issues.redhat.com/browse/OCPBUGS-4496

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-08-130531

How reproducible:

    Always

Steps to Reproduce:

    1. create a ConfigMap ConsoleYAMLSample without 'snippet: true'
apiVersion: console.openshift.io/v1
kind: ConsoleYAMLSample
metadata:
  name: cm-example-without-snippet
spec:
  targetResource:
    apiVersion: v1
    kind: ConfigMap
  title: Example ConfigMap
  description: An example ConfigMap YAML sample
  yaml: |
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: game-demo
    data:
      player_initial_lives: "3"
      ui_properties_file_name: "user-interface.properties"
      game.properties: |
        enemy.types=aliens,monsters
        player.maximum-lives=5    
      user-interface.properties: |
        color.good=purple
        color.bad=yellow
        allow.textmode=true
    2. goes to ConfigMap creation page -> YAML view
    3. create a ConfigMap ConsoleYAMLSample WITH 'snippet: true'
apiVersion: console.openshift.io/v1
kind: ConsoleYAMLSample
metadata:
  name: cm-example-without-snippet
spec:
  targetResource:
    apiVersion: v1
    kind: ConfigMap
  title: Example ConfigMap
  description: An example ConfigMap YAML sample
  snippet: true
  yaml: |
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: game-demo
    data:
      player_initial_lives: "3"
      ui_properties_file_name: "user-interface.properties"
      game.properties: |
        enemy.types=aliens,monsters
        player.maximum-lives=5    
      user-interface.properties: |
        color.good=purple
        color.bad=yellow
        allow.textmode=true
    4. goes to ConfigMap creation page -> YAML view

Actual results:

2. Sample tab doesn't show up
4. Snippet tab appears

Expected results:

2. Sample tab should show up when there is no snippet: true

Additional info:

https://github.com/openshift/console/pull/14350

Bug OCPBUGS-42106: Continuous pull-secret updates / slow initialization on build01 (test platform infrastructure)

View the Description View the linked PRs

Description of problem:

Test Platform has detected a large increase in the amount of time spent waiting for pull secrets to be initialized.
Monitoring the audit log, we can see nearly continuous updates to the SA pull secrets in the cluster (~2 per minute for every SA pull secret in the cluster).

Controller manager is filled with entries like: 
- "Internal registry pull secret auth data does not contain the correct number of entries" ns="ci-op-tpd3xnbx" name="deployer-dockercfg-p9j54" expected=5 actual=4"
- "Observed image registry urls" urls=["172.30.228.83:5000","image-registry.openshift-image-registry.svc.cluster.local:5000","image-registry.openshift-image-registry.svc:5000","registry.build01.ci.openshift.org","registry.build01.ci.openshift.org"

In this "Observed image registry urls" log line, notice the duplicate entries for "registry.build01.ci.openshift.org" . We are not sure what is causing this but it leads to duplicate entry, but when actualized in a pull secret map, the double entry is reduced to one. So the controller-manager finds the cardinality mismatch on the next check.

The duplication is evident in OpenShiftControllerManager/cluster:
      dockerPullSecret:
        internalRegistryHostname: image-registry.openshift-image-registry.svc:5000
        registryURLs:
        - registry.build01.ci.openshift.org
        - registry.build01.ci.openshift.org


But there is only one hostname in config.imageregistry.operator.openshift.io/cluster:
  routes:
  - hostname: registry.build01.ci.openshift.org
    name: public-routes
    secretName: public-route-tls

Version-Release number of selected component (if applicable):

4.17.0-rc.3

How reproducible:

Constant on build01 but not on other build farms

Steps to Reproduce:

    1. Something ends up creating duplicate entries in the observed configuration of the openshift-controller-manager.
    2.
    3.

Actual results:

- Approximately 400K secret patches an hour on build01 vs ~40K on other build farms. Intialization times have increased by two orders of magnitude in new ci-operator namespaces.    
- The openshift-controller-manager is hot looping and experiencing client throttling.

Expected results:

1. Initialization of pull secrets in a namespace should take < 1 seconds. On build01, it can take over 1.5 minutes.
2. openshift-controller-manager should not possess duplicate entries.
3. If duplicate entries are a configuration error, openshift-controller-manager should de-dupe the entries.
4. There should be alerting when the openshift-controller-manager experiences client-side throttling / pathological behavior.

Additional info:

Bug OCPBUGS-44662: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/29304

Bug OCPBUGS-39526: ART requests updates to 4.18 image ose-cluster-machine-approver-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/238

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component name: ose-cluster-machine-approver-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

https://github.com/openshift/cluster-machine-approver/pull/238

Bug OCPBUGS-37584: Topology screen crashes when completed pod is selected

View the Description View the linked PRs

Description of problem:

Topology screen crashes and reports "Oh no! something went wrong" when a pod in completed state is selected.

Version-Release number of selected component (if applicable):

RHOCP 4.15.18

How reproducible:

100%

Steps to Reproduce:

1. Switch to developer mode
2. Select Topology
3. Select a project that has completed cron jobs like openshift-image-registry
4. Click the green CronJob Object
5. Observe Crash

Actual results:

The Topology screen crashes with error "Oh no! Something went wrong."

Expected results:

After clicking the completed pod / workload, the screen should display the information related to it.

Additional info:

https://github.com/openshift/console/pull/14113

Bug OCPBUGS-41992: Perform a better sanitisation from the input into IgnitionServer from HTTP header

View the Description View the linked PRs

The error bellow was solved in this PR https://github.com/openshift/hypershift/pull/4723, but we can do a better sanitisation of the IgnitionServer payload. This is the suggestion from Alberto in Slack: https://redhat-internal.slack.com/archives/G01QS0P2F6W/p1726257008913779?thread_ts=1726241321.475839&cid=G01QS0P2F6W

✗ [High] Cross-site Scripting (XSS) 
  Path: ignition-server/cmd/start.go, line 250 
  Info: Unsanitized input from an HTTP header flows into Write, where it is used to render an HTML page returned to the user. This may result in a Reflected Cross-Site Scripting attack (XSS).

https://github.com/openshift/hypershift/pull/4735

Bug OCPBUGS-43717: Should not show Hide/Reveal values button for binary type secret data

View the Description View the linked PRs

Description of problem:

when viewing binary type of secret data, we are also providing 'Reveal/Hide values' option which is redundant

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-22-123921

How reproducible:

Always

Steps to Reproduce:

    1. create a Key/Value secret when the data is binary file, Workloads -> Secrets -> Create Key/value secret -> upload binary file as secret data -> Create
    2. check data on Secret details page
    3.

Actual results:

2. both options: Save file and Reveal/Hide Values are provided. But `Reveal/Hide values` button makes no sense since the data is binary file

Expected results:

2. Only show 'Save file' option for binary data

Additional info:

https://github.com/openshift/console/pull/14437

Bug OCPBUGS-39559: ART requests updates to 4.18 image coredns-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/coredns/pull/130

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/coredns/pull/130

Bug OCPBUGS-43698: Cluster dump not collecting any artifact anymore on OpenStack

View the Description View the linked PRs

Description of problem:

HostedCluster dump not working anymore in OpenStack CI

Version-Release number of selected component (if applicable):

4.18 and 4.17

Missing: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-openstack/1848665789276622848/artifacts/e2e-openstack/hypershift-openstack-e2e-execute/artifacts/TestNodePool_HostedCluster0/namespaces/e2e-clusters-qkvx6-example-j98bv/

Not missing: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-openstack/1845041867452846080/artifacts/e2e-openstack/hypershift-openstack-e2e-execute/artifacts/TestCreateCluster/namespaces/e2e-clusters-t5ddx-example-5tl72/

https://github.com/openshift/hypershift/pull/4965

Bug OCPBUGS-38878: Network operator fails to finish rolling out in OpenStack platform because it cannot update Infrastructure.Spec

View the Description View the linked PRs

Description of problem:

In the case of OpenStack, the network operator tries and fails to update the infrastructure resource.

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    always

Steps to Reproduce:

    1. Install hypershift
    2. Create openstack hosted cluster

Actual results:

Network operator fails to report as available due to:
 - lastTransitionTime: "2024-08-22T15:54:16Z"
    message: 'Error while updating infrastructures.config.openshift.io/cluster: failed
      to apply / update (config.openshift.io/v1, Kind=Infrastructure) /cluster: infrastructures.config.openshift.io
      "cluster" is forbidden: ValidatingAdmissionPolicy ''config'' with binding ''config-binding''
      denied request: This resource cannot be created, updated, or deleted. Please
      ask your administrator to modify the resource in the HostedCluster object.'
    reason: UpdateInfrastructureSpecOrStatus
    status: "True"
    type: network.operator.openshift.io/Degraded

Expected results:

    Cluster operator becomes available

Additional info:

    This is a bug introduced with https://github.com/openshift/hypershift/pull/4303

https://github.com/openshift/hypershift/pull/4609

Bug OCPBUGS-41109: ART requests updates to 4.18 image ose-kube-metrics-server-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubernetes-metrics-server/pull/32

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubernetes-metrics-server/pull/32

Bug OCPBUGS-43813: Unnecessary warning notification message on debug pod

View the Description View the linked PRs

Description of problem: Unnecessary warning notification message on debug pod.

Code: https://github.com/openshift/console/blob/bdb211350a66fe96ab215a655d41c45864dc3cef/frontend/public/components/debug-terminal.tsx#L114

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14445

Bug CLID-278: oc-mirror v2 fails at second attempt to mirror-to-mirror

View the Description View the linked PRs

use an imageSetConfig with operator catalog

do a mirror to mirror

without removing the working-dir or the cache, do mirror to mirror again

It fails with error : filtered declarative config not found

https://github.com/openshift/oc-mirror/pull/956

Story TRT-1761: Write a test that should ensure kubelet never had disk pressure

View the Description View the linked PRs

We think that low disk space is likely the cause of https://issues.redhat.com/browse/OCPBUGS-37785

It's not immediately obvious that this happened during the run without digging into the events.

Could we create a new test to enforce that the kubelet never reports disk pressure during a run?

https://github.com/openshift/origin/pull/29286

Bug OCPBUGS-38907: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14186

Bug OCPBUGS-44528: [4.18] Rebase openshift/etcd to 3.5.17

View the Description View the linked PRs

Rebase openshift/etcd to latest 3.5.17 upstream release.

https://github.com/openshift/etcd/pull/302

Spike PSAP-1428: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/1136

Bug OCPBUGS-43767: Receiving alert 'PodStartupStorageOperationsFailing'

View the Description View the linked PRs

Description of problem:

IHAC who is facing the same problem with OCPBUGS-17356 in OCP 4.16 cluster. There is no ContainerCreating pod and the alert firing appears to be a false positive.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.16.15

Additional info:
This is very similar to OCPBUGS-17356

https://github.com/openshift/cluster-storage-operator/pull/533

Bug OCPBUGS-36810: When MCOS is deleted in building state configmap resources related to MOSC are not deleted.

View the Description View the linked PRs

Description of problem:

When we try to deleted the MachineOSConfig when it is still building state. Then the resources related to MOSC is deleted but not for configmap. And hence when we try to again apply the MOSC in same pool the status of MOSB is not properly generated. 

To resolve the issue we have to manually delete the resources created in configmap.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-07-10-022831   True        False         3h15m   Cluster version is 4.16.0-0.nightly-2024-07-10-022831

How reproducible:

Steps to Reproduce:

1. Create CustomMCP
2. Apply any MOSC
3. Delete the MOSC while it is still in building stage
4. Again apply the MOSC
5. Check the MOSB status
oc get machineosbuilds.
NAME                                                            PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED
infra-rendered-infra-371dc5d02dbe0bb5712857393db95bf3-builder                                                   False

Actual results:

oc get machineosbuilds
 NAME                                                            PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED infra-rendered-infra-371dc5d02dbe0bb5712857393db95bf3-builder                                                   False

Expected results:

We should be able to see the status

Additional info:

Check the logs of machine-os-builder
$ oc logs machine-os-builder-74d56b55cf-mp6mv | grep -i error I0710 11:05:56.750770       1 build_controller.go:474] Error syncing machineosbuild infra3: could not start build for MachineConfigPool infra: could not load rendered MachineConfig mc-rendered-infra-371dc5d02dbe0bb5712857393db95bf3 into configmap: configmaps "mc-rendered-infra-371dc5d02dbe0bb5712857393db95bf3" already exists

https://github.com/openshift/machine-config-operator/pull/4576

Bug OCPBUGS-38061: Validate Test Suite for the following Node scenarios

View the Description View the linked PRs

After looking at this test run we need to validate the following scenarios:

Monitor test for nodes should fail when nodes go ready=false unexpectedly.
Monitor test for nodes should fail when the unreachable taint is placed on them.
Monitor test for node leases should create timeline entries when leases are not renewed “on time”. This could also fail after N failed renewal cycles.

Do the monitor tests in openshift/origin accurately test these scenarios?

https://github.com/openshift/origin/pull/28989

Bug OCPBUGS-32857: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-39551: ART requests updates to 4.18 image openshift-enterprise-cli-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/oc/pull/1866

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/oc/pull/1866

Bug OCPBUGS-41231: ART requests updates to 4.18 image openshift-enterprise-keepalived-ipfailover-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/images/pull/196

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/images/pull/196

Bug OCPBUGS-39420: ROSA HCP Nodepool versions unexpectedly do not match Node versions

View the Description View the linked PRs

Description of problem:

ROSA HCP allows customers to select hostedcluster and nodepool OCP z-stream versions, respecting version skew requirements. E.g.:

A 4.15.28 hostedcluster with
A 4.15.28 nodepool
A 4.15.25 nodepool

Version-Release number of selected component (if applicable):

Reproducible on 4.14-4.16.z, this bug report demonstrates it for a 4.15.28 hostedcluster with a 4.15.25 nodepool

How reproducible:

100%

Steps to Reproduce:

    1. Create a ROSA HCP cluster, which comes with a 2-replica nodepool with the same z-stream version (4.15.28)
    2. Create an additional nodepool at a different version (4.15.25)

Actual results:

Observe that while nodepool objects report the different version (4.15.25), the resulting kernel version of the node is that of the hostedcluster (4.15.28)

❯ k get nodepool -n ocm-staging-2didt6btjtl55vo3k9hckju8eeiffli8                                                                                    
NAME                     CLUSTER       DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
mshen-hyper-np-4-15-25   mshen-hyper   1               1               False         True         4.15.25   False             False            
mshen-hyper-workers      mshen-hyper   2               2               False         True         4.15.28   False             False  


❯ k get no -owide                                            
NAME                                         STATUS   ROLES    AGE   VERSION            INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-0-129-139.us-west-2.compute.internal   Ready    worker   24m   v1.28.12+396c881   10.0.129.139   <none>        Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow)   5.14.0-284.79.1.el9_2.aarch64   cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9
ip-10-0-129-165.us-west-2.compute.internal   Ready    worker   98s   v1.28.12+396c881   10.0.129.165   <none>        Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow)   5.14.0-284.79.1.el9_2.aarch64   cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9
ip-10-0-132-50.us-west-2.compute.internal    Ready    worker   30m   v1.28.12+396c881   10.0.132.50    <none>        Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow)   5.14.0-284.79.1.el9_2.aarch64   cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/4686

Bug OCPBUGS-43157: Cloud controller manager operator can fail when running goimports through fmt make target

View the Description View the linked PRs

Description of problem:

    When running the `make fmt` target in the repository the command can fail due to a mismatch of versions between the go language and the goimports dependency.

Version-Release number of selected component (if applicable):

    4.16.z

How reproducible:

    always

Steps to Reproduce:

    1.checkout release-4.16 branch
    2.run `make fmt`

Actual results:

INFO[2024-10-01T14:41:15Z] make fmt make[1]: Entering directory '/go/src/github.com/openshift/cluster-cloud-controller-manager-operator' hack/goimports.sh go: downloading golang.org/x/tools v0.25.0 go: golang.org/x/tools/cmd/goimports@latest: golang.org/x/tools@v0.25.0 requires go >= 1.22.0 (running go 1.21.11; GOTOOLCHAIN=local)

Expected results:

    successful completion of `make fmt`

Additional info:

    our goimports.sh script file reference `goimports@latest` which means that this problem will most likely affect older branches as well. we will need to set a specific version of the goimports package for those branches.

given that the CCCMO includes golangci-lint, and uses it for a test, we should include goimports through golangci-lint which will solve this problem without needing special versions of goimports.

https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/370

Bug OCPBUGS-44657: add IBM Block Storage CSI driver support for RWX

View the Description View the linked PRs

Corresponding Jira ticket for PR https://github.com/openshift/console/pull/14076

https://github.com/openshift/console/pull/14076

Bug OCPBUGS-38813: OLM catalog references need updates

View the Description View the linked PRs

Description of problem:

  OLM 4.17 references 4.16 catalogs

Version-Release number of selected component (if applicable):

How reproducible:

    Always

Steps to Reproduce:

    1. oc get pods -n openshift-marketplace -o yaml | grep "image: registry.redhat.io"

Actual results:

      image: registry.redhat.io/redhat/certified-operator-index:v4.16
      image: registry.redhat.io/redhat/certified-operator-index:v4.16
      image: registry.redhat.io/redhat/community-operator-index:v4.16
      image: registry.redhat.io/redhat/community-operator-index:v4.16
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.16
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.16
      image: registry.redhat.io/redhat/redhat-operator-index:v4.16
      image: registry.redhat.io/redhat/redhat-operator-index:v4.16

Expected results:

      image: registry.redhat.io/redhat/certified-operator-index:v4.17
      image: registry.redhat.io/redhat/certified-operator-index:v4.17
      image: registry.redhat.io/redhat/community-operator-index:v4.17
      image: registry.redhat.io/redhat/community-operator-index:v4.17
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.17
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.17
      image: registry.redhat.io/redhat/redhat-operator-index:v4.17
      image: registry.redhat.io/redhat/redhat-operator-index:v4.17

Additional info:

https://github.com/operator-framework/operator-marketplace/pull/572

Bug OCPBUGS-38922: Azure-file mount permission denied with private storage account created by internal image registry

View the Description View the linked PRs

Description of problem:

With the Configuring a private storage endpoint on Azure by enabling the Image Registry Operator to discover VNet and subnet names[1], if creating cluster with internal Image Registry, it will create a storage account with private endpoint, so once the new pvc using the same skuName with this private storage account, it will hit the mount permission issue. 
 

[1] https://docs.openshift.com/container-platform/4.16/post_installation_configuration/configuring-private-cluster.html#configuring-private-storage-endpoint-azure-vnet-subnet-iro-discovery_configuring-private-cluster

Version-Release number of selected component (if applicable):

4.17

How reproducible:

Always

Steps to Reproduce:

Creating cluster with flexy job: aos-4_17/ipi-on-azure/versioned-installer-customer_vpc-disconnected-fully_private_cluster-arm profile and specify enable_internal_image_registry: "yes"
Create pod and pvc with azurefile-csi sc

Actual results:

pod failed to up due to mount error:

mount //imageregistryciophgfsnrc.file.core.windows.net/pvc-facecce9-d4b5-4297-b253-9a6200642392 on /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/b4b5e52fb1d21057c9644d0737723e8911d9519ec4c8ddcfcd683da71312a757/globalmount failed with mount failed: exit status 32
  Mounting command: mount
  Mounting arguments: -t cifs -o mfsymlinks,cache=strict,nosharesock,actimeo=30,gid=1018570000,file_mode=0777,dir_mode=0777, //imageregistryciophgfsnrc.file.core.windows.net/pvc-facecce9-d4b5-4297-b253-9a6200642392 /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/b4b5e52fb1d21057c9644d0737723e8911d9519ec4c8ddcfcd683da71312a757/globalmount
  Output: mount error(13): Permission denied

Expected results:

Pod should be up

Additional info:

We can have some simple WA like using storageclass with networkEndpointType: privateEndpoint or specify another storage account, but using the pre-defined storageclass azurefile-csi will fail. And the automation is not easy to walk around.  

I'm not sure if CSI Driver could check if the reused storage account has the private endpoint before using the existing storage account.

Bug OCPBUGS-43383: Cloud provider vsphere can fail when running goimports through fmt make target

View the Description View the linked PRs

Description of problem:

    Running "make fmt" in the repository fails with an error about a version mismatch between goimports and the go language version.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1. checkout release-4.16 branch
    2. run "make fmt" (with golang version 1.21)

Actual results:

openshift-hack/check-fmt.sh
go: downloading golang.org/x/tools v0.26.0
go: golang.org/x/tools/cmd/goimports@latest: golang.org/x/tools@v0.26.0 requires go >= 1.22.0 (running go 1.21.11; GOTOOLCHAIN=local)
make: *** [openshift.mk:18: fmt] Error 1

Expected results:

    completion without errors

Additional info:

this is affecting us currently with 4.16 and previous, but will become a persistent problem over time.

we can correct this by using a holistic approach such as calling goimports from the binary that is included in our build images.

https://github.com/openshift/cloud-provider-vsphere/pull/77

Bug OCPBUGS-27050: Update 4.16 ose-azure-disk-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/114

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-15255: [aws] Machine got stuck in Provisioning when EC2 instance was Terminated by AWS

View the Description View the linked PRs

Description of problem:

Machine got stuck in Provisioning phase after the EC2 gets terminated by AWS.

The scenario I got this problem was when running an rehearsal cluster in a under development[1] job[2] for AWS Local Zone. The EC2 created through MachineSet template was launched in the Local Zone us-east-1-qro-1a, but the instance was terminated right after it was created with this message[3] (From AWS Console):
~~~
Client.VolumeLimitExceeded: Volume limit exceeded. You have exceeded the maximum gp2 storage limit of 87040 GiB in this location. Please contact AWS Support for more information.
~~~

When I saw this problem in the Console, I removed the Machine object and the MAPI was able to create a new instance in the same Zone:

~~~
$ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test
Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init)
sh-4.4$ oc get machines -A
NAMESPACE               NAME                                                     PHASE          TYPE         REGION      ZONE         AGE
openshift-machine-api   ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx   Provisioning                                         45m

sh-4.4$ oc delete machine ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx -n openshift-machine-api
machine.machine.openshift.io "ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" deleted

(...)
$ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test
Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init)
sh-4.4$ oc get machines -n openshift-machine-api -w
NAME                                                     PHASE         TYPE         REGION      ZONE               AGE
ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-v675j   Provisioned   c5.2xlarge   us-east-1   us-east-1-qro-1a   2m6s
~~~

The job[2] didn't finish successfully due the timeout checking for node readiness, but the Machine got provisioned correctly (without Console errors) and kept in running state.

The main problem I can see in the logs of Machine Controller is an endless loop trying to reconcile an terminated machine/instance (i-0fc8f2e7fe7bba939):

~~~
2023-06-20T19:38:01.016776717Z I0620 19:38:01.016760       1 controller.go:156] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling Machine
2023-06-20T19:38:01.016776717Z I0620 19:38:01.016767       1 actuator.go:108] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator checking if machine exists
2023-06-20T19:38:01.079829331Z W0620 19:38:01.079800       1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.132099118Z E0620 19:38:01.132063       1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.132099118Z I0620 19:38:01.132080       1 reconciler.go:296] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Instance does not exist
2023-06-20T19:38:01.132146892Z I0620 19:38:01.132096       1 controller.go:349] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling machine triggers idempotent create
2023-06-20T19:38:01.132146892Z I0620 19:38:01.132101       1 actuator.go:81] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator creating machine
2023-06-20T19:38:01.132489856Z I0620 19:38:01.132460       1 reconciler.go:41] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: creating machine
2023-06-20T19:38:01.190935211Z W0620 19:38:01.190901       1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.238693678Z E0620 19:38:01.238661       1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.238693678Z I0620 19:38:01.238680       1 machine_scope.go:90] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: patching machine
2023-06-20T19:38:01.249796760Z E0620 19:38:01.249761       1 actuator.go:72] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx error: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue
2023-06-20T19:38:01.249824958Z W0620 19:38:01.249796       1 controller.go:351] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: failed to create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue
2023-06-20T19:38:01.249858967Z E0620 19:38:01.249847       1 controller.go:324]  "msg"="Reconciler error" "error"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue" "controller"="machine-controller" "name"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" "namespace"="openshift-machine-api" "object"={"name":"ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx","namespace":"openshift-machine-api"} "reconcileID"="8890f9f7-2fbf-441d-a8b7-a52ec5f4ae2f"
~~~

I also reviewed the Account quotas for EBS gp2 and we are under the limits. The second machine was also provisioned, so I would discard any account quotas, and focus on the capacity issues in the Zone - considering Local Zone does not have high capacity as regular zones, it could happen more frequently.

I am asking the AWS teams a RCA, asking more clarification how we can programatically get this error (maybe EC2 API, I didn't described the EC2 when the event happened).


[1] https://github.com/openshift/release/pull/39902#issuecomment-1599559108
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39902/rehearse-39902-pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-localzones/1671215930459295744
[3] https://user-images.githubusercontent.com/3216894/247285243-3cd28306-2972-4576-a9a6-a620e01747a6.png

Version-Release number of selected component (if applicable):

4.14.0-0.ci.test-2023-06-20-191559-ci-op-ljs7pd35-latest

How reproducible:

- Rarely by AWS (mainly in zone capacity issues - a RCA has been requested to AWS to check if we can find options to reproduce)

Steps to Reproduce:

this is hard to reproduce as the EC2 had been terminated by AWS. 

I created one script to watch the specific subnet ID and terminate any instances created on it instantaneously, but the Machine is going to the Failed phase and getting stuck on it - and not the "Provisioning" as we got in the CI job.

Steps to try to reproduce:
1. Create a cluster with Local Zone support: https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-localzone.html
2. Wait for the cluster be created
3. Scale down the MachineSet for the Local Zone
4. Start a new Terminal(#2): watch and terminate EC2 instance created in an Local Zone subnet (example: us-east-1-bue-1a)
~~~
machineset_monitor="byonet1-sc9fb-edge-us-east-1-bue-1a"

# discover the subnet ID
subnet_id=$(oc get machineset $machineset_monitor -n openshift-machine-api -o json | jq -r .spec.template.spec.providerSpec.value.subnet.id)

# discover the zone name
zone_name="$(aws ec2 describe-subnets --subnet-ids $subnet_id --query 'Subnets[].AvailabilityZone' --output text)"

# Discover instance ids in the subnet and terminate it
while true; do
    echo "$(date): Getting instance in the zone ${zone_name} / subnet ${subnet_id}..."
    
    instance_ids=$(aws ec2 describe-instances --filters Name=subnet-id,Values=$subnet_id Name=instance-state-name,Values=pending,running,shutting-down,stopping --query 'Reservations[].Instances[].InstanceId' --output text)
    
    echo "$(date): Instances retrieved: $instance_ids"
    if [[ -n "$instance_ids" ]]; then
        echo "Terminating instances..."
        aws ec2 terminate-instances --instance-ids $instance_ids
        sleep 1
    else
        echo "Awaiting..."
        sleep 2
    fi
done
~~~

4. Scale up the MachineSet
5. Observe the Machines

Actual results:

Expected results:

- Machine moved to Failed phase when EC2 is terminated by AWS, or
- maybe self-recover the Machine when EC2 is deleted/terminated by deleting the Machine object when managed by a MachineSet, so we can prevent manual steps

Additional info:

https://github.com/openshift/machine-api-provider-aws/pull/83

Bug OCPBUGS-42972: OCP router is missing support of appProtocol = kubernetes.io/h2c and appProtocol=h2c is failing with newest clients

View the Description View the linked PRs

`appProtocol: kubernetes.io/h2c` has been adopted upstream but OCP router does not support it so Serverless needs to revert its use downstream.
recently we faced an issue where we have an grpc client using an edge terminated route where alpn is empty, when we use `appProtocol: h2c` .
With the latest grpc client versions an alpn is required, see here and here. Could this be fixed as well?

https://github.com/openshift/router/pull/627

Bug OCPBUGS-43430: Rebase master branch for cloud-provider-openstack onto release-1.31

View the Description View the linked PRs

Description of problem:

release-4.18 of openshift/cloud-provider-openstack should be based off upstream release-1.31 branch.

https://github.com/openshift/cloud-provider-openstack/pull/304

Bug OCPBUGS-41602: monitoring-plugin pod generates too many logs

View the Description View the linked PRs

Description of problem:

    Multiple monitoring-plugin Pods return the response code every 10s, there will be too many logs as time goes by

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-09-09-150616

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

% oc -n openshift-monitoring logs monitoring-plugin-76b8c847f6-m872m
time="2024-09-10T07:55:52Z" level=info msg="enabled features: []\n" module=main
time="2024-09-10T07:55:52Z" level=warning msg="cannot read config file, serving plugin with default configuration, tried /etc/plugin/config.yaml" error="open /etc/plugin/config.yaml: no such file or directory" module=server
time="2024-09-10T07:55:52Z" level=info msg="listening on https://:9443" module=server
10.128.2.2 - - [10/Sep/2024:07:55:53 +0000] "GET /health HTTP/2.0" 200 2
10.128.2.2 - - [10/Sep/2024:07:55:58 +0000] "GET /health HTTP/2.0" 200 2
10.128.2.2 - - [10/Sep/2024:07:56:08 +0000] "GET /health HTTP/2.0" 200 2
10.128.2.2 - - [10/Sep/2024:07:56:18 +0000] "GET /health HTTP/2.0" 200 2
10.128.2.2 - - [10/Sep/2024:07:56:28 +0000] "GET /health HTTP/2.0" 200 2
10.128.2.2 - - [10/Sep/2024:07:56:38 +0000] "GET /health HTTP/2.0" 200 2
...

$ oc -n openshift-monitoring logs monitoring-plugin-76b8c847f6-m872m | grep "GET /health HTTP/2.0" | wc -l
1967

Expected results:

    Before we switched to the golang backend, there are usually not many logs

Additional info:

https://github.com/openshift/monitoring-plugin/pull/186

Bug OCPBUGS-43653: cinder-csi-driver-operator does not restart openstack-cinder-csi-driver-node pods on credentials change

View the Description View the linked PRs

Description of problem:

Running https://github.com/shiftstack/installer/blob/master/docs/user/openstack/README.md#openstack-credentials-update leads to cinder pvc stuck in terminating status:

 $ oc get pvc -A
NAMESPACE              NAME                                 STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS         VOLUMEATTRIBUTESCLASS   AGE
cinder-test            pvc-0                                Terminating   pvc-d7d37d04-d8d1-4a61-a3bc-c038e53a13c7   1Gi        RWO            standard-csi         <unset>                 12h
cinder-test            pvc-1                                Terminating   pvc-32049f0e-b842-4e54-aff8-5f41f51b3c54   1Gi        RWO            standard-csi         <unset>                 12h
cinder-test            pvc-2                                Terminating   pvc-3eb42d8a-f22f-418b-881e-21c913b89c56   1Gi        RWO            standard-csi         <unset>                 12h

The cinder-csi-controller reports below error:

E1022 07:21:11.772540       1 utils.go:95] [ID:4401] GRPC error: rpc error: code = Internal desc = DeleteVolume failed with error Expected HTTP response code [202 204] when accessing [DELETE https://10.46.44.159:13776/v3/c27fbb9d859e40cc9
6f82e47b5ceebd6/volumes/bd5e6cf9-f27e-4aff-81ac-a83e7bccea86], but got 400 instead: {"badRequest": {"code": 400, "message": "Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing
and must not be migrating, attached, belong to a group, have snapshots or be disassociated from snapshots after volume transfer."}}

However, in openstack, they appears in-use:

stack@undercloud-0 ~]$ OS_CLOUD=shiftstack openstack volume list                                                                                                                                                                            
/usr/lib/python3.9/site-packages/osc_lib/utils/__init__.py:515: DeprecationWarning: The usage of formatter functions is now discouraged. Consider using cliff.columns.FormattableColumn instead. See reviews linked with bug 1687955 for more
detail.
  warnings.warn(
+--------------------------------------+------------------------------------------+-----------+------+------------------------------------------------------+                                                                                
| ID                                   | Name                                     | Status    | Size | Attached to                                          |                                                                                
+--------------------------------------+------------------------------------------+-----------+------+------------------------------------------------------+                                                                                
| 093b14c1-a79a-46aa-ab6b-6c71d2adcef9 | pvc-3eb42d8a-f22f-418b-881e-21c913b89c56 | in-use    |    1 | Attached to ostest-2nkmx-worker-0-cflkl on /dev/vdd  |                                                                                
| 4342c947-732d-4d23-964c-58bd56b79fd4 | pvc-32049f0e-b842-4e54-aff8-5f41f51b3c54 | in-use    |    1 | Attached to ostest-2nkmx-worker-0-cflkl on /dev/vdc  |                                                                                
| 6da3147f-4ce8-4e17-a29a-6f311599a969 | pvc-d7d37d04-d8d1-4a61-a3bc-c038e53a13c7 | in-use    |    1 | Attached to ostest-2nkmx-worker-0-cflkl on /dev/vdb  |

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-21-010606    
RHOS-17.1-RHEL-9-20240701.n.1

How reproducible:

Always (twice in a row)

Additional info:

must-gather provided in private comment

Bug OCPBUGS-41132: ART requests updates to 4.18 image cluster-version-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1083

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-version-operator/pull/1083

Bug OCPBUGS-43309: The "oc adm ocp-certificates regenerate-machine-config-server-serving-cert" command is failing

View the Description View the linked PRs

Description of problem:

The "oc adm ocp-certificates regenerate-machine-config-server-serving-cert" is failing

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-10-195326


$ oc  version
Client Version: 4.18.0-202410080912.p0.g3692450.assembly.stream-3692450
Kustomize Version: v5.4.2
Server Version: 4.18.0-0.nightly-2024-10-10-195326
Kubernetes Version: v1.31.1

How reproducible:

Always

Steps to Reproduce:

    1. Execute the "oc adm ocp-certificates regenerate-machine-config-server-serving-cert" with the right oc binary for the tested version
    2.
    3.

Actual results:


The "oc adm ocp-certificates regenerate-machine-config-server-serving-cert"  command fails with this error:

$ oc adm ocp-certificates regenerate-machine-config-server-serving-cert
W1011 10:13:41.951040 2699876 recorder_logging.go:53] &Event{ObjectMeta:{dummy.17fd5e657c5748ca  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:SecretUpdateFailed,Message:Failed to update Secret/: Secret "machine-config-server-tls" is invalid: type: Invalid value: "kubernetes.io/tls": field is immutable,Source:EventSource{Component:,Host:,},FirstTimestamp:2024-10-11 10:13:41.950941386 +0000 UTC m=+0.377199185,LastTimestamp:2024-10-11 10:13:41.950941386 +0000 UTC m=+0.377199185,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
The Secret "machine-config-server-tls" is invalid: type: Invalid value: "kubernetes.io/tls": field is immutable

Expected results:

The command should be executed without errors

Additional info:

https://github.com/openshift/oc/pull/1900

Story HOSTEDCP-2028: Add test coverage for TechPreviewNoUpgrade in hypershift e2e

View the linked PRs

https://github.com/openshift/hypershift/pull/4902

Bug OCPBUGS-33324: baremetal installer output too busy

View the Description View the linked PRs

This line is repeated many times, about once a second when provisioning a new cluster:

    level=debug msg=    baremetalhost resource not yet available, will retry

https://github.com/openshift/installer/pull/8355

Bug OCPBUGS-38951: CCO GCP WIF bug roundup (see summary)

View the Description View the linked PRs

Regular expression for matching audience string is incorrect
STS functionality functions incorrectly due to convoluted logic (detected by QE)

https://github.com/openshift/cloud-credential-operator/pull/745

Bug OCPBUGS-41264: ART requests updates to 4.18 image ose-powervs-cloud-controller-manager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-powervs/pull/79

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-powervs/pull/79

Bug OCPBUGS-39485: ART requests updates to 4.18 image ose-cluster-image-registry-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-image-registry-operator/pull/1113

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-image-registry-operator/pull/1113

Bug OCPBUGS-41260: ART requests updates to 4.18 image ose-cluster-kube-scheduler-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/548

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-kube-scheduler-operator/pull/548

Bug OCPBUGS-39628: ART requests updates to 4.18 image ose-tools-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/oc/pull/1867

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/oc/pull/1867

Bug OCPBUGS-39529: ART requests updates to 4.18 image ose-oauth-apiserver-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/118

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/oauth-apiserver/pull/118

Bug OCPBUGS-39588: ART requests updates to 4.18 image ose-ibm-cloud-controller-manager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/75

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cloud-provider-ibm/pull/75

Bug OCPBUGS-28647: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/1129

Bug OCPBUGS-10957: must-gather creates empty monitoring/prometheus/rules.json file

View the Description View the linked PRs

Description of problem:

must-gather creates empty monitoring/prometheus/rules.json file due to error "Unable to connect to the server: x509: certificate signed by unknown authority"

Version-Release number of selected component (if applicable):

4.9

How reproducible:

not sure what customer did on certs

Steps to Reproduce:

1.
2.
3.

Actual results:

monitoring/prometheus/rules.json is empty, while monitoring/prometheus/rules.sterr contains error message "Unable to connect to the server: x509: certificate signed by unknown authority"

Expected results:

as must-gather runs inside the cluster only it should be safe to ignore any certificate check when data is queried from Prometheus

Additional info:

https://attachments.access.redhat.com/hydra/rest/cases/03329385/attachments/e89af78a-3e35-4f1a-a13c-46f05ff755cc?usePresignedUrl=true should contain an example

https://github.com/openshift/must-gather/pull/440

Bug OCPBUGS-39078: EnsureValidatingAdmissionPolicies e2e checks are flaky

View the Description View the linked PRs

High flake rate on new EnsureValidatingAdmissionPolicies e2e tests

EnsureValidatingAdmissionPoliciesDontBlockStatusModifications
EnsureValidatingAdmissionPoliciesCheckDeniedRequests
EnsureValidatingAdmissionPoliciesExists

High concentration on quickly completing test clusters like `TestNoneCreateCluster` and `TestHAEtcdChaos`

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/4608/pull-ci-openshift-hypershift-main-e2e-aws/1828490199248670720

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/4623/pull-ci-openshift-hypershift-main-e2e-aws/1828497750048641024

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-aws-ovn/1827922846928605184

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-aws-ovn/1828286495450992640

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-aws-ovn/1827858693312483328

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-aws-ovn/1828130481720266752

https://github.com/openshift/hypershift/pull/4629

Bug OCPBUGS-41291: ART requests updates to 4.18 image openshift-enterprise-tests-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/origin/pull/29071

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/origin/pull/29071

Bug OCPBUGS-43059: SNO Connection Error During Upgrades

View the Description View the linked PRs

Because SNO replaces the api-server during an upgrade, the storage-operator's csi-snapshot-container exits because it can retreive a CR, causing an exit loop back-off for the period where the api-server is down, this also effects other tests during this same time frame. We will be resolving each one of these individually and updating the tests for the time being to unblock the problems.

Flake KubeAPIErrorBudgetBurn until resolved being tracked here https://issues.redhat.com/browse/OCPBUGS-43071
Skip container restart for snapshot-controller https://issues.redhat.com/browse/OCPBUGS-43113
Modify DNS test until regression is resolved (this doesn't happen often but not taking chances) https://issues.redhat.com/browse/OCPBUGS-42777
Added the pathological errors during kube api progressing as well - this effort is long term here https://issues.redhat.com/browse/OCPEDGE-1288

Additional context here:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-single-node-upgrade-4.18-micro-release-openshift-release-analysis-aggregator/1844300620022943744

https://redhat-internal.slack.com/archives/C0763QRRUS2/p1728567187172169

https://github.com/openshift/origin/pull/29183

Bug OCPBUGS-43087: openstack: nodepool condition error on unsupported arch

View the Description View the linked PRs

Description of problem:

    When deploying nodepools on OpenStack, the Nodepool condition complains about unsupported amd64 while we actually support it.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/hypershift/pull/4868

Bug OCPBUGS-43428: Haproxy timeouts not aligned with k8s healthiness checks

View the Description View the linked PRs

As part of TRT investigations of k8s API disruptions, we have discovered there are times when haproxy considers underlying apiserver as Down, yet from k8s perspective the apiserver is healthy&functional.

From the customer perspective, during this time any call to the cluster API endpoint will fail. It simply looks like an outage.

Thorough investigation leads us to the following difference in how haproxy perceives apiserver being alive versus how k8s perceives it, i.e.

inter 1s fall 2 rise 3

and

     readinessProbe:
      httpGet:
        scheme: HTTPS
        port: 6443
        path: readyz
      initialDelaySeconds: 0
      periodSeconds: 5
      timeoutSeconds: 10
      successThreshold: 1
      failureThreshold: 3

We can see the top check is much stricter. And it belongs to haproxy. As a result, haproxy sees the following

2024-10-08T12:37:32.779247039Z [WARNING]  (29) : Server masters/master-2 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 5ms. 0 active and 0 backup servers left. 154 sessions active, 0 requeued, 0 remaining in queue.

much faster than k8s would consider something as wrong.

In order to remediate this issue, it has been agreed the haproxy checks should be softened and adjusted to the k8s readiness probe.

https://github.com/openshift/machine-config-operator/pull/4646

Bug OCPBUGS-38925: hypershift periodic conformance are failing due to coreos changes

View the Description View the linked PRs

Description of problem:

periodics are failing due to a change in coreos.

Version-Release number of selected component (if applicable):

    4.15,4.16,4.17,4.18

How reproducible:

    100%

Steps to Reproduce:

    1. Check any periodic conformance jobs
    2.
    3.

Actual results:

    periodic conformance fails with hostedcluster creation

Expected results:

    periodic conformance test suceeds

Additional info:

https://github.com/openshift/hypershift/pull/4605

Bug OCPBUGS-41956: cri-o 1.31 unexpectedly switched to crun in 4.18 nightlies

View the Description View the linked PRs

We want to use crun as default in 4.18, but upstream cri-o switched before we're ready.

https://github.com/openshift/machine-config-operator/pull/4590

Bug OCPBUGS-37560: Console user settings resources misses ownerRef, removing a user results in remaining data

View the Description View the linked PRs

Description of problem:
Console user settings are saved in a ConfigMap for each user in the namespace openshift-console-user-settings.

The console frontend uses the k8s API to read and write that ConfigMap. The console backend creates a ConfigMap with a Role and RoleBinding for each user, giving that single user read and write access to his/her own ConfigMap.

The number of Role and RoleBindings might decrease a cluster performance. This has happened in the past, esp. on the Developer Sandbox, where a long-living cluster creates new users that is then automatically removed after a month. Keeping the Role and RoleBinding results in performance issues.

The resources had an ownerReference before 4.15 so that the 3 resources (1 ConfigMap, 1 Role, 1 RoleBinding) was automatically removed when the User resource was deleted. This ownerReference was removed with 4.15 to support external OIDC providers.

The ask in this issue is to restore that ownerReference for the OpenShift auth provider.

History:

User setting feature was introduced 2020 with 4.7 (~~ODC-4370~~) without a ownerReference for these resources.
After noticing performance issues on Dev Sandbox 2022 (BZ 2019564) we added an ownerReference in 4.11 (PR 11130) and backported this change 4.10 and 4.9.
The ownerReference was removed in 4.15 with ~~CONSOLE-3829~~/~~OCPBUGS-16814~~/PR 13321. This is a regression.

Context

In order to address security requirements images for our products should comply with FIPS as described here: https://docs.google.com/presentation/d/1o3IowxHX6BsnxGkIInaQ0lBgnn_K5Ex8jxwCYCeNsqs/edit#slide=id.g2679cb578c3_0_10

Outcomes

all dynamic plugins owned by our team are FIPS compliant
- all dynamic plugins do not include Not compliant build options for Go
Dynamic plugins list to be checked
- Troubleshooting panel
- Logging
- Monitoring
- Dashboards
- Distributed tracing

Steps

Remove not compliant build options for Go
Sync with QE to test in a FIPS compliant OS using the FIPS or Die feature, check if this tests can be automated
Update COO midstream with the fix commit

Acceptance Criteria

All golang-based containers use the ENV GOEXPERIMENT=strictfipsruntime.
All golang-based containers use the ENV CGO_ENABLED=1.
All golang-based containers use the build tag strictfipsruntime.
All golang-based containers omit using static linking.
All golang-based container omit using the build tag no_openssl.
All containers use a runner base RHEL ELS image: e.g. registry.redhat.io/rhel9-4-els/rhel:9.4
All images pass the check-payload checks successfully.

https://github.com/openshift/monitoring-plugin/pull/169

Bug CNV-48190: Move the title above the tabs in NetworkPolicies page

View the Description View the linked PRs

Description of problem:

On NetworkPolicies page, the position of the titles and the tab does not have the same look of other pages, it should have the same style with others, move the title above the tabs.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/networking-console-plugin/pull/118

Bug OCPBUGS-37274: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-43439: [aws] ec2:DescribeInstanceTypes permission is required when instance type specified

View the Description View the linked PRs

Description of problem:

    If an instance type is specified in the install-config.yaml, the installer will try to validate its availability in the given region and that it meets the minimum requirements for OCP nodes. When that happens, the `ec2:DescribeInstanceTypes` permission is used but it's not validated by the installer as a required permissions for installs.

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    Always by setting an instanceType in the install-config.yaml

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    If you install with an user with minimal permissions, you'll get the error:

level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [controlPlane.platform.aws: Internal error: error listing instance types: fetching instance types: UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-8phprrsm-ccf9a-minimal-perm is not authorized to perform: ec2:DescribeInstanceTypes because no identity-based policy allows the ec2:DescribeInstanceTypes action
                level=error msg=	status code: 403, request id: 559344f4-0fc3-4a6c-a6ee-738d4e1c0099, compute[0].platform.aws: Internal error: error listing instance types: fetching instance types: UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-8phprrsm-ccf9a-minimal-perm is not authorized to perform: ec2:DescribeInstanceTypes because no identity-based policy allows the ec2:DescribeInstanceTypes action   
                level=error msg=	status code: 403, request id: 584cc325-9057-4c31-bb7d-2f4458336605]

Expected results:

    The installer fails with an explicit message saying that `ec2:DescribeInstanceTypes` is required.

Additional info:

https://github.com/openshift/installer/pull/9106

Bug OCPBUGS-43448: Feature flags are not passed to cluster policy controller

View the Description View the linked PRs

Description of problem:

The cluster policy controller does not get the same feature flags that other components in the control plane are getting.

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Always

Steps to Reproduce:

    1. Create hosted cluster
    2. Get cluster-policy-controller-config configmap from control plane namespace

Actual results:

Default feature gates are not included in the config

Expected results:

Feature gates are included in the config

Additional info:

https://github.com/openshift/hypershift/pull/4925

Story ETCD-639: E2E- skip static pod rollouts when quorum is not safe

View the Description View the linked PRs

This E2E tests whether the etcd is able to block the rollout of a new revision when the quorum is not safe.

https://github.com/openshift/origin/pull/28958

Bug OCPBUGS-38436: Power VS: Madrid cannot use e980 as a system type

View the Description View the linked PRs

Description of problem:

    e980 is a valid system type for the madrid region but it is not listed as such in the installer.

Version-Release number of selected component (if applicable):

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy to mad02 with SysType set to e980
    2. Fail
    3.

Actual results:

    Installer exits

Expected results:

    Installer should continue as it's a valid system type.

Additional info:

https://github.com/openshift/installer/pull/8831

Bug OCPBUGS-24905: Update 4.16 ose-aws-ebs-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/78

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-44423: Word “Meet” in “Meet OpenShift Lightspeed” should be traslated when language is Chinese

View the Description View the linked PRs

Description of problem:

When user set Chinese launguage, check on OpenShift Lightspeed nav modal, the "Meet OpenShift Lightspeed" is translated to "OpenShift Lightspeed", "Meet" is not translated.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-10-133523

How reproducible:

Always

Steps to Reproduce:

    1. When Chinese language is set, check the "Meet OpenShift Lightspeed" on OpenShift Lightspeed nav modal.
    2.
    3.

Actual results:

1. The "Meet OpenShift Lightspeed" is translated to "OpenShift Lightspeed", "Meet" is not translated.

Expected results:

1. "Meet" could be translated in Chinese. It has been translated for other languages.

Additional info:

https://github.com/openshift/console/pull/14484

Bug OU-584: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/262

Bug OCPBUGS-39525: MustParse functions used on user-provided input can panic controller

View the Description View the linked PRs

The HyperShift codebase has numerous examples of MustParse*() functions being used on non-constant input. This is not their intended use, as any failure will cause a panic in the controller.

In a few cases they are are called on user-provided input, meaning any authenticated user can (intentionally or unintentionally) deny service to all other users by providing invalid input which continuously crashes the HostedCluster controller.

This is probably a security issue, but as I have already described it in https://github.com/openshift/hypershift/pull/4546 there is no reason to embargo it.

https://github.com/openshift/hypershift/pull/4546

Bug OCPBUGS-38339: Oc-mirror should not panic when failed to get release signature

View the Description View the linked PRs

Description of problem:

Oc-mirror should not panic when failed to get  release signature

Version-Release number of selected component (if applicable):

oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1)  Mirror2disk+disk2mirror with following imagesetconfig, and mirror to enterprise registry :
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.15                                             
      type: ocp
      minVersion: '4.15.18'
      maxVersion: '4.15.18'

2) Setup squid and only with white list with enterprise registry and the OSUS service ;
cat /etc/squid/squid.conf
http_port 3128
coredump_dir /var/spool/squid
acl whitelist dstdomain "/etc/squid/whitelist"
http_access allow whitelist
http_access deny !whitelist

cat /etc/squid/whitelist 
my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com             -------------registry route  (oc get route -n your registry app's project)
update-service-oc-mirror-route-openshift-update-service.apps.yinzhou-88.qe.devcluster.openshift.com        ---osus route  (oc get route -n openshift-update-service)

Sudo systemctl restart squid
export https_proxy=http://127.0.0.1:3128
export http_proxy=http://127.0.0.1:3128

3)  Setting registry redirect with : 
cat ~/.config/containers/registries.conf 
[[registry]]
  location = "quay.io"
  insecure = false
  blocked = false
  mirror-by-digest-only = false
  prefix = ""
  [[registry.mirror]]
    location = "my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com"
    insecure = false


4) Use the same imagesetconfig and mirror to a new folder :
`oc-mirror -c config-38037.yaml file://new-folder --v2`

Actual results:

4) the oc-mirror command panic with error :

I0812 06:45:26.026441 199941 core-cincinnati.go:508] Using proxy 127.0.0.1:3128 to request updates from https://update-service-oc-mirror-route-openshift-update-service.apps.yinzhou-417.qe.devcluster.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=a6097264-8b29-438f-9e71-4aba1e9ec32d
2024/08/12 06:45:26 [ERROR] : http request Get "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=0f55261077557d1bb909c06b115e0c79b0025677be57ba2f045495c11e2443ee/signature-1": Forbidden
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x3d1e3f6]

goroutine 1 [running]:
github.com/openshift/oc-mirror/v2/internal/pkg/release.SignatureSchema.GenerateReleaseSignatures({

{0x55d8670, 0xc000729e00}

, {0x4c7b348, 0x15}, {0xc000058c60, 0x1c, {...}, {...}, {...}, {...}, ...}, ..., ...}, ...)
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/signature.go:97 +0x676
github.com/openshift/oc-mirror/v2/internal/pkg/release.(*CincinnatiSchema).GetReleaseReferenceImages(0xc0007203c0, {0x55caf68, 0x764c060})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/cincinnati.go:230 +0x70b
github.com/openshift/oc-mirror/v2/internal/pkg/release.(*LocalStorageCollector).ReleaseImageCollector(0xc000b12388, {0x55caf68, 0x764c060})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/local_stored_collector.go:58 +0x407
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).CollectAll(0xc000ae8908, {0x55caf68, 0x764c060})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:955 +0x122
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).RunMirrorToDisk(0xc000ae8908, 0xc0005f3b08, {0xa?, 0x20?, 0x20?})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:707 +0x1aa
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).Run(0xc000ae8908, 0xc0005f1640?, {0xc0005f1640?, 0x0?, 0x0?})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:459 +0x149
github.com/openshift/oc-mirror/v2/internal/pkg/cli.NewMirrorCmd.func1(0xc0005f3b08, {0xc0005f1640, 0x1, 0x4})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:207 +0x32a
github.com/spf13/cobra.(*Command).execute(0xc0005f3b08, {0xc000166010, 0x4, 0x4})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/spf13/cobra/command.go:987 +0xab1
github.com/spf13/cobra.(*Command).ExecuteC(0xc0005f3b08)
/home/fedora/yinzhou/oc-mirror/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(0x741ec38?)
/home/fedora/yinzhou/oc-mirror/vendor/github.com/spf13/cobra/command.go:1039 +0x13
main.main()
/home/fedora/yinzhou/oc-mirror/cmd/oc-mirror/main.go:10 +0x18

Expected results:

The command could fail, but not panic

https://github.com/openshift/oc-mirror/pull/924

Bug OCPBUGS-42265: Manual fix to broken rails template

View the Description View the linked PRs

Description of problem:

Before the fix for https://issues.redhat.com/browse/OCPBUGS-42253 is merged upstream and propagated, we can apply a temporary fix directly in the samples operator repo, unblocking us from the need wait for that to happen.

Version-Release number of selected component (if applicable):

How reproducible:

always

Steps to Reproduce:

    1.oc new-app openshift/rails-postgresql-example
    2.
    3.

Actual results:

app pod in crash loop

Expected results:

app working

Additional info:

https://github.com/openshift/cluster-samples-operator/pull/574

Bug MGMT-19054: [Staging] - ARM64 cluster - Enabling cnv --> getting error on LSO

View the Description View the linked PRs

Description of the problem:

BE 2.35.1 - OCP 4.17 ARM64 cluster - Selecting CNV in UI throws the following error:
Local Storage Operator is not available when arm64 CPU architecture is selected
How reproducible:

100%

Steps to reproduce:

Actual results:

Expected results:

https://github.com/openshift/assisted-service/pull/6879

Bug OCPBUGS-41206: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/199

Bug OCPBUGS-42238: Multus daemonset requires graceful termination [cno integration]

View the Description View the linked PRs

Description of problem:

    This PR introduces graceful shutdown functionality to the Multus daemon by adding a /readyz endpoint alongside the existing /healthz. The /readyz endpoint starts returning 500 once a SIGTERM is received, indicating the daemon is in shutdown mode. During this time, CNI requests can still be processed for a short window. The daemonset configs have been updated to increase terminationGracePeriodSeconds from 10 to 30 seconds, ensuring we have a bit more time for these clean shutdowns.This addresses a race condition during pod transitions where the readiness check might return true, but a subsequent CNI request could fail if the daemon shuts down too quickly. By introducing the /readyz endpoint and delaying the shutdown, we can handle ongoing CNI requests more gracefully, reducing the risk of disruptions during critical transitions.

Version-Release number of selected component (if applicable):

How reproducible:

    Difficult to reproduce, might require CI signal

https://github.com/openshift/cluster-network-operator/pull/2508

Task HOSTEDCP-1874: Remove saas template

View the Description View the linked PRs

I talked with Gerd Oberlechner; the hack/app-sre/saas_template.yaml - it is not used anymore in app-interface.

It should be safe to remove this.

https://github.com/openshift/hypershift/pull/4521

Bug OCPBUGS-39456: ART requests updates to 4.18 image ose-baremetal-cluster-api-controllers-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-api-provider-metal3/pull/21

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-api-provider-metal3/pull/21

Bug OCPBUGS-39188: [4.18] Rebase openshift/etcd to 3.5.16

View the Description View the linked PRs

Rebase openshift/etcd to latest 3.5.16 upstream release.

https://github.com/openshift/etcd/pull/290

Bug OCPBUGS-43279: [GCP] installing with custom machine types failed with error like 'failed to create install config: [controlPlane.platform.gcp.type: Not found: "custom"'

View the Description View the linked PRs

Description of problem:

    The bug fixing of https://issues.redhat.com/browse/OCPBUGS-41184 introdcued the machine type validation error.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-10-14-021053

How reproducible:

    Always

Steps to Reproduce:

    1. "create install-config", and then insert the machine type settings (see [1])  
    2. "create manifests" (or "create cluster")

Actual results:

    ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [controlPlane.platform.gcp.type: Not found: "custom", compute[0].platform.gcp.type: Not found: "custom"]

Expected results:

    Success

Additional info:

    FYI the 4.17 PROW CI test failure: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-gcp-ipi-mini-perm-custom-type-f28/1845589157397663744

https://github.com/openshift/installer/pull/9094

Bug OCPBUGS-23260: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-38717: Telemetry userPreference results in empty nodes output to the DOM

View the Description View the linked PRs

The Telemetry userPreference added to the General tab in https://github.com/openshift/console/pull/13587 results in empty nodes being output to the DOM. This results in extra spacing any time a new user preference is added to the bottom of the General tab.

https://github.com/openshift/console/pull/14291

Bug OCPBUGS-39339: bootstrap log bundle path issue when finding service file "release-image"

View the Description View the linked PRs

Description of problem:

    The issue comes from https://issues.redhat.com/browse/OCPBUGS-37540?focusedId=25386451&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25386451.
Error message is shown when gather bootstrap log bundle although log bundle gzip file is generated.

ERROR Invalid log bundle or the bootstrap machine could not be reached and bootstrap logs were not collected.

Version-Release number of selected component (if applicable):

    4.17+

How reproducible:

    Always

Steps to Reproduce:

    1. Run `openshift-install gather bootstrap --dir <install-dir>`
    2.
    3.

Actual results:

    Error message shown in output of command `openshift-install gather bootstrap --dir <install-dir>`

Expected results:

    No error message shown there.

Additional info:

Analysis from Rafael, https://issues.redhat.com/browse/OCPBUGS-37540?focusedId=25387767&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25387767

https://github.com/openshift/installer/pull/9073

Bug OCPBUGS-38368: Fix the config loading warning in cns-migrator tool

View the Description View the linked PRs

After multi-VC changes were merged, now when we use this tool, following warnings get logged:

E0812 13:04:34.813216   13159 config_yaml.go:208] Unmarshal failed: yaml: unmarshal errors:
line 1: cannot unmarshal !!seq into config.CommonConfigYAML
I0812 13:04:34.813376   13159 config.go:272] ReadConfig INI succeeded. INI-based cloud-config is deprecated and will be removed in 2.0. Please use YAML based cloud-config.

Which looks bit scarier than it should.

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/258

Bug OCPBUGS-38780: Incorrect link in storageNotConfiguredMessage in cluster-monitoring-operator

View the Description View the linked PRs

Description of problem:

storageNotConfiguredMessage contains link to https://docs.openshift.com/container-platform/%s/monitoring/configuring-the-monitoring-stack.html , which leads to 404, needs to be changed to https://docs.openshift.com/container-platform/%s/observability/monitoring/configuring-the-monitoring-stack.html

Version-Release number of selected component (if applicable):

4.16

How reproducible:

always

https://github.com/openshift/cluster-monitoring-operator/pull/2442

Bug OCPBUGS-39361: No hints provided for field values in Shipwright build form

View the Description View the linked PRs

The fields in Shipwright build form show no hints or default values. They should provide examples and hints to help user provided correct values when creating a build.

For example:

builder-image: example url of s2i builder image in the internal registry, or link to list of imagestreams in the "openshift" namespace
output-image: a hint with the text "Example for OpenShift internal registry: image-registry.openshift-image-registry.svc:5000/<namespace>/<image-name>:latest"

https://github.com/openshift/console/pull/14510

Bug OCPBUGS-39108: image-registry pod keep restarting due to panic

View the Description View the linked PRs

Description of problem:

IHAC running 4.16.1 OCP cluster. In their cluster image registry pod is restarting with below messages:

message: "/image-registry/vendor/github.com/aws/aws-sdk-go/service/s3/api.go:7629 +0x1d0\ngithub.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk(0xc000a3c120, {0x28924c0, 0xc0001f5b20}, 0xc00083bab8, {0xc00125b7d1, 0x20}, {0x2866860, 0x1}, 0xc00120a8d0)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1135 +0x348\ngithub.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).Walk(0xc000675ec0?, {0x28924c0, 0xc0001f5b20}, {0xc000675ec0, 0x20}, 0xc00083bc10?)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1095 +0x148\ngithub.com/distribution/distribution/v3/registry/storage/driver/base.(*Base).Walk(0xc000519480, {0x2892778?, 0xc00012cf00?}, {0xc000675ec0, 0x20}, 0x1?)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/base/base.go:237 +0x237\ngithub.com/distribution/distribution/v3/registry/storage.getOutstandingUploads({0x2892778, 0xc00012cf00}, {0x289d728?, 0xc000519480})\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:70 +0x1f9\ngithub.com/distribution/distribution/v3/registry/storage.PurgeUploads({0x2892778, 0xc00012cf00}, {0x289d728?, 0xc000519480?}, {0xc1a937efcf6aec96, 0xfffddc8e973b8a89, 0x3a94520}, 0x1)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:34 +0x12d\ngithub.com/distribution/distribution/v3/registry/handlers.startUploadPurger.func1()\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:1139 +0x33f\ncreated by github.com/distribution/distribution/v3/registry/handlers.startUploadPurger in goroutine 1\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:1127 +0x329\n" reason: Error startedAt: "2024-08-27T09:08:14Z" name: registry ready: true restartCount: 250 started: true

Version-Release number of selected component (if applicable):

    4.16.1

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

   all the pods are restating

Expected results:

    It should not restart.

Additional info:

https://redhat-internal.slack.com/archives/C013VBYBJQH/p1724761756273879    
upstream report: https://github.com/distribution/distribution/issues/4358

https://github.com/openshift/image-registry/pull/414

Bug OCPBUGS-39249: Fix lists sorting

View the Description View the linked PRs

Service :Labels, Pod selector, Location sorting doesn't work

Routes: all columns sorting doesn't work

Ingress: Host column sorting doesn't work

Bug OCPBUGS-39572: ART requests updates to 4.18 image ose-cluster-config-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-config-operator/pull/426

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-config-operator/pull/426

Bug OCPBUGS-38859: add a new monitor test: api unreachable interval from client perspectives

View the Description View the linked PRs

add a new monitor test: api unreachable interval from client perspectives

Bug OCPBUGS-38273: ART requests updates to 4.18 image ose-cluster-samples-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/559

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-samples-operator/pull/559

Bug OCPBUGS-41227: ART requests updates to 4.18 image ose-cluster-openshift-controller-manager-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/364

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/364

Bug OCPBUGS-32592: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4741

Story TRT-1828: hypershift failing after cloud-provider-aws rebase

View the Description View the linked PRs

https://redhat-external.slack.com/archives/C01C8502FMM/p1727172176885689

https://github.com/openshift/cloud-provider-aws/pull/95

Bug OCPBUGS-27250: "spec.enableFeature" should be "spec.enableFeatures" in prometheus.spec.remoteWrite.sendExemplars

View the Description View the linked PRs

Description of problem:

4.15 and 4.16

$ oc explain prometheus.spec.remoteWrite.sendExemplars
GROUP:      monitoring.coreos.com
KIND:       Prometheus
VERSION:    v1FIELD: sendExemplars <boolean>DESCRIPTION:
    Enables sending of exemplars over remote write. Note that exemplar-storage
    itself must be enabled using the `spec.enableFeature` option for exemplars
    to be scraped in the first place. 
     It requires Prometheus >= v2.27.0.

no `spec.enableFeature` option

$ oc explain prometheus.spec.enableFeature
GROUP:      monitoring.coreos.com
KIND:       Prometheus
VERSION:    v1
error: field "enableFeature" does not exist

should be `spec.enableFeatures`

$ oc explain prometheus.spec.enableFeatures
GROUP:      monitoring.coreos.com
KIND:       Prometheus
VERSION:    v1FIELD: enableFeatures <[]string>
DESCRIPTION:
    Enable access to Prometheus feature flags. By default, no features are
    enabled. 
     Enabling features which are disabled by default is entirely outside the
    scope of what the maintainers will support and by doing so, you accept that
    this behaviour may break at any time without notice. 
     For more information see
    https://prometheus.io/docs/prometheus/latest/feature_flags/

Version-Release number of selected component (if applicable):

 4.15 and 4.16

How reproducible:

always

Bug OCPBUGS-38498: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9017

Bug OCPBUGS-38709: Hypershift generated "kubernetes-default-proxy" causes "missing port in address" if hostedCluster CR proxy does not contain a port

View the Description View the linked PRs

Description of problem:

When a user is trying to deploy a Hosted Cluster using Hypershift, If in the hostedCluster CR under Spec.Configuration.Proxy.HTTPSProxy there is defined a proxy URL missing the port (because uses the default port) this is gonna be passed with this code] inside the "kube-apiserver-proxy" yaml manifest under the spec.containers.command like below:

$ oc get pod n kube-system kube-apiserver-proxy-xxxxx -o yaml| yq '.spec.containers[].command' [ "control-plane-operator", "kubernetes-default-proxy", "listen-addr=172.20.0.1:6443", "proxy-addr=example.proxy.com", "-apiserver-addr=<apiserver-IP>:<port>" ]

Then this code will parse these values. Here]

This command have these flags that will be used for the container to do the API calls.

The net.Dial function that is used from the golang net package expects a host/ip:port. Check the docs here: https://pkg.go.dev/net#Dial

For TCP and UDP networks, the address has the form "host:port". The host must be a literal IP address, or a host name that can be resolved to IP addresses. The port must be a literal port number or a service name.

So the pod will end up having this issue:

2024-08-19T06:55:44.831593820Z {"level":"error","ts":"2024-08-19T06:55:44Z","logger":"kubernetes-default-proxy","msg":"failed diaing backend","proxyAddr":"example.proxy.com","error":"dial tcp: address example.proxy.com: missing port in address","stacktrace":"github.com/openshift/hypershift/kubernetes-default-proxy.(*server).run.func1\n\t/hypershift/kubernetes-default-proxy/kubernetes_default_proxy.go:89"}

Some ideas on how to solve his are below:

Validate the hostedCluster CR
Add logic to append the default port if missing
Something else?

How reproducible:

Try to deploy a Hosted Cluster using Hypershift operator using a proxy URL without a port (e.g <example.proxy.com>:<port>) in the hostedCluster CR under "Spec.Configuration.Proxy.HTTPSProxy". This will result to the below error in the kube-apiserver-proxy container: "missing port in address"

Actual results:

The kube-apiserver-proxy container returns "missing port in address"

Expected results:

The kube-apiserver-proxy container to don't return "missing port in address"

Additional info:

This can be workarounded by adding a ":" and a port number after the proxy IP/URL in the hostedCluster."Spec.Configuration.Proxy.HTTPSProxy".

https://github.com/openshift/hypershift/pull/4603

Bug OCPBUGS-41138: ART requests updates to 4.18 image ose-cluster-storage-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/500

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-storage-operator/pull/500

Bug OCPBUGS-30620: Remove csi-operator legacy/ directory

View the Description View the linked PRs

AWS EBS, Azure Disk and Azure File operators are now built from cmd/ and pkg/, there is no code used from legacy/ dir and we should remove it.

There are still test manifests in legacy/ directory that are still used! They need to be moved somewhere else + Dockerfile.*.test and CI steps must be updated!

Technically, this is a copy of ~~STOR-1797~~, but we need a bug to be able to backport aws-ebs changes to 4.15 and not use legacy/ directory there too.

Bug OCPBUGS-36283: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8819

Bug OCPBUGS-42264: Reconcile API CRD move for bootstrap unit tests

View the Description View the linked PRs

When https://github.com/openshift/machine-config-operator/pull/4597 landed, bootstrap tests startup began to fail as it is doesn't install the required CRDs. This is because the CRDs no longer live in the MCO repo and the startup code needs to be reconciled to pick up the MCO specific CRDs from the o/api repo.

https://github.com/openshift/machine-config-operator/pull/4604

Bug OCPBUGS-38275: ART requests updates to 4.18 image ironic-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/ironic-image/pull/539

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/ironic-image/pull/539

Bug OCPBUGS-37534: 4.12 -> 4.13 upgrade using IPI on Azure does not work

View the Description View the linked PRs

Description of problem:

Prow jobs upgrading from 4.9 to 4.16 are failing when they upgrade from 4.12 to 4.13.

Nodes become NotReady when MCO tries to apply the new 4.13 configuration to the MCPs.

The failing job is: periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.9-azure-ipi-f28

We have reproduced the issue and we found an ordering cycle error in the journal log

Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 systemd-journald.service[838]: Runtime Journal (/run/log/journal/960b04f10e4f44d98453ce5faae27e84) is 8.0M, max 641.9M, 633.9M free.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found ordering cycle on network-online.target/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on node-valid-hostname.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on ovs-configuration.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on firstboot-osupdate.target/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-firstboot.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-pull.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Job network-online.target/start deleted to break ordering cycle starting with machine-config-daemon-pull.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: Queued start job for default target Graphical Interface.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: (This warning is only shown for the first unit using IP firewalling.)
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: Deactivated successfully.

Version-Release number of selected component (if applicable):

    Using IPI on Azure, these are the version involved in the current issue upgrading from 4.9 to 4.13:
    
      version: 4.13.0-0.nightly-2024-07-23-154444
      version: 4.12.0-0.nightly-2024-07-23-230744
      version: 4.11.59
      version: 4.10.67
      version: 4.9.59

How reproducible:

    Always

Steps to Reproduce:

    1. Upgrade an IPI on Azure cluster from 4.9 to 4.13. Theoretically, upgrading from 4.12 to 4.13 should be enough, but we reproduced it following the whole path.

Actual results:


    Nodes become not ready
$ oc get nodes
NAME                                                 STATUS                        ROLES    AGE     VERSION
ci-op-g94jvswm-cc71e-998q8-master-0                  Ready                         master   6h14m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-master-1                  Ready                         master   6h13m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-master-2                  NotReady,SchedulingDisabled   master   6h13m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus1-c7ngb   NotReady,SchedulingDisabled   worker   6h2m    v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus2-2ppf6   Ready                         worker   6h4m    v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus3-nqshj   Ready                         worker   6h6m    v1.25.16+306a47e

And in the NotReady nodes we can see the ordering cycle error mentioned in the description of this ticket.

Expected results:

No ordering cycle error should happen and the upgrade should be executed without problems.

Additional info:

https://github.com/openshift/machine-config-operator/pull/4515

Bug OCPBUGS-42546: The MCO does not properly degrade when pools are failing to render a new config

View the Description View the linked PRs

Description of problem:

When machineconfig fails to generate, we set upgradeable=false and degrade pools. The expectation is that the CO would also degrade after some time (normally 30 minutes) since master pool is degraded, but that doesn't seem to be happening. Based on our initial investigation, the event/degrade is happening but it seems to be being cleared.

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Should be always

Steps to Reproduce:

    1. Apply a wrong config, such as a bad image.config object:
spec:
  registrySources:
    allowedRegistries:
    - test.reg
    blockedRegistries:
    - blocked.reg
    
    2. upgrade the cluster or roll out a new MCO pod
    3. observe that pools are degraded but the CO isn't

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-config-operator/pull/4617

Bug OCPBUGS-37850: Machine-config daemon ListPools panic during tech-preview CI runs

View the Description View the linked PRs

Description of problem:

Occasional machine-config daemon panics in test-preview. For example this run has:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1076/pull-ci-openshift-cluster-version-operator-master-e2e-aws-ovn-techpreview/1819082707058036736

And the referenced logs include a full stack trace, the crux of which appears to be:

E0801 19:23:55.012345    2908 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 127 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2424b80, 0x4166150})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0004d5340?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x2424b80?, 0x4166150?})
	/usr/lib/golang/src/runtime/panic.go:770 +0x132
github.com/openshift/machine-config-operator/pkg/helpers.ListPools(0xc0007c5208, {0x0, 0x0})
	/go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:142 +0x17d
github.com/openshift/machine-config-operator/pkg/helpers.GetPoolsForNode({0x0, 0x0}, 0xc0007c5208)
	/go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:66 +0x65
github.com/openshift/machine-config-operator/pkg/daemon.(*PinnedImageSetManager).handleNodeEvent(0xc000a98480, {0x27e9e60?, 0xc0007c5208})
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/pinned_image_set.go:955 +0x92

Version-Release number of selected component (if applicable):

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-daemon.*Observed+a+panic' | grep 'failures match'
periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview (all) - 37 runs, 62% failed, 13% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-techpreview-serial (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview-serial (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview-serial (all) - 6 runs, 17% failed, 200% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview (all) - 7 runs, 57% failed, 50% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview (all) - 18 runs, 17% failed, 33% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 11 runs, 18% failed, 50% of failures match = 9% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview-serial (all) - 7 runs, 57% failed, 25% of failures match = 14% impact

How reproducible:

looks like ~15% impact in those CI runs CI Search turns up.

Steps to Reproduce:

Run lots of CI. Look for MCD panics.

Actual results

CI Search results above.

Expected results

No hits.

https://github.com/openshift/machine-config-operator/pull/4533

Bug OCPBUGS-38802: node-joiner add-nodes ignores infrastructure platform type

View the Description View the linked PRs

Description of problem:

    Infrastructure object with platform None is ignored by node-joiner tool

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Always

Steps to Reproduce:

    1. Run the node-joiner add-nodes command

Actual results:

    Currently the node-joiner tool retrieves the platform type from the kube-system/cluster-config-v1 config map

Expected results:

Retrieve the platform type from the infrastructure cluster object

Additional info:

https://github.com/openshift/installer/pull/8888

Bug OCPBUGS-43562: cinder-csi operator faulty on proxy installation

View the Description View the linked PRs

Description of problem:

All opentack-cinder-csi-driver-node pods are in crashloopback status during IPI installation with proxy configured:

2024-10-18 11:27:41.936 | NAMESPACE                                          NAME                                                         READY   STATUS             RESTARTS       AGE
2024-10-18 11:27:41.946 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-9dkwz                       1/3     CrashLoopBackOff   61 (59s ago)   106m
2024-10-18 11:27:41.956 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-cdf2d                       1/3     CrashLoopBackOff   53 (19s ago)   90m
2024-10-18 11:27:41.966 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-chnj6                       1/3     CrashLoopBackOff   61 (85s ago)   106m
2024-10-18 11:27:41.972 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-fwgg4                       1/3     CrashLoopBackOff   53 (32s ago)   90m
2024-10-18 11:27:41.979 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-h5qg8                       1/3     CrashLoopBackOff   61 (88s ago)   106m
2024-10-18 11:27:41.989 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-jbzj9                       1/3     CrashLoopBackOff   52 (42s ago)   90m

The pod complains with below:

2024-10-18T11:20:57.226298852Z W1018 11:20:57.226085 1 main.go:87] Failed to GetOpenStackProvider: Get "https://10.46.44.29:13000/": dial tcp 10.46.44.29:13000: i/o timeout

Looks it is not using the proxy to reach the OSP API.
Version-Release number of selected component (if applicable):

   4.18.0-0.nightly-2024-10-16-094159

Must-gather for 4.18 proxy installation (& must-gather for successful 4.17 proxy installation for comparison) in private comment.

Bug OCPBUGS-43740: ovnkube pod crashed if changing internalTransitSwitchSubnet subnet during live migration

View the Description View the linked PRs

After changing internalJoinSubnet,internalTransitSwitchSubnet, on day2 and do live migration. ovnkube node pod crashed

network part as below the service cidr has same subnet with the ovn default internalTransitSwitchSubnet

    clusterNetwork:
    - cidr: 100.64.0.0/15
      hostPrefix: 23
    serviceNetwork:
    - 100.88.0.0/16

and then:

oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":{"internalJoinSubnet": "100.82.0.0/16"}}}}}'
oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":{"internalTransitSwitchSubnet": "100.69.0.0/16"}}}}}'

with error:

start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: EmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:100.254.0.0/17 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:

{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5}

DisablePacke

https://github.com/openshift/cluster-network-operator/pull/2549

Bug OCPBUGS-41253: ART requests updates to 4.18 image prom-label-proxy-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/376

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/prom-label-proxy/pull/376

Bug OCPBUGS-43482: some toggle buttons lack of unique identifiers

View the Description View the linked PRs

Description of problem:

For example, the toggle button on Node and Pods logs page don't have unique identifier, it's hard to locate these buttons during automation 

`Select a path` toggle button has definition
<button class="pf-v5-c-menu-toggle" type="button" aria-label="Select a path" aria-expanded="false">
  <span class="pf-v5-c-menu-toggle__text">openshift-apiserver</span>
  <span class="pf-v5-c-menu-toggle__controls">...........
</button>

`Select a log file` toggle button
<button class="pf-v5-c-menu-toggle" type="button" aria-expanded="false">
  <span class="pf-v5-c-menu-toggle__text">Select a log file
  </span><span class="pf-v5-c-menu-toggle__controls">.......
</button>

Since we have many toggle buttons on the page, it's quite hard to locate without distinguishable identifiers

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14408

Bug OCPBUGS-39474: ART requests updates to 4.18 image ose-gcp-pd-csi-driver-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/126

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/gcp-pd-csi-driver-operator/pull/126

Bug OCPBUGS-43971: CAPO bump to v0.11.1

View the Description View the linked PRs

Placeholder for bumping CAPO in the installer.

Bug OCPBUGS-36891: Update OWNERS file in origin/test/extended/router

View the Description View the linked PRs

Description of problem:

QE Liang Quan requested a review of https://github.com/openshift/origin/pull/28912 and the OWNERS file doesn't reflect current staff available to review.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

N/A

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    OWNERS file contains
  - danehans
  - frobware
  - knobunc
  - Miciah
  - miheer
  - sgreene570

Expected results:

Add new OWNERS as reviewers/approvers:
- alebedev87
- candita
- gcs278
- rfredette
- Thealisyed
- grzpiotrowski

Move old OWNERS to emeritus_approvers:
  - danehans 
  - sgreene570

Additional info:

    Example in https://github.com/openshift/cluster-ingress-operator/blob/master/OWNERS

https://github.com/openshift/origin/pull/29000

Bug OCPBUGS-38241: Failure: Operator progressing (NeedsUpdateReplicas): Observed 3 replica(s) in need of update

View the Description View the linked PRs

Component Readiness has found a potential regression in the following test:

operator conditions control-plane-machine-set

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-03T00:00:00Z
End Time: 2024-08-09T23:59:59Z
Success Rate: 92.05%
Successes: 81
Failures: 7
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 429
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Cloud%20Compute%20%2F%20Other%20Provider&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-09%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-03%2000%3A00%3A00&testId=Operator%20results%3A6d9ee55972f66121016367d07d52f0a9&testName=operator%20conditions%20control-plane-machine-set

https://github.com/openshift/installer/pull/8822

Bug OCPBUGS-42003: Operator is not getting deployed, InstallPlan too large leads to ResourceExhausted

View the Description View the linked PRs

Description of problem:

Operator is not getting installed. There are multiple install plans getting created/deleted for the same operator. There is not even any error indicated in the subscription or somewhere. The bundle unpacking job is completed.

Images:
quay.io/nigoyal/odf-operator-bundle:v0.0.1
quay.io/nigoyal/odf-operator-catalog:v0.0.1

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Always

Steps to Reproduce:

Create the below manifests
   
---
apiVersion: v1
kind: Namespace
metadata:
  labels:
    openshift.io/cluster-monitoring: "true"
    pod-security.kubernetes.io/audit: baseline
    pod-security.kubernetes.io/audit-version: v1.25
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: v1.25
    pod-security.kubernetes.io/warn: baseline
    pod-security.kubernetes.io/warn-version: v1.25
  name: openshift-storage
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: odf-operatorgroup
  namespace: openshift-storage
spec:
  targetNamespaces:
  - openshift-storage
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: odf-catalogsource
  namespace: openshift-storage
spec:
  grpcPodConfig:
    securityContextConfig: legacy
  displayName: Openshift Data Foundation
  image: quay.io/nigoyal/odf-operator-catalog:v0.0.1
  priority: 100
  publisher: ODF
  sourceType: grpc
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: odf-subscription
  namespace: openshift-storage
spec:
  channel: alpha
  name: odf-operator
  source: odf-catalogsource
  sourceNamespace: openshift-storage

Actual results:

Operator is not getting installed.

Expected results:

Operator should get installed.

Additional info:

The bundle is a unified bundle created from multiple bundles.

Slack Discussion: https://redhat-internal.slack.com/archives/C3VS0LV41/p1726026365936859

https://github.com/openshift/operator-framework-olm/pull/877

Bug OCPBUGS-43987: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/2513

Bug OCPBUGS-42472: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-network-operator/pull/2513

Bug MGMT-18645: AgentClusterInstall spec not syncing due to "failed to verify certificate: x509: certificate signed by unknown authority" on release image

View the Description View the linked PRs

Description of the problem:

When attempting to install a spoke cluster, the AgentClusterInstall is not being generated correctly due to release image certificate not being trusted

  - lastProbeTime: "2024-08-20T20:10:16Z"
    lastTransitionTime: "2024-08-20T20:10:16Z"
    message: "The Spec could not be synced due to backend error: failed to get release
      image 'quay.io/openshift-release-dev/ocp-release@sha256:58c9cdeddb33100ee29441e374467592cbd39c3fc56552c57bf2a183a85025f3'.
      Please ensure the releaseImage field in ClusterImageSet '4.17.0' is valid,  (error:
      command 'oc adm release info -o template --template '{{.metadata.version}}'
      --insecure=false --icsp-file=/tmp/icsp-file98462205 quay.io/openshift-release-dev/ocp-release@sha256:58c9cdeddb33100ee29441e374467592cbd39c3fc56552c57bf2a183a85025f3
      --registry-config=/tmp/registry-config740495490' exited with non-zero exit code
      1: \nFlag --icsp-file has been deprecated, support for it will be removed in
      a future release. Use --idms-file instead.\nerror: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:58c9cdeddb33100ee29441e374467592cbd39c3fc56552c57bf2a183a85025f3:
      Get \"https://quay.io/v2/\": tls: failed to verify certificate: x509: certificate
      signed by unknown authority\n)."

How reproducible:

Intermittent

Steps to reproduce:

1. Attempt to create cluster resources after assisted-service is running

Actual results:

AgentClusterInstall fails due to certificate errors

Expected results:

The registry housing the release image has it's certificate verified correctly

Additional Info:

Restarting the assisted-service pod fixes the issue. It seems like there is race condition between the operator setting up the configmap with the correct contents and the assisted pod starting and mounting the configmap to /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem

https://github.com/openshift/assisted-service/pull/6734

Bug OCPBUGS-41104: ART requests updates to 4.18 image ose-machine-api-provider-aws-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/111

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component name: ose-machine-api-provider-aws-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

https://github.com/openshift/machine-api-provider-aws/pull/111

Bug OCPBUGS-41358: While upgrading the cluster from UI observed `Warning alert:Admission Webhook Warning`

View the Description View the linked PRs

Description of problem:

While upgrading the cluster from web-console the below warning message observed. 
~~~
Warning alert:Admission Webhook Warning
ClusterVersion version violates policy 299 - "unknown field \"spec.desiredUpdate.channels\"", 299 - "unknown field \"spec.desiredUpdate.url\""
~~~

There are no such fields in the clusterVersion yaml for which the warning message fired.

From the documentation here: https://docs.openshift.com/container-platform/4.16/rest_api/config_apis/clusterversion-config-openshift-io-v1.html 

It's possible to see that "spec.desiredUpdate" exists, but there is no mention of values "channels" or "url" under desiredUpdate.



Note: This is not impacting the cluster upgrade. However creating confusion among customers due to the warning message.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

Everytime

Steps to Reproduce:

    1. Install cluster of version 4.16.4
    2. Upgrade the cluster from web-console to the next-minor version
    3.

Actual results:

    Upgrade should proceed with no such warnings

Expected results:

Additional info:

https://github.com/openshift/console/pull/14395

Bug OCPBUGS-43581: [4.16.15] TLS Validation errors when collect-profiles access OLM metrics

View the Description View the linked PRs

Description of problem:

Upon upgrade of 4.16.15, OLM is failing to upgrade operator cluster service versions due to a TLS validation error. 

From the OLM controller manager pod, logs show this: 
oc logs -n openshift-operator-lifecycle-manager olm-operator-7c9f76554-j22j5 | grep "tls" | head
"tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")"

It's also observed in the api-server-operator logs that many webhooks are affected with the following errors: 
$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-8445495998-s6wgd | grep "failed to connect" | tail
W1018 21:44:07.641047       1 degraded_webhook.go:147] failed to connect to webhook "machineautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority
W1018 21:44:08.647623       1 degraded_webhook.go:147] failed to connect to webhook "machineautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority
W1018 21:53:58.542660       1 degraded_webhook.go:147] failed to connect to webhook "clusterautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority 

This is causing the OLM controller to hang and is failing to install/upgrade operators based on the OLM controller logs.

How reproducible:

    Very reproducible upon upgrade from 4.16.14 to 4.16.15 on any Openshift Dedicated or ROSA Openshfit cluster.

Steps to Reproduce:

    1. Install OSD or ROSA cluster at 4.16.14 or below
    2. Upgrade to 4.16.15
    3. Attempt to install or upgrade operator via new ClusterServiceVersion

Actual results:

# API SERVER OPERATOR
    $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-666b796d8b-lqp56 | grep "failed to connect" | tail
W1013 20:59:49.131870       1 degraded_webhook.go:147] failed to connect to webhook "webhook.pipeline.tekton.dev" via service "tekton-pipelines-webhook.openshift-pipelines.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "tekton-pipelines-webhook.openshift-pipelines.svc")
W1013 20:59:50.147945       1 degraded_webhook.go:147] failed to connect to webhook "webhook.pipeline.tekton.dev" via service "tekton-pipelines-webhook.openshift-pipelines.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "tekton-pipelines-webhook.openshift-pipelines.svc")

#OLM 
$ oc logs -n openshift-operator-lifecycle-manager olm-operator-7c9f76554-j22j5 | grep "tls" | head
2024/10/13 12:00:08 http: TLS handshake error from 10.128.18.80:53006: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")
2024/10/14 11:45:05 http: TLS handshake error from 10.130.19.10:36766: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")

Expected results:

    no tls validation errors upon upgrade or installation of operators via OLM

Additional info:

https://github.com/openshift/operator-framework-olm/pull/884

Bug OCPBUGS-42134: "router's" should be "router's" on route creation page.

View the Description View the linked PRs

Description of problem:

On route creation page, when check on "Secure Route", select "Edge" or "Re-encrypt" TLS termination, there is "TLS certificates for edge and re-encrypt termination. If not specified, the router&apos;s default certificate is used." under "Certificates".  "router&apos;s" should be "router's"

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-18-003538
4.18.0-0.nightly-2024-09-17-060032

How reproducible:

Always

Steps to Reproduce:

    1.Check on route creation page, when check on "Secure Route", select "Edge" or "Re-encrypt" TLS termination.
    2.
    3.

Actual results:

1. There is "TLS certificates for edge and re-encrypt termination. If not specified, the router&apos;s default certificate is used.

Expected results:

1. "router&apos;s" should be "router's"

Additional info:

https://github.com/openshift/networking-console-plugin/pull/112

Task HOSTEDCP-1942: Add functional tests for ETCD recovery

View the Description View the linked PRs

As an engineer I would like to have a functional test that make sure the ETCD recovery function works as expected without deploy a Full OCP or HostedCluster.

Alternatives:

Use an ensure function to test this procedure in a living cluster
Add a test using a Kubernetes instead of an OCP and create the ETCD cluster in there.

https://github.com/openshift/hypershift/pull/4668

Bug OCPBUGS-38228: "OpenShift LightSpeed" should be "OpenShift Lightspeed" on getting started resource card of overview page

View the Description View the linked PRs

Description of problem:

On overview page's getting started resources card, there is "OpenShift LightSpeed" link when this operator is available on the cluster, the text should be updated to "OpenShift Lightspeed" to keep consistent with operator name.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-08-013133
4.16.0-0.nightly-2024-08-08-111530

How reproducible:

Always

Steps to Reproduce:

    1. Check overview page's getting started resources card,  
    2.
    3.

Actual results:

1. There is "OpenShift LightSpeed" link  in "Explore new features and capabilities"

Expected results:

1. The text should be "OpenShift Lightspped" to keep consistent with operator name.

Additional info:

https://github.com/openshift/console/pull/14132

Bug MGMT-19021: Agents for newly created BMHs in hosted cluster are not being approved after provisioning

View the Description View the linked PRs

Description of the problem:

When provisioning a hosted cluster using a ZTP workflow to create BMH and NodePool CRs, corresponding agents are created for the BMHs, but those agents do not get added to the hostedCluster as they are not set to spec.approved=true

This is a recent change in behavior, and appears to be related to This commit meant to allow BMH CRs to be safely restored by OADP in DR scenarios.

Manual approval of the agents will result in a successful result.
Setting the PAUSE_PROVISIONED_BMHS boolean to false does result in a successful result.

How reproducible:

Always

Steps to reproduce:

1. Create BMH and NodePool for HostedCluster

2. Observe creation of agents on cluster

3. Observe agents do not join cluster

Actual results:

Agents exist, are not added to nodepool

Expected results:

Agents and their machines are added to the nodepool and the hosted cluster sees nodes appear.

https://github.com/openshift/assisted-service/pull/6840

Bug OCPBUGS-37932: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-ingress-operator/pull/1137

Bug OCPBUGS-38779: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/4695

Bug OCPBUGS-41196: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug TRT-1843: New low level operator conditions test blocking hypershift on payloads

View the Description View the linked PRs

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.18.0-0.nightly/release/4.18.0-0.nightly-2024-10-03-040646

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-hypershift-ovn-conformance-4.18-release-openshift-release-analysis-aggregator/1841692537991991296

Test is:

[sig-arch][Early] Operators low level operators should have at least the conditions we had in 4.17 [Suite:openshift/conformance/parallel]

From PR: https://github.com/openshift/origin/pull/29132

https://github.com/openshift/origin/pull/29154

Bug OCPBUGS-38599: vsphere - when folder is undefined and datacenter is in a folder, entire folder path is incorrectly created

View the Description View the linked PRs

Description of problem:

If folder is undefined and the datacenter exists in a datacenter-based folder
the installer will create the entire path of folders from the root of vcenter - which is incorrect

This does not occur if folder is defined.

An upstream bug was identified when debugging this:

https://github.com/vmware/govmomi/issues/3523

https://github.com/openshift/installer/pull/8861

Bug OCPBUGS-41140: ART requests updates to 4.18 image ose-csi-snapshot-controller-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/163

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-external-snapshotter/pull/163

Bug OCPBUGS-41237: ART requests updates to 4.18 image ose-haproxy-router-base-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/router/pull/623

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/router/pull/623

Bug OCPBUGS-39118: Update Lightspeed logo to new standards

View the Description View the linked PRs

Description of problem:

For light theme, the Lightspeed logo should use the multi-color version.

For dark theme, the Lightspeed logo should use the single color version for both the button and the content.

https://github.com/openshift/console/pull/14215

Task CONSOLE-4188: Allow troubleshooting panel to be triggered from the application launcher

View the Description View the linked PRs

Background

In order for customers to easily access the troubleshooting panel in the console, we need to add a button that can be accessed globally.

Outcomes

The troubleshooting panel can be triggered from the application launcher menu, present in the OpenShift console masthead

https://github.com/openshift/console/pull/14097

Bug OCPBUGS-33628: ROSA HCP: TLS for openshift.default.svc is not valid

View the Description View the linked PRs

Description of problem:

Get "https://openshift.default.svc/.well-known/oauth-authorization-server": tls: failed to verify certificate: x509: certificate is valid for localhost, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, kube-apiserver, kube-apiserver.ocm-production-2b0eqpjq13aaba19ncgajh1asp39602g-faldana-hcp.svc, kube-apiserver.ocm-production-2b0eqpjq13aaba19ncgajh1asp39602g-faldana-hcp.svc.cluster.local, api.faldana-hcp.rvvd.p3.openshiftapps.com, api.faldana-hcp.hypershift.local, not openshift.default.svc

Version-Release number of selected component (if applicable):

    4.15.9

How reproducible:

    stable

Steps to Reproduce:

    Get "https://openshift.default.svc/.well-known/oauth-authorization-server"

Actual results:

    x509: certificate is valid for ... kubernetes.default.svc ..., not openshift.default.svc

Expected results:

OK

Additional info:

    Works fine with ROSA Classic.

The context: customer is configuring access to the RHACS console via Openshift Auth Provider.

Discussion:
https://redhat-internal.slack.com/archives/C028JE84N59/p1715048866276889

https://github.com/openshift/hypershift/pull/4670

Bug OCPBUGS-42717: Power VS: Installer segfaults when deploying a private cluster.

View the Description View the linked PRs

Description of problem:

    When using an internal publishing strategy, the client is not properly initialized and will cause a code path to be hit which tries to access a field of a null pointer.

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy a private cluster
    2. segfault
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/8826

Bug OCPBUGS-38272: vSphere CSI driver does not restart on config file change

View the Description View the linked PRs

Description of problem:

When user changes Infrastructure object, e.g. adds a new vCenter, the operator generates a new driver config (Secret named vsphere-csi-config-secret), but the controller pods are not restarted and use the old config.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly *after* 2024-08-09-031511

How reproducible: always

Steps to Reproduce:

Enable TechPreviewNoUpgrade
Add a new vCenter to infrastructure. It can be the same one as the existing one - we just need to trigger "disable CSi migration when there are 2 or more vCenters"
See that vsphere-csi-config-secret changed and has `migration-datastore-url =` (i.e. empty string value)

Actual results: the controller pods are not restarted

Expected results: the controller pods are restarted

https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/246

Bug OCPBUGS-33629: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/images/pull/197

Bug OCPBUGS-36222: AWS Installs Fail when Installer Host cannot resolve LB DNS Name

View the Description View the linked PRs

Description of problem:

The AWS Cluster API Provider (CAPA) runs a required check to resolve the DNS Name for load balancers it creates. If the CAPA controller (in this case, running in the installer) cannot resolve the DNS record, CAPA will not report infrastructure ready. We are seeing in some cases, that installations running on local hosts (we have not seen this problem in CI) will not be able to resolve the LB DNS name record and the install will fail like this:

    DEBUG I0625 17:05:45.939796    7645 awscluster_controller.go:295] "Waiting on API server ELB DNS name to resolve" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/umohnani-4-16test-5ndjw" namespace="openshift-cluster-api-guests" name="umohnani-4-16test-5ndjw" reconcileID="553beb3d-9b53-4d83-b417-9c70e00e277e" cluster="openshift-cluster-api-guests/umohnani-4-16test-5ndjw" 
DEBUG Collecting applied cluster api manifests...  
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 15m0s: client rate limiter Wait returned an error: context deadline exceeded

We do not know why some hosts cannot resolve these records, but it could be something like issues with the local DNS resolver cache, DNS records are slow to propagate in AWS, etc.

Version-Release number of selected component (if applicable):

    4.16, 4.17

How reproducible:

    Not reproducible / unknown -- this seems to be dependent on specific hosts and we have not determined why some hosts face this issue while others do not.

Steps to Reproduce:

n/a

Actual results:

Install fails because CAPA cannot resolve LB DNS name

Expected results:

    As the DNS record does exist, install should be able to proceed.

Additional info:

Slack thread:

https://redhat-internal.slack.com/archives/C68TNFWA2/p1719351032090749

https://github.com/openshift/installer/pull/8927

Bug OCPBUGS-42742: MOSB is in updating state even though build pod is successful.

View the Description View the linked PRs

Description of problem:

    When verifying OCPBUGS-38869 or in 4.18, the MOSB is still in updating state even though build pod is successfully removed and seeing error in machine-os build pod

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.Apply any MOSC  
    2.see build pod is successful
    3.But MOSB is still in updating state   
    4.And can see error in machine-os build pod

Actual results:
I have applied below MOSC

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: abc   
spec:
  machineConfigPool:
    name: worker
  buildOutputs:
    currentImagePullSecret:
      name: $(oc get -n openshift-machine-config-operator sa default -ojsonpath='{.secrets[0].name}')
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
    renderedImagePushSecret:
      name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
    renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
    containerFile:
    - containerfileArch: noarch
      content: |-
        # Pull the centos base image and enable the EPEL repository.
        FROM quay.io/centos/centos:stream9 AS centos
        RUN dnf install -y epel-release        # Pull an image containing the yq utility.
        FROM docker.io/mikefarah/yq:latest AS yq        # Build the final OS image for this MachineConfigPool.
        FROM configs AS final        # Copy the EPEL configs into the final image.
        COPY --from=yq /usr/bin/yq /usr/bin/yq
        COPY --from=centos /etc/yum.repos.d /etc/yum.repos.d
        COPY --from=centos /etc/pki/rpm-gpg/RPM-GPG-KEY-* /etc/pki/rpm-gpg/        # Install cowsay and ripgrep from the EPEL repository into the final image,
        # along with a custom cow file.
        RUN sed -i 's/\$stream/9-stream/g' /etc/yum.repos.d/centos*.repo && \
            rpm-ostree install cowsay ripgrep
EOF

$ oc get machineosconfig
NAME   AGE
abc    45m

$  oc logs build-rendered-worker-c67571b26a7e0d94dc2bf01dca97bbe5 -f
  ...
Copying blob sha256:a8157ed01dfc7fe15c8f2a86a3a5e30f7fcb7f3e50f8626b32425aaf821ae23d
Copying config sha256:4b15e94c47f72b6c082272cf1547fdd074bd3539b327305285d46926f295a71b
Writing manifest to image destination
+ return 0 

$  oc get machineosbuild
NAME                                                              PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED
worker-rendered-worker-c67571b26a7e0d94dc2bf01dca97bbe5-builder   False      True       False       False         False

$  oc logs machine-os-builder-654fc664bb-qvjkn  | grep -i error
I1003 16:12:52.463155       1 pod_build_controller.go:296] Error syncing pod openshift-machine-config-operator/build-rendered-worker-c67571b26a7e0d94dc2bf01dca97bbe5: unable to update with build pod status: could not update MachineOSConfig"abc": MachineOSConfig.machineconfiguration.openshift.io "abc" is invalid: [observedGeneration: Required value, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

Expected results:

MOSB should be successful

Additional info:

https://github.com/openshift/machine-config-operator/pull/4640

Bug OCPBUGS-38499: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8789

Bug TRT-1824: New server side applies test doesn't consistently create tests

View the Description View the linked PRs

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-upgrade/1836280498960207872 has a test for monitoring

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-upgrade/1836280440277700608 does not

The test needs to reliably produce results

https://github.com/openshift/origin/pull/29111

Bug OCPBUGS-39511: ART requests updates to 4.18 image ose-cluster-olm-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/65

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-olm-operator/pull/65

Bug OCPBUGS-41270: "pods should successfully create sandboxes by adding pod to network" are failing on multiple platforms

View the Description View the linked PRs

Component Readiness has found a potential regression in the following test:

[sig-network] pods should successfully create sandboxes by adding pod to network

Probability of significant regression: 96.41%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-27T00:00:00Z
End Time: 2024-09-03T23:59:59Z
Success Rate: 88.37%
Successes: 26
Failures: 5
Flakes: 12

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 98.46%
Successes: 43
Failures: 1
Flakes: 21

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=metal&Platform=metal&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=minor&Upgrade=minor&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Networking%20%2F%20cluster-network-operator&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20metal%20unknown%20ha%20minor&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Ametal&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-09-03%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-27%2000%3A00%3A00&testId=openshift-tests-upgrade%3A65e48733eb0b6115134b2b8c6a365f16&testName=%5Bsig-network%5D%20pods%20should%20successfully%20create%20sandboxes%20by%20adding%20pod%20to%20network

Here is an example run.

We see the following signature for the failure:

namespace/openshift-etcd node/master-0 pod/revision-pruner-11-master-0 hmsg/b90fda805a - 111.86 seconds after deletion - firstTimestamp/2024-09-02T13:14:37Z interesting/true lastTimestamp/2024-09-02T13:14:37Z reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-11-master-0_openshift-etcd_08346d8f-7d22-4d70-ab40-538a67e21e3c_0(d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57): error adding pod openshift-etcd_revision-pruner-11-master-0 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57" Netns:"/var/run/netns/97dc5eb9-19da-462f-8b2e-c301cfd7f3cf" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-etcd;K8S_POD_NAME=revision-pruner-11-master-0;K8S_POD_INFRA_CONTAINER_ID=d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57;K8S_POD_UID=08346d8f-7d22-4d70-ab40-538a67e21e3c" Path:"" ERRORED: error configuring pod [openshift-etcd/revision-pruner-11-master-0] networking: Multus: [openshift-etcd/revision-pruner-11-master-0/08346d8f-7d22-4d70-ab40-538a67e21e3c]: error waiting for pod: pod "revision-pruner-11-master-0" not found

The same signature has been reported for both azure and x390x as well.

It is worth mentioning that sdn to ovn transition adds some complication to our analysis. From the component readiness above, you will see most of the failures are for job: periodic-ci-openshift-release-master-nightly-X.X-upgrade-from-stable-X.X-e2e-metal-ipi-ovn-upgrade. This is a new job for 4.17 and therefore miss base stats in 4.16.

So we ask for:

An analysis of the root cause and impact of this issue
Team can compare relevant 4.16 sdn jobs to see if this is really a regression.
Given the current passing rate of 88%, what priority we should give to this?
Since this is affecting component readiness, and management depends on a green dashboard for release decision, we need to figure out what is the best approach for handling this issue.

Bug OCPBUGS-41290: ART requests updates to 4.18 image ose-must-gather-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/must-gather/pull/441

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/must-gather/pull/441

Bug OCPBUGS-27056: Update 4.16 ose-aws-ebs-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/115

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-38409: Update CPO and HO base images after branching

View the Description View the linked PRs

Update our CPO and HO dockerfiles to use appropriate base image versions.

https://github.com/openshift/hypershift/pull/4543

Bug OCPBUGS-42235: Cypress tests fail locally when authentication is disabled

View the Description View the linked PRs

Description of problem:

    When running a cypress test locally, with auth disabled, while logged in to kubeadmin, (e.g., running pipeline-ci.feature within test-cypress-pipelines), the before each fails because it expects there to be an empty message, when we are actually logged into kubeadmin

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    always

Steps to Reproduce:

    1. Run console the ./contrib/oc-environment.sh way while logged into kubeadmin
    2. Run pipeline-ci.feature within the test-cypress-pipelines yarn script in the frontend folder

Actual results:

    The after-each of the tests fail

Expected results:

    The after-each of the tests are allowed to pass

Additional info:

https://github.com/openshift/console/pull/14315

Story STOR-1714: Release leader election on operator shutdown

View the Description View the linked PRs

As OCP user, I want storage operators restarted quickly and newly started operator to start leading immediately without ~3 minute wait.

This means that the old operator should release its leadership after it receives SIGTERM and before it exists. Right now, storage operators fail to release the leadership in ~50% of cases.

Steps to reproduce:

Delete an operator Pod (`oc delete pod xyz`).
Wait for a replacement Pod to be created.
Check logs of the replacement Pod. It should contain "successfully acquired lease XYZ" relatively quickly after the Pod start (+/- 1 second?)
Go to 1. and retry few times.

This is an hack'n'hustle "work", not tied to any Epic, I'm using it just to get proper QE and tracking what operators are being updated (see linked github PRs).

Bug OCPBUGS-38622: Prometheus no longer accepts samples of the same series with different timestamps

View the Description View the linked PRs

Description of problem:

    See https://github.com/prometheus/prometheus/issues/14503 for more details

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

Steps to Reproduce:
1. Make Prometheus scrape a target that exposes multiple samples of the same series with different explicit timestamps, for example:

# TYPE requests_per_second_requests gauge
# UNIT requests_per_second_requests requests
# HELP requests_per_second_requests test-description
requests_per_second_requests 16 1722466225604
requests_per_second_requests 14 1722466226604
requests_per_second_requests 40 1722466227604
requests_per_second_requests 15 1722466228604
# EOF

2. Not all the samples will be ingested
3. If Prometheus continues scraping that target for a moment, the PrometheusDuplicateTimestamps will fire.
Actual results:

Expected results: all the samples should be considered (or course if the timestamps are too old or are too in the future, Prometheus may refuses them.)

Additional info:

     Regression introduced in Prometheus 2.52.
    Proposed upstream fixes: https://github.com/prometheus/prometheus/pull/14683 https://github.com/prometheus/prometheus/pull/14685

https://github.com/openshift/prometheus/pull/220

Bug OCPBUGS-39455: ART requests updates to 4.18 image ose-operator-framework-tools-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/850

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/operator-framework-olm/pull/850

Bug OCPBUGS-41288: ART requests updates to 4.18 image ose-cli-artifacts-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/oc/pull/1871

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/oc/pull/1871

Bug OU-589: Update monitoring-plugin helm chart to deploy incidents

View the Description View the linked PRs

The helm charts in the monitoring-plugin can currently either deploy the monitoring-plugin in it's CMO state or with the acm-alerting feature flag enabled. Update it so that it can work with the incidents feature flag as well.

https://github.com/openshift/monitoring-plugin/pull/261

Bug OCPBUGS-38343: Should save the release signature in the archive tar file instead of count on the enterprise cache (or working-dir)

View the Description View the linked PRs

Description of problem:

Should save the release signature in the archive tar file instead of count on the enterprise cache (or working-dir)

Version-Release number of selected component (if applicable):

oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1) Prepare data for enterprise registry use mirror2disk+disk2mirror mode with the following command :
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    graph: true
    channels:
    - name: stable-4.15

`oc-mirror -c config-38037.yaml  file://out38037 --v2`
`oc-mirror -c config-38037.yaml --from file://out38037  docker://my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com --v2  --dest-tls-verify=false`

  2) Prepare the env to simulate the enclave cluster :
cat /etc/squid/squid.conf
http_port 3128
coredump_dir /var/spool/squid
acl whitelist dstdomain "/etc/squid/whitelist"
http_access allow whitelist
http_access deny !whitelist

cat /etc/squid/whitelist 
my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com             -------------registry route  (oc get route -n your registry app's project)
update-service-oc-mirror-route-openshift-update-service.apps.yinzhou-88.qe.devcluster.openshift.com        ---osus route  (oc get route -n openshift-update-service)

Sudo systemctl restart squid
export https_proxy=http://127.0.0.1:3128
export http_proxy=http://127.0.0.1:3128

Setting registry redirect with : 
cat ~/.config/containers/registries.conf 
[[registry]]
  location = "quay.io"
  insecure = false
  blocked = false
  mirror-by-digest-only = false
  prefix = ""
  [[registry.mirror]]
    location = "my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com"
    insecure = false


3) Simulate enclave mirror with same imagesetconfig with command :
 `oc-mirror -c config-38037.yaml file://new-folder --v2`

Actual results:

3) The mirror2disk failed with error :

Expected results:

No error and should contain the signature in the archives tar file , not count on the enterprise cache (From  custom usage, they may on different machine for enclave cluster , or they may not use the same directory )

https://github.com/openshift/oc-mirror/pull/924

Bug OCPBUGS-38739: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4579

Bug OCPBUGS-30469: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-30658: oc adm upgrade --help minor rephrasing is needed for --to-latest param

View the Description View the linked PRs

Description of problem:

From the output of "oc adm upgrade --help":
...
    --to-latest=false:
        Use the next available version.
...

seems like "Use the latest available version" is more appropriate.

Version-Release number of selected component (if applicable):

    4.14.0

How reproducible:

    100%

Steps to Reproduce:

    1. [kni@ocp-edge119 ~]$ oc adm upgrade --help

Actual results:

...
    --to-latest=false:         Use the next available version. 
...

Expected results:

...     
    --to-latest=false:         Use the latest available version. 
...

Additional info:

https://github.com/openshift/oc/pull/1861

Bug OCPBUGS-38936: NodePool Controller doesn't respect LatestSupportedVersion const

View the Description View the linked PRs

Description of problem:

    NodePool Controller doesn't respect LatestSupportedVersion https://github.com/openshift/hypershift/blob/main/support/supportedversion/version.go#L19

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Create HostedCluster / NodePool
    2. Upgrade both HostedCluster and NodePool at the same time to a version higher than the LatestSupportedVersion

Actual results:

    NodePool tries to upgrade to the new version while the HostedCluster ValidReleaseImage condition fails with: 'the latest version supported is: "x.y.z". Attempting to use: "x.y.z"'

Expected results:

    NodePool ValidReleaseImage condition also fails

Additional info:

https://github.com/openshift/hypershift/pull/4608

Bug OCPBUGS-43764: Local endpoint for the DNS service not working in OpenShift 4.17 with 3rd party CNI

View the Description View the linked PRs

Description of problem:

IBM ROKS uses Calico as their CNI. In previous versions of OpenShift, OpenShiftSDN would create IPTable rules that would force local endpoint for DNS Service.

Starting in OCP 4.17 with the removal of SDN, IBM ROKS is not using OVN-K and therefor local endpoint for dns service is not working as expected.

IBM ROKS is asking that the code block be restored to restore the functionality previously seen in OCP 4.16

https://github.com/openshift/sdn/blob/release-4.16/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L979-L992

Without this functionality IBM ROKS is not able to GA OCP 4.17

https://github.com/openshift/sdn/pull/638

Story HOSTEDCP-1475: Remove hardcoded apiserver-network-proxy image from HCCO

View the Description View the linked PRs

User Story:

As a user of HyperShift, I want to be able to:

to pull the api server network proxy image from the release image

so that I can achieve

the HCCO will be using the image from the release image the HC is using.

Acceptance Criteria:

Description of criteria:

The api server network proxy image is no longer hardcoded.
All required tests are passing on the PR.

Out of Scope:

N/A

Engineering Details:

This is hard coded herehttps://github.com/openshift/hypershift/blob/e0efbbc4889fe6e62f7c2ebd91c8933f927418fd/control-plane-operator/hostedclusterconfigoperator/cmd.go#L54.
It's then consumed here.]
I believe this is the image we want to pull from the release image:

% oc adm release info quay.io/openshift-release-dev/ocp-release:4.14.33-multi -a ~/all-the-pull-secrets.json --pullspecs | grep apiserver
  apiserver-network-proxy

This does not require a design proposal.
This does not require a feature gate.

https://github.com/openshift/hypershift/pull/4548

Bug MGMT-19151: Ingress HTTP requests after IBU fail with unknown certificate authority

View the Description View the linked PRs

Description of the problem:

The ingress TLS certificate, which is the one presented to HTTP clients e.g. when requesting resources under *.apps.<cluster-name>.<base-domain>, is not signed by a certificate included in the cluster's CA certificates. This results in those ingress HTTP requests to fail with the error: `tls: failed to verify certificate: x509: certificate signed by unknown authority`.

How reproducible:

100%

Steps to reproduce:

1. Before running an IBU, verify that the target cluster's ingress works properly:

Download the cluster's CA chain `oc config view --raw -o jsonpath=' {.clusters[0].cluster.certificate-authority-data}' | base64 --decode > ca.crt`
Use curl to make a head HTTP request, e.g. `curl --cacert ca.crt -I https://oauth.apps.target.ibo1.redhat.com:443/.well-known/oauth-authorization-server`

2. Run an IBU.

3. Perform steps 1. and 2. again. You will see the error `curl: (60) SSL certificate problem: self signed certificate in certificate chain`.

Alternative steps using openssl:

1. Run an IBU

2. Download the cluster's CA bundle `oc config view --raw -o jsonpath='{.clusters[0].cluster.certificate-authority-data}' | base64 --decode > ca.crt`

3. Download the ingress certificate `openssl s_client -connect oauth.apps.target.ibo1.redhat.com:443 -showcerts </dev/null </dev/null 2>/dev/null | awk '/BEGIN CERTIFICAT/,/END CERTIFICATE/ {print}' > ingress.crt`

4. Try to verify the cert with the CA chain: `openssl verify -CAfile ca.crt ingress.crt` - this step fails.

Actual results:

Ingress HTTP requests using the cluster's CA TLS transport fail with unknown certificate authority error.

Expected results:

Ingress HTTP requests using the cluster's CA TLS transport should succeed.

https://github.com/openshift/installer/pull/9128

Vulnerability OCPBUGS-43951: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-image/pull/602

Story TRT-1794: Write monitortest for catching disruption near the start of first e2e test

View the Description View the linked PRs

Related to a component regression we found that looked like we had no clear test to catch, sample runs:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-kube-apiserver-rollout/1827763939853733888

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-ipv4/1826908352773361664

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-dualstack/1828844069434953728

All three runs show a pattern. The actual test failures look unpredictable, some tests are passing at the same time, others fail to talk to the apiserver.

The pattern we see is 1 or more tests failing right at the start of e2e testing, disruption, etcd log messages indicating slowness, and etcd leadership state changes.

Because the tests are unpredictable, we'd like a test that catches this symptom. We think the safest way to do this is to look for disruption within x minutes of the first e2e test.

This would be implemented as a monitortest, likely somewhere around here: https://github.com/openshift/origin/blob/master/pkg/monitortests/kubeapiserver/legacykubeapiservermonitortests/monitortest.go

Although it would be reasonable to add a new monitortest in the parent package above this level.

The test would need to do the following:

scan final intervals for the earliest interval with source=SourceE2ETest (constant in monitorapi/types.go), save it's start time
scan final intervals for those with source=SourceDisruption, and reason=DisruptionBegan, and a backend matching one of the apiservers (kube, openshift, oauth)
flake the test (return a failure junit result + a success junit result) if we see any SourceDisruption intervals within X minutes of that first e2e test.
Choose X based on what we see in the above links.

https://github.com/openshift/origin/pull/29061

Bug OCPBUGS-23080: [regression] Impossible to pass multiline parameters to templates

View the Description View the linked PRs

Description of problem:

This is essentially an incarnation of the bug https://bugzilla.redhat.com/show_bug.cgi?id=1312444 that was fixed in OpenShift 3 but is now present again.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Select a template in the console web UI, try to enter a multiline value.

Actual results:

It's impossible to enter line breaks.

Expected results:

It should be possible to achieve entering a multiline parameter when creating apps from templates.

Additional info:

I also filed an issue here https://github.com/openshift/console/issues/13317.
P.S. It's happening on https://openshift-console.osci.io, not sure what version of OpenShift they're running exactly.

https://github.com/openshift/console/pull/14416

Bug OCPBUGS-38926: [cluster-samples-operator] Bump Kubernetes Version 29.2 to latest stable API

View the Description View the linked PRs

Description of problem:

We need to bump the Kubernetes Version. To the latest API version OCP is using.

This what was done last time:

https://github.com/openshift/cluster-samples-operator/pull/409

Find latest stable version from here: https://github.com/kubernetes/api

This is described in wiki: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities

Version-Release number of selected component (if applicable):

How reproducible:

Not really a bug, but we're using OCPBUGS so that automation can manage the PR lifecycle (SO project is no longer kept up-to-date with release versions, etc.).

https://github.com/openshift/cluster-samples-operator/pull/561

Bug OCPBUGS-42143: oc adm prune deployments` does not work and giving panic when using --replica-set option

View the Description View the linked PRs

Description of problem:

    There is another panic occurred in https://issues.redhat.com/browse/OCPBUGS-34877?focusedId=25580631&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25580631 which should be fixed

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oc/pull/1878

Bug OCPBUGS-38474: AdditionalTrustedCA in ImageConfig is not wired correctly

View the Description View the linked PRs

Description of problem:

    AdditionalTrustedCA is not wired correctly so the configmap is not found my its operator. This feature is meant to be exposed by XCMSTRAT-590, but at the moment it seems to be broken

Version-Release number of selected component (if applicable):

    4.16.5

How reproducible:

    Always

Steps to Reproduce:

1. Create a configmap containing a registry and PEM cert, like https://github.com/openshift/openshift-docs/blob/ef75d891786604e78dcc3bcb98ac6f1b3a75dad1/modules/images-configuration-cas.adoc#L17  
2. Refer to it in .spec.configuration.image.additionalTrustedCA.name     
3. image-registry-config-operator is not able to find the cm and the CO is degraded

Actual results:

   CO is degraded

Expected results:

    certs are used.

Additional info:

I think we may miss a copy of the configmap from the cluster NS to the target ns. It should be also deleted if it is deleted.

 % oc get hc -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd -o jsonpath="{.items[0].spec.configuration.image.additionalTrustedCA}" | jq
{
  "name": "registry-additional-ca-q9f6x5i4"
}

% oc get cm -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd registry-additional-ca-q9f6x5i4
NAME                              DATA   AGE
registry-additional-ca-q9f6x5i4   1      16m

logs of cluster-image-registry operator

E0814 13:22:32.586416       1 imageregistrycertificates.go:141] ImageRegistryCertificatesController: unable to sync: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found, requeuing

CO is degraded

% oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.16.5    True        False         False      3h58m
csi-snapshot-controller                    4.16.5    True        False         False      4h11m
dns                                        4.16.5    True        False         False      3h58m
image-registry                             4.16.5    True        False         True       3h58m   ImageRegistryCertificatesControllerDegraded: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found
ingress                                    4.16.5    True        False         False      3h59m
insights                                   4.16.5    True        False         False      4h
kube-apiserver                             4.16.5    True        False         False      4h11m
kube-controller-manager                    4.16.5    True        False         False      4h11m
kube-scheduler                             4.16.5    True        False         False      4h11m
kube-storage-version-migrator              4.16.5    True        False         False      166m
monitoring                                 4.16.5    True        False         False      3h55m

https://github.com/openshift/hypershift/pull/4621

Bug OCPBUGS-37867: oc-mirror skips images with digest and tag

View the Description View the linked PRs

Description of problem:

    When an image is referenced by tag and digest, oc-mirror skips the image

Version-Release number of selected component (if applicable):

How reproducible:

    Do mirror to disk and disk to mirror using the registry.redhat.io/redhat/redhat-operator-index:v4.16 and the operator multiarch-tuning-operator

Steps to Reproduce:

    1 mirror to disk
    2 disk to mirror

Actual results:

    docker://gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1@sha256:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522 (Operator bundles: [multiarch-tuning-operator.v0.9.0] - Operators: [multiarch-tuning-operator]) error: Invalid source name docker://localhost:55000/kubebuilder/kube-rbac-proxy:v0.13.1:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522: invalid reference format

Expected results:

The image should be mirrored

Additional info:

https://github.com/openshift/oc-mirror/pull/911

Bug OCPBUGS-38069: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8869

Bug OCPBUGS-38648: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14182

Bug OCPBUGS-41249: ART requests updates to 4.18 image ose-image-customization-controller-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/image-customization-controller/pull/128

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/image-customization-controller/pull/128

Bug OCPBUGS-38795: Incorrect status message of profile when creating tuned profile prior to label nodes

View the Description View the linked PRs

Description of problem:

Creating a tuned profile with annotation  tuned.openshift.io/deferred: "update" first before label target node, then label node with profile=, the value of kernel.shmmni applied immediately. but it shows the message [The TuneD daemon profile is waiting for the next node restart: openshift-profile],  then reboot nodes, it will restore to default value of  kernel.shmmni, not setting to expected value.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Creating OCP cluster with latest 4.18 nightly version
    2. Create tuned profile before label node
       please refer to issue 1 if you want to reproduce the issue in the doc https://docs.google.com/document/d/1h-7AIyqf7sHa5Et2XF7a-RuuejwVkrjhiFFzqZnNfvg/edit

Actual results:

   It should show the message [TuneD profile applied]. the sysctl value should keep as expect after node reboot

Expected results:

    It shouldn't show the message The TuneD daemon profile is waiting for the next node restart: openshift-profile when executing oc get profile also the sysctl value shouldn't revert after node reboot

Additional info:

https://github.com/openshift/cluster-node-tuning-operator/pull/1142

Bug OCPBUGS-42601: (MISSING) output with the `oc adm must-gather --help` output

View the Description View the linked PRs

Description of problem:

    (MISSING) output with the `oc adm must-gather --help` output
[`4.15.0` - `oc`](https://mirror.openshift.com/pub/openshift-v4/multi/clients/ocp/4.15.0/ppc64le/openshift-client-linux.tar.gz) introduces strange output with the `oc adm must-gather --help` output.

Version-Release number of selected component (if applicable):

    4.15.0 and higher

How reproducible:

    4.15.0 and higher you can run the reproducer steps

Steps to Reproduce:

    1.curl -O -L  https://mirror.openshift.com/pub/openshift-v4/multi/clients/ocp/4.15.0/ppc64le/openshift-client-linux.tar.gz     
2. untar
    3. ./oc adm must-gather --help

Actual results:

    # ./oc adm must-gather --help
    --volume-percentage=30:        Specify maximum percentage of must-gather pod's allocated volume that can be used. If this limit is exceeded,        must-gather will stop gathering, but still copy gathered data. Defaults to 30%!(MISSING)

Expected results:

    No (MISSING) content in the output

Additional info:

https://github.com/openshift/oc/pull/1896

Task PODAUTO-257: Cluster Resouce Override 4.18 Release Chores

View the linked PRs

https://github.com/openshift/kubernetes-autoscaler/pull/321

Bug OCPBUGS-39457: ART requests updates to 4.18 image ose-kube-storage-version-migrator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/207

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/207

Bug OCPBUGS-39561: ART requests updates to 4.18 image openshift-enterprise-console-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/console/pull/14238

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/console/pull/14238

Bug MGMT-17805: Unable to register cluster s390x due to skip MCO reboot error

View the Description View the linked PRs

Description of the problem:

Trying to create a cluster from UI , fails.

How reproducible:

Steps to reproduce:

Actual results:

Expected results:

https://github.com/openshift/assisted-service/pull/6682

Bug OCPBUGS-34917: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/1132

Vulnerability OCPBUGS-41560: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/194

Bug OCPBUGS-35054: [AWS CAPI install] Network setting is not correct while install cluster into VPC which contains multi-CIDR subnets

View the Description View the linked PRs

Description of problem:

Create VPC and subnets with following configs [refer to attached CF template]:
Subnets (subnets-pair-default) in CIDR 10.0.0.0/16
Subnets (subnets-pair-134) in CIDR 10.134.0.0/16
Subnets (subnets-pair-190) in CIDR 10.190.0.0/16

Create cluster into subnets-pair-134, the bootstrap process fails [see attached log-bundle logs]:

level=debug msg=I0605 09:52:49.548166 	937 loadbalancer.go:1262] "adding attributes to load balancer" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" attrs=[{"Key":"load_balancing.cross_zone.enabled","Value":"true"}]
level=debug msg=I0605 09:52:49.909861 	937 awscluster_controller.go:291] "Looking up IP address for DNS" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" dns="yunjiang29781a-86-rvqd9-int-19a9485653bf29a1.elb.us-east-2.amazonaws.com"
level=debug msg=I0605 09:52:53.483058 	937 reflector.go:377] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: forcing resync
level=debug msg=Fetching Bootstrap SSH Key Pair...

Checking security groups:
<infraid>-lb allows 10.0.0.0/16:6443 and 10.0.0.0/16:22623
<infraid>-apiserver-lb allows 10.0.0.0/16:6443 and 10.134.0.0/16:22623 (and 0.0.0.0/0:6443)

are these settings correct?

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-03-060250

How reproducible:

Always

Steps to Reproduce:

    1. Create subnets using attached CG template
    2. Create cluster into subnets which CIDR is 10.134.0.0/16
    3.

Actual results:

Bootstrap process fails.

Expected results:

Bootstrap succeeds.

Additional info:

No issues if creating cluster into subnets-pair-default (10.0.0.0/16)
No issues if only one CIDR in VPC, e.g. set VpcCidr to 10.134.0.0/16 in https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/01_vpc.yaml

https://github.com/openshift/installer/pull/8591

Bug OCPBUGS-38900: [Documentation] Errors reported by tuned when using SecureBoot

View the Description View the linked PRs

Description of problem:

When using SecureBoot tuned reports the following error as debugfs access is restricted:

tuned.utils.commands: Writing to file '/sys/kernel/debug/sched/migration_cost_ns' error: '[Errno 1] Operation not permitted: '/sys/kernel/debug/sched/migration_cost_ns''
tuned.plugins.plugin_scheduler: Error writing value '5000000' to 'migration_cost_ns'

This issue has been reported with the following tickets:

As this is a confirmed limitation of the NTO due to the TuneD component, we should document this as a limitation in the OpenShift Docs:
https://docs.openshift.com/container-platform/4.16/nodes/nodes/nodes-node-tuning-operator.html

Expected Outcome:

Document that the NTO cannot leverage some of the Tuned features when secureboot is enabled.

https://github.com/openshift/cluster-node-tuning-operator/pull/1163

Bug OCPBUGS-39435: ART requests updates to 4.18 image telemeter-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/telemeter/pull/543

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/telemeter/pull/543

Bug OCPBUGS-37453: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14084

Bug OCPBUGS-43756: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4973

Bug OCPBUGS-41254: ART requests updates to 4.18 image csi-driver-nfs-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-driver-nfs/pull/146

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/csi-driver-nfs/pull/146

Bug OCPBUGS-39232: OLM e2e smoke test failing because of the unavailable operator

View the Description View the linked PRs

Description of problem:

    The smoke test for OLM run by the OpenShift e2e suite is specifying an unavailable operator for installation, causing it to fail.

Version-Release number of selected component (if applicable):

How reproducible:

    Always (when using 4.17+ catalog versions)

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/origin/pull/29048

Bug OCPBUGS-41681: Disable cluster-olm-operator to fix merges

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-olm-operator/pull/66

Bug OCPBUGS-32785: Azure Workload Identity in static PVs did not work for CSI-File

View the Description View the linked PRs

Description of problem:

The customer uses Azure File CSI driver and without this they cannot make use of the Azure Workload Identity work which was one of the banner features of OCP 4.14. This feature is currently available in 4.16, however it will take the customer 3-6 months to validate 4.16 and start its rollout putting their plans to complete a large migration to Azure by end of 2024 at risk.
Could you please backport either the 1.29.3 feature for Azure Workload Idenity or rebase our Azure File CSI driver in 4.14 and 4.15 to at least 1.29.3 which includes the desired feature.

Version-Release number of selected component (if applicable):

azure-file-csi-driver in 4.14 and 4.15
- In 4.14, azure-file-csi-driver is version 1.28.1
- In 4.15, azure-file-csi-driver is version 1.29.2

How reproducible:

Always

Steps to Reproduce:

    1. Install ocp 4.14 with Azure Workload Managed Identity
    2. Try to configure Managed Workload Identiy with Azure CSI file

https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/master/docs/workload-identity-static-pv-mount.md

Actual results:

Is not usable

Expected results:

Azure Workload Identity should be manage with Azure File CSi as part of the whole feature

Additional info:

Bug OCPBUGS-42141: Sort function on NetworkPolicies page is incorrect after enable Pagination

View the Description View the linked PRs

Description of problem:

    Sort function on NetworkPolicies page is incorrect after enable Pagination

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-09-17-060032

How reproducible:

    Always

Steps to Reproduce:

    1. Create multiple resouces for NetworkPolicies
    2. Navigate to Networking -> NetworkPolicies page-> NetworkPolicies Tab
    3. Make sure the option of '15 per page' has been selected
    4. Click the 'Name column' button to sort the table

Actual results:

    The sort result is not correct

PFA: https://drive.google.com/file/d/12-eURLqMPZM5DNxfAPoWzX1CJr0Wyf_u/view?usp=drive_link

Expected results:

    Table data can be sorted by using resource name, even if pagination is enabled

Additional info:

https://github.com/openshift/networking-console-plugin/pull/108

Task OSASINFRA-3533: Bump Gophercloud to v2 in cloud-network-config-controller

View the Description View the linked PRs

https://github.com/search?q=repo%3Aopenshift%2Fcloud-network-config-controller%20gophercloud&type=code

Migrate to v2 following this guide: https://github.com/gophercloud/gophercloud/blob/master/docs/MIGRATING.md

https://github.com/openshift/cloud-network-config-controller/pull/153

Bug OCPBUGS-38549: Console sends ocm->organization.id instead organization.external_id

View the Description View the linked PRs

In analytics events, console sends the Organization.id from OpenShift Cluster Manager's Account Service, rather than the Organization.external_id. The external_id is meaningful company-wide at Red Hat, while the plain id is only meaningful within OpenShift Cluster Manager. You can use id to lookup external_id in OCM, but it's an extra step we'd like to avoid if possible.

cc Ali Mobrem

https://github.com/openshift/console-operator/pull/925

Bug OCPBUGS-28702: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/13468

Bug OCPBUGS-37491: co/ingress status cannot reflect the real condition

View the Description View the linked PRs

Description of problem:

co/ingress is always good even operator pod log error:

2024-07-24T06:42:09.580Z    ERROR    operator.canary_controller    wait/backoff.go:226    error performing canary route check    {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-07-20-191204

How reproducible:

    100%

Steps to Reproduce:

    1. install AWS cluster
    2. update ingresscontroller/default and adding   "endpointPublishingStrategy.loadBalancer.allowedSourceRanges", eg

spec:
  endpointPublishingStrategy:
    loadBalancer:
      allowedSourceRanges:
      - 1.1.1.2/32

    3. above setting drop most traffic to LB, so some operator degraded

Actual results:

    co/authentication and console degraded but co/ingress is still good

$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.17.0-0.nightly-2024-07-20-191204   False       False         True       22m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.hongli-aws.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 
console                                    4.17.0-0.nightly-2024-07-20-191204   False       False         True       22m     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 
ingress                                    4.17.0-0.nightly-2024-07-20-191204   True        False         False      3h58m   


check the ingress operator log and see:

2024-07-24T06:59:09.588Z    ERROR    operator.canary_controller    wait/backoff.go:226    error performing canary route check    {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

Expected results:

    co/ingress status should reflect the real condition timely

Additional info:

    even co/ingress status can be updated in some scenarios, but it is always less sensitive than authentication and console, we always rely on authentication/console to know the route healthy, the purpose of ingress canary route becomes meaningless.

https://github.com/openshift/cluster-ingress-operator/pull/1125

Bug OCPBUGS-37773: Block users from adding 2nd vCenter as day-2 in infra object

View the Description View the linked PRs

Since, we are not going to support addition of 2nd vCenter as day-2 operation - we need to block users from doing this.

https://github.com/openshift/api/pull/2006

Bug OCPBUGS-42316: DeFlake platform pods should not be force deleted with gracePeriod 0

View the Description View the linked PRs

It looks like the must gather pods are the worst culprits but these are not actually considered to be platform pods.

Step 1: Exclude must gather pods from this test.

Step 2: Research the other failures.

https://github.com/openshift/origin/pull/29125

Bug MGMT-18923: GPU inventory data is wrong

View the Description View the linked PRs

Description of the problem:

the GPU data in our host inventory is wrong

How reproducible:

Always

Steps to reproduce:

Actual results:


 "gpus": [ \{ "address": "0000:00:0f.0" } ],

Expected results:

Bug OCPBUGS-44099: oauth-server panic with OAuth2.0 idp names that contain whitespaces

View the Description View the linked PRs

Description of problem:

OCPBUGS-42772 is verified. But testing found oauth-server panic with OAuth2.0 idp names that contain whitespaces

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-31-190119

How reproducible:

Always

Steps to Reproduce:

1. Set up Google IDP with below:
$ oc create secret generic google-secret-1 --from-literal=clientSecret=xxxxxxxx -n openshift-config
$ oc edit oauth cluster
spec:
  identityProviders:
  - google:
      clientID: 9745..snipped..apps.googleusercontent.com
      clientSecret:
        name: google-secret-1
      hostedDomain: redhat.com
    mappingMethod: claim
    name: 'my Google idp'
    type: Google
...

Actual results:

oauth-server panic:

$ oc get po -n openshift-authentication
NAME                               READY   STATUS             RESTARTS
oauth-openshift-59545c6f5-dwr6s    0/1     CrashLoopBackOff   11 (4m10s ago)
...

$ oc logs -p -n openshift-authentication oauth-openshift-59545c6f5-dwr6s
Copying system trust bundle
I1101 03:40:09.883698       1 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="serving-cert::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.crt::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.key"
I1101 03:40:09.884046       1 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="sni-serving-cert::/var/config/system/secrets/v4-0-config-system-router-certs/apps.hongli-az.qe.azure.devcluster.openshift.com::/var/config/system/secrets/v4-0-config-system-router-certs/apps.hongli-az.qe.azure.devcluster.openshift.com"
I1101 03:40:10.335739       1 audit.go:340] Using audit backend: ignoreErrors<log>
I1101 03:40:10.347632       1 requestheader_controller.go:244] Loaded a new request header values for RequestHeaderAuthRequestController
panic: parsing "/oauth2callback/my Google idp": at offset 0: invalid method "/oauth2callback/my"goroutine 1 [running]:
net/http.(*ServeMux).register(...)
        net/http/server.go:2738
net/http.(*ServeMux).Handle(0x29844c0?, {0xc0008886a0?, 0x2984420?}, {0x2987fc0?, 0xc0006ff4a0?})
        net/http/server.go:2701 +0x56
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).getAuthenticationHandler(0xc0006c28c0, {0x298f618, 0xc0008a4d00}, {0x2984540, 0xc000171450})
        github.com/openshift/oauth-server/pkg/oauthserver/auth.go:407 +0x11ad
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).getAuthorizeAuthenticationHandlers(0xc0006c28c0, {0x298f618, 0xc0008a4d00}, {0x2984540, 0xc000171450})
        github.com/openshift/oauth-server/pkg/oauthserver/auth.go:243 +0x65
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).WithOAuth(0xc0006c28c0, {0x2982500, 0xc0000aca80})
        github.com/openshift/oauth-server/pkg/oauthserver/auth.go:108 +0x21d
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth(0xc0006c28c0, {0x2982500?, 0xc0000aca80?}, 0xc000785888)
        github.com/openshift/oauth-server/pkg/oauthserver/oauth_apiserver.go:342 +0x45
k8s.io/apiserver/pkg/server.completedConfig.New.func1({0x2982500?, 0xc0000aca80?})
        k8s.io/apiserver@v0.29.2/pkg/server/config.go:825 +0x28
k8s.io/apiserver/pkg/server.NewAPIServerHandler({0x252ca0a, 0xf}, {0x2996020, 0xc000501a00}, 0xc0005d1740, {0x0, 0x0})
        k8s.io/apiserver@v0.29.2/pkg/server/handler.go:96 +0x2ad
k8s.io/apiserver/pkg/server.completedConfig.New({0xc000785888?, {0x0?, 0x0?}}, {0x252ca0a, 0xf}, {0x29b41a0, 0xc000171370})
        k8s.io/apiserver@v0.29.2/pkg/server/config.go:833 +0x2a5
github.com/openshift/oauth-server/pkg/oauthserver.completedOAuthConfig.New({{0xc0005add40?}, 0xc0006c28c8?}, {0x29b41a0?, 0xc000171370?})
        github.com/openshift/oauth-server/pkg/oauthserver/oauth_apiserver.go:322 +0x6a
github.com/openshift/oauth-server/pkg/cmd/oauth-server.RunOsinServer(0xc000451cc0?, 0xc000810000?, 0xc00061a5a0)
        github.com/openshift/oauth-server/pkg/cmd/oauth-server/server.go:45 +0x73
github.com/openshift/oauth-server/pkg/cmd/oauth-server.(*OsinServerOptions).RunOsinServer(0xc00030e168, 0xc00061a5a0)
        github.com/openshift/oauth-server/pkg/cmd/oauth-server/cmd.go:108 +0x259
github.com/openshift/oauth-server/pkg/cmd/oauth-server.NewOsinServerCommand.func1(0xc00061c300?, {0x251a8c8?, 0x4?, 0x251a8cc?})
        github.com/openshift/oauth-server/pkg/cmd/oauth-server/cmd.go:46 +0xed
github.com/spf13/cobra.(*Command).execute(0xc000780008, {0xc00058d6c0, 0x7, 0x7})
        github.com/spf13/cobra@v1.7.0/command.go:944 +0x867
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001a3b08)
        github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v1.7.0/command.go:992
k8s.io/component-base/cli.run(0xc0001a3b08)
        k8s.io/component-base@v0.29.2/cli/run.go:146 +0x290
k8s.io/component-base/cli.Run(0xc00061a5a0?)
        k8s.io/component-base@v0.29.2/cli/run.go:46 +0x17
main.main()
        github.com/openshift/oauth-server/cmd/oauth-server/main.go:46 +0x2de

Expected results:

No panic

Additional info:

Tried in old env like 4.16.20 with same steps, no panic:
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.20   True        False         95m     Cluster version is 4.16.20

$ oc get po -n openshift-authentication
NAME                               READY   STATUS    RESTARTS   AGE    
oauth-openshift-7dfcd8c8fd-77ltf   1/1     Running   0          116s   
oauth-openshift-7dfcd8c8fd-sr97w   1/1     Running   0          89s    
oauth-openshift-7dfcd8c8fd-tsrff   1/1     Running   0          62s

https://github.com/openshift/oauth-server/pull/166

Bug USHIFT-4090: Monitor test api-unreachable-from-client-metrics fails in MicroShift

View the Description View the linked PRs

Description of problem:

New monitor test api-unreachable-from-client-metrics does not pass in MicroShift. Since this is a monitor test there is no way to skip it and a fix is needed.
This test is breaking conformance job for MicroShift, which is critical to the blocking job to be.

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Run conformance over MicroShift.

Steps to Reproduce:

1.
2.
3.

Actual results:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-microshift-release-4.18-periodics-e2e-aws-ovn-ocp-conformance/1828583537415032832

Expected results:

Additional info:

https://github.com/openshift/origin/pull/29042

Bug OCPBUGS-38285: ART requests updates to 4.18 image openshift-enterprise-base-rhel9-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/images/pull/191

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/images/pull/191

Bug OCPBUGS-39369: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/4765

Bug OCPBUGS-38461: oc-mirror should fail when call the cincinatti API failed

View the Description View the linked PRs

Description of problem:

oc-mirror should fail when call the cincinatti API failed

Version-Release number of selected component (if applicable):

oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1) Set squid proxy;
2) use following imagesetconfig to mirror ocp:
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    graph: true
    channels:
    - name: stable-4.15                                             
      type: ocp
      minVersion: '4.15.18'
      maxVersion: '4.15.18'

oc-mirror -c config.yaml  file://out38037 --v2

Actual results:

2) oc-mirror failed to get cincinatti API, but oc-mirror  just log an error, state that 0 images to copy and continue

oc-mirror -c config-38037.yaml  file://out38037 --v2

2024/08/13 04:27:41  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/08/13 04:27:41  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/08/13 04:27:41  [INFO]   : ⚙️  setting up the environment for you...
2024/08/13 04:27:41  [INFO]   : 🔀 workflow mode: mirrorToDisk 
2024/08/13 04:27:41  [INFO]   : 🕵️  going to discover the necessary images...
2024/08/13 04:27:41  [INFO]   : 🔍 collecting release images...
I0813 04:27:41.388376  203687 core-cincinnati.go:508] Using proxy 127.0.0.1:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=1454eaf7-7f41-4678-ae88-30d4957e24f9
2024/08/13 04:27:41  [ERROR]  : get release images: error list APIRequestError: channel "stable-4.15": RemoteFailed: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=1454eaf7-7f41-4678-ae88-30d4957e24f9": Forbidden 
2024/08/13 04:27:41  [WARN]   : error during graph image processing - SKIPPING: Get "https://api.openshift.com/api/upgrades_info/graph-data": Forbidden
2024/08/13 04:27:41  [INFO]   : 🔍 collecting operator images...
2024/08/13 04:27:41  [INFO]   : 🔍 collecting additional images...
2024/08/13 04:27:41  [INFO]   : 🚀 Start copying the images...
2024/08/13 04:27:41  [INFO]   : images to copy 0 
2024/08/13 04:27:41  [INFO]   : === Results ===
2024/08/13 04:27:41  [INFO]   : 📦 Preparing the tarball archive...
2024/08/13 04:27:41  [INFO]   : mirror time     : 464.620593ms
2024/08/13 04:27:41  [INFO]   : 👋 Goodbye, thank you for using oc-mirror

Expected results:

when Cincinatti API is not reacheable (api.openshift.com), oc-mirror should fail immediately

https://github.com/openshift/oc-mirror/pull/933

Bug OCPBUGS-43665: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/1198

Bug OCPBUGS-39189: Console plugin missing required scc annotation

View the Description View the linked PRs

Expected results:

networking-console-plugin deployment has the required-scc annotation

Additional info:

The deployment does not have any annotation about it

CI warning

# [sig-auth] all workloads in ns/openshift-network-console must set the 'openshift.io/required-scc' annotation
annotation missing from pod 'networking-console-plugin-7c55b7546c-kc6db' (owners: replicaset/networking-console-plugin-7c55b7546c); suggested required-scc: 'restricted-v2'

Bug OCPBUGS-41188: ART requests updates to 4.18 image ose-baremetal-installer-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/installer/pull/8962

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/installer/pull/8962

Bug OCPBUGS-38789: Remove Networking section from console code

View the Description View the linked PRs

Description of problem:

The network section will be delivered using the networking-console-plugin through the cluster-network-operator.
So we have to remove the section from here to avoid duplication.

Version-Release number of selected component (if applicable):
4.18

How reproducible:
Always

Steps to Reproduce:

Open the network section

Actual results:
Service, Route, Ingress and NetworkPolicy are defined two times in the section

Expected results:
Service, Route, Ingress and NetworkPolicy are defined only one time in the section

Additional info:

https://github.com/openshift/console/pull/13675

Bug OCPBUGS-38435: Add monitor test for detecting if pods restartCount is greater than zero

View the Description View the linked PRs

From david:

pod/metal3-static-ip-set namespace/openshift-machine-api should trip some kind of test due to restartCount=5 on its container. Let’s say any pod that is created after the install is finished should restart=0 and see how many fail that criteria

We should have a test that makes sure that pods created after cluster is up should not have a non zero restartCount.

https://github.com/openshift/origin/pull/29013

Bug OCPBUGS-39432: ART requests updates to 4.18 image ose-openshift-controller-manager-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/openshift-controller-manager/pull/330

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/openshift-controller-manager/pull/330

Bug OCPBUGS-39494: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14232

Bug OCPBUGS-41112: ART requests updates to 4.18 image ose-cluster-ingress-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1140

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-ingress-operator/pull/1140

Bug OCPBUGS-43520: GCP CAPI install is allowing ALL for kube-api firewall rule on private clusters.

View the Description View the linked PRs

Description of problem:

   When installing a GCP cluster with the CAPI based method, the kube-api firewall rule that is created always uses a source range of 0.0.0.0/0. In the prior terraform based method, internal published clusters were limited to the network_cidr. This change opens up the API to additional sources, which could be problematic such as in situations where traffic is being routed from a non-cluster subnet.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Always

Steps to Reproduce:

    1. Install a cluster in GCP with publish: internal
    2.
    3.

Actual results:

    Kube-api firewall rule has source of 0.0.0.0/0

Expected results:

    Kube-api firewall rule has a more limited source of network_cidr

Additional info:

https://github.com/openshift/installer/pull/9113

Bug OCPBUGS-25557: Update 4.16 ose-aws-ebs-csi-driver-operator-container image to be consistent with ART

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/csi-operator/pull/87

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Bug OCPBUGS-38838: Kubectl logs all pod logs the Deployment has 2 replicas and each pod has 2 containers should get logs from

View the Description View the linked PRs

Tests:

Kubectl logs all pod logs the Deployment has 2 replicas and each pod has 2 containers should get logs from all pods based on default container
Kubectl logs all pod logs the Deployment has 2 replicas and each pod has 2 containers should get logs from each pod and each container in Deployment

are being disabled in https://github.com/openshift/kubernetes/blob/master/openshift-hack/e2e/annotate/rules.go

These tests should be enabled after the 1.31 kube bump in oc

https://github.com/openshift/kubernetes/pull/2114

Bug OCPBUGS-38842: Image registry unable to run due to permissions error

View the Description View the linked PRs

Component Readiness has found a potential regression in the following test:

[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-image-registry

Probability of significant regression: 98.02%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-15T00:00:00Z
End Time: 2024-08-22T23:59:59Z
Success Rate: 94.74%
Successes: 180
Failures: 10
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 89
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=aws&Platform=aws&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=micro&Upgrade=micro&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Image%20Registry&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20aws%20unknown%20ha%20micro&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-22%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-15%2000%3A00%3A00&testId=openshift-tests-upgrade%3A10a9e2be27aa9ae799fde61bf8c992f6&testName=%5Bsig-cluster-lifecycle%5D%20pathological%20event%20should%20not%20see%20excessive%20Back-off%20restarting%20failed%20containers%20for%20ns%2Fopenshift-image-registry

Also hitting 4.17, I've aligned this bug to 4.18 so the backport process is cleaner.

The problem appears to be a permissions error preventing the pods from starting:

2024-08-22T06:14:14.743856620Z ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied

Originating from this code: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L489

Both 4.17 and 4.18 nightlies bumped rhcos and in there is an upgrade like this:

container-selinux-3-2.231.0-1.rhaos4.16.el9-noarch container-selinux-3-2.231.0-2.rhaos4.17.el9-noarch

With slightly different versions in each stream, but both were on 3-2.231.

Hits other tests too:

operator conditions image-registry
Operator upgrade image-registry
[sig-cluster-lifecycle] Cluster completes upgrade
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

https://github.com/openshift/cluster-image-registry-operator/pull/1096

Bug OCPBUGS-42148: default thanos-ruler retention time should be 24h, not 15d in api.md

View the Description View the linked PRs

Description of problem:

checked in 4.17.0-0.nightly-2024-09-18-003538, default thanos-ruler retention time is 24h, not 15d mentioned in https://github.com/openshift/cluster-monitoring-operator/blob/release-4.17/Documentation/api.md#thanosrulerconfig, the issue exists in 4.12+

$ for i in $(oc -n openshift-user-workload-monitoring get sts --no-headers | awk '{print $1}'); do echo $i; oc -n openshift-user-workload-monitoring get sts $i -oyaml | grep retention; echo -e "\n"; done
prometheus-user-workload
        - --storage.tsdb.retention.time=24h

thanos-ruler-user-workload
        - --tsdb.retention=24h

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-18-003538

How reproducible:

always

Steps to Reproduce:

1. see the description

Actual results:

default thanos-ruler retention time is 15d in api.md

Expected results:

should be 24h

Additional info:

https://github.com/openshift/cluster-monitoring-operator/pull/2500

Bug OCPBUGS-23306: Avoid eviction of CSI driver daemonsets pods from the cluster-autoscaler

View the Description View the linked PRs

Cluster-autoscaler by default evict all those pods -including those coming from daemon sets-
In the case of EFS-CSI drivers, which are mounted as nfs volumes, this is causing nfs stale and that application worloads are not terminated gracefully.

Version-Release number of selected component (if applicable):

4.11

How reproducible:

- While scaling down a node from the cluster-autoscaler-operator, the DS pods are beeing evicted.

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

CSI pods might not be evicted by the cluster autoscaler (at least prior to workloads termination) as it might produce data corruption

Additional info:

Is possible to disable csi pods eviction adding the following annotation label on the csi driver pod
cluster-autoscaler.kubernetes.io/enable-ds-eviction: "false"

Bug OCPBUGS-38877: Unexpected haproxy-monitor errors

View the Description View the linked PRs

Description of problem:

In discussion of https://issues.redhat.com/browse/OCPBUGS-37862 it was noticed that sometimes the haproxy-monitor is reporting "API is not reachable through HAProxy" which means it is removing the firewall rule to direct traffic to HAProxy. This is not ideal since it means keepalived will likely fail over the VIP and it may be breaking existing connections to HAProxy.

There are a few possible reasons for this. One is that we only require two failures of the healthcheck in the monitor to trigger this removal. For something we don't expect to need to happen often during normal operation of a cluster, this is probably a bit too harsh, especially since we only check every 6 seconds so it's not like we're looking for quick error detection. This is more a bootstrapping thing and a last ditch effort to keep the API functional if something has gone terribly wrong in the cluster. If it takes a few more seconds to detect an outage that's better than detecting outages that aren't actually outages.

The first thing we're going to try to fix this is to increase what amounts to the "fall" value for the monitor check. If that doesn't eliminate the problem we will have to look deeper at the HAProxy behavior during node reboots.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/baremetal-runtimecfg/pull/332

Bug OCPBUGS-41631: Panic seen in CI job for MCC pod

View the Description View the linked PRs

Description of problem:

Panic seen in below CI job when run the below command

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match'
periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-insights-operator-release-4.17-insights-operator-e2e-tests-periodic (all) - 2 runs, 100% failed, 50% of failures match = 50% impact

Panic observed:

E0910 09:00:04.283647       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 268 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x36c8b40, 0x5660c90})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ce8540?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x36c8b40?, 0x5660c90?})
	/usr/lib/golang/src/runtime/panic.go:770 +0x132
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateNode(0xc000d6e360, {0x3abd580?, 0xc00224a608}, {0x3abd580?, 0xc001bd2308})
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:585 +0x1f3
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:246
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:976 +0xea
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001933f70, {0x3faaba0, 0xc000759710}, 0x1, 0xc00097bda0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000750f70, 0x3b9aca00, 0x0, 0x1, 0xc00097bda0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000dc2630)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:972 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 261
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x33204b3]

Version-Release number of selected component (if applicable):

How reproducible:

Seen in this CI run -https://prow.ci.openshift.org/job-history/test-platform-results/logs/periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic

Steps to Reproduce:

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match'

Actual results:

Expected results:

 No panic to observe

Additional info:

https://github.com/openshift/machine-config-operator/pull/4575

Bug TRT-1851: external binary registry: unable to extract image references from release payload

View the Description View the linked PRs

Failures beginning in 4.18.0-0.ci-2024-10-08-185524

 Suite run returned error: unable to extract image references from release payload: failed extracting image-references from "registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524": error during image extract: exit status 1 (error: unable to read image registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524: unauthorized: authentication required
)
error running options: unable to extract image references from release payload: failed extracting image-references from "registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524": error during image extract: exit status 1 (error: unable to read image registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524: unauthorized: authentication required
)error: unable to extract image references from release payload: failed extracting image-references from "registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524": error during image extract: exit status 1 (error: unable to read image registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524: unauthorized: authentication required
)

revert

https://github.com/openshift/origin/pull/29173

Bug OCPBUGS-43380: Hypershift jobs failing due to kube-controller panic ~90% of the time

View the Description View the linked PRs

Undiagnosed panic detected in pod

This test is failing the majority of the time on hypershift jobs.

The failure looks straightforward:

{  pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-18-18.ec2.internal_cluster-policy-controller.log.gz:E1015 12:53:31.246033       1 scctopsamapping.go:336] "Observed a panic" panic="unknown volume type: image" panicGoValue="&errors.errorString{s:\"unknown volume type: image\"}" stacktrace=<

We're close to not being able to see, but it looks like this may have started Oct 3rd.

For job runs with the test failure see here.

https://github.com/openshift/cluster-policy-controller/pull/157

Bug OCPBUGS-44476: Shared VPC: Control plane operator fails to create DNS entries in local zone when local zone exists in the cluster account

View the Description View the linked PRs

Description of problem:

    When hosted zones are created in the cluster creator account, and the ingress role is a role in the cluster creator account, the private link controller fails to create DNS records in the local zone.

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Always

Steps to Reproduce:

    1. Set up shared vpc infrastructure in which the hosted zone and local zone exist in the cluster creator account. 
    2. Create a hosted cluster

Actual results:

    The hosted cluster never gets nodes to join because it is missing records in the local hosted zone.

Expected results:

    The hosted cluster completes installation with available nodes.

Additional info:

    Creating the hosted zones in the cluster creator account is an alternative way of setting up shared vpc infrastructure. In this mode, the role to assume for creating DNS records is a role in the cluster creator account and not in the vpc account.

https://github.com/openshift/hypershift/pull/5126

Bug OCPBUGS-39250: Fix route page

View the Description View the linked PRs

Cluster-admin user click on 'Create Route' on list page, it will be always loading, sometimes this is also happening for normal user
Delete Route doesn't work, every time we delete a route, a new one will be generated(this seems only reproducible when there are more than one routes)
Route creation form, `Path` is required field, but when we create 'Secure Route', it will report error
Error "Invalid value: "/": passthrough termination does not support paths" for field "spec.path".

https://github.com/openshift/networking-console-plugin/pull/61

Bug OCPBUGS-41486: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-node-tuning-operator/pull/1161

Bug OCPBUGS-31738: monitor test pod-network-avalibility setup fails frequently on openstack

View the Description View the linked PRs

Description of problem:

The [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup test frequently fails on OpenStack platform, which in turn also causes the [sig-network] can collect pod-to-service poller pod logs and [sig-network] can collect host-to-service poller pod logs tests to fail.

These failure happen frequently in vh-mecha, for example for all CSI jobs, such as 4.16-e2e-openstack-csi-cinder.

https://github.com/openshift/origin/pull/28698

Bug OCPBUGS-39501: ART requests updates to 4.18 image ose-cluster-baremetal-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/442

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/cluster-baremetal-operator/pull/442

Bug OCPBUGS-43508: Fix TestOperandProxyConfiguration and TestLeaderElection flakes on Image Registry Operator

View the Description View the linked PRs

Description of problem:

    These two tests have been flaking more often lately. The TestLeaderElection flake is partially (but not solely) connected to OCPBUGS-41903.

   TestOperandProxyConfiguration seems to fail in the teardown while waiting for other cluster operators to become available.

   Although these flakes aren't customer facing, they considerably slow development cycles (due to retests) and also consume more resources than they should (every retest runs on a new cluster), so we want to backport the fixes.

Version-Release number of selected component (if applicable):

    4.18, 4.17, 4.16, 4.15, 4.14

How reproducible:

    Sometimes

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-image-registry-operator/pull/1140

Bug OCPBUGS-19936: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8941

Bug OCPBUGS-38966: [GCP] installing into GCP shared VPC with BYO hosted zone failed with error "failed to create the private managed zone"

View the Description View the linked PRs

Description of problem:

    installing into GCP shared VPC with BYO hosted zone failed with error "failed to create the private managed zone"

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-multi-2024-08-26-170521

How reproducible:

    Always

Steps to Reproduce:

    1. pre-create the dns private zone in the service project, with the zone's dns name like "<cluster name>.<base domain>" and binding to the shared VPC
    2. activate the service account having minimum permissions, i.e. no permission to bind a private zone to the shared VPC in the host project (see [1])
    3. "create install-config" and then insert the interested settings (e.g. see [2])
    4. "create cluster"

Actual results:

    It still tries to create a private zone, which is unexpected.

failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create the private managed zone: failed to create private managed zone: googleapi: Error 403: Forbidden, forbidden

Expected results:

    The installer should use the pre-configured dns private zone, rather than try to create a new one.

Additional info:

The 4.16 epic adding the support: https://issues.redhat.com/browse/CORS-2591

One PROW CI test which succeeded using Terraform installation: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-4.17-upgrade-from-stable-4.17-gcp-ipi-xpn-mini-perm-byo-hosted-zone-arm-f28/1821177143447523328

The PROW CI test which failed: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-gcp-ipi-xpn-mini-perm-byo-hosted-zone-amd-f28-destructive/1828255050678407168

https://github.com/openshift/installer/pull/8908

Bug OCPBUGS-42231: [IBMCloud] MonitorTests fail due to CSI Driver pods require ClusterRole SCC binding

View the Description View the linked PRs

Description of problem:

    OCP Conformance MonitorTests can fail based on CSI Drivers pod and ClusterRole applied order. SA, CR, CRB likely should be applied first prior to deployment/pods.

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

60%

Steps to Reproduce:

    1. Create IPI cluster on IBM Cloud
    2. Run OCP Conformance w/ MonitorTests

Actual results:

    : [sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]

{  fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors
Error creating: pods "ibm-vpc-block-csi-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[9]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[2].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/ibm-vpc-block-csi-node -n openshift-cluster-csi-drivers happened 7 times

Ginkgo exit error 1: exit with code 1}

Expected results:

    No pod creation failures using the wrong SCC, because the ClusterRole/ClusterRoleBinding, etc. had not been applied yet.

Additional info:

Sorry, I did not see an IBM Cloud Storage listed in the targeted Component for this bug, so selected the generic Storage component. Please forward as necessary/possible.


Items to consider:

ClusterRole:  https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/privileged_role.yaml

ClusterRoleBinding:  https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/node_privileged_binding.yaml

The ibm-vpc-block-csi-node-* pods eventually reach running using privileged SCC. I do not know whether it is possible to stage the resources that get created first, within the CSI Driver Operator
https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/9288e5078f2fe3ce2e69a4be3d94622c164c3dbd/pkg/operator/starter.go#L98-L99
Prior to the CSI Driver daemonset (`node.yaml`), perhaps order matters within the list.

Example of failure in CI:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8235/pull-ci-openshift-installer-master-e2e-ibmcloud-ovn/1836521032031145984

https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/127

Bug OCPBUGS-38235: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/8820

Bug OCPBUGS-38406: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-43494: Searching volumesnapshot and volumesnaphotclass with label doesn't work.

View the Description View the linked PRs

Description of problem:

On "Search" page, search resource VolumeSnapshots/VolumeSnapshotClasses and filter with label, the filter doesn't work.

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-04-08-024331

How reproducible:

always

Steps to Reproduce:

To reproduce VolumeSnapshot bug:  
    1. Go to VolumeSnapshots page under a namespace that has VolumeSnapshotClaims defined (e.g. openshift-pipelines)
    2. Create two new VolumeSnapshots - use one of the defined VolumeSnapshotClaims during creation.
    3. Click on one of the created VolumeSnapshots and add a label - e.g. "demoLabell".
    4. Go to "Search" page, choose "VolumeSnapshots" resource, filter with any label, eg "demoLabel", "something"
To reproduce VolumeSnapshotClass bug: 
    1. Go to VolumeSnapshotsClasses page
    2. Create two new VolumeSnapshotClasses. 
    3. Click on one of the created VolumeSnapshotClasses and add a label - e.g. "demoLabel".    
    4. Go to "Search" page, choose "VolumeSnapshots" resource, filter with any label, eg "demoLabel", "something"

Actual results:

1. Label filters don't work.
2. VolumeSnapshots are listed without being filtered by label.
2. VolumeSnapshotClasses are listed without being filtered by label.

Expected results:

1. VSs and VSCs should be filtered by label.

Additional info:

Screenshots VS: 
https://drive.google.com/drive/folders/1GEUgOn5FXr-l3LJNF-FWBmn-bQ8uE_rD?usp=sharing   
Screenshoft VSC:
https://drive.google.com/drive/folders/1gI7PNCzcCngfmFT5oI1D6Bask5EPsN7v?usp=sharing

https://github.com/openshift/console/pull/14410

Bug OCPBUGS-42529: status message not populated as expected

View the Description View the linked PRs

Description of problem:

    %s is not populated with authoritativeAPI , when cluster is enabled for migration

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-09-23-182657

How reproducible:

    Always

Steps to Reproduce:

Set featuregate as below

spec:
  featureSet: CustomNoUpgrade
  customNoUpgrade:
    enabled:
    - MachineAPIMigration    

Update - oc edit --subresource status` to add the `.status.authoritativeAPI` field to see the behaviour of the pausing.

eg- oc edit --subresource status machineset.machine.openshift.io miyadav-2709a-5v7g7-worker-eastus2

Actual results:

    status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2024-09-27T07:22:58Z"
    reason: AuthoritativeAPI is set to MachineAPI
    severity: The AuthoritativeAPI is set to %s
    status: "False"
    type: Paused
  fullyLabeledRepl

Expected results:

    status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2024-09-27T07:22:58Z"
    reason: AuthoritativeAPI is set to MachineAPI
    severity: The AuthoritativeAPI is set to MachineAPI
    status: "False"
    type: Paused
  fullyLabeledRepl

Additional info:

    related to - https://issues.redhat.com/browse/OCPCLOUD-2565

    message: 'The AuthoritativeAPI is set to '
    reason: AuthoritativeAPIMachineAPI
    severity: Info
    status: "False"
    type: Paused

https://github.com/openshift/machine-api-operator/pull/1294

Bug OCPBUGS-41172: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes/pull/2076

Bug OCPBUGS-36680: The change of the additionalTrustBundle doesn't propagate to worker node.

View the Description View the linked PRs

Description of problem:

   Specifying additionalTrustBundle in the HC doesnt propogate down to the worker nodes

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1.Create CM with additionalTrustBundle
    2.Specify CM in HC.Spec.AdditionalTrustBundle
    3.Debug worker nodes and check if additionalTrustBundle has been updated

Actual results:

    additionalTrustBundle hasnt propogated down to nodes

Expected results:

     additionalTrustBundle propogated down to nodes

Additional info:

https://github.com/openshift/hypershift/pull/4331

Bug OCPBUGS-38077: Name attribute should not be mandatory on StorageControllers using Redfish

View the Description View the linked PRs

Description of problem:

Redfish exception occurred while provisioning a worker using HW RAID configuration on HP server with ILO 5:

step': 'delete_configuration', 'abortable': False, 'priority': 0}: Redfish exception occurred. Error: The attribute StorageControllers/Name is missing from the resource /redfish/v1/Systems/1/Storage/DE00A000

spec used:
spec:
  raid:
    hardwareRAIDVolumes:
    - name: test-vol
      level: "1"
      numberOfPhysicalDisks: 2
      sizeGibibytes: 350
  online: true

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. Provision an HEP worker with ILO 5 using redfish
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/ironic-image/pull/540

Story ART-10934: Validate payloads with release-3 sigstore public key

View the Description View the linked PRs

Software production has changed the key they want ART to sign with. ART is currently signing with the original key we were provided and sigstore-3.

https://github.com/openshift/cluster-update-keys/pull/63

Task MON-3850: Lint CMO tests

View the Description View the linked PRs

Allow CMO tests to be linted as well.

https://github.com/openshift/cluster-monitoring-operator/pull/2292

Bug OCPBUGS-38655: Proxy URL NOT injected in UWM Prometheus after configuring remote-write

View the Description View the linked PRs

Description of problem:

After configuring remote-write for UWM prometheus named "user-workload" in configmap named user-workload-monitoring-config, the proxyURL (same as cluster proxy resource) is not getting injected at all.

Version-Release number of selected component (if applicable):

4.16.4

How reproducible:

100%

Steps to Reproduce:

1. Configure proxy custom resource in RHOCP 4.16.4 cluster
2. Create user-workload-monitoring-config configmap in openshift-monitoring project
3. Inject remote-write config (without specifically configuring proxy for remote-write)
4. After saving the modification in  user-workload-monitoring-config configmap, check the remoteWrite config in Prometheus user-workload CR. Now it does NOT contain the proxyUrl. Example snippet:
==============
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
[...]
  name: user-workload
  namespace: openshift-user-workload-monitoring
spec:
[...]
  remoteWrite:
  - url: http://test-remotewrite.test.svc.cluster.local:9090    <<== No Proxy URL Injected

Actual results:

UWM prometheus CR named "user-workload" doesn't inherit the proxyURL from cluster proxy resource.

Expected results:

UWM prometheus CR named "user-workload" should inherit proxyURL from cluster proxy resource and it should also respect noProxy which is configured in cluster proxy.

Additional info:

https://github.com/openshift/cluster-monitoring-operator/pull/2523

Bug OCPBUGS-39096: Live migration: CNO should report as a metric when there is network overlap

View the Description View the linked PRs

Description of problem:

    CNO doesnt report, as a metric, when there is a network overlap when live migration is initiated.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-network-operator/pull/2482

Bug OCPBUGS-39469: ART requests updates to 4.18 image ose-cluster-capi-operator-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/200

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component name: ose-cluster-capi-operator-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

https://github.com/openshift/cluster-capi-operator/pull/200

Bug OCPBUGS-39470: ART requests updates to 4.18 image ose-thanos-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/thanos/pull/151

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/thanos/pull/151

4.18.0-ec.4

Changes from 4.17.10

Complete Features

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Use Cases (Optional):

Overview

Goals

Requirements

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Epic Goal

Why is this important?

Upstream links

Acceptance Criteria

Dependencies (internal and external)

Done Checklist

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Epic Goal

Why is this important?

Acceptance Criteria

Out of scope

Previous Work (Optional):

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)