Back to index

4.18.0-ec.4

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.17.10

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Feature Overview (aka. Goal Summary)

Today we expose two main APIs for HyperShift, namely `HostedCluster` and `NodePool`. We also have metrics to gauge adoption by reporting the # of hosted clusters and nodepools.

But we are still missing other metrics to be able to make correct inference about what we see in the data.

Goals (aka. expected user outcomes)

  • Provide Metrics to highlight # of Nodes per NodePool or # of Nodes per cluster
  • Make sure the error between what appears in CMO via `install_type` and what we report as # Hosted Clusters is minimal.

Use Cases (Optional):

  • Understand product adoption
  • Gauge Health of deployments
  • ...

 

Overview

Today we have hypershift_hostedcluster_nodepools as a metric exposed to provide information on the # of nodepools used per cluster. 

 

Additional NodePools metrics such as hypershift_nodepools_size and hypershift_nodepools_available_replicas are available but not ingested in Telemetry.

In addition to knowing how many nodepools per hosted cluster, we would like to expose the knowledge of the nodepool size.

 

This will help inform our decision making and provide some insights on how the product is being adopted/used.

Goals

The main goal of this epic is to show the following NodePools metrics on Telemeter, ideally as recording rules: 

  • Hypershift_nodepools_size
  • hypershift_nodepools_available_replicas

Requirements

The implementation involves creating updates to the following GitHub repositories:

similar PRs:
https://github.com/openshift/hypershift/pull/1544
https://github.com/openshift/cluster-monitoring-operator/pull/1710

Feature Overview (aka. Goal Summary)

This feature is about providing workloads within an HCP KubeVirt cluster access to gpu devices. This is an important use case that expands usage of HCP KubeVirt to AL and ML workloads.

Goals (aka. expected user outcomes)

  • Users can assign GPUs to HCP KubeVirt worker nodes using NodePool API

Requirements (aka. Acceptance Criteria):

  • Expose the ability to assign GPUS to kubevirt NodePools
  • Ensure nvidia supports the nvidia gpu operator on hcp kubevirt
  • document usage and support of nvidia gpus with hcp kubevirt
  • CI environment and tests to verify gpu assignment to kubevirt nodepools functions 

 

 

GOAL:

Support running workloads within HCP KubeVirt clusters which need access to GPUs.

Accomplishing this involves multiple efforts

  • The NodePool API must be expanded to allow assignment of GPUs to the KubeVirt worker node VMs.
  • ensure nvidia operator works within the HCP cluster for gpus passed through to KubeVirt VMs
  • Develop a CI environment which allows us to exercise gpu passthrough. 

Diagram of multiple nvidia operator layers

https://docs.google.com/document/d/1HwXVL_r9tUUwqDct8pl7Zz4bhSRBidwvWX54xqXaBwk/edit 

This will be covered by HCP doc team.

We start by contributing the documentation upstream to the hypershift repo which is published here, https://hypershift-docs.netlify.app/. Then create a task for the ACM docs team to port those changes to the official documentation. They use our content as a seed for the official documentation content. (contact points is Laura Hinson - on parental leave, and Servesha Dudhgaonkar)

Feature Overview (aka. Goal Summary)  

Graduce the new PV access mode ReadWriteOncePod as GA.

Such PV/PVC can be used only in a single pod on a single node compared to the traditional ReadWriteOnce access mode, where such a PV/PVC can be used on a single node by many pods.

Goals (aka. expected user outcomes)

The customers can start using the new ReadWriteOncePod access mode.

This new mode allows customers to provision and attach PV and get the guarantee that it cannot be attached to another local pod.

 

Requirements (aka. Acceptance Criteria):

This new mode should support the same operations as regular ReadWriteOnce PVs therefore it should pass the regression tests. We should also ensure that this PV can't be accessed by another local-to-node pod.

 

Use Cases (Optional):

As a user I want to attach a PV to a pod and ensure that it can't be accessed by another local pod.

Background

We are getting this feature from upstream as GA. We need to test it and fully support it.

Customer Considerations

 

Check that there is no limitations / regression.

Documentation Considerations

Remove tech preview warning. No additional change.

 

Interoperability Considerations

N/A

Epic Goal

Support upstream feature "New RWO access mode " in OCP as GA, i.e. test it and have docs for it.

This is continuation of STOR-1171  (Beta/Tech Preview in 4.14), now we just need to mark it as GA and remove all TechPreview notes from docs.

Why is this important?

  • We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. External: the feature is currently scheduled for GA in Kubernetes 1.29, i.e. OCP 4.16, but it may change before Kubernetes 1.29 GA.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)

 

Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.

Maximum number of snapshot is 32 per volume

Goals (aka. expected user outcomes)

Customers can override the default (three) value and set it to a custom value.

Make sure we document (or link) the VMWare recommendations in terms of performances.

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

 

https://kb.vmware.com/s/article/1025279

Requirements (aka. Acceptance Criteria):

The setting can be easily configurable by the OCP admin and the configuration is automatically updated. Test that the setting is indeed applied and the maximum number of snapshots per volume is indeed changed.

No change in the default

Use Cases (Optional):

As an OCP admin I would like to change the maximum number of snapshots per volumes.

Out of Scope

Anything outside of 

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

Background

The default value can't be overwritten, reconciliation prevents it.

Customer Considerations

Make sure the customers understand the impact of increasing the number of snapshots per volume.

https://kb.vmware.com/s/article/1025279

Documentation Considerations

Document how to change the value as well as a link to the best practice. Mention that there is a 32 hard limit. Document other limitations if any.

Interoperability Considerations

N/A

Epic Goal*

The goal of this epic is to allow admins to configure the maximum number of snapshots per volume in vSphere CSI and find an way how to add such extension to OCP API.

Possible future candidates:

  • configure EFS volume size monitioring (via driver cmdline arg.) - STOR-1422
  • configure OpenStack topology - RFE-11

 
Why is this important? (mandatory)

Currently the maximum number of snapshots per volume in vSphere CSI is set to 3 and cannot be configured. Customers find this default limit too low and are asking us to make this setting configurable.

Maximum number of snapshot is 32 per volume

https://kb.vmware.com/s/article/1025279

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html#GUID-7BA0CDAE-E031-470E-A685-60C82DAE36D2__GUID-D9A97A90-2777-46EA-94EB-F04A27FBB76D

 

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As an admin I would like to configure the maximum number of snapshots per volume.
  2. As a user I would like to create more than 3 snapshots per volume

 
Dependencies (internal and external) (mandatory)

1) Write OpenShift enhancement (STOR-1759)

2) Extend ClusterCSIDriver API (TechPreview) (STOR-1803)

3) Update vSphere operator to use the new snapshot options (STOR-1804)

4) Promote feature from Tech Preview to Accessible-by-default (STOR-1839)

  • prerequisite: add e2e test and demonstrate stability in CI (STOR-1838)

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation - STOR
  • QE - STOR
  • PX - Enablement
  • Others -

Acceptance Criteria (optional)

Configure the maximum number of snapshot to a higher value. Check the config has been updated and verify that the maximum number of snapshots per volume maps to the new setting value.

Drawbacks or Risk (optional)

Setting this config setting with a high value can introduce performances issues. This needs to be documented.

https://kb.vmware.com/s/article/1025279

 

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)

Support the SMB CSI driver through an OLM operator as tech preview. The SMB CSI driver allows OCP to consume SMB/CIFS storage with a dynamic CSI driver. This enables customers to leverage their existing storage infrastructure with either SAMBA or Microsoft environment.

https://github.com/kubernetes-csi/csi-driver-smb

Goals (aka. expected user outcomes)

Customers can start testing connecting OCP to their backend exposing CIFS. This can allow to consume net new volume or consume existing data produced outside OCP.

Requirements (aka. Acceptance Criteria):

Driver already exists and is under the storage SIG umbrella. We need to make sure the driver is meeting OCP quality requirement and if so develop an operator to deploy and maintain it.

Review and clearly define all driver limitations and corner cases.

Use Cases (Optional):

  • As an OCP admin, I want OCP to consume storage exposed via SMB/CIFS to capitalise on my existing infrastructure.
  • As an user, I want to consume external data stored on a SMB/CIFS backend.

Questions to Answer (Optional):

Review the different authentication method.

Out of Scope

Windows containers support.

Only storage class login/password authentication method. Other methods can be reviewed and considered for GA.

Background

Customers are expecting to consume storage and possibly existing data via SMB/CIFS. As of today vendor's drivers support is really limited in terms of CIFS support whereas this protocol is widely used on premise especially with MS/AD customers.

Customer Considerations

Need to understand what customers expect in terms of authentication.

How to extend this feature to windows containers.

Documentation Considerations

Document the operator and driver installation, usage capabilities and limitations.

Interoperability Considerations

Future: How to manage interoperability with windows containers (not for TP)

Feature Overview (aka. Goal Summary)  

Graduate the SMB CSI driver and its operator to GA

Goals (aka. expected user outcomes)

The Goal is to write an operator to deploy and maintain the SMB CSI driver

https://github.com/kubernetes-csi/csi-driver-smb

 

  1. Provide a day 2 OLM based operator that deploys the SMB CSI driver.
  2. Ensure the driver passes all CSI related tests.
  3. Identify all upstream capabilities and limitation. Define what we will support at GA.

Authentication will be limited to a secret in the storage class. NTLM style authentication only, no kerberos support until we have it officialy supported and documented. This limits the CSI to run on non FIPS environments.

We're also excluding support for DFS (Distributed File System) at GA, we will look at possible support in a future OCP release.

 

 

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Operator and driver meets the GA quality criteria. We have a good way to deploy a CIFS backend for CI/Testing.

Identify all upstream capabilities and limitation. Define what we support at GA.

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Self-managed
Classic (standalone cluster) Yes
Hosted control planes Should work
Multi node, Compact (three node), or Single node (SNO), or all all
Connected / Restricted Network yes
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_x64
Operator compatibility OLM
Backport needed (list applicable versions) No
UI need (e.g. OpenShift Console, dynamic plugin, OCM) No
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

We have several customer's request to allows pods to access storage shared exposed as SMB/CIFS. This can be because of already existing data generated outside OCP or because the customer's environment already integrates an AD/SMB NAS infrastructure. This is fairly common in on-prem environments.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

How do automatically deploy a SMB server for automated testing?

What authentication method will we support? - NTLM style only

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

Support of SMB server

Authentication beyond the default one which references secrets in the SC & static provisioning, NTLM style only. 

No kerberos support until we have it officialy supported and documented. This limits the CSI to run on non FIPS environments.

 

https://github.com/kubernetes-csi/csi-driver-smb/blob/master/deploy/example/e2e_usage.md#option1-storage-class-usage

https://github.com/kubernetes-csi/csi-driver-smb/blob/master/deploy/example/e2e_usage.md#option2-pvpvc-usage

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

The windows container team can't directly leverage this work atm because they can't ship CSI drivers for windows.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

Customers may want to run these on FIPS enabled clusters which requires keberos authentication as NTLM is not FIPS compliant. Unfortunately there is no official OCP kerberos support today. This will be reassessed when we have it.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Reuse the TP doc, remove TP warning. Change any delta content between TP and GA. Be explicit on supported authentification (NTML/ no FIPS) and samba / windows versions supported.

We're also excluding support for DFS (Distributed File System) at GA, we will look at possible support in a future OCP release.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Customers using windows containers may be interested by that feature.

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

The Azure File CSI driver currently lacks cloning and snapshot restore features. The goal of this feature is to support the cloning feature as technology preview. This will help support snapshots restore in a future release

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

As a user I want to easily clone Azure File volume by creating a new PVC with spec.DataSource referencing origin volume.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

This feature only applies to OCP running on Azure / ARO and File CSI.

The usual CSI cloning CI must pass.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all all although SNO is rare on Azure
Connected / Restricted Network both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86
Operator compatibility Azure File CSI operator
Backport needed (list applicable versions) No
UI need (e.g. OpenShift Console, dynamic plugin, OCM) No
Other (please specify) ship downstream images with from forked azcopy

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

Restoring snapshots are out of scope for now.

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Update the CSI capability matrix and any language that mentions that Azure File CSI does not support cloning.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Not impact but benefit Azure / ARO customers.

Epic Goal*

Azure File added support for cloning volumes which relies on azcopy command upstream. We need to fork azcopy so we can build and ship downstream images with from forked azcopy. AWS driver does the same with efs-utils.

Upstream repo: https://github.com/Azure/azure-storage-azcopy

NOTE: using snapshots as a source is currently not supported: https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/7591a06f5f209e4ef780259c1631608b333f2c20/pkg/azurefile/controllerserver.go#L732 

 

Why is this important? (mandatory)

This is required for adding Azure File cloning feature support.

 

Scenarios (mandatory) 

1. As a user I want to easily clone Azure File volume by creating a new PVC with spec.DataSource referencing origin volume.

 
Dependencies (internal and external) (mandatory)

1) Write OpenShift enhancement (STOR-1757)

2) Fork upstream repo (STOR-1716)

3) Add ART definition for OCP Component (STOR-1755)

  • prerequisite: Onboard image with DPTP/CI (STOR-1752)
  • prerequisite: Perform a threat model assessment (STOR-1753)
  • prerequisite: Establish common understanding with Product Management / Docs / QE / Product Support (STOR-1753)
  • requirement: ProdSec Review (STOR-1756)

4) Use the new image as base image for Azure File driver (STOR-1794)

5) Ensure e2e cloning tests are in CI (STOR-1818)

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - yes
  • Documentation - yes
  • QE - yes
  • PX - ???
  • Others - ART

 

Acceptance Criteria (optional)

Downstream Azure File driver image must include azcopy and cloning feature must be tested.

 

Drawbacks or Risk (optional)

No risks detected so far.

 

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

  • As an Alibaba Cloud customer, I want to create an OpenShift cluster with the Assisted Installer using the agnostic platform (platform=none) for connected deployments.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

In order to remove IPI/UPI support for Alibaba Cloud in OpenShift (currently Tech Preview, see also OCPSTRAT-1042), we need to provide an alternate method for Alibaba Cloud customers to spin up an OpenShift cluster. To that end, we want customers to use Assisted Installer with platform=none (and later platform=external) to bring up their OpenShift clusters.

  • Stretch goal to do this with platform=external.
  • Note: We can TP with platform=none or platform=external, but for GA it must be with platform=external.
  • Document how to use this installation method

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  • Hybrid Cloud Console updated to reflect Alibaba Cloud installation with Assisted Installer (Tech Preview).
  • Documentation that tells customer how to use this install method
  • CI for this install method optional for OCP 4.16 (and will be addressed in the future)

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Self-managed
Classic (standalone cluster) Classic
Hosted control planes N/A
Multi node, Compact (three node), or Single node (SNO), or all Multi-node
Connected / Restricted Network Connected for OCP 4.16 (Future: restricted)
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_x64
Operator compatibility This should be the same for any operator on platform=none
Backport needed (list applicable versions) OpenShift 4.16 onwards
UI need (e.g. OpenShift Console, dynamic plugin, OCM) Hybrid Cloud Console changes needed
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

  • Restricted network deployments, i.e. As an Alibaba Cloud customer, I want to create an OpenShift cluster with the Agent-based Installer using the agnostic platform (platform=none) for restricted network deployments.

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

For OpenShift 4.16, we want to remove IPI support (currently Tech Preview) for Alibaba Cloud support (OCPSTRAT-1042). Instead we want it to make it Assisted Installer (Tech Preview) with the agnostic platform for Alibaba Cloud in OpenShift 4.16 (OCPSTRAT-1149).

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Previous UPI-based installation doc: Alibaba Cloud Red Hat OpenShift Container Platform 4.6 Deployment Guide

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

As an Alibaba Cloud customer, I want to create an OpenShift cluster with the Assisted Installer using the agnostic platform (platform=none) for connected deployments.

Epic Goal

  • Start with the original Alibaba Cloud Red Hat OpenShift Container Platform 4.6 Deployment Guide and adjust it to use the Assisted Installer with platform=none.
  • Document the steps for a successful installation using that method and feed the docs team with that information.
  • Narrow down the scope to the minimum viable to achieve Tech Preview in 4.16. We'll handle platform=external and better tools and automation in future releases.
  • Engage with the Assisted Installer team and the Solutions Architect / PTAM of Alibaba for support.
  • Provide frequent updates on work progress (at least weekly).
  • Assist QE in testing.

Why is this important?

  • In order to remove IPI/UPI support for Alibaba Cloud in OpenShift, we need to provide an alternate method for Alibaba Cloud customers to spin up an OpenShift cluster. To that end, we want customers to use Assisted Installer with platform=none (and in future releases platform=external) to bring up their OpenShift clusters.

Acceptance Criteria

  • Reproducible, stable, and documented installation steps using the Assisted Installer with platform=none provided to the docs team and QE.

Out of scope

  1. CI

Previous Work (Optional):

  1. https://www.alibabacloud.com/blog/alibaba-cloud-red-hat-openshift-container-platform-4-6-deployment-guide_597599
  2. https://github.com/kwoodson/terraform-openshift-alibaba for reference, it may help
  3. Alibaba IPI for reference, it may help
  4. Using the Assisted Installer to install a cluster on Oracle Cloud Infrastructure for reference

<!--

Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:

https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/

As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.

Before submitting it, please make sure to remove all comments like this one.

-->

{}USER STORY:{}

<!--

One sentence describing this story from an end-user perspective.

-->

As a [type of user], I want [an action] so that [a benefit/a value].

{}DESCRIPTION:{}

<!--

Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.

-->

{}Required:{}

...

{}Nice to have:{}

...

{}ACCEPTANCE CRITERIA:{}

<!--

Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.

-->

{}ENGINEERING DETAILS:{}

<!--

Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.

-->

Feature Overview (aka. Goal Summary)  

Enable Hosted Control Planes guest clusters to support up to 500 worker nodes. This enable customers to have clusters with large amount of worker nodes.

Goals (aka. expected user outcomes)

Max cluster size 250+ worker nodes (mainly about control plane). See XCMSTRAT-371 for additional information.
Service components should not be overwhelmed by additional customer workloads and should use larger cloud instances and when worker nodes are lesser than the threshold it should use smaller cloud instances.

Requirements (aka. Acceptance Criteria):

 

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Managed
Classic (standalone cluster) N/A
Hosted control planes Yes
Multi node, Compact (three node), or Single node (SNO), or all N/A
Connected / Restricted Network Connected
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_64 ARM
Operator compatibility N/A
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) N/A
Other (please specify)  

Questions to Answer (Optional):

Check with OCM and CAPI requirements to expose larger worker node count.

 

Documentation:

  • Design document detailing the autoscaling mechanism and configuration options
  • User documentation explaining how to configure and use the autoscaling feature.

Acceptance Criteria

  • Configure max-node size from CAPI
  • Management cluster nodes automatically scale up and down based on the hosted cluster's size.
  • Scaling occurs without manual intervention.
  • A set of "warm" nodes are maintained for immediate hosted cluster creation.
  • Resizing nodes should not cause significant downtime for the control plane.
  • Scaling operations should be efficient and have minimal impact on cluster performance.

 

Goal

  • Dynamically scale the serving components of control planes

Why is this important?

  • To be able to have clusters with large amount of worker nodes

Scenarios

  1. A hosted cluster amount of worker nodes increases past X amount, the serving components are moved to larger cloud instances
  2. A hosted cluster amount of workers falls below a threshold, the serving components are moved to smaller cloud instances.

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a service provider, I want to be able to:

  • Configure priority and fairness settings per HostedCluster size and force these settings to be applied on the resulting hosted cluster.

so that I can achieve

  • Prevent user of hosted cluster from bringing down the HostedCluster kube apiserver with their workload.

Acceptance Criteria:

Description of criteria:

  • HostedCluster priority and fairness settings should be configurable per cluster size in the ClusterSizingConfiguration CR
  • Any changes in priority and fairness inside the HostedCluster should be prevented and overridden by whatever is configured on the provider side.
  • With the proper settings, heavy use of the API from user workloads should not result in the KAS pod getting OOMKilled due to lack of resources.

This does not require a design proposal.
This does not require a feature gate.

Feature Overview (aka. Goal Summary)  

CRIO wipe is existing feature in Openshift . When node reboots CRIO wipe goes and clear the node of all images so that node boots clean . When node come back up it need access to image registry to get all images and it takes time to get all images . For telco and edge situation node might not have access to image registry and takes time to come up .

Goal of this feature is to adjust CRIO wipe to wipe only images that has been corrupted because of sudden reboot not all images 

Feature Overview

Phase 2 of the enclave support for oc-mirror with the following goals

  • Incorporate feedback from the field from 4.16 TP
  • Performance improvements

Goals

  • Update the batch processing using `containers/image` to do the copying for setting the number of blobs (layers) to download
  • Introduce a worker for concurrency that can also update the number of images to download to improve overall performance (these values can be tweaked via CLI flags). 
  • Collaborate with the UX team to improve the console output while pulling or pushing images. 

Feature Overview

Adding nodes to on-prem clusters in OpenShift in general is a complex task. We have numerous methods and the field keeps adding automation around these methods with a variety of solutions, sometimes unsupported (see "why is this important below"). Making cluster expansions easier will let users add nodes often and fast, leading to an much improved UX.

This feature adds nodes to any on-prem clusters, regardless of their installation method (UPI, IPI, Assisted, Agent), by booting an ISO image that will add the node to the cluster specified by the user, regardless of how the cluster was installed.

Goals and requirements

  • Users can install a host on day 2 using a bootable image to an OpenShift cluster.
  • At least platforms baremetal, vSphere, none and Nutanix are supported
  • Clusters installed with any installation method can be expanded with the image
  • Clusters don't need to run any special agent to allow the new nodes to join.

How this workflow could look like

1. Create image:

$ export KUBECONFIG=kubeconfig-of-target-cluster
$ oc adm node-image -o agent.iso --network-data=worker-n.nmstate --role=worker

2. Boot image

3. Check progress

$ oc adm add-node 

Consolidate options

An important goal of this feature is to unify and eliminate some of the existing options to add nodes, aiming to provide much simpler experience (See "Why is this important below"). We have official and field-documented ways to do this, that could be removed once this feature is in place, simplifying the experience, our docs and the maintenance of said official paths:

  • UPI: Adding RHCOS worker nodes to a user-provisioned infrastructure cluster
    • This feature will replace the need to use this method for the majority of UPI clusters. The current UPI method consists on many many manual steps. The new method would replace it by a couple of commands and apply to probably more than 90% of UPI clusters.
  • Field-documented methods and asks
  • IPI:
    • There are instances were adding a node to an bare metal IPI-deployed cluster can't be done via its BMC. This new feature, while not replacing the day-2 IPI workflow, solves the problem for this use case.
  • MCE: Scaling hosts to an infrastructure environment
    • This method is the most time-consuming and in many cases overkilling, but currently, along with the UPI method, is one of the two options we can give to users.
    • We shouldn't need to ask users to install and configure the MCE operator and its infrastructure for single clusters as it becomes a project even larger than UPI's method and save this for when there's more than one cluster to manage.

With this proposed workflow we eliminate the need of using the UPI method in the vast majority of the cases. We also eliminate the field-documented methods that keep popping up trying to solve this in multiple formats, and the need to recommend using MCE to all on-prem users, and finally we add a simpler option for IPI-deployed clusters.

In addition, all the built-in validations in the assisted service would be run, improving the installation the success rate and overall UX.

This work would have an initial impact on bare metal, vSphere, Nutanix and platform-agnostic clusters, regardless of how they were installed.

Why is this important

This feature is essential for several reasons. Firstly, it enables easy day2 installation without burdening the user with additional technical knowledge. This simplifies the process of scaling the cluster resources with new nodes, which today is overly complex and presents multiple options (https://docs.openshift.com/container-platform/4.13/post_installation_configuration/cluster-tasks.html#adding-worker-nodes_post-install-cluster-tasks).

Secondly, it establishes a unified experience for expanding clusters, regardless of their installation method. This streamlines the deployment process and enhances user convenience.

Another advantage is the elimination of the requirement to install the Multicluster Engine and Infrastructure Operator , which besides demanding additional system resources, are an overkill for use cases where the user simply wants to add nodes to their existing cluster but aren't managing multiple clusters yet. This results in a more efficient and lightweight cluster scaling experience.

Additionally, in the case of IPI-deployed bare metal clusters, this feature eradicates the need for nodes to have a Baseboard Management Controller (BMC) available, simplifying the expansion of bare metal clusters.

Lastly, this problem is often brought up in the field, where examples of different custom solutions have been put in place by redhatters working with customers trying to solve the problem with custom automations, adding to inconsistent processes to scale clusters. 

Oracle Cloud Infrastructure

This feature will solve the problem cluster expansion for OCI. OCI doesn't have MAPI and CAPI isn't in the mid term plans. Mitsubishi shared their feedback making solving the problem of lack of cluster expansion a requirement to Red Hat and Oracle.

Existing work

We already have the basic technologies to do this with the assisted-service and the agent-based installer, which already do this work for new clusters, and from which we expect to leverage the foundations for this feature.

Day 2 node addition with agent image.

Yet Another Day 2 Node Addition Commands Proposal

Enable day2 add node using agent-install: AGENT-682

 

Epic Goal

  • Cleanup/carryover work from AGENT-682 for the GA release

Why is this important?

  • Address all the required elements for the GA, such as the FIPS compliancy. This will allow a smoother integration of the node-joiner into the oc tool, as planned in   OCPSTRAT-784

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. None

Previous Work (Optional):

  1. https://issues.redhat.com/browse/AGENT-682

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Add an integration test to verify that the add-nodes command generates correctly the ISO.

review the proper usage & download of the envtest related binaries (api-server and etcd)

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Enable GCP Workload Identity Webhook

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Provide GCP workload identity webhook so a customer can more easily configure their applications to use the service account tokens minted by clusters that use GCP Workload Identity.{}

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Both, the scope of this is for self-managed
Classic (standalone cluster) Classic
Hosted control planes N/A
Multi node, Compact (three node), or Single node (SNO), or all All
Connected / Restricted Network All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_x64
Operator compatibility TBD
Backport needed (list applicable versions) N/A
UI need (e.g. OpenShift Console, dynamic plugin, OCM) TBD
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

Just like AWS STS and ARO Entra Workload ID, we want to provide the GCP workload identity webhook so a customer can more easily configure their applications to use the service account tokens minted by clusters that use GCP Workload Identity.

  • For AWS, we deploy the AWS STS pod identity webhook as a customer convenience for configuring their applications to utilize service account tokens minted by a cluster that supports STS. When you create a pod that references a service account, the webhook looks for annotations on that service account and if found, the webhook mutates the deployment in order to set environment variables + mounts the service account token on that deployment so that the pod has everything it needs to make an API client.
  • Our temporary access token (using TAT in place of STS because STS is AWS specific) enablement for (select) third party operators does not rely on the webhook and is instead using CCO to create a secret containing the variables based on the credentials requests. The service account token is also explicitly mounted for those operators. Pod identity webhooks were considered as an alternative to this approach but weren't chosen.
  • Basically, if we deploy this webhook it will be for customer convenience and will enable us to potentially use the Azure pod identity webhook in the future if we so chose. Note that AKS provides this webhook and other clouds like Google offer a webhook solution for configuring customer applications.
  • This is about providing parity with other solutions but not required for anything directly related to the product.
    If we don't provide this Azure pod identity webhook method, customer would need to get the details via some other way like a secret or set explicitly as environment variables. With the webhook, you just annotate your service account.
  • For Azure pod identity webhook, see CCO-363 and https://azure.github.io/azure-workload-identity/docs/installation/mutating-admission-webhook.html.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Will require following

Background

  • For AWS, we deploy the AWS STS pod identity webhook as a customer convenience for configuring their applications to utilize service account tokens minted by a cluster that supports STS. When you create a pod that references a service account, the webhook looks for annotations on that service account and if found, the webhook mutates the deployment in order to set environment variables + mounts the service account token on that deployment so that the pod has everything it needs to make an API client.
  • Our temporary access token (using TAT in place of STS because STS is AWS specific) enablement for (select) third party operators does not rely on the webhook and is instead using CCO to create a secret containing the variables based on the credentials requests. The service account token is also explicitly mounted for those operators. Pod identity webhooks were considered as an alternative to this approach but weren't chosen.
  • Basically, if we deploy this webhook it will be for customer convenience and will enable us to potentially use the Azure pod identity webhook in the future if we so chose. Note that AKS provides this webhook and other clouds like Google offer a webhook solution for configuring customer applications.
  • This is about providing parity with other solutions but not required for anything directly related to the product.
    If we don't provide this Azure pod identity webhook method, customer would need to get the details via some other way like a secret or set explicitly as environment variables. With the webhook, you just annotate your service account.
  • For Azure pod identity webhook, see CCO-363 and https://azure.github.io/azure-workload-identity/docs/installation/mutating-admission-webhook.html.

 

Once we have forked the webhook, we need to configure the operator to deploy similar to how we do for the other platforms.

  • ccoctl to create the webhook secret
  • podidentity webhook controller to deploy the webhook when secret exists

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The GCP IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing GCP Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provision GCP infrastructure without the use of Terraform

Why is this important?

  • Removing Terraform from Installer

Scenarios

  1. The new provider should aim to provide the same results as the existing GCP

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

installing into Shared VPC stuck in waiting for network infrastructure ready

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-10-225505

How reproducible:

Always

Steps to Reproduce:

1. "create install-config" and then insert Shared VPC settings (see [1])
2. activate the service account which has the minimum permissions in the host project (see [2])
3. "create cluster"

FYI The GCP project "openshift-qe" is the service project, and the GCP project "openshift-qe-shared-vpc" is the host project. 

Actual results:

1. Getting stuck in waiting for network infrastructure to become ready, until Ctrl+C is pressed.
2. 2 firewall-rules are created in the service project unexpectedly (see [3]).

Expected results:

The installation should succeed, and there should be no any firewall-rule getting created either in the service project or in the host project.

Additional info:

 

Description of problem:

After successful installation IPI or UPI cluster using minimum permissions, when destroying the cluster, it keeps telling error "failed to list target tcp proxies: googleapi: Error 403: Required 'compute.regionTargetTcpProxies.list' permission" unexpectedly.    

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-01-175607

How reproducible:

    Always

Steps to Reproduce:

    1. try IPI or UPI installation using minimum permissions, and make sure it succeeds
    2. destroy the cluster using the same GCP credentials    

Actual results:

    It keeps telling below errors until timeout.

08-27 14:51:40.508  level=debug msg=Target TCP Proxies: failed to list target tcp proxies: googleapi: Error 403: Required 'compute.regionTargetTcpProxies.list' permission for 'projects/openshift-qe', forbidden
...output omitted...
08-27 15:08:18.801  level=debug msg=Target TCP Proxies: failed to list target tcp proxies: googleapi: Error 403: Required 'compute.regionTargetTcpProxies.list' permission for 'projects/openshift-qe', forbidden

Expected results:

    It should not try to list regional target tcp proxies, because CAPI installation only creates global target tcp proxy. And the service account given to installer already has the required compute.targetTcpProxies permissions (see [1] and [2]). 

Additional info:

    FYI the latest IPI PROW CI test was about 19 days ago, where no such issue, see https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-gcp-ipi-mini-perm-custom-type-f28/1823483536926052352

Required GCP permissions for installer-provisioned infrastructure https://docs.openshift.com/container-platform/4.16/installing/installing_gcp/installing-gcp-account.html#minimum-required-permissions-ipi-gcp_installing-gcp-account

Required GCP permissions for user-provisioned infrastructure https://docs.openshift.com/container-platform/4.16/installing/installing_gcp/installing-gcp-user-infra.html#minimum-required-permissions-upi-gcp_installing-gcp-user-infra

Description of problem:

    Shared VPC installation using service account having all required permissions failed due to cluster operator ingress degraded, by telling error "error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc'"

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-multi-2024-08-07-221959

How reproducible:

    Always

Steps to Reproduce:

1. "create install-config", then insert the interested settings (see [1])
2. "create cluster" (see [2])

Actual results:

    Installation failed, because cluster operator ingress degraded (see [2] and [3]). 

$ oc get co ingress
NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress             False       True          True       113m    The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: error getting load balancer's firewall: googleapi: Error 403: Required 'compute.firewalls.get' permission for 'projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a5b1f420669b3474d959cff80e8452dc', forbidden...
$ 

In fact the mentioned k8s firewall-rule doesn't exist in the host project (see [4]), and, the given service account does have enough permissions (see [6]).

Expected results:

    Installation succeeds, and all cluster operators are healthy. 

Additional info:

    

Goal

  • Document collecting OpenShift-on-OpenStack metrics together with OpenStack metrics. Document sending metrics to an external Prometheus-compatible instance, and as a stretch goal, to RHOSO's telemetry-operator.

Why is this important?

  • Let users correlate SoS anomalies with OpenStack metrics.

Scenarios

\

  1. ...

Acceptance Criteria

  •  

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Document all implementation steps and requirements to configure RHOSO's telemetry-operator to scrap an arbitrary external endpoint (which would be in our case ACM's monitoring operator in OpenShift) to add metrics.

Identify the minimum required access level to add scraping endpoints and OpenShift UI dashboards.

Feature Overview (aka. Goal Summary)

The objective is to create a comprehensive backup and restore mechanism for HCP OpenShift Virtualization Provider. This feature ensures both the HCP state and the worker node state are backed up and can be restored efficiently, addressing the unique requirements of KubeVirt environments.

Goals (aka. Expected User Outcomes)

  • Users will be able to backup and restore the KubeVirt HCP cluster, including both HCP state and worker node state.
  • Ensures continuity and reliability of operations after a restore, minimizing downtime and data loss.
  • Supports seamless re-connection of HCP to worker nodes post-restore.

Requirements (aka. Acceptance Criteria)

  • Backup of KubeVirt CSI infra PVCs
  • Backup of KubeVirt VMs + VM state + (possibly even network attachment definitions)
  • Backup of Cloud Provider KubeVirt Infra Load Balancer services (having IP addresses change here on the service could be problematic)
  • Backup of Any custom network policies associated with VM pods
  • Backup of VMs and state placed on External Infra

Use Cases (Optional)

  1. Disaster Recovery: In case of a disaster, the system can restore the HCP and worker nodes to the previous state, ensuring minimal disruption.
  2. Cluster Migration: Allows migration of hosted clusters across different management clusters/
  3. System Upgrades: Facilitates safe upgrades by providing a reliable restore point.

Out of Scope

  • Real-time synchronization of backup data.
  • Non-disruptive Backup and restore (ideal but not required)

Documentation Considerations

Interoperability Considerations

  • Impact on other projects like ACM/MCE vol-sync.
  • Test scenarios to validate interoperability with existing backup solutions.

The HCP team has delivered OADP backup and restore steps for the Agent and AWS provider here. We need to add the steps necessary to make these steps work for HCP KubeVirt clusters.

Requirements

  • Deliver backup/restore steps that reach feature parity with the documented agent and aws platforms
  • Ensure that kubevirt-csi and cloud-provider-kubevirt LBs can be backup and restored successfully
  • Ensure this works with external infra

 

Non Requirements

  • VMs do not need to be backed up to reach feature parity because the current aws/agent steps require the cluster to scale down to zero before backing up.

document this process in the upstream hypershift documentation.

  • Backup while HCP is live with active worker nodes (don't back up workers, but backup should not disrupt workers)
  • Greenfield restore (meaning previous HCP is removed), HCP nodes are re-created during the restore
  • Restore is limited to same mgmt cluster the HCP originated on

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Allow customer to enabled EFS CSI usage metrics.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

OCP already supports exposing CSI usage metrics however the EFS metrics are not enabled by default. The goal of this feature is to allows customers to optionally turn on EFS CSI usage metrics in order to see them in the OCP console.

The EFS metrics are not enabled by default for a good reason as it can potentially impact performances.  It's disabled in OCP, because the CSI driver would walk through the whole volume, and that can be very slow on large volumes. For this reason, the default will remain the same (no metrics), customers would need explicitly opt-in.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Clear procedure on how to enable it as a day 2 operation. Default remains no metrics. Once enabled the metrics should be available for visualisation.

 

We should also have a way to disable metrics.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all AWS only
Connected / Restricted Network both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all AWS/EFS supported
Operator compatibility EFS CSI operator
Backport needed (list applicable versions) No
UI need (e.g. OpenShift Console, dynamic plugin, OCM) Should appear in OCP UI automatically
Other (please specify) OCP on AWS only

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an OCP user i want to be able to visualise the EFS CSI metrics

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

Additional metrics

Enabling metrics by default.

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

Customer request as per 

https://issues.redhat.com/browse/RFE-3290

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

We need to be extra clear on the potential performance impact

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Document how to enable CSI metrics + warning about the potential performance impact.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

It can benefit any cluster on AWS using EFS CSI including ROSA

Epic Goal*

This goal of this epic is to provide a way to admin to turn on EFS CSI usage metrics. Since this could lead to performance because the CSI driver would walk through the whole volume this option will not be enabled by default; admin will need to explicitly opt-in.

 
Why is this important? (mandatory)

Turning on EFS metrics allows users to monitor how much EFS space is being used by OCP.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As an admin I would like to turn on EFS CSI metrics 
  2. As an admin I would like to visualise how much EFS space is used by OCP.

 
Dependencies (internal and external) (mandatory)

None

Contributing Teams(and contacts) (mandatory) 

  • Development - STOR
  • Documentation - STOR
  • QE - STOR
  • PX - Yes, knowledge transfer
  • Others -

Acceptance Criteria (optional)

Enable CSI metrics via the operator - ensure the driver is started with the proper cmdline options. Verify that the metrics are sent and exposed to the users.

Drawbacks or Risk (optional)

Metrics are calculated by walking through the whole volume which can impact performances. For this reason enabling CSI metrics will need an explicit opt-in from the admin. This risk needs to be explicitly documented.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)  

As a product manager or business owner of OpenShift Lightspeed. I want to track who is using what feature of OLS and WHY. I also want to track the product adoption rate so that I can make decision about the product ( add/remove feature , add new investment )

Requirements (aka. Acceptance Criteria):

Notes:

Enable moniotring of OLS by defult when a user install OLS operator ---> check the box by defualt 

Users will have the ability to disable the monitoring by . ----> in check the box

 

Refer to this slack conversation :https://redhat-internal.slack.com/archives/C068JAU4Y0P/p1723564267962489 

 

Story

As a OLS developer, I want to the users to see the Operator recommended cluster monitoring box checked, so that the metrics are collected by default.

Acceptance Criteria

  • make the operator install UI to have the Operator recommended cluster monitoring box checked by default

AWS CAPI implementation supports "Tenancy" configuration option: https://pkg.go.dev/sigs.k8s.io/cluster-api-provider-aws@v1.5.0/api/v1beta1#AWSMachineSpec

This option corresponds to functionality OCP currently exposes through MAPI:

This option is currently in use by existing ROSA customers, and will need to be exposed in HyperShift NodePools

User Story:

As a (user persona), I want to be able to:

  • Set Tenancy options through the NodePool API.

so that I can achieve

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This does not require a design proposal.
This requires a feature gate.

wrap nodePool tenancy API field in a struct, to group and easily add new placement options to the API in the future.

User Story:

As a (user persona), I want to be able to:

  • Set Tenancy options through the NodePool API.

so that I can achieve

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This does not require a design proposal.
This requires a feature gate.

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Introduce snapshots support for Azure File as Tech Preview

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

After introducing cloning support in 4.17, the goal of this epic is to add the last remaining piece to support snapshots support as tech preview

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Should pass all the regular CSI snapshot tests. All failing or known issues should be documented in the RN. Since this feature is TP we can still introduce it with knows issues.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all all with Azure
Connected / Restricted Network all
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Operator compatibility Azure File CSI
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM) Already covered
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an OCP on Azure user I want to perform snapshots of my PVC and be able to restore them as a new PVC.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

Is there any known issues, if so they should be documented.

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

N/A

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

We have support for other cloud providers CSI snapshots, we need to align capabilities in Azure with their File CSI. Upstream support has lagged.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

User experience should be the same as other CSI drivers.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Add snapshot support in the CSI driver table, if there is any specific information to add, include it in the Azure File CSI driver doc. Any known issue should be documented in the RN.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Can be leveraged by ARO or OSD on Azure.

Epic Goal*

Add support for snapshots in Azure File.

 

Why is this important? (mandatory)

We should track upstream issues and ensure enablement in OpenShift. Snapshots are a standard feature of CSI and the reason we did not support it until now was lacking upstream support for snapshot restoration.

Snapshot restore feature was added recently in upstream driver 1.30.3 which we rebased to in 4.17 - https://github.com/kubernetes-sigs/azurefile-csi-driver/pull/1904

Furthermore we already included azcopy cli which is a depencency of cloning (and snapshots). Enabling snapshots in 4.17 is therefore just a matter of adding a sidecar, volumesnapshotclass and RBAC in csi-operator which is cheap compared to the gain.

However, we've observed a few issues with cloning that might need further fixes to be able to graduate to GA and intend releasing the cloning feature as Tech Preview in 4.17 - since snapshots are implemented with azcopy too we expect similar issues and suggest releasing snapshot feature also as Tech Preview first in 4.17.

 
Scenarios (mandatory) 

Users should be able to create a snapshot and restore PVC from snapshots.

 
Dependencies (internal and external) (mandatory)

azcopy - already added in scope of cloning epic

upstream driver support for snapshot restore - already added via 4.17 rebase

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Introduce snapshots support for Azure File as Tech Preview

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

After introducing cloning support in 4.17, the goal of this epic is to add the last remaining piece to support snapshots support as tech preview

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Should pass all the regular CSI snapshot tests. All failing or known issues should be documented in the RN. Since this feature is TP we can still introduce it with knows issues.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all all with Azure
Connected / Restricted Network all
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Operator compatibility Azure File CSI
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM) Already covered
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an OCP on Azure user I want to perform snapshots of my PVC and be able to restore them as a new PVC.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

Is there any known issues, if so they should be documented.

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

N/A

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

We have support for other cloud providers CSI snapshots, we need to align capabilities in Azure with their File CSI. Upstream support has lagged.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

User experience should be the same as other CSI drivers.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Add snapshot support in the CSI driver table, if there is any specific information to add, include it in the Azure File CSI driver doc. Any known issue should be documented in the RN.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Can be leveraged by ARO or OSD on Azure.

Feature Overview (aka. Goal Summary)  

This feature only covers the downstream MAPI work to Enable Capacity Blocks. 

Capacity Blocks is needed in managed OpenShift (ROSA with Hosted Control Planes) via CAPI. Once the HCP feature and OCM feature are completed then a Service Consumer can  use upstream CAPI to set Capacity reservations in  ROSA+HCP cluster.

https://docs.aws.amazon.com/en_us/AWSEC2/latest/UserGuide/capacity-blocks-using.html#capacity-blocks-purchase 

Feature Overview

Enable sharing ConfigMap and Secret across namespaces

Requirements

Requirement Notes isMvp?
Secrets and ConfigMaps can get shared across namespaces   YES

Questions to answer…

NA

Out of Scope

NA

Background, and strategic fit

Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them. 

Documentation Considerations

Questions to be addressed:
 * What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
 * Does this feature have doc impact?
 * New Content, Updates to existing content, Release Note, or No Doc Impact
 * If unsure and no Technical Writer is available, please contact Content Strategy.
 * What concepts do customers need to understand to be successful in [action]?
 * How do we expect customers will use the feature? For what purpose(s)?
 * What reference material might a customer want/need to complete [action]?
 * Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
 * What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic Goal*

Remove the Shared Resource CSI Driver as a tech preview feature.
 
Why is this important? (mandatory)

Shared Resources was originally introduced as a tech preview feature in OpenShift Container Platform. After extensive review, we have decided to GA this component through the Builds for OpenShift layered product.

Expected GA will be alongside OpenShift 4.16. Therefore it is safe to remove in OpenShift 4.17

 
Scenarios (mandatory)

  1. Accessing RHEL content in builds/workloads
  2. Sharing other information across namespaces in the cluster (ex: OpenShift pull secret) 

 
Dependencies (internal and external) (mandatory)

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - OpenShift Storage, OpenShift Builds (#forum-openshift-builds)
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

  • Shared Resource CSI driver cannot be installed using OCP feature gates/tech preview feature set.

Drawbacks or Risk (optional)

  • Using Shared Resources requires installation of a layered product, not part of OCP core.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview:

Ensure CSI Stack for Azure is running on management clusters with hosted control planes, allowing customers to associate a cluster as "Infrastructure only" and move the following parts of the stack:

  • Azure Disk CSI driver
  • Azure File CSI driver
  • Azure File CSI driver operator

Value Statement:

This feature enables customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.

Goals:

  1. Ability for customers to associate a cluster as "Infrastructure only" and pack control planes on role=infra nodes.
  2. Ability to run cluster-storage-operator (CSO) + Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster.
  3. Ability to run the driver DaemonSet in the hosted cluster.

Requirements:

  1. The feature must ensure that the CSI Stack for Azure is installed and running on management clusters with hosted control planes.
  2. The feature must allow customers to associate a cluster as "Infrastructure only" and pack control planes on role=infra nodes.
  3. The feature must enable the Azure Disk CSI driver, Azure File CSI driver, and Azure File CSI driver operator to run on the appropriate clusters.
  4. The feature must enable the cluster-storage-operator (CSO) + Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods to run in the management cluster.
  5. The feature must enable the driver DaemonSet to run in the hosted cluster.
  6. The feature must ensure security, reliability, performance, maintainability, scalability, and usability.

Use Cases:

  1. A customer wants to run their Azure infrastructure using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. They use this feature to associate a cluster as "Infrastructure only" and pack control planes on role=infra nodes.
  2. A customer wants to use Azure storage without having to see/manage its stack, especially on a managed service. This would mean that we need to run the cluster-storage-operator (CSO) + Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster and the driver DaemonSet in the hosted cluster. 

Questions to Answer:

  1. What Azure-specific considerations need to be made when designing and delivering this feature?
  2. How can we ensure the security, reliability, performance, maintainability, scalability, and usability of this feature?

Out of Scope:

Non-CSI Stack for Azure-related functionalities are out of scope for this feature.

Workload identity authentication is not covered by this feature - see STOR-1748

Background

This feature is designed to enable customers to run their Azure infrastructure more efficiently and cost-effectively by using HyperShift control planes and supporting infrastructure without incurring additional charges from Red Hat.

Documentation Considerations:

Documentation for this feature should provide clear instructions on how to enable the CSI Stack for Azure on management clusters with hosted control planes and associate a cluster as "Infrastructure only." It should also include instructions on how to move the Azure Disk CSI driver, Azure File CSI driver, and Azure File CSI driver operator to the appropriate clusters.

Interoperability Considerations:

This feature impacts the CSI Stack for Azure and any layered products that interact with it. Interoperability test scenarios should be factored by the layered products.

 

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

Run Azure File CSI driver operator + Azure File CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".

 

 
Why is this important? (mandatory)

This allows customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.

 
Scenarios (mandatory) 

When leveraging Hosted control planes, the Azure File CSI driver operator + Azure File CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.

 
Dependencies (internal and external) (mandatory)

Hosted control plane on Azure.

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation -
  • QE - 
  • PX - 
  • Others -

 

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

Run Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster allowing customers to associate a cluster as "Infrastructure only".

 

 
Why is this important? (mandatory)

This allows customers to run their Azure infrastructure more efficiently and cost-effectively by using hosted control planes and supporting infrastructure without incurring additional charges from Red Hat. Additionally, customers should care most about workloads not the management stack to operate their clusters, this feature gets us closer to this goal.

 
Scenarios (mandatory) 

When leveraging Hosted control planes, the Azure Disk CSI driver operator + Azure Disk CSI driver control-plane Pods should run in the management cluster. The driver DaemonSet should run on the managed cluster. This deployment model should provide the same feature set as the regular OCP deployment.

 
Dependencies (internal and external) (mandatory)

Hosted control plane on Azure.

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation -
  • QE - 
  • PX - 
  • Others -

 

Done - Checklist (mandatory)

As part of this epic, Engineers working on Azure Hypershift should be able to build and use Azure Disk storage on hypershift guests via developer preview custom build images.

Goal

This goals of this features are:

  • optimize and streamline the operations of HyperShift Operator (HO) on Azure Kubernetes Service (AKS) clusters
  • Enable auto-detectopm of the underlying environment (managed or self-managed) to optimize the HO accordingly.

Place holder epic to capture all azure tickets.

TODO: review.

User Story:

As an end user of a hypershift cluster, I want to be able to:

  • Not see internal host information when inspecting a serving certificate of the kubernetes API server

so that I can achieve

  • No knowledge of internal names for the kubernetes cluster.

From slack thread: https://redhat-external.slack.com/archives/C075PHEFZKQ/p1722615219974739 

We need 4 different certs:

  • common sans
  • internal san
  • fqdn
  • svc ip

Feature Overview (aka. Goal Summary)  

Unify and update hosted control planes storage operators so that they have similar code patterns and can run properly in both standalone OCP and HyperShift's control plane.

Goals (aka. expected user outcomes)

  • Simplify the operators with a unified code pattern
  • Expose metrics from control-plane components
  • Use proper RBACs in the guest cluster
  • Scale the pods according to HostedControlPlane's AvailabilityPolicy
  • Add proper node selector and pod affinity for mgmt cluster pods

Requirements (aka. Acceptance Criteria):

  • OCP regression tests work in both standalone OCP and HyperShift
  • Code in the operators looks the same
  • Metrics from control-plane components are exposed
  • Proper RBACs are used in the guest cluster
  • Pods scale according to HostedControlPlane's AvailabilityPolicy
  • Proper node selector and pod affinity is added for mgmt cluster pods

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal*

Our current design of EBS driver operator to support Hypershift does not scale well to other drivers. Existing design will lead to more code duplication between driver operators and possibility of errors.
 
Why is this important? (mandatory)

An improved design will allow more storage drivers and their operators to be added to hypershift without requiring significant changes in the code internals.
 
Scenarios (mandatory) 

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Finally switch both CI and ART to the refactored aws-ebs-csi-driver-operator.

The functionality and behavior should be the same as the existing operator, however, the code is completely new. There could be some rough edges. See https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-driver-operator-merge.md 

 

Ci should catch the most obvious errors, however, we need to test features that we do not have in CI. Like:

  • custom CA bundles
  • cluster-wide proxy
  • custom encryption keys used in install-config.yaml
  • government cluster
  • STS
  • SNO
  • and other

Out CSI driver YAML files are mostly copy-paste from the initial CSI driver (AWS EBS?). 

As OCP engineer, I want the YAML files to be generated, so we can keep consistency among the CSI drivers easily and make them less error-prone.

It should have no visible impact on the resulting operator behavior.

Feature Overview

Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.

Goals

Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).

Use Cases

Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.

 

粗文本*h3. *Feature Overview

Support deploying an OpenShift cluster across multiple vSphere clusters, i.e. configuring multiple vCenter servers in one OpenShift cluster.

Goals

Multiple vCenter support in the Cloud Provider Interface (CPI) and the Cloud Storage Interface (CSI).

Use Cases

Customers want to deploy OpenShift across multiple vSphere clusters (vCenters) primarily for high availability.

 

Done Done Done Criteria

This section contains all the test cases that we need to make sure work as part of the done^3 criteria.

  • Clean install of new cluster with multi vCenter configuration
  • Clean install of new cluster with single vCenter still working as previously
  • VMs / machines can be scaled across all vCenters / Failure Domains
  • PVs should be able to be created on all vCenters

Out-of-Scope

This section contains all scenarios that are considered out of scope for this enhancement that will be done via a separate epic / feature / story.

  • Migration of single vCenter OCP to a multi vCenter (stretch
  •  

Feature Overview

Add authentication to the internal components of the Agent Installer so that the cluster install is secure.

Goals

  • Day1: Only allow agents booted from the same agent ISO to register with the assisted-service and use the agent endpoints
  • Day2: Only allow agents booted from the same node ISO to register with the assisted-service and use the agent endpoints
  •  
  • Only allow access to write endpoints to the internal services
  • Use authentication to read endpoints

 

Epic Goal

  • This epic scope was originally to encompass both authentication and authorization but we have split the expanding scope into a separate epic.
  • We want to add authorization to the internal components of Agent Installer so that the cluster install is secure. 

Why is this important?

  • The Agent Installer API server (assisted-service) has several methods for Authorization but none of the existing methods are applicable tothe Agent Installer use case. 
  • During the MVP of Agent Installer we attempted to turn on the existing authorization schemes but found we didn't have access to the correct API calls.
  • Without proper authorization it is possible for an unauthorized node to be added to the cluster during install. Currently we expect this to be done as a mistake rather than maliciously.

Brainstorming Notes:

Requirements

  • Allow only agents booted from the same ISO to register with the assisted-service and use the agent endpoints
  • Agents already know the InfraEnv ID, so if read access requires authentication then that is sufficient in some existing auth schemes.
  • Prevent access to write endpoints except by the internal systemd services
  • Use some kind of authentication for read endpoints
  • Ideally use existing credentials - admin-kubeconfig client cert and/or kubeadmin-password
  • (Future) Allow UI access in interactive mode only

 

Are there any requirements specific to the auth token?

  • Ephemeral
  • Limited to one cluster: Reuse the existing admin-kubeconfig client cert

 

Actors:

  • Agent Installer: example wait-for
  • Internal systemd: configurations, create cluster infraenv, etc
  • UI: interactive user
  • User: advanced automation user (not supported yet)

 

Do we need more than one auth scheme?

Agent-admin - agent-read-write

Agent-user - agent-read

Options for Implementation:

  1. New auth scheme in assisted-service
  2. Reverse proxy in front of assisted-service API
  3. Use an existing auth scheme in assisted-service

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Previous Work (Optional):

  1. AGENT-60 Originally we wanted to just turn on local authorization for Agent Installer workflows. It was discovered this was not sufficient for our use case.

Open questions::

  1. Which API endpoints do we need for the interactive flow?
  2. What auth scheme does the Assisted UI use if any?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a user, when creating node ISOs, I want to be able to:

  • See the ISO's expiration time logged when the ISO is generated using "oc adm node-image create"

so that I can achieve

  • Enhanced awareness of the ISO expiration date
  • Prevention of unexpected expiration issues
  • Improved overall user experience during node creation

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • aws-ebs-csi-driver-operator (in csi-operator)
  • aws-efs-csi-driver-operator
  • azure-disk-csi-driver-operator
  • azure-file-csi-driver-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • csi-driver-manila-operator
  • vmware-vsphere-csi-driver-operator
  • alibaba-disk-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • csi-driver-shared-resource-operator
  • ibm-powervs-block-csi-driver-operator
  • secrets-store-csi-driver-operator

 

  • cluster-storage-operator
  • cluster-csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

EOL, do not upgrade:

  • github.com/oVirt/csi-driver-operator

Feature Overview

Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

 
Goals

  • Functionality on GCP GA
  • inclusion in the cluster backups
  • flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

This is continuation of CORS-2455 / CFE-719 work, where support for GCP tags & labels delivered as TechPreview in 4.14 and to make it GA in 4.15. It would involve removing any reference to TechPreview in code and doc and to incorporate any feedback received from the users.

TechPreview featureSet check added in installer for userLabels and userTags should be removed and the TechPreview reference made in the install-config GCP schema should be removed.

Acceptance Criteria

  • Should be able to define userLabel and userTags without setting TechPreviewNoUpgrade featureSet.

TechPreview featureSet check added in machine-api-provider-gcp operator for userLabels and userTags.

And the new featureGate added in openshift/api should also be removed.

Acceptance Criteria

  • Should be able to define userLabel and userTags without setting featureSet.

This Feature covers effort in person-weeks of meetings in #wg-managed-ocp-versions where OTA helped SD refine how their doing OCM work would help, and what that OCM work might look like https://issues.redhat.com/browse/OTA-996?focusedId=25608383&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25608383.

 

Background. 

Currently the ROSA/ARO versions are not managed by OTA team.
This Feature covers the engineering effort to take the responsibility of management of OCP version management in OSD, ROSA and ARO from SRE-P to OTA.

Here is the design document for the effort: https://docs.google.com/document/d/1hgMiDYN9W60BEIzYCSiu09uV4CrD_cCCZ8As2m7Br1s/edit?skip_itp2_check=true&pli=1

Here are some objectives :

  • Managed clusters would get update recommendations from Red Hat hosted OSUS directly without much middle layers like ClusterImageSet.
  • The new design should reduce the cost of maintaining versions for managed clusters including Hypershift hosted control planes.

Presentation from Jeremy Eder :

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

This epic is to transfer the responsibility of OCP version management in OSD, ROSA and ARO from SRE-P to OTA.

  • The responsibility of management of OCP versions available to all self-managed OCP customers lies with the OTA team.
    • This responsibility makes the OTA team the center of excellence around version health.
  • The responsibility of management of OCP versions available to managed customers lies with the SRE-P team.  Why do this project:
    • The SREP team took on version management responsibility out of necessity.  Code was written and maintained to serve a service-tailored "version list".  This code requires effort to maintain.
    • As we begin to sell more managed OCP, it makes sense to move this responsibility back to the OTA team as this is not an "SRE" focused activity.
    • As the CoE for version health, the OTA team has the most comprehensive overview of code health.
    • The OTA team would benefit by coming up to speed on managed OCP-specific lifecycles/policies as well as become aware of the "why" for various policies where they differ from self-managed OCP.

 

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision Azure infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The Azure IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing Azure Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provision Azure infrastructure without the use of Terraform

Why is this important?

  • Removing Terraform from Installer

Scenarios

  1. The new provider should aim to provide the same results as the existing Azure
  2. terraform provider.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

Failed to create second cluster in shared vnet, below error is thrown out during creating network infrastructure when creating 2nd cluster, installer timed out and exited.
==============
07-23 14:09:27.315  level=info msg=Waiting up to 15m0s (until 6:24AM UTC) for network infrastructure to become ready...
...
07-23 14:16:14.900  level=debug msg=	failed to reconcile cluster services: failed to reconcile AzureCluster service loadbalancers: failed to create or update resource jima0723b-1-x6vpp-rg/jima0723b-1-x6vpp-internal (service: loadbalancers): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-1-x6vpp-rg/providers/Microsoft.Network/loadBalancers/jima0723b-1-x6vpp-internal
07-23 14:16:14.900  level=debug msg=	--------------------------------------------------------------------------------
07-23 14:16:14.901  level=debug msg=	RESPONSE 400: 400 Bad Request
07-23 14:16:14.901  level=debug msg=	ERROR CODE: PrivateIPAddressIsAllocated
07-23 14:16:14.901  level=debug msg=	--------------------------------------------------------------------------------
07-23 14:16:14.901  level=debug msg=	{
07-23 14:16:14.901  level=debug msg=	  "error": {
07-23 14:16:14.901  level=debug msg=	    "code": "PrivateIPAddressIsAllocated",
07-23 14:16:14.901  level=debug msg=	    "message": "IP configuration /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-1-x6vpp-rg/providers/Microsoft.Network/loadBalancers/jima0723b-1-x6vpp-internal/frontendIPConfigurations/jima0723b-1-x6vpp-internal-frontEnd is using the private IP address 10.0.0.100 which is already allocated to resource /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/frontendIPConfigurations/jima0723b-49hnw-internal-frontEnd.",
07-23 14:16:14.902  level=debug msg=	    "details": []
07-23 14:16:14.902  level=debug msg=	  }
07-23 14:16:14.902  level=debug msg=	}
07-23 14:16:14.902  level=debug msg=	--------------------------------------------------------------------------------

Install-config for 1st cluster:
=========
metadata:
  name: jima0723b
platform:
  azure:
    region: eastus
    baseDomainResourceGroupName: os4-common
    networkResourceGroupName: jima0723b-rg
    virtualNetwork: jima0723b-vnet
    controlPlaneSubnet: jima0723b-master-subnet
    computeSubnet: jima0723b-worker-subnet
publish: External

Install-config for 2nd cluster:
========
metadata:
  name: jima0723b-1
platform:
  azure:
    region: eastus
    baseDomainResourceGroupName: os4-common
    networkResourceGroupName: jima0723b-rg
    virtualNetwork: jima0723b-vnet
    controlPlaneSubnet: jima0723b-master-subnet
    computeSubnet: jima0723b-worker-subnet
publish: External

shared master subnet/worker subnet:
$ az network vnet subnet list -g jima0723b-rg --vnet-name jima0723b-vnet -otable
AddressPrefix    Name                     PrivateEndpointNetworkPolicies    PrivateLinkServiceNetworkPolicies    ProvisioningState    ResourceGroup
---------------  -----------------------  --------------------------------  -----------------------------------  -------------------  ---------------
10.0.0.0/24      jima0723b-master-subnet  Disabled                          Enabled                              Succeeded            jima0723b-rg
10.0.1.0/24      jima0723b-worker-subnet  Disabled                          Enabled                              Succeeded            jima0723b-rg

internal lb frontedIPConfiguration on 1st cluster:
$ az network lb show -n jima0723b-49hnw-internal -g jima0723b-49hnw-rg --query 'frontendIPConfigurations'
[
  {
    "etag": "W/\"7a7531ca-fb02-48d0-b9a6-d3fb49e1a416\"",
    "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/frontendIPConfigurations/jima0723b-49hnw-internal-frontEnd",
    "inboundNatRules": [
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-0",
        "resourceGroup": "jima0723b-49hnw-rg"
      },
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-1",
        "resourceGroup": "jima0723b-49hnw-rg"
      },
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/inboundNatRules/jima0723b-49hnw-master-2",
        "resourceGroup": "jima0723b-49hnw-rg"
      }
    ],
    "loadBalancingRules": [
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/loadBalancingRules/LBRuleHTTPS",
        "resourceGroup": "jima0723b-49hnw-rg"
      },
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-49hnw-rg/providers/Microsoft.Network/loadBalancers/jima0723b-49hnw-internal/loadBalancingRules/sint-v4",
        "resourceGroup": "jima0723b-49hnw-rg"
      }
    ],
    "name": "jima0723b-49hnw-internal-frontEnd",
    "privateIPAddress": "10.0.0.100",
    "privateIPAddressVersion": "IPv4",
    "privateIPAllocationMethod": "Static",
    "provisioningState": "Succeeded",
    "resourceGroup": "jima0723b-49hnw-rg",
    "subnet": {
      "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima0723b-rg/providers/Microsoft.Network/virtualNetworks/jima0723b-vnet/subnets/jima0723b-master-subnet",
      "resourceGroup": "jima0723b-rg"
    },
    "type": "Microsoft.Network/loadBalancers/frontendIPConfigurations"
  }
]

From above output, privateIPAllocationMethod is static and always allocate privateIPAddress to 10.0.0.100, this might cause the 2nd cluster installation failure.

Checked the same on cluster created by using terraform, privateIPAllocationMethod is dynamic.
===============
$ az network lb show -n wxjaz723-pm99k-internal -g wxjaz723-pm99k-rg --query 'frontendIPConfigurations'
[
  {
    "etag": "W/\"e6bec037-843a-47ba-a725-3f322564be58\"",
    "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/frontendIPConfigurations/internal-lb-ip-v4",
    "loadBalancingRules": [
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/loadBalancingRules/api-internal-v4",
        "resourceGroup": "wxjaz723-pm99k-rg"
      },
      {
        "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-pm99k-rg/providers/Microsoft.Network/loadBalancers/wxjaz723-pm99k-internal/loadBalancingRules/sint-v4",
        "resourceGroup": "wxjaz723-pm99k-rg"
      }
    ],
    "name": "internal-lb-ip-v4",
    "privateIPAddress": "10.0.0.4",
    "privateIPAddressVersion": "IPv4",
    "privateIPAllocationMethod": "Dynamic",
    "provisioningState": "Succeeded",
    "resourceGroup": "wxjaz723-pm99k-rg",
    "subnet": {
      "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wxjaz723-rg/providers/Microsoft.Network/virtualNetworks/wxjaz723-vnet/subnets/wxjaz723-master-subnet",
      "resourceGroup": "wxjaz723-rg"
    },
    "type": "Microsoft.Network/loadBalancers/frontendIPConfigurations"
  },
...
]

Version-Release number of selected component (if applicable):

  4.17 nightly build

How reproducible:

  Always

Steps to Reproduce:

    1. Create shared vnet / master subnet / worker subnet
    2. Create 1st cluster in shared vnet
    3. Create 2nd cluster in shared vnet
    

Actual results:

    2nd cluster installation failed

Expected results:

    Both clusters are installed successfully.

Additional info:

    

 

Description of problem:

Install Azure fully private IPI cluster by using CAPI with payload built from cluster bot including openshift/installer#8727,openshift/installer#8732,

install-config:
=================
platform:
  azure:
    region: eastus
    outboundType: UserDefinedRouting
    networkResourceGroupName: jima24b-rg
    virtualNetwork: jima24b-vnet
    controlPlaneSubnet: jima24b-master-subnet
    computeSubnet: jima24b-worker-subnet
publish: Internal
featureSet: TechPreviewNoUpgrade

Checked storage account created by installer, its property allowBlobPublicAccess is set to True.
$ az storage account list -g jima24b-fwkq8-rg --query "[].[name,allowBlobPublicAccess]" -o tsv
jima24bfwkq8sa    True

This is not consistent with terraform code, https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L74

At least, storage account should have no public access for fully private cluster.

Version-Release number of selected component (if applicable):

    4.17 nightly build

How reproducible:

    Always

Steps to Reproduce:

    1. Create fully private cluster
    2. Check storage account created by installer
    3.
    

Actual results:

    storage account have public access on fully private cluster.

Expected results:

     storage account should have no public access on fully private cluster.

Additional info:

    

Description of problem:

In install-config file, there is no zone/instance type setting under controlplane or defaultMachinePlatform
==========================
featureSet: CustomNoUpgrade
featureGates:
- ClusterAPIInstallAzure=true
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3

create cluster, master instances should be created in multi zones, since default instance type 'Standard_D8s_v3' have availability zones. Actually, master instances are not created in any zone.
$ az vm list -g jima24a-f7hwg-rg -otable
Name                                        ResourceGroup     Location        Zones
------------------------------------------  ----------------  --------------  -------
jima24a-f7hwg-master-0                      jima24a-f7hwg-rg  southcentralus
jima24a-f7hwg-master-1                      jima24a-f7hwg-rg  southcentralus
jima24a-f7hwg-master-2                      jima24a-f7hwg-rg  southcentralus
jima24a-f7hwg-worker-southcentralus1-wxncv  jima24a-f7hwg-rg  southcentralus  1
jima24a-f7hwg-worker-southcentralus2-68nxv  jima24a-f7hwg-rg  southcentralus  2
jima24a-f7hwg-worker-southcentralus3-4vts4  jima24a-f7hwg-rg  southcentralus  3

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-23-145410

How reproducible:

Always

Steps to Reproduce:

1. CAPI-based install on azure platform with default configuration
2. 
3.

Actual results:

master instances are created but not in any zone.

Expected results:

master instances should be created per zone based on selected instance type, keep the same behavior as terraform based install.

Additional info:

When setting zones under controlPlane in install-config, master instances can be created per zone.
install-config:
===========================
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      zones: ["1","3"]

$ az vm list -g jima24b-p76w4-rg -otable
Name                                        ResourceGroup     Location        Zones
------------------------------------------  ----------------  --------------  -------
jima24b-p76w4-master-0                      jima24b-p76w4-rg  southcentralus  1
jima24b-p76w4-master-1                      jima24b-p76w4-rg  southcentralus  3
jima24b-p76w4-master-2                      jima24b-p76w4-rg  southcentralus  1
jima24b-p76w4-worker-southcentralus1-bbcx8  jima24b-p76w4-rg  southcentralus  1
jima24b-p76w4-worker-southcentralus2-nmgfd  jima24b-p76w4-rg  southcentralus  2
jima24b-p76w4-worker-southcentralus3-x2p7g  jima24b-p76w4-rg  southcentralus  3

 

Description of problem:

Launch CAPI based installation on Azure Government Cloud, installer was timeout when waiting for network infrastructure to become ready.

06-26 09:08:41.153  level=info msg=Waiting up to 15m0s (until 9:23PM EDT) for network infrastructure to become ready...
...
06-26 09:09:33.455  level=debug msg=E0625 21:09:31.992170   22172 azurecluster_controller.go:231] "failed to reconcile AzureCluster" err=<
06-26 09:09:33.455  level=debug msg=	failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg
06-26 09:09:33.456  level=debug msg=	--------------------------------------------------------------------------------
06-26 09:09:33.456  level=debug msg=	RESPONSE 404: 404 Not Found
06-26 09:09:33.456  level=debug msg=	ERROR CODE: SubscriptionNotFound
06-26 09:09:33.456  level=debug msg=	--------------------------------------------------------------------------------
06-26 09:09:33.456  level=debug msg=	{
06-26 09:09:33.456  level=debug msg=	  "error": {
06-26 09:09:33.456  level=debug msg=	    "code": "SubscriptionNotFound",
06-26 09:09:33.456  level=debug msg=	    "message": "The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found."
06-26 09:09:33.456  level=debug msg=	  }
06-26 09:09:33.456  level=debug msg=	}
06-26 09:09:33.456  level=debug msg=	--------------------------------------------------------------------------------
06-26 09:09:33.456  level=debug msg=	. Object will not be requeued
06-26 09:09:33.456  level=debug msg= > logger="controllers.AzureClusterReconciler.reconcileNormal" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" reconcileID="f2ff1040-dfdd-4702-ad4a-96f6367f8774" x-ms-correlation-request-id="d22976f0-e670-4627-b6f3-e308e7f79def" name="jima26mag-9bqkl"
06-26 09:09:33.457  level=debug msg=I0625 21:09:31.992215   22172 recorder.go:104] "failed to reconcile AzureCluster: failed to reconcile AzureCluster service group: reconcile error that cannot be recovered occurred: resource is not Ready: The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.: PUT https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima26mag-9bqkl-rg\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: SubscriptionNotFound\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"SubscriptionNotFound\",\n    \"message\": \"The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found.\"\n  }\n}\n--------------------------------------------------------------------------------\n. Object will not be requeued" logger="events" type="Warning" object={"kind":"AzureCluster","namespace":"openshift-cluster-api-guests","name":"jima26mag-9bqkl","uid":"20bc01ee-5fbe-4657-9d0b-7013bd55bf96","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","resourceVersion":"1115"} reason="ReconcileError"
06-26 09:17:40.081  level=debug msg=I0625 21:17:36.066522   22172 helpers.go:516] "returning early from secret reconcile, no update needed" logger="controllers.reconcileAzureSecret" controller="ASOSecret" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster" AzureCluster="openshift-cluster-api-guests/jima26mag-9bqkl" namespace="openshift-cluster-api-guests" name="jima26mag-9bqkl" reconcileID="2df7c4ba-0450-42d2-901e-683de399f8d2" x-ms-correlation-request-id="b2bfcbbe-8044-472f-ad00-5c0786ebbe84"
06-26 09:23:46.611  level=debug msg=Collecting applied cluster api manifests...
06-26 09:23:46.611  level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure is not ready: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
06-26 09:23:46.611  level=info msg=Shutting down local Cluster API control plane...
06-26 09:23:46.612  level=info msg=Stopped controller: Cluster API
06-26 09:23:46.612  level=warning msg=process cluster-api-provider-azure exited with error: signal: killed
06-26 09:23:46.612  level=info msg=Stopped controller: azure infrastructure provider
06-26 09:23:46.612  level=warning msg=process cluster-api-provider-azureaso exited with error: signal: killed
06-26 09:23:46.612  level=info msg=Stopped controller: azureaso infrastructure provider
06-26 09:23:46.612  level=info msg=Local Cluster API system has completed operations
06-26 09:23:46.612  [ERROR] Installation failed with error code '4'. Aborting execution.

From above log, Azure Resource Management API endpoint is not correct, endpoint "management.azure.com" is for Azure Public cloud, the expected one for Azure Government should be "management.usgovcloudapi.net".

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-06-23-145410

How reproducible:

    Always

Steps to Reproduce:

    1. Install cluster on Azure Government Cloud, capi-based installation 
    2.
    3.
    

Actual results:

    Installation failed because of the wrong Azure Resource Management API endpoint used.

Expected results:

    Installation succeeded.

Additional info:

    

Description of problem:

    CAPZ creates an empty route table during installs

Version-Release number of selected component (if applicable):

4.17    

How reproducible:

    Very

Steps to Reproduce:

    1.Install IPI cluster using CAPZ
    2.
    3.
    

Actual results:

    Empty route table created and attached to worker subnet

Expected results:

    No route table created

Additional info:

    

Epic Goal*

There was an epic / enhancement to create a cluster-wide TLS config that applies to all OpenShift components:

https://issues.redhat.com/browse/OCPPLAN-4379
https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/tls-config.md

For example, this is how KCM sets --tls-cipher-suites and --tls-min-version based on the observed config:

https://issues.redhat.com/browse/WRKLDS-252
https://github.com/openshift/cluster-kube-controller-manager-operator/pull/506/files

The cluster admin can change the config based on their risk profile, but if they don't change anything, there is a reasonable default.

We should update all CSI driver operators to use this config. Right now we have a hard-coded cipher list in library-go. See OCPBUGS-2083 and OCPBUGS-4347 for background context.

 
Why is this important? (mandatory)

This will keep the cipher list consistent across many OpenShift components. If the default list is changed, we get that change "for free".

It will reduce support calls from customers and backport requests when the recommended defaults change.

It will provide flexibility to the customer, since they can set their own TLS profile settings without requiring code change for each component.

 
Scenarios (mandatory) 

As a cluster admin, I want to use TLSSecurityProfile to control the cipher list and minimum TLS version for all CSI driver operator sidecars, so that I can adjust the settings based on my own risk assessment.

 
Dependencies (internal and external) (mandatory)

None, the changes we depend on were already implemented.

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation - 
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

We want to stop building the kube-proxy image out of the openshift-sdn repo, and start building it out of the openshift/kubernetes repo along with the other kubernetes binaries.

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

openshift-sdn is no longer part of OCP in 4.17, so remove references to it in the networking APIs.

Consider whether we can remove the entire network.openshift.io API, which will now be no-ops.

In places where both sdn and ovn-k are supported, remove references to sdn.

In some places (notably the migration API), we will probably leave an API in place that currently has no purpose.

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal:

As an administrator, I would like to use my own managed DNS solution instead of only specific openshift-install supported DNS services (such as AWS Route53, Google Cloud DNS, etc...) for my OpenShift deployment.

 

Problem:

While cloud-based DNS services provide convenient hostname management, there's a number of regulatory (ITAR) and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.

 

Why is this important:

  • Provides customers with the flexibility to leverage their own custom managed ingress DNS solutions already in use within their organizations.
  • Required for regions like AWS GovCloud in which many customers may not be able to use the Route53 service (only for commercial customers) for both internal or ingress DNS.
  • OpenShift managed internal DNS solution ensures cluster operation and nothing breaks during updates.

 

Dependencies (internal and external):

 

Prioritized epics + deliverables (in scope / not in scope):

  • Ability to bootstrap cluster without an OpenShift managed internal DNS service running yet
  • Scalable, cluster (internal) DNS solution that's not dependent on the operation of the control plane (in case it goes down)
  • Ability to automatically propagate DNS record updates to all nodes running the DNS service within the cluster
  • Option for connecting cluster to customers ingress DNS solution already in place within their organization

 

Estimate (XS, S, M, L, XL, XXL):

 

Previous Work:

 

Open questions:

 

Link to Epic: https://docs.google.com/document/d/1OBrfC4x81PHhpPrC5SEjixzg4eBnnxCZDr-5h3yF2QI/edit?usp=sharing

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Update Installer changes to accompany the move to Installation with CAPG

Why is this important?

  • https://issues.redhat.com/browse/CORS-2460 was completed when installation on GCP was completed using terraform. Now, with the removal of terraform based installation and move to CAPI based installation, some of the previously completed tasks need to be revisited-re-implemented.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  1. Grab the Network Field of GCPClusterStatus
  2. Within the Network Struct, grab the APIServerForwardingRule and APIInternalForwardingRule fields
  3. Each of these fields are of type ForwardingRule which in turn contain the IPaddress of the LB
  4. Verify accuracy of this IP address by calling this method even when custom-dns is not configured. Compare the IP address extracted by this method with the DNS configuration.
  5. Using existing methods to add the above IP address to the Infra CR within the bootstrap Ignition.

Currently, the `master.ign` contains the URL from which to download the actual Ignition. On cloud platforms, this value is:
"source":"https://api-int.<cluster domain>:22623/config/master"

Update this value with the API-Int LB IP when custom-dns is enabled on the GCP platform. 

 

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview

This feature is to track automation in ODC, related packages, upgrades and some tech debts

Goals

  • Improve automation for Pipelines dynamic plugins
  • Improve automation for OpenShift Developer console
  • Move cypress script into frontend to make it easier to approve changes
  • Update to latest PatternFly QuickStarts

Requirements

  • TBD
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. No

 

Questions to answer…

  • Is there overlap with what other teams at RH are already planning?  No overlap

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

This won't impact documentation and this feature is to mostly enhance end to end test and job runs on CI

Assumptions

  • ...

Customer Considerations

  • No direct impact to customer

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Problem:

Improving existing tests in CI to run more tests

Goal:

Why is it important?

Use cases:

  1. Improving test execution to get more tests run on CI

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

Epic Goal

  • Rename `local-cluster` in RHACM.

Why is this important?

  • Customers have found it confusing to see the `local-cluster` as a hardcoded object in their ACM clusters list. 
    • They have not complained about the fact that it is there, but rather just the name of it.
  • In particular, as the architecture of RHACM evolves to include a global Hub of Hubs, the management of sub-hubs ("leaf hubs") will get problematic if we start to see numerous managed sub-hubs all with the same name `local-cluster` being imported to the global hub.

Scenarios

  1. Customer installs RHACM
  2. Customer sees local-cluster in the all clusters list
  3. Customer can rename local-cluster as needed

Alternate scenario

  1. Customer installs RHACM
  2. customer sees the management hub in the all clusters list with a unique cluster ID, not a user-configurable name
  3. Customer cannot rename local-cluster as needed; instead they could use a label to indicate some colloquial nickname 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. Too many to accurately list at this point, but we need to consider every component, every part of RHACM.

Previous Work (Optional):

Open questions:

  1. Should the local-cluster object be a standardized unique cluster ID? or should it be user configurable?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR> 

Slack Channel

#acm-1290-rename-local-cluster

Feature goal (what are we trying to solve here?)

Remove hard-coded local-cluster from the import local cluster feature and verify that we don't use it in the infrastructure operator  

DoD (Definition of Done)

Testing the import local cluster and checking the behavior after the upgrade.

Does it need documentation support?

Yes.

Feature origin (who asked for this feature?)

  • A Customer asked for it

  •  

No

  • A solution architect asked for it

  •  

No

 

  • Internal request

 

  •  

Reasoning (why it's important?)

  • behavior change in ACM

Competitor analysis reference

  • Do our competitors have this feature?
    • No

Feature usage (do we have numbers/data?)

  • Not relevant

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Not relevant to UI

Presently the name of the local-cluster is hardwired to "local-cluster" in the local cluster import tool.
It is possible to redefine the name of the "local-cluster" in ACM then the correct local-cluster name needs to be picked up and used by the ManagedCluster.

Suggested approach

1: Obtain the correct "local-cluster" name from the ManagedCluster CR that has been labelled as "local-cluster"
2: Use this name to import the local cluster, annotate the created AgentServiceConfig, ClusterDeployment and InfraEnv as a "local cluster"
3: Handle any updates to ManagedCluster to keep the name in sync.
4: During deletion of local cluster CRs, this annotation may be used to identify CRs to be deleted.

This will leave an edge case, there will be an AgentServiceConfig, ClusterDeployment and InfraEnv "left behind" for any users who have renamed their ManagedCluster and then performed an upgrade to this new version. Those users will need to manually remove these CR's. (I will discuss further with ACM to determine a suitable course of action here.)

This makes the following assumptions, which should also be checked with the ACM team.

1: ACM users may rename their "local-cluster" in ACM (meaning that we should pick this change up)
2: ACM will use the label "local-cluster" in the ManagedCluster CR to signify a local cluster
3: There will only be one "local-cluster" in ACM (note that it's possible to add a label arbitrarily so this may not be properly enforceable.)

Requirement description:

As an VM Admin, I want to improve overall density. In our traditional VM environments, we find that we are memory bound much more than CPU. Even with properly sized VMs, we see a lot of memory just sitting around allocated to the VM, but not actually used. Moreover, we always see people requesting VMs that are sized way too big for their workloads. It is better customer service allow it to some degree and then recover the memory at the hypervisor level.

MVP:

  • Move SWAP to beta (OCP TP)
  • Dashboard for monitoring
  • Make sure the scheduler sees the real memory available, rather than that allocated to the VMs.

Documents:

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Prometheus query for UI:
sum by (instance)(((node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) + (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes)) / node_memory_MemTotal_bytes) *100

In human words: This is approximating how much over-committment of memory is taking place. A value of 100 means RAM+SWAP usage are 100% of system RAM capacity. 105% means RAM+SWAP are factor 105% of system RAM capacity.

Threshold: Yellow 95%, Red 105%
Based on: https://docs.google.com/document/d/1AbR1LACNMRU2QMqFpe-Se2mCEFLMqW_M9OPKh2v3yYw,

https://docs.google.com/document/d/1E1joajwxQChQiDVTsr9Qk_iIhpQkSI-VQP-o_BMx8Aw

 

Provide a simple way to get a VM-friendly networking setup, without having to configure the underlying physical network.

Goal

Provide a network solution working out of the box, meeting expectations of a typical VM workload.

User Stories

  • As an owner of a VM that is connected only to a secondary overlay network, I want to fetch resources from outside networks (internet).
  • As a developer migrating my VMs to OCP, I do not want to change my application to support multiple NICs.
  • My application needs access to a flat network connecting it to other VMs and Pods.
  • I want to expose my selected applications over the network to users outside the cluster.
  • I'm limited by public cloud networking restrictions and I rely on their LoadBalancer to route traffic to my applications.
  • As a developer who defined a custom primary network in their project,
    I want to connect my VM to this new primary network, so it can utilize it for east/west/north/south, while still being able to connect to KAPI.

Non-Requirements

  • Service mesh integration is not a part of this
  • Seamless live-migration is not a must
  • UI integration is tracked in CNV-46603

Notes

  • porting the persistent IPs tests from u/s to d/s
  • ensure these run in ovn-kubernetes ocp repo as presubmit job
  • gather feedback to gratuate the PersistentIPsForVirtualization feature gate to GA

Goal

Primary used-defined networks can be managed from the UI and the user flow is seamless.

User Stories

  • As a cluster admin,
    I want to use the UI to define a ClusterUserDefinedNetwork, assigned with a namespace selector.
  • As a project admin,
    I want to use the UI to define a UserDefinedNetwork in my namespace.
  • As a project admin,
    I want to be queried to create a UserDefinedNetwork before I create any Pods/VMs in my new project.
  • As a project admin running VMs in a namespace with UDN defined,
    I expect the "pod network" to be called "user-defined primary network",
    and I expect that when using it, the proper network binding is used.
  • As a project admin,
    I want to use the UI to request a specific IP for my VM connected to UDN.

UX doc

https://docs.google.com/document/d/1WqkTPvpWMNEGlUIETiqPIt6ZEXnfWKRElBsmAs9OVE0/edit?tab=t.0#heading=h.yn2cvj2pci1l

Non-Requirements

  • <List of things not included in this epic, to alleviate any doubt raised during the grooming process.>

Notes

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description

“In order to have the same UX/UI in the dev and admin perspectives, we as the Observability UI Team need to reuse the dashboards coming from the monitoring plugin”

Goals & Outcomes

Product Requirements:

  • The dev console dashboards are loaded from the monitoring plugin

Background

The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

The dev console page displays fewer dashboards than the admin version of the page, so that difference will need to be supported by monitoring-plugin.

Outcomes

  • The dev console page for dashboards is loaded from monitoring-plugin and the code for the page is removed from the console codebase.
  • The dev console version of the page has the project selector dropdown, but the admin console page doesn't, so monitoring-plugin will need to be changed to support that difference.
  • We need to check when fetching dashboards that the dev and admin dashboards are fetched from the right endpoint
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Background

The admin console's silences page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

Outcomes

  • The dev console silences tab is loaded from the monitoring plugin
  • The dev console silences detail is loaded from the monitoring plugin
  • The dev console silences creation is loaded from the monitoring plugin
  • The code for the silences is removed from the console codebase
  • The dev console version of the page has the project selector dropdown, but the admin console page doesn't, so monitoring-plugin will need to be changed to support that difference.

Ensure removal of deprecated patternfly components from kebab-dropdown.tsx and alerting.tsx once this story and OU-561 are completed.

Proposed title of this feature request

Fleet / Multicluster Alert Management User Interface

What is the nature and description of the request?

Large enterprises are drowning in cluster alerts.

side note: Just within my demo RHACM Hub environment, across 12 managed clusters (OCP, SNO, ARO, ROSA, self-managed HCP, xKS), I have 62 alerts being reported! And I have no idea what to do about them!

Customers need the ability to interact with alerts in a meaningful way, to leverage a user interface that can filter, display, multi-select, sort, etc. To multi-select and take actions, for example:

  • alert filter state is warning
  • clusters filter is label environment=development
  • multi-select this result set
  • take action to Silence the alerts!

Why does the customer need this? (List the business requirements)

Platform engineering (sys admin; SRE etc) must maintain the health of the cluster and ensure that the business applications are running stable. There might indeed be another tool and another team which focuses on the Application health itself, but for sure the platform team is interested to ensure that the platform is running optimally and all critical alerts are responded to.

As of TODAY, what the customer must do is perform alert management via CLI. This is tedious, ad-hoc, and error prone. see blog link

The requirements are:

  • filtering fleet alerts
  • multiselect for actions like silence
  • as a bonus, configuring alert forwarding will be amazing to have.

List any affected packages or components.

OCP console Observe dynamic plugin

ACM Multicluster observability (MCO operator)

Description

"In order to provide ACM with the same monitoring capabilities OCP has, we as the Observability UI Team need to allow the monitoring plugin to be installed and work in ACM environments."

Goals & Outcomes

Product Requirements:

  • Be able to install the monitoring plugin without CMO, use COO
  • Allow the monitoring plugin to use a different backend endpoint to fetch alerts, ACM has is own alert manager
  • Add a column to the alerts list to display the cluster that originated the alert
  • Include only the alerting parts which include the alerts list, alert detail and silences

UX Requirements:

  • Align UX text and patterns between ACM concepts (hub cluster, spoke cluster, core operators) and current the monitoring plugin

Open Questions

  • Do the current monitoring plugin and the ACM monitoring plugin need to coexist in a cluster?
  • Do we need to connect to a different prometheus/thanos or is just a different alert manager?

Background

In order for ACM to reuse the monitoring plugin, the plugin needs to connect to a different alert manager. It needs to also contain a new column in alerts to show the source cluster these alerts are generated from

 

Check the ACM documentation around alerts for reference: https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html/observability/observing-environments-intro#observability-arch

Outcomes

  • The monitoring plugin can discover alert managers present in the cluster
  • If multiple alert managers are discovered, the plugin should display a dropdown to select an alert manager to connect to, if no alert manager is discovered the plugin should fallback to the incluster one
  • The monitoring plugin can connect to a specific alert manager, to create silences
  • The monitoring plugin can connect to a specific prometheus rules endpoint, to read alerts
  • Add a new column that can display from which cluster the alert is coming from

Steps

  1. Use the backend API to list the alert managers that are different from the in-cluster one
  2. If there is only one alert manager, we need to make sure to inform the user which one is selected
  3. On Selecting an alert manager, all the requests for creating and silencing alerts should be targeted to the selected alert manager

 

Background

In order to include the monitoring image so it can be installed using COO, we need to adjust COO Konflux

Outcomes

  • COO Mid stream includes the monitoring plugin image
  • COO build configuration includes the monitoring plugin

Background

In order to enable/disable features for monitoring in different OpenShift flavors, the monitoring plugin should support feature flags

Outcomes

  • The monitoring plugin with the Go backend can be deployed with CMO and the image is built correctly from the ART team

Placeholder feature for ccx-ocp-core maintenance tasks.

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

 Description of problem:

Insights operator should replaces %s in https://console.redhat.com/api/gathering/v2/%s/gathering_rules error messages like the failed-to-bootstrap:

$ jq -r .content osd-ccs-gcp-ad-install.log | sed 's/\\n/\n/g' | grep 'Cluster operator insights'
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights ClusterTransferAvailable is False with Unauthorized: failed to pull cluster transfer: OCM API https://api.openshift.com/api/accounts_mgmt/v1/cluster_transfers/?search=cluster_uuid+is+%REDACTED%27+and+status+is+%27accepted%27 returned HTTP 401: REDACTED"
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights Disabled is False with AsExpected: "
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights RemoteConfigurationAvailable is False with HttpStatus401: received HTTP 401 Unauthorized from https://console.redhat.com/api/gathering/v2/%s/gathering_rules"
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights RemoteConfigurationValid is Unknown with NoValidationYet: "
time="2024-09-05T08:12:51Z" level=info msg="Cluster operator insights SCAAvailable is False with Unauthorized: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 401: REDACTED
level=info msg=Cluster operator insights ClusterTransferAvailable is False with Unauthorized: failed to pull cluster transfer: OCM API https://api.openshift.com/api/accounts_mgmt/v1/cluster_transfers/?search=cluster_uuid+is+%27REDACTED%27+and+status+is+%27accepted%27 returned HTTP 401: REDACTED
level=info msg=Cluster operator insights Disabled is False with AsExpected: 
level=info msg=Cluster operator insights RemoteConfigurationAvailable is False with HttpStatus401: received HTTP 401 Unauthorized from https://console.redhat.com/api/gathering/v2/%s/gathering_rules
level=info msg=Cluster operator insights RemoteConfigurationValid is Unknown with NoValidationYet: 
level=info msg=Cluster operator insights SCAAvailable is False with Unauthorized: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 401: REDACTED
level=info msg=Cluster operator insights UploadDegraded is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: {\"errors\":[{\"meta\":{\"response_by\":\"gateway\"},\"detail\":\"UHC services authentication failed\",\"status\":401}]}

Version-Release number of selected component

Seen in 4.17 RCs. Also in this comment.

How reproducible

Unknown

Steps to Reproduce:

Unknown.

Actual results:

ClusterOperator conditions talking about https://console.redhat.com/api/gathering/v2/%s/gathering_rules

Expected results

URIs we expose in customer-oriented messaging to not have %s placeholders.

Additional detail

Seems like the template is coming in as conditionalGathererEndpoint here. Seems like insights-operator#964 introduced the %s, but I'm not finding the logic that's supposed to populate that placeholder.

Description of problem:

When the Insights Operator is disabled (as described in the docs here  or here), the RemoteConfigurationAvailable and RemoteConfigurationValid clusteroperator conditions are reporting the previous (before distabling the gathering) state (which might be Available=True and Valid=True).

 
Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Disable the data gathering in the Insights Operator followings the docs links above
    2. Watch the clusteroperator conditions with "oc get co insights -o json | jq .status.conditions"
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Rapid recommendations enhancement defines this built-in configuration when the operator cannot reach the remote endpoint.

The issue is that the built-in configuration (though currently empty) is no taken into account - i.e the data requested in the built-configuration is not gathered.

With the rapid recommendations feature (enhancement) one can request various messages from Pods matching various Pod name regular expressions

The problem is when there is a Pod (e.g foo-1 from the below example) matching more than one requested Pod name regex:

{
    'namespace': 'test-namespace',
    'pod_name_regex': 'foo-.*',
    'messages': ['regex1', 'regex2']
},
{
    'namespace': 'test-namespace'',
    'pod_name_regex': 'foo-1',
    'messages': ['regex3', 'regex4']
}

Assume Pods with names foo-1 and foo-bar. Currently all the regexes (regex1,regex2, regex3, regex4) are filtered for both Pods.

The desired behavior is foo1 filters all the regexes, but foo-bar is filtered only with regex1 and regex2

Goal:
Track Insights Operator Data Enhancements epic in 2024

 

 

 

 

Description

We can remove all the hardcoded container log gatherers (except the conditionals) in favor of Rapid Recommendations approach. They can be remove in the 4.18 version

Context:

As we discussed in INSIGHTOCP-1814 , it's a good candidate can help customers to fix the issue caused by too many unused MachineConfigs.

Required Data:

The total number of MachineConfigs in the cluster, the unused number of MachineConfigs in the cluster.

Backports:

To the OCP version we supported.

 

Proposed title of this feature request

Container scanner aims to gain data necessary for business analytics of usage of RH MW portfolio in live fleet.

 

The request includes assistance with on-boarding container scanner, help bringing it up to Insights Operator standards. GA quality requires performance and scalability QE on top of the functional testing alone.

 

Enhancement proposal tracked at: https://github.com/openshift/enhancements/pull/1584/files

 

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision IBM Cloud VPC infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The IBM Cloud VPC IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing IBM Cloud VPC Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Feature Overview (Goal Summary)

Hosted Control Planes and HyperShift provide consumers with a different architectural path to OpenShift that aligns best with their multi-cluster deployment needs. However, today’s API surface area in HCP remains “like a box of chocolates you never know what you're gonna get”~ Forrest Gump. Sometimes gated best-effort via the `hcp` cli (which is suboptimal).

The goal of this feature is to build a standard for communicating features that are GA/Preview. This would allow us:

  • To experiment while setting the right expectations. 
  • Prompt what we deem tested/stable. 
  • Simplify our test matrix and smoothes the documentation process.

This can be done following the guidelines in the FeatureGate FAQ. For example, by introducing a structured system of feature gates in our hosted control plane API, such that features are categorized into 'n-by-default', 'accessible-by-default', 'inaccessible-by-default or TechPreviewNoUpgrade', and 'Tech Preview', we would be ensuring clarity, compliance, and a smooth development and user experience.

Requirements (Acceptance Criteria)

  • Feature Categorization: Ability to categorize API features according to OpenShift's guidelines (e.g., DevPreview/TechPreview/GA).
  • Backward Compatibility: Ensures backward compatibility.
  • Upgrade Path: Clear upgrade paths for 'accessible-by-default' features
  • Documentation:  documentation for each category of feature gates.

 Additional resources

There are other teams (e.g., the assisted installer) teams following a structured pattern for gating features:

Currently there's no rigorous technical mechanism to feature gate functionality nor APIs in hypershift.
We defer to docs which results in bad UX, consumer confusion and maintainability burden.

We should have technical implementation to allow features and APIs to only run behind a flag.

Currently there's no rigorous technical mechanism to feature gate functionality nor APIs in hypershift.
We defer to docs which results in bad UX, consumer confusion and maintainability burden.

We should have technical implementation to allow features and APIs to only run behind a flag.

Feature Overview

As a cluster-admin, I want to run update in discrete steps. Update control plane and worker nodes independently.
I also want to back-up and restore incase of a problematic upgrade.

 

Background:

This Feature is a continuation of https://issues.redhat.com/browse/OCPSTRAT-180.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.Below is the list of done tasks.  

  1. OTA-700 Reduce False Positives (such as Degraded) 
  2. OTA-922 - Better able to show the progress made in each discrete step 
  3. [Cover by status command]Better visibility into any errors during the upgrades and documentation of what they error means and how to recover. 

Goals

  1. Have an option to do upgrades in more discrete steps under admin control. Specifically, these steps are: 
    • Control plane upgrade
    • Worker nodes upgrade
    • Workload enabling upgrade (i..e. Router, other components) or infra nodes
  2. An user experience around an end-2-end back-up and restore after a failed upgrade 
  3. MCO-530 - Support in Telemetry for the discrete steps of upgrades 

References

Epic Goal

  • Eliminate the gap between measured availability and Available=true

Why is this important?

  • Today it's not uncommon, even for CI jobs, to have multiple operators which blip through either Degraded=True or Available=False conditions
  • We should assume that if our CI jobs do this then when operating in customer environments with higher levels of chaos things will be even worse
  • We have had multiple customers express that they've pursued rolling back upgrades because the cluster is telling them that portions of the cluster are Degraded or Unavailable when they're actually not
  • Since our product is self-hosted, we can reasonably expect that the instability that we experience on our platform workloads (kube-apiserver, console, authentication, service availability), will also impact customer workloads that run exactly the same way: we're just better at detecting it.

Scenarios

  1. In all of the following, assume standard 3 master 0 worker or 3 master 2+ worker topologies
  2. Add/update CI jobs which ensure 100% Degraded=False and Available=True for the duration of upgrade
  3. Add/update CI jobs which measure availability of all components which are not explicitly defined as non-HA (ex: metal's DHCP server is singleton)
  4. Address all identified issues

Acceptance Criteria

  • openshift/enhancements CONVENTIONS outlines these requirements
  • CI - Release blocking jobs include these new/updated tests
  • Release Technical Enablement - N/A if we do this we should need no docs
  • No outstanding identified issues

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Clayton, David, and Trevor identified many issues early in 4.8 development but were unable to ensure all teams addressed them that list is in this query, teams will be asked to address everything on this list as a 4.9 blocker+ bug and we will re-evaluate status closer to 4.9 code freeze to see which may be deferred to 4.10
    https://bugzilla.redhat.com/buglist.cgi?columnlist=product%2Ccomponent%2Cassigned_to%2Cbug_severity%2Ctarget_release%2Cbug_status%2Cresolution%2Cshort_desc%2Cchangeddate&f1=longdesc&f2=cf_environment&j_top=OR&list_id=12012976&o1=casesubstring&o2=casesubstring&query_based_on=ClusterOperator%20conditions&query_format=advanced&v1=should%20not%20change%20condition%2F&v2=should%20not%20change%20condition%2F

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Tests in place
  • DEV - No outstanding failing tests
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:

Feb 18 21:11:25.799 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()

And the job failed, but none of the failures were "something made openshift-apiserver mad enough to go Degraded".

Definition of done:

  • Same as OTA-362
  • File bugs or the existing issues
  • If bug exists then add the tests to the exception list.
  • Unless tests are in exception list , they should fail if we see degraded != false.

BU Priority Overview

Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) Bare metal

Goals

  • Validating OpenShift on OCI baremetal to make it officially supported. 
  • Enable installation of OpenShift 4 on OCI bare metal using Assisted Installer.
  • Provide published installation instructions for how to install OpenShift on OCI baremetal
  • OpenShift 4 on OCI baremetal can be updated that results in a cluster and applications that are in a healthy state when update is completed.
  • Telemetry reports back on clusters using OpenShift 4 on OCI baremetal for connected OpenShift clusters (e.g. platform=external or none + some other indicator to know it's running on OCI baremetal).

Use scenarios

  • As a customer, I want to run OpenShift Virtualization on OpenShift running on OCI baremetal.
  • As a customer, I want to run Oracle BRM on OpenShift running OCI baremetal.

Why is this important

  • Customers who want to move from on-premises to Oracle cloud baremetal
  • OpenShift Virtualization is currently only supported on baremetal

Requirements

 

Requirement Notes
OCI Bare Metal Shapes must be certified with RHEL It must also work with RHCOS (see iSCSI boot notes) as OCI BM standard shapes require RHCOS iSCSI to boot (OCPSTRAT-1246)
Certified shapes: https://catalog.redhat.com/cloud/detail/249287
Successfully passing the OpenShift Provider conformance testing – this should be fairly similar to the results from the OCI VM test results. Oracle will do these tests.
Updating Oracle Terraform files  
Making the Assisted Installer modifications needed to address the CCM changes and surface the necessary configurations. Support Oracle Cloud in Assisted-Installer CI: MGMT-14039

 

RFEs:

  • RFE-3635 - Supporting Openshift on Oracle Cloud Infrastructure(OCI) & Oracle Private Cloud Appliance (PCA)

OCI Bare Metal Shapes to be supported

Any bare metal Shape to be supported with OCP has to be certified with RHEL.

From the certified Shapes, those that have local disks will be supported. This is due to the current lack of support in RHCOS for the iSCSI boot feature. OCPSTRAT-749 is tracking adding this support and remove this restriction in the future.

As of Aug 2023 this excludes at least all the Standard shapes, BM.GPU2.2 and BM.GPU3.8, from the published list at: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#baremetalshapes 

Assumptions

  • Pre-requisite: RHEL certification which includes RHEL and OCI baremetal shapes (instance types) has successfully completed.

 

 

 

 
 

Feature goal (what are we trying to solve here?)

Please describe what this feature is going to do.

DoD (Definition of Done)

Please describe what conditions must be met in order to mark this feature as "done".

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • Please describe why this feature is important
  • How does this feature help the product?

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere
  • Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

To make iSCSI work, a secondary VNIC must be configured during discovery, and when the machine reboots on core OS. The configuration is almost the same for discovery and Core OS.

Currently, we have one script owned by Red Hat for discovery, and a custom manifest owned by Oracle for CoreOS configuration.

I think this configuration should be owned by Oracle because the network configuration depends on OCI API. Also, we need this script to be the same is order to ensure that the configuration applied on discovery will be the same when the machine reboots on Core OS. Finally, if a customer has a specific need, they won't be able to tailor the configuration to their needs easily, as they would have to use the REST API of the assisted service.

My suggestion is to ask Oracle to drop the configuration script in their metadata service using Oracle's terraform template. On Red Hat side, we would have to pull this script on the node, and execute it thanks to a systemd unit. The same would be done from the custom manifest provided by Oracle.

Feature goal (what are we trying to solve here?)

During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15 when using OCI external platform.

DoD (Definition of Done)

iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend. 

When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1 ip=ibft` kargs during install to enable iSCSI booting.

Does it need documentation support?

yes

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Oracle

Reasoning (why it’s important?)

  • In OCI there are bare metal instances with iscsi support and we want to allow customers to use it{}

  PR https://github.com/openshift/assisted-service/pull/6257 must be adapted to be used along external platform.

Since we ensure that the iscsi network is not the default route, the PR above will ensure that automatically select the subnet used by the default route.

The secondary VNIC must be configured manually in OCI, a script must be injected in the discovery ISO to configure it.

Feature Overview

We are planning to support 5-node control planes to cover a set of active-active failure domains for OpenShift control planes (see OCPSTRAT-1199).

The Agent-Based Installer is required to enable this setup on day-1.

For additional context of the 5-node and 2-node control plane model please read:

Feature Overview

We are planning to support 4/5-node control planes to cover a set of active-active failure domains for OpenShift control planes (see OCPSTRAT-1199).

Assisted Installer must support this new topology too.

For additional context of the 5-node and 2-node control plane model please read:

 Currently, in HA clusters, assisted-service enforces exactly 3 control planes. This issue should change this behaviour to enable 3-5 control planes instead. It was decided in https://redhat-internal.slack.com/archives/G01A5NB3S6M/p1728296942806519?thread_ts=1727250326.825979&cid=G01A5NB3S6M that there will be no fail mechanism to continue with the installation in case one of the control planes failed to install. This issue should also align assisted-service behaviour with marking control planes as schedulable if there are less than 2 workers in the cluster, and not otherwise. It should also align assisted-service behaviour with failing installation if the user asked for at least 2 workers and got less

Epic Goal

  • Enable users to install a 4/5-node control plane on day 1.

Why is this important?

  • Users would like more resilient clusters that are deployed across two sites in 3+2 or 2+2 deployments

Scenarios

  1. In a 2 + 1 compact cluster deployment across two sites, the failure of the site with 2 control plane nodes would leave a single control plane node to be used for recovery. Having an extra node on the surviving site would give better odds that the cluster can be recovered in the event another node fails.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Users can set control plane replicas to 4 or 5.
  • Any number of worker nodes can be added to a control plane with 4 or 5 replicas.

Dependencies (internal and external)

  1. https://issues.redhat.com/browse/MGMT-18588
  2. https://issues.redhat.com/browse/OCPSTRAT-1199

Previous Work (Optional):

  1. N/A

Open questions::

  1. Should this be made available only for the baremetal platform? 

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a user, I want to be able to:

  • configure my install-config.yaml with either 4 or 5 control plane replicas

so that I can achieve

  • install a cluster in day 1 with either 4 or 5 control plane nodes

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Integration test showing adding 4 and 5 control plane node does not result in error

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This does not require a design proposal.
This does not require a feature gate.

Feature Overview (aka. Goal Summary)  

Support network isolation and multiple primary networks (with the possibility of overlapping IP subnets) without having to use Kubernetes Network Policies.

Goals (aka. expected user outcomes)

  • Provide a configurable way to indicate that a pod should be connected to a unique network of a specific type via its primary interface.
  • Allow networks to have overlapping IP address space.
  • The primary network defined today will remain in place as the default network that pods attach to when no unique network is specified.
  • Support cluster ingress/egress traffic for unique networks, including secondary networks.
  • Support for ingress/egress features where possible, such as:
    • EgressQoS
    • EgressService
    • EgressIP
    • Load Balancer Services

Requirements (aka. Acceptance Criteria):

  • Support for 10,000 namespaces
  •  

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Design Document

Use Cases (Optional):

  • As an OpenStack or vSphere/vCenter user, who is migrating to OpenShift Kubernetes, I want to guarantee my OpenStack/vSphere tenant network isolation remains intact as I move into Kubernetes namespaces.
  • As an OpenShift Kubernetes user, I do not want to have to rely on Kubernetes Network Policy and prefer to have native network isolation per tenant using a layer 2 domain.
  • As an OpenShift Network Administrator with multiple identical application deployments across my cluster, I require a consistent IP-addressing subnet per deployment type. Multiple applications in different namespaces must always be accessible using the same, predictable IP address.

Questions to Answer (Optional):

  •  

Out of Scope

  • Multiple External Gateway (MEG) Support - support will remain for default primary network.
  • Pod Ingress support - support will remain for default primary network.
  • Cluster IP Service reachability across networks. Services and endpoints will be available only within the unique network.
  • Allowing different service CIDRs to be used in different networks.
  • Localnet will not be supported initially for primary networks.
  • Allowing multiple primary networks per namespace.
  • Allow connection of multiple networks via explicit router configuration. This may be handled in a future enhancement.
  • Hybrid overlay support on unique networks.

Background

OVN-Kubernetes today allows multiple different types of networks per secondary network: layer 2, layer 3, or localnet. Pods can be connected to different networks without discretion. For the primary network, OVN-Kubernetes only supports all pods connecting to the same layer 3 virtual topology.

As users migrate from OpenStack to Kubernetes, there is a need to provide network parity for those users. In OpenStack, each tenant (analog to a Kubernetes namespace) by default has a layer 2 network, which is isolated from any other tenant. Connectivity to other networks must be specified explicitly as network configuration via a Neutron router. In Kubernetes the paradigm is the opposite; by default all pods can reach other pods, and security is provided by implementing Network Policy.

Network Policy has its issues:

  • it can be cumbersome to configure and manage for a large cluster
  • it can be limiting as it only matches TCP, UDP, and SCTP traffic
  • large amounts of network policy can cause performance issues in CNIs

With all these factors considered, there is a clear need to address network security in a native fashion, by using networks per user to isolate traffic instead of using Kubernetes Network Policy.

Therefore, the scope of this effort is to bring the same flexibility of the secondary network to the primary network and allow pods to connect to different types of networks that are independent of networks that other pods may connect to.

Customer Considerations

  •  

Documentation Considerations

  •  

Interoperability Considerations

Test scenarios:

  • E2E upstream and downstream jobs covering supported features across multiple networks.
  • E2E tests ensuring network isolation between OVN networked and host networked pods, services, etc.
  • E2E tests covering network subnet overlap and reachability to external networks.
  • Scale testing to determine limits and impact of multiple unique networks.

Feature Overview (aka. Goal Summary)  

Crun is GA as non default since OCP 4.14 . We want to make it as default in 4.18 while still supporting runc as non-default 

Benefits of Crun is covered here https://github.com/containers/crun 

 

FAQ.:  https://docs.google.com/document/d/1N7tik4HXTKsXS-tMhvnmagvw6TE44iNccQGfbL_-eXw/edit

***Note -> making Crun default does not means we will remove the support for runc nor we have any plans in foreseeable future to do that  

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

Per OCPSTRAT-1278, we want to support OCP on C3 instance type (baremetal) in order to enabled OCP virt on GCP. The C3 instance type supports the hyperdisk-balanced disks.

The goal is to validate that our GCP CSI operator can deploy the driver on C3 baremetal nodes and function as expected.

As OCP virt requires RWX to support VM live migration, we need to make sure the driver works with this access type with volumeType block.

 
Why is this important? (mandatory)

Product level priority to enabled OCP virt on GCP. Multiple customers are waiting for this solution. See OCPSTRAT-1278 for additional details.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As a customer, I want to run OpenShift Virtualization on OpenShift running on GCP baremetal instance types.

 
Dependencies (internal and external) (mandatory)

PD CSI driver to support baremetal / C3 instance type

PD CSI driver to support block RWX

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR
  • Documentation - VIRT / Installer
  • QE - STOR
  • PX - 
  • Others -

Acceptance Criteria (optional)

GCP PD CSI on C3 nodes passes the regular CSI tests + RWX with volumeType block. Actual VM live migration tests will be done by the virt team.

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Feature Overview

iSCSI boot is supported in RHEL and since the implementation of OCPSTRAT-749 it's also available in RHCOS.

Customers require using this feature in different bare metal environments on-prem and cloud-based.

Assisted Installer implements support for it in Oracle Cloud Infrastructure (MGMT-16167) to support their bare metal standard "shapes".

This feature extends this support to make it generic and supported in the Agent-Based Installer, the Assisted Installer and in ACM/MCE.

Goals

Support iSCSI boot in bare metal nodes, including platform baremetal and platform "none".

Requirements

Assisted installer can boot and install OpenShift on nodes with iSCSI disks.

Agent-Based Installer can boot and install OpenShift on nodes with iSCSI disks.

MCE/ACM can boot and install OpenShift on nodes with iSCSI disks.

The installation can be done on clusters with platform baremetal and clusters with platform "none".

Epic Goal

Support booting from iSCSI using ABI starting OCP 4.16.

 

The following PRs are the gaps between release-4.17 branch and master that are needed to make the integration work on 4.17.

https://github.com/openshift/assisted-service/pull/6665

https://github.com/openshift/assisted-service/pull/6603

https://github.com/openshift/assisted-service/pull/6661

 

The feature has to be backported to 4.16 as well. TBD - list all the PRs that have to be backported.

 

Instructions to test the AI feature with local env - https://docs.google.com/document/d/1RnRhJN-fgofnVSBTA6mIKcK2_UW7ihbZDLGAVHSdpzc/edit#heading=h.bf4zg53460gu

Why is this important?

  • Oracle has a client with disconnected env waiting for it - slack discussion

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • iSCSI boot is enabled on ocp >= 4.16

Dependencies (internal and external)

  1.  https://issues.redhat.com/browse/MGMT-16167 - AI support boot for iSCSI for COI
  2. https://issues.redhat.com/browse/MGMT-17556 - AI generic support for iSCSI boot

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Add new systemd services ( already available in Assisted service) into ABI to enable iSCSI boot

Feature Overview

Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift

prerequisite work Goals completed in OCPSTRAT-1122
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.

Phase 1 & 2 covers implementing base functionality for CAPI.

Background, and strategic fit

  • Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
  • CAPI has much better community interaction than MAPI.
  • Other projects are considering using CAPI and it would be cleaner to have one solution
  • Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

Epic Goal

  • As we prepare to move over to using Cluster API (CAPI) we need to make sure that we have the providers in place to work with this. This Epic is to track the tech preview of the provider for Azure

Why is this important?

  • What are the benefits to the customer, or to us, that make this worth
    doing? Fulfills a critical need for a customer? Improves
    supportability/debuggability? Improves efficiency/performance? This
    section is used to help justify the priority of this item vs other things
    we can do.

Drawbacks

  • Reasons we should consider NOT doing this such as: limited audience for
    the feature, feature will be superceded by other work that is planned,
    resulting feature will introduce substantial administrative complexity or
    user confusion, etc.

Scenarios

  • Detailed user scenarios that describe who will interact with this
    feature, what they will do with it, and why they want/need to do that thing.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub
    Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As an OpenShift engineer I want the CAPI Providers repositories to use the new generator tool so that they can independently generate CAPI Provider transport ConfigMaps

Background

Once the new CAPI manifests generator tool is ready, we want to make use of that directly from the CAPI Providers repositories so we can avoid storing the generated configuration centrally and independently apply that based on the running platform.

Steps

  • Install new CAPI manifest generator as a go `tool` to all the CAPI provider repositories
  • Setup a make target under the `/openshift/Makefile` to invoke the generator. Make it output the manifests under `/openshift/manifests`
  • Make sure `/openshift/manifests` is mapped to `/manifests` in the openshift/Dockerfile, so that the files are later picked up by CVO
  • Make sure the manifest generation works by triggering a manual generation
  • Check in the newly generated transport ConfigMap + Credential Requests (to let them be applied by CVO)

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • CAPI manifest generator tool is installed 
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Feature Overview (aka. Goal Summary)  

Sigstore image verification for namespace  

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As an openshift developer, I want to implement MCO ctrcfg runtime controller watching the ImagePolicy resources. The controller will update the sigstore verification file that crio --signature-policy-dir uses for namespaced policies.

Goal

This goals of this features are:

  • As part of a Microsoft guideline/requirement for implementing ARO HCP, we need to design a shared-ingress to kube-apiserver because MSFT has internal restrictions on IPv4 usage.  

Background

Given Microsoft's constraints on IPv4 usage, there is a pressing need to optimize IP allocation and management within Azure-hosted environments.

 

Interoperability Considerations

  • Impact: Which versions will be impacted by the changes?
  • Test Scenarios: Must test across various network and deployment scenarios to ensure compatibility and scale (perf/scale)

There's currently multiple ingress strategies we support for hosted cluster service endpoints (kas, nodePort, router...).
In a context of uncertainty about what use cases would be more critical to support, we initially exposed this in a flexible API that enables to potentially choose any combination of ingress strategies and endpoints.
ARO has internal restrictions on IPv4 usage. Because of this, to simplify the above and to be more cost effective in terms of infra we'd want to have a common shared ingress solution for all hosted clusters fleet.

As a management cluster owner I want to make sure the shared ingress is resilient to cluster failures

User Story:

Currently the SharedIngress controller waits for a HostedCluster to exist before creating the Service/LoadBalancer of the shared-ingress.

The controller should create the Service/LoadBalancer even 

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview 

Introduce the CAPI provider for bare metal in OpenShift as an alternative and long term replacement of MAPI for managing nodes and clusters.

Goals

Technology Preview release, introducing the current implementation of the Cluster API Provider Metal3 (CAPM3) into OpenShift.

https://github.com/metal3-io/cluster-api-provider-metal3 

Goal

Our goal is to be able to deploy baremetal clusters using Cluster API in Openshift. 

Upstream

Metal3, our upstream community, already provides a CAPI provider, and our aim is to bring it downstream. 

Other

We will collaborate with the Cluster Infrastructure team on points of integration as needed.

Scope questions

  • Changes in Ironic?
    • No
  • Changes in Metal3?
    • Bringing in downstream-only MAPI changes to upstream CAPI
  • Changes in OpenShift?
    • Create and set up the CAPI repo
    • Bring in any useful changes from MAPI
  • Spec/Design/Enhancements?
    • Not for this
    • But any follow-up work (replacing BM MAPI with CAPI or doing Hypershift) likely will
  • Dependencies on other teams?
    • Maybe ART?

Feature Overview

Firmware (BIOS) updates and attributes configuration from OpenShift is key in O-RAN clusters. While can do it on day 1, customers need to set firmware attributes to hosts that have already been deployed and are part of a cluster.

This feature adds the capability of updating firmware attributes and updating the firmware image for hosts in deployed clusters.

As part of demoing our integration with hardware vendors, we need to show the ability to reconfigure already provisioned hosts: modify their BIOS settings and, in the future, do firmware upgrades. The initial demo will be concentrated on BIOS settings. The demo is expected to be based on 4.15 and to use unmerged patches since 4.15 is closed for feature development. The path to productization will be determined as an outcome of the demo.

The assumed end result is an ability to run firmware upgrades and update BIOS settings for hosts that are already provisioned without fully deprovisioning them. The hosts will still be rebooted, so some external orchestrator (a human or ZTP) will need to drain the nodes first.

Feature Overview (aka. Goal Summary)  

  • With this next-gen OLM GA release (graduated from ‘Tech Preview’), customers can: 
    • discover collections of k8s extension/operator contents released in the FBC format with richer visibility into their release channels, versions, update graphs, and the deprecation information (if any) to make informed decisions about installation and/or update them.
    • install a k8s extension/operator declaratively and potentially automate with GitOps to ensure predictable and reliable deployments.
    • update a k8s extension/operator to a desired target version or keep it updated within a specific version range for security fixes without breaking changes.
    • remove a k8s extension/operator declaratively and entirely including cleaning up its CRDs and other relevant on-cluster resources (with a way to opt out of this coming up in a later release).
  • To address the security needs of 30% of our customers who run clusters in disconnected environments, the GA release will include cluster extension lifecycle management functionality for offline environments.
  • [Tech Preview] (Cluster)Extension lifecycle management can handle runtime signature validation for container images to support OpenShift’s integration with the rising Sigstore project for secure validation of cloud-native artifacts,

Goals (aka. expected user outcomes)

1. Pre-installation:

  • Customers can access a collection of k8s extension contents from a set of default catalogs leveraging the existing catalog images shipped with OpenShift (in the FBC format) with the new Catalog API from the OLM v1 GA release.
  • With the new GAed Catalog API, customers get richer package content visibility in their release channels, versions, update graphs, and the deprecation information (if any) to help make informed decisions about installation and/or update.
  • With the new GAed Catalog API, customers can render the catalog content in their clusters with fewer resources in terms of CPU and memory usage and faster performance.
  • Customers can filter the available packages based on the package name and see the relevant information from the metadata shipped within the package. 

2. Installation:

  • Customers using a ServiceAccount with sufficient permissions can install a k8s extension/operator with a desired target version or the latest version within a specific version range (from the associated channel) to get the latest security fixes.
  • Customers can easily automate the installation flow declaratively with GitOps to ensure predictable and reliable deployments.
  • Customers get protection from having two conflicting k8s extensions/operators owning the same API objects, i.e., no conflicting ownership, ensuring cluster stability.
  • Customers can access the* metadata of the installed k8s extension/operator to see essential information such as its provided APIs, example YAMLs of its provided APIs, descriptions, infrastructure features, valid subscriptions, etc.

3. Update:

  • Customers can see what updates are available for their k8s extension/operators in the form of immediate target versions and the associated update channels.
  • Customers can trigger the update of a k8s extension/operator with a desired target version or the latest version within a specific version range (from the associated channel) to get the latest security fixes.
  • Customers get protection from workload or k8s extension/operator breakage due to CustomResourceDefinition (CRD) being upgraded to a backward incompatible version during an update.
  • During OpenShift cluster update, customers* get Informed when installed k8s extensions/operators ** do not support the next OpenShift version *(when annotated by the package author/provider).  Customers must update those k8s extensions/operators to a newer/compatible version before OLM unblocks the OpenShift cluster update. 

4. Uninstallation/Deletion:

  • Customers can cleanly remove an installed k8s extension/operator including deleting CustomResourceDefinitions (CRDs), custom resource objects (CRs) of the CRDs, and other relevant resources to revert the cluster to its original state before the installation declaratively.

5. Disconnected Environments for High-Security Workloads:

  • Approximately 30% of our customers prioritize high security by running their clusters in internet-disconnected environments, especially for mission-critical production workloads. To benefit these users, our supported GA release needs to include cluster extension lifecycle management functionality that functions within these disconnected environments.

6. [Tech Preview] Signature Validation for Secure Workflows:

  • The Red Hat-sponsored Sigstore project is gaining traction in the Kubernetes community, aiming to simplify the signing of cloud-native artifacts. OpenShift leverages Sigstore tooling to enable scalable and flexible signature validation, including support for disconnected environments. This functionality will be available as a Tech Preview in 4.17 and is targeted for General Availability (GA) Tech Preview Phase 2 in the upcoming 4.18 release. To fully support this integration as a Tech Preview release, the (cluster)extension lifecycle management needs to (be prepared to) handle runtime validation of Sigstore signatures for container images.

Requirements (aka. Acceptance Criteria):

All the expected user outcomes and the acceptance criteria in the engineering epics are covered.

Background

OLM: Gateway to the OpenShift Ecosystem

Operator Lifecycle Manager (OLM) has been a game-changer for OpenShift Container Platform (OCP) 4.  Since its launch in 2019, OLM has fostered a rich ecosystem, expanding from a curated set of 25 operators to over 100 officially supported Red Hat operators and hundreds more from certified ISVs and the community.

OLM empowers users to manage diverse technologies with ease, including ACM, ACS, Quay, GitOps, Pipelines, Service Mesh, Serverless, and Virtualization.  It has also facilitated the introduction of groundbreaking operators for entirely new workloads, like Nvidia GPU, PTP, Windows Machine Config, SR-IOV networking, and more.  Today, a staggering 91% of our connected customers leverage OLM's capabilities.

OLM v0: A Stepping Stone

While OLM v0 has been instrumental, it has limitations.  The API design, not fully GitOps-friendly or entirely declarative, presents a steeper learning curve due to its complexity.  Furthermore, OLM v0 was designed with the assumption of namespace-scoped CRDs (Custom Resource Definitions), allowing for independent operator installations and parallel versions within a single cluster.  However, this functionality never materialized in core Kubernetes, and OLM v0's attempt to simulate it has introduced limitations and bugs.

The Operator Framework Team: Building the Future

The Operator Framework team is the cornerstone of the OpenShift ecosystem.  They build and manage OLM, the Operator SDK, operator catalog formats, and tooling (opm, file-based catalogs).  Their work directly impacts how operators are developed, packaged, delivered, and managed by users and SRE teams on OpenShift clusters.

A Streamlined Future with OLM v1

The Operator Framework team has undergone significant restructuring to focus on the next generation of OLM – OLM v1.  This transition includes moving the Operator SDK to a feature-complete state with ongoing maintenance for compatibility with the latest Kubernetes and controller-runtime libraries.  This strategic shift allows the team to dedicate resources to completely revamping OLM's API and management concepts for catalog content delivery.  

Leveraging learnings and customer feedback since OCP 4's inception, OLM v1 is designed to be a major overhaul, and it will be shipped as a Generally Available (GA) feature in OpenShift 4.17.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

1. Pre-installation:

  • [GA release] Docs provide instructions on how to add Red Hat-provided Operator catalogs with the pull secret for catalogs hosted on a secure registry.
  • [GA release] Docs provide instructions on how to discover the Operator packages from a catalog.
  • [GA release] Docs provide instructions on how to query and inspect the metadata of Operator bundles and find feasible ones to be installed with the OLM v1.

2. Installation:

  • [GA release] Docs provide instructions on how to use a ServiceAccount with sufficient permissions to install a k8s extension/operator with a desired target version or the latest version within a specific version range to get the latest security fixes.
  • [GA release] Docs provide instructions on how to automate the installation flow declaratively with GitOps to ensure predictable and reliable deployments.
  • [GA release] Docs mention the OLM v1’s protection from having two conflicting k8s extensions/operators owning the same API objects, i.e., no conflicting ownership, ensuring cluster stability.
  • [GA release] Docs provide instructions on how to access the metadata of the installed k8s extension/operator to see essential information such as its provided APIs, example YAMLs of its provided APIs, descriptions, infrastructure features, valid subscriptions, etc.
  • [GA release] Docs explain how to create RBACs from a CRD to grant cluster users access to the installed k8s extension/operator's provided APIs.

3. Update:

  • [GA release] Docs provide instructions on how to see what updates are available for their k8s extension/operators in the form of immediate target versions and the associated update channels.
  • [GA release] Docs provide instructions on how to trigger the update of a k8s extension/operator with a desired target version or the latest version within a specific version range to get the latest security fixes.
  • [GA release] Docs mention OLM v1’s protection from workload or k8s extension/operator breakage due to CustomResourceDefinition (CRD) being upgraded to a backward incompatible version during an update.
  • [GA release] Docs mention OLM v1 will block the OpenShift cluster update if installed k8s extensions/operators do not support the next OpenShift version (when annotated by the package author/provider).  Provide instructions on how to find and update to a newer/compatible version before OLM unblocks the OpenShift cluster update.

4. Uninstallation/Deletion:

  • [GA release] Docs provide instructions on how to cleanly remove an installed k8s extension/operator including deleting CustomResourceDefinitions (CRDs), custom resource objects (CRs) of the CRDs, and other relevant resources.
  • [GA release] Docs provide instructions to verify the cluster has been reverted to its original state after uninstalling a k8s extension/operator

Relevant upstream CNCF OLM v1 requirements, engineering brief, and epics:

1. Pre-installation:

2. Installation:

3. Update:

4. Uninstallation/Deletion:

Relevant documents:

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Today, OLM v0 ships with four catalogs: redhat-operators, certified-operators, community-operators and redhat-marketplace. Since catalogd does not know about the existence of the OLM v0 catalog sources, we need to expose those catalogs to the cluster by default once OLM v1 is GA.
  • The goal of this epic is to ensure that those four catalogs are available by default as Catalog objects so that the operator-controller can resolve and install content without additional user configuration.

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Typically, any non-deployment resource managed by cluster-olm-operator would be handled by a StaticResourceController (usage ref). Unfortunately, the StaticResourceController only knows how to handle specific types, as seen by the usage of the ApplyDirectly function in the StaticResourceController.Sync method. Due to the ApplyDirectly function only handling a set of known resources, the ClusterCatalog resource would likely not be handled the same as other static manifests currently managed by cluster-olm-operator.

In order to enable cluster-olm-operator to properly manage ClusterCatalog resources, it is proposed that we implement a custom factory.Controller that knows how to appropriately apply and manage ClusterCatalog resources such that:

  • Changes to any fields specified in the default ClusterCatalog resources are reverted to the default values
  • Changes to fields not specified in the default ClusterCatalog resources are left untouched

The openshift/library-go project has a lot of packages that will likely make this implementation pretty straightforward. The custom controller implementation will likely also require implementation of some pre-condition logic that ensures the ClusterCatalog API is available on the cluster before attempting to use it.

NOTE: All features will be tech-preview in the first release and then will graduate to GA next release or when it is ready for GA.

Epic Goal

  • OLM V1 supports disconnected Environments for High-Security Workloads

Why is this important?

  • Significant number of our customers prioritize high security by running their clusters in internet-disconnected environments, especially for mission-critical production workloads. To benefit these users, our supported GA release needs to include cluster extension lifecycle management functionality that functions within these disconnected environments.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Currently a place holder.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Create a test that builds upon the catalogd happy-path, by creating a manifest image, and then updating the ClusterCatalog to references that image. Then creating a ClusterExtension to deploy the manifests.

The status of the ClusterExtension should then be checked.

The manifests do not need to create a deployment, in fact it would be better if the manifest included simpler resources such as a configmap or secret.

This will create the initial openshift/origin tests. This will consist of tests that ensure, while in tech-preview, that the ClusterExtension and ClusterCatalog APIs are present. This includes creating an OWNERS files that will make approving/reviewing future PRs easier.

Test 1:

  1. Create a Bundle with the following property:

 

apiVersion: operators.coreos.com/v1alpha1

kind: ClusterServiceVersion

metadata:

 annotations:

   olm.properties: '[\{"type": "olm.maxOpenShiftVersion", "value": "4.17"}]'

 

Note the value needs to be equal to the cluster version this is being tested on.

 

  1. Apply a ClusterExtension resource that installs the bundle
  2. Query the operator conditions to ensure that:
  3. Upgradeable is set to false
  4. Reason is “IncompatibleOperatorsInstalled”
  5. Message is the name of the bundle name

 

Test 2

Same as test 1 but with two bundles. Message should have names in alphabetical order.

 

Test 3

Apply a bundle without the annotation. Upgradeable should be True.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

https://docs.google.com/document/d/18m-OG0PN8-jjjgGT33WNujzmj_1B2Tqoqd-bVKX4CkE/edit?usp=sharing 

  • Many operators write the MaxOCPVersion field in their bundle metadata. OLM v1 needs to support the same MaxOCPVersion workflow, where OLM blocks a cluster upgrade when that version is set.
  • Outside the scope of this epic, but in a future iteration, we should also respect MinKubeVersion (and potentially support MaxKubeVersion?)

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  1. cluster-olm-operator watches clusterextensions
  2. cluster-olm-operator queries downstream-only helm chart metadata in the release secrets of each installed operator
  3. cluster-olm-operator sets Upgradeable=False with the appropriate reason and message when the maxocpversion is the current cluster version

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Once OLM v1.0.0 is feature complete and the team feels comfortable enabling it by default, we should remove the OLM v1 feature flag and deploy it on all clusters by default.
  • We should also introduce OLMv1 behind a CVO capability to give customers the option of leaving it disabled in their clusters.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • OLMv1 is enabled by default in OCP
  • OLMv1 can be fully disabled at install/upgrade time using CVO capabilities

 

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  1. We are encountering toleration misses in origin azure tests which are preventing our components from stabilizing.
  2. cluster-olm-operator is spamming api server with condition lastUpdateTimes
  3. disconnected environment in CI/origin is different from OLMv1 expectations (but we do feel that v1 disconnected functionality is getting enough validation elsewhere to be confident).  Created OCPBUGS-44810 to align expectations of the disconnected environments
  4.  

 

Refactor cluster-olm-operator to use v1 of the OLM openshift/api/operator API

A/C:

 - cluster-olm-operator now uses OLM v1

 - OLM resource manifest updated to use v1

 - CI is green

OpenShift offers a "capabilities" to allow users to select which components to include in the cluster at install time.

It was decided the capability name should be: OperatorLifecycleManagerV1 [ref

A/C:

 - ClusterVersion resource updated with OLM v1 capability
 - cluster-olm-operator manifests updated with capability.openshift.io/name=OperatorLifecycleManagerV1 annotation

Promote OLM API in the OpenShift API from v1alpha1 to v1 (see https://github.com/openshift/api/blob/master/operator/v1alpha1/types_olm.go#L1)

A/C:

 - openshift/api/operator/v1alpha1 OLM promoted to v1

 - openshift/api/operator/v1alpha1 OLM removed

 

 

 

As someone troubleshooting an OLMv1 issue with a cluster, I'd like to be able to see the state of cluster-olm-operator and the OLM resource, so that I can have all the information I need to fix the issue.

 

A/C:

 - must-gather contains cluster-olm-operator namespace and contained resources
 - must-gather contains OLM cluster scoped resource
 - if cluster-olm-operator fails before updating its ClusterOperator, I'd still want the cluster-olm-operator namespace, it's resources, and the cluster scoped OLM resource to be in the must-gather

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Why is this important?

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Goal:
Update team owned repositories to Kubernetes v1.31

?? is the 1.31 freeze
?? is the 1.31 GA

Problem:<please update links for 1.31>
The following repository must be rebased onto the latest version of Kubernetes:

  1.  oc: https://github.com/openshift/oc/pull/1877

The following repositories should be rebased onto the latest version of Kubernetes:

  1. cluster-kube-controller-manager operator: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/816
  2. cluster-policy-controller: https://github.com/openshift/cluster-policy-controller/pull/156 
  3. cluster-kube-scheduler operator: https://github.com/openshift/cluster-kube-scheduler-operator/pull/547
  4. secondary-scheduler-operator: https://github.com/openshift/secondary-scheduler-operator/pull/225
  5. cluster-capacity: https://github.com/openshift/cluster-capacity/pull/97
  6.  run-once-duration-override-operator: https://github.com/openshift/run-once-duration-override-operator/pull/68
  7.  run-once-duration-override: https://github.com/openshift/run-once-duration-override/pull/36
  8.  cluster-openshift-controller-manager-operator: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/368 
  9.  openshift-controller-manager: https://github.com/openshift/openshift-controller-manager/pull/345 
  10.  cli-manager-operator: https://github.com/openshift/cli-manager-operator/pull/358
  11.  cli-manager: https://github.com/openshift/cli-manager/pull/144
  12. cluster-kube-descheduler-operator: https://github.com/openshift/cluster-kube-descheduler-operator/pull/384
  13. descheduler:

Entirely remove dependencies on k/k repository inside oc.

Why is this important:

  • Customers demand we provide the latest stable version of Kubernetes. 
  • The rebase and upstream participation represents a significant portion of the Workloads team's activity.

 
 
 
 

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned components should be running on Kubernetes 1.29
  • This includes
    • The cluster autoscaler (+operator)
    • Machine API operator
      • Machine API controllers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cloud Controller Manager Operator
      • Cloud controller managers for:
        • AWS
        • Azure
        • GCP
        • vSphere
        • OpenStack
        • IBM
        • Nutanix
    • Cluster Machine Approver
    • Cluster API Actuator Package
    • Control Plane Machine Set Operator

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository

Epic Goal

  • The goal of this epic is to upgrade all OpenShift and Kubernetes components that MCO uses to v1.29 which will keep it on par with rest of the OpenShift components and the underlying cluster version.

Why is this important?

  • Uncover any possible issues with the openshift/kubernetes rebase before it merges.
  • MCO continues using the latest kubernetes/OpenShift libraries and the kubelet, kube-proxy components.
  • MCO e2e CI jobs pass on each of the supported platform with the updated components.

Acceptance Criteria

  • All stories in this epic must be completed.
  • Go version is upgraded for MCO components.
  • CI is running successfully with the upgraded components against the 4.18/master branch.

Dependencies (internal and external)

  1. ART team creating the go 1.31 image for upgrade to go 1.31.
  2. OpenShift/kubernetes repository downstream rebase PR merge.

Open questions::

  1. Do we need a checklist for future upgrades as an outcome of this epic?-> yes, updated below.

Done Checklist

  • Step 1 - Upgrade go version to match rest of the OpenShift and Kubernetes upgraded components.
  • Step 2 - Upgrade Kubernetes client and controller-runtime dependencies (can be done in parallel with step 3)
  • Step 3 - Upgrade OpenShift client and API dependencies
  • Step 4 - Update kubelet and kube-proxy submodules in MCO repository
  • Step 5 - CI is running successfully with the upgraded components and libraries against the master branch.

Please review the following PR: https://github.com/openshift/machine-config-operator/pull/4561

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

User or Developer story

As a MCO developer, I want to pick up the openshift/kubernetes updates for the 1.31 k8s rebase to track the k8s version as rest of the OpenShift 1.31 cluster.

Engineering Details

  • Update the go.mod, go.sum and vendor dependencies pointing to the kube 1.31 libraries. This includes all direct kubernetes related libraries as well as openshift/api , openshift/client-go, openshift/library-go and openshift/runtime-utils

Acceptance Criteria:

  • All k8s.io related dependencies should be upgraded to 1.31.
  • openshift/api , openshift/client-go, openshift/library-go and openshift/runtime-utils should be upgraded to latest commit from master branch
  • All ci tests must be passing

As part of our continuous improvement efforts, we need to update our Dockerfile to utilize the new multi-base images provided in OpenShift 4.18. The current Dockerfile is based on RHEL 8 and RHEL 9 builder images from OpenShift 4.17, and we want to ensure our builds are aligned with the latest supported images, for multiple architectures.

Updating the RHEL 9 builder image to

registry.ci.openshift.org/ocp/builder:rhel-9-golang-1.22-builder-multi-openshift-4.18

Updating the RHEL 8 builder image to

registry.ci.openshift.org/ocp/builder:rhel-8-golang-1.22-builder-multi-openshift-4.18

Updating the base image to

registry.ci.openshift.org/ocp-multi/4.18-art-latest-multi:machine-config-operator

or specifying a different tag if we dont want to only do mco

Ensuring all references and dependencies in the Dockerfile are compatible with these new images.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned CAPI components should be running on Kubernetes 1.30
  • target is 4.18 since CAPI is always a release behind upstream

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

To align with the 4.18 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

Epic Goal

  • The goal of this epic is to upgrade all OpenShift and Kubernetes components that CCO uses to v1.31 which keeps it on par with rest of the OpenShift components and the underlying cluster version.

Why is this important?

  • To make sure that Hive imports of other OpenShift components do not break when those rebase
  • To avoid breaking other OpenShift components importing from CCO.
  • To pick up upstream improvements

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. Kubernetes 1.31 is released (August 2024)

Previous Work (Optional):

  1. Similar previous epic CCO-541

Done Checklist

  • CI - CI is running, tests are automated and merged.

Epic Goal*

Drive the technical part of the Kubernetes 1.31 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.

 
Why is this important? (mandatory)

OpenShift 4.18 cannot be released without Kubernetes 1.31

 
Scenarios (mandatory) 

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

PRs:

Retro: Kube 1.31 Rebase Retrospective Timeline (OCP 4.18)

Retro recording: https://drive.google.com/file/d/1htU-AglTJjd-VgFfwE3z_dH5tKXT1Tes/view?usp=drive_web

Description of problem:

Given 2 images with different names, but same layers, "oc image mirror" will only mirror 1 of them. For example:

$ cat images.txt
quay.io/openshift/community-e2e-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/bertinatto/test-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS
quay.io/openshift/community-e2e-images:e2e-31-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS quay.io/bertinatto/test-images:e2e-31-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS

$ oc image mirror -f images.txt
quay.io/
  bertinatto/test-images
    manifests:
      sha256:298dcd808e27fbf96614e4c6f06730f22964dce41dcdc7bf21096c42411ba773 -> e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS
  stats: shared=0 unique=0 size=0B

phase 0:
  quay.io bertinatto/test-images blobs=0 mounts=0 manifests=1 shared=0

info: Planning completed in 2.6s
sha256:298dcd808e27fbf96614e4c6f06730f22964dce41dcdc7bf21096c42411ba773 quay.io/bertinatto/test-images:e2e-33-registry-k8s-io-e2e-test-images-resource-consumer-1-13-LT0C2W4wMzShSeGS
info: Mirroring completed in 240ms (0B/s)    

Version-Release number of selected component (if applicable):

4.18    

How reproducible:

Always    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

Only one of the images were mirrored.

Expected results:

Both images should be mirrored.     

Additional info:

    

This PR https://github.com/openshift/origin/pull/29141 loosens the check to ignore the warning message in the output in order to unblock https://github.com/openshift/oc/pull/1877. Once the requires PRs are merged, we should revert back to `o.Equal` again. This issue is created to track this work. 

TechPreview clusters are unable to bootstrap because kube-apiserver fails to start with the following error:

E0827 20:29:22.653501 1 run.go:72] "command failed" err="group version resource.k8s.io/v1alpha2 that has not been registered"

This happens because, in Kubernetes 1.31, the group version resource.k8s.io/v1alpha2 was removed and replaced with resource.k8s.io/v1alpha3. This is part of the DynamicResourceAllocation feature, which is currently TechPreview.

After discussing this with the team, we decided that the best approach is to modify the cluster-kube-apiserver-operator to start the kube-apiserver with the correct group version based on the Kubernetes version being used.

As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.

Feature Overview (aka. Goal Summary)  

Here are common update improvements from customer interactions on Update experience

  1. Show nodes where pod draining is taking more time.
    Customers have to dig deeper often to find the nodes for further debugging. 
    The ask has been to bubble up this on the update progress window.
  2. oc update status ?
    From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"  
     But the ask is to show more details in a human-readable format.

    Know where the update has stopped. Consider adding at what run level it has stopped.
     
    oc get clusterversion
    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    
    version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
    

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API.  Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

  • From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process. 
  • Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

We utilize MCO annotations to determine whether a node is degraded or unavailable, and we solely source the Reason annotation to put into the insight. Many common cases are not covered by this, especially the unavailable ones: nodes can be cordoned, have a condition like DiskPressure, be in the process of termination etc. Not sure whether our code or something like MCO should provide it, but captured this as a card for now.

Current state:

An update is in progress for 28m42s: Working towards 4.14.1: 700 of 859 done (81% complete), waiting on network

= Control Plane =
...
Completion:      91%

Improvement opportunities

1. Inconsistent info: CVO message says "700 of 859 done (81% complete)" but control plane section says "Completion: 91%"
2. Unclear measure of completion: CVO message counts manifest applied and control plane section says "Completion: 91%" which counts upgraded COs. Both messages do not state what they count. Manifest count is an internal implementation detail which users likely do not understand. COs are less so, but we should be more clear in what the completion means.
3. We could take advantage of this line and communicate progress with more details

Definition of Done

We'll only remove CVO message once the rest of the output functionally covers it, so the inconsistency stays until OTA-1154. Otherwise:

= Control Plane =
...
Completion:      91% (30 operators upgraded, 1 upgrading, 2 waiting)

Upgraded operators are COs that updated its version, no matter its conditions
Upgrading operators are COs that havent updated its version and are Progressing=True
Waiting operators are COs that havent updated its version and are Progressing=False

Description

During an upgrade, once control plane is successfully updated, status items related to that part of the upgrade cease to be relevant, and therefore we can either hide them entirely, or we can show a simplified version of them. The relevant sections are Control plane and Control plane nodes.

As an OTA engineer,
I would like to make sure the node in a single-node cluster is handled correctly in the upgrade-status command.

Context:
According to the discussion with the MCO team,
the node is in MCP/master but not worker.
This card is to make sure that the node are displayed that way too. My feeling is that the current code probably does the job already. In that case, we should add test coverage for the case to avoid regression in the future.

AC:

Epic Goal

Address performance and scale issues in Whereabouts IPAM CNI

Why is this important?

Whereabouts is becoming increasingly more popular for use on workloads that operate at scale. Whereabouts was originally built as a convenience function for a handful of IPs, however, more and more customers want to use whereabouts in scale sitatuions.

Notably, for telco and ai/ml scenarios. Some ai/ml scenarios launch a large number of pods that need to use secondary networks for related traffic.

 

Supporting Documents

Upstream collaboration outline

Acceptance Criteria

  • TBD

Feature Overview

This is Image mode on OpenShift. It uses the rpm-ostree native containers interface and not bootc but that is an implementation detail.

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

On-cluster, automated RHCOS Layering builds are important for multiple reasons:

  • One-click/one-command upgrades of OCP are very popular. Many customers may want to make one or just a few customizations but also want to keep that simplified upgrade experience. 
  • Customers who only need to customize RHCOS temporarily (hotfix, driver test package, etc) will find off-cluster builds to be too much friction for one driver.
  • One of OCP's virtues is that the platform and OS are developed, tested, and versioned together. Off-cluster building breaks that connection and leaves it up to the user to keep the OS up-to-date with the platform containers. We must make it easy for customers to add what they need and keep the OS image matched to the platform containers.

Goals & Requirements

  • The goal of this feature is primarily to bring the 4.14 progress (OCPSTRAT-35) to a Tech Preview or GA level of support.
  • Customers should be able to specify a Containerfile with their customizations and "forget it" as long as the automated builds succeed. If they fail, the admin should be alerted and pointed to the logs from the failed build.
    • The admin should then be able to correct the build and resume the upgrade.
  • Intersect with the Custom Boot Images such that a required custom software component can be present on every boot of every node throughout the installation process including the bootstrap node sequence (example: out-of-box storage driver needed for root disk).
  • Users can return a pool to an unmodified image easily.
  • RHEL entitlements should be wired in or at least simple to set up (once).
  • Parity with current features – including the current drain/reboot suppression list, CoreOS Extensions, and config drift monitoring.

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

 

As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.
 
As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.

 
As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.

 

To test:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.

As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.

As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up 
(MCO-770, MCO-578, MCO-574 )

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.

Maybe:

Entitlements: MCO-1097, MCO-1099

Not Likely:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.

Currently, we are using bare pod objects for doing our image builds. While this works, it makes adding retry logic and other things much more difficult since we will have to implement this logic. Instead, we should use Kubernetes Jobs objects.

Jobs have built-in mechanisms for retrying, exponential backoff, concurrency controls, etc. This frees us from having to implement complicated retry logic for build failures beyond our control such as pod evictions, etc.

 

Done When:

  • BuildController uses Kubernetes Jobs instead of bare pods to perform builds.
  • All tests have been updated.

The Insights Operator syncs the customer's Simple Content Access certificate to the etc-pki-entitlement secret in the openshift-config-managed namespace every 8 hours. Currently, the user is expected to clone this secret into the MCO namespace, prior to initiating a build if they require this cert during the build process. We'd like this step automated so that user does not have to do this manual step.

Whenever a must-gather is collected, it includes all of the objects at the time of the must-gather creation. Right now, must-gathers do not include MachineOSConfigs and MachineOSBuilds, which would be useful to have for support and debugging purposes.

 

Done When:

  • must-gathers include all MachineOSConfigs / MachineOSBuilds, if present.

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

 

As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.
 
As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.

 
As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.

 

To test:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.

As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.

As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up 
(MCO-770, MCO-578, MCO-574 )

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.

Maybe:

Entitlements: MCO-1097, MCO-1099

Not Likely:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.

Currently, it is not possible for cluster admins to revert from a pool that is opted into on-cluster builds and layered MachineConfig updates. See https://issues.redhat.com/browse/OCPBUGS-16201 for details around what happens.

It is worth mentioning that this is mostly an issue for UPI (user provided infrastructure) / bare metal users of OpenShift. For IPI cases in AWS / GCP / Azure / et. al., one can simply delete the node and the machine, which will cause the Machine API to provision a fresh node to replace it, e.g.:

 

#!/bin/bash

node_name="$1"
node_name="${node_name/node\//}"
machine_id="$(oc get "node/$node_name" -o jsonpath='{.metadata.annotations.machine\.openshift\.io/machine}')"
machine_id="${machine_id/openshift-machine-api\//}"
oc delete --wait=false "machine/$machine_id" -n openshift-machine-api
oc delete --wait=false "node/$node_name"

 

Done When

  • The MCD can revert from a node from on-cluster builds / layered MachineConfigs into the legacy behavior.
  • Or we've determined that the above is either infeasible or undeisrable.

Description of problem:

When we create a MOSC to enable OCL in a pool, and then we delete the MOSC resource to revert it, then the MOSB and CMs are garbage collected but we need to wait a long and random time until the nodes are updated with the new config.
    

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2024-10-15-080246-ci-ln-0gsqflb-latest   True        False         8h      Cluster version is 4.18.0-0.test-2024-10-15-080246-ci-ln-0gsqflb-latest

    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create a MOSC to enable OCL in the worker pool
    2. Wait until the new OCL image is applied to all worker nodes
    3. Remove the MOSC resource created in step 1

    

Actual results:

MOSB and CMs are cleaned, but the nodes are not updated. After a random amount of time  the nodes are updated. (Somewhere around 10-20 minutes)
    

Expected results:

There should be no long pause between the deletion of the MOSC resource and the beginning of the nodes update process.
    

Additional info:

As a workaround, if we add any label to the worker pool to force a sync operation the worker nodes start updating immediately.
    

Description of problem:

When OCL is configured in a cluster using a proxy configuration, OCL is not using the proxy to build the image.
    

Version-Release number of selected component (if applicable):

 oc get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-rc.8   True        False         5h14m   Cluster version is 4.16.0-rc.8

    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create a cluster that uses a proxy and cannot access the internet if not by using this proxy
    
    We can do it by using this flexy-install template, for example:
    https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/5724d9c157d51f175069c5bf09be1872173d0167/functionality-testing/aos-4_16/ipi-on-aws/versioned-installer-customer_vpc-http_proxy-multiblockdevices-fips-ovn-ipsec-ci

    private-templates/functionality-testing/aos-4_16/ipi-on-aws/versioned-installer-customer_vpc-http_proxy-multiblockdevices-fips-ovn-ipsec-ci

    2. Enable OCL in a machineconfigpool by creating a MOSC resrouce 
   
    

Actual results:

The build pod will not use the proxy to build the image and it will fail with a log similar to this one


time="2024-06-25T13:38:19Z" level=debug msg="GET https://quay.io/v1/_ping"
time="2024-06-25T13:38:49Z" level=debug msg="Ping https://quay.io/v1/_ping err Get \"https://quay.io/v1/_ping\": dial tcp 44.216.66.253:443: i/o timeout (&url.Error{Op:\"Get\", URL:\"https://quay.io/v1/_ping\", Err:(*net.OpError)(0xc000220d20)})"
time="2024-06-25T13:38:49Z" level=debug msg="Accessing \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883\" failed: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 44.216.66.253:443: i/o timeout"
time="2024-06-25T13:38:49Z" level=debug msg="Error pulling candidate quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883: pinging container registry quay.io: Get \"https://quay.io/v2/\": dial tcp 44.216.66.253:443: i/o timeout"
Error: creating build container: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eaa7835f2ec7d2513a76e30a41c21ce62ec11313fab2f8f3f46dd4999957a883: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 44.216.66.253:443: i/o timeout
time="2024-06-25T13:38:49Z" level=debug msg="shutting down the store"
time="2024-06-25T13:38:49Z" level=debug msg="exit status 125"




    

Expected results:

The build should be able to access the necessary resources by using the configured proxy
    

Additional info:

When verifying this ticket, we need to pay special attention to https proxies using their own user-ca certificate

We can use this flexy-install template: 
https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/5724d9c157d51f175069c5bf09be1872173d0167/functionality-testing/aos-4_16/ipi-on-osp/versioned-installer-https_proxy

private-templates/functionality-testing/aos-4_16/ipi-on-osp/versioned-installer-https_proxy

In this kind of clusters it is not enough to use the proxy to build the image, but we need to use the /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt file to be able to reach the yum repositories, since rpm-ostree will complain about an intermediate certificate (this one of the https proxy) being self-signed.

To test it we can use a custom Containerfile including something simelar to:

RUN cd /etc/yum.repos.d/ && curl -LO https://pkgs.tailscale.com/stable/fedora/tailscale.repo && \
    rpm-ostree install tailscale && rpm-ostree cleanup -m && \
    systemctl enable tailscaled && \
    ostree container commit


BuildController is responsible for a lot of things. Unfortunately, it is very difficult to determine where and how BuildController does its job, which makes it more difficult to extend and modify as well as test.

Instead, it may be more useful to think of BuildController as the thing that converts MachineOSBuilds into build pods, jobs, et. al. Similar to how we have a subcontroller for dealing with build pods, we should have another subcontroller whose job is to produce MachineOSBuilds.

 

Done When:

  • A MachineOSBuildController (or similar) is introduced into pkg/controller/build whose sole job is to watch for MachineOSConfig creation / changes, as well as MachineConfigPool config updates.
  • In response to the aforementioned events, MachineOSBuildController should create a MachineOSBuild object using those inputs.
  • If a build is currently in progress and one of the aforementioned events occurs, either MachineOSBuildController or BuildController (TBD which one), should cancel the running build, clean up any ephemeral build objects, and start a new build.
  • BuildController can be simplified to only look for the creation and deletion of MachineOSBuild objects.
  • This, coupled with https://issues.redhat.com/browse/MCO-1326, will go a long way toward making BuildController more resilient, modular, and testable.

Description of problem:

When OCL is enabled and we configure several MOSC resources for several MCPs, the MCD pods are restarted every few seconds.
They should only be restarted once per MOSC, instead they are continuously restarted.
    

Version-Release number of selected component (if applicable):

IPI on AWS version 4.17.0-0.test-2024-10-02-080234-ci-ln-2c0xsqb-latest
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Enable techpreview
    2. Create 5 custom MCPs
    3. Create one MOSC resource for each new MCP
    

Actual results:

MCD pods will be restarted every few seconds

$ oc get pods
NAME                                                             READY   STATUS    RESTARTS   AGE
kube-rbac-proxy-crio-ip-10-0-31-199.us-east-2.compute.internal   1/1     Running   4          4h51m
kube-rbac-proxy-crio-ip-10-0-31-37.us-east-2.compute.internal    1/1     Running   4          4h43m
kube-rbac-proxy-crio-ip-10-0-38-189.us-east-2.compute.internal   1/1     Running   4          4h51m
kube-rbac-proxy-crio-ip-10-0-54-127.us-east-2.compute.internal   1/1     Running   3          4h43m
kube-rbac-proxy-crio-ip-10-0-69-126.us-east-2.compute.internal   1/1     Running   4          4h51m
machine-config-controller-d6bdf7d85-2wb22                        2/2     Running   0          113m
machine-config-daemon-d7t4d                                      2/2     Running   0          6s
machine-config-daemon-f7vv2                                      2/2     Running   0          12s
machine-config-daemon-h8t8z                                      2/2     Running   0          8s
machine-config-daemon-q9fhr                                      2/2     Running   0          10s
machine-config-daemon-xvff2                                      2/2     Running   0          4s
machine-config-operator-56cdd7f8fd-wlsdd                         2/2     Running   0          105m
machine-config-server-klggk                                      1/1     Running   1          4h48m
machine-config-server-pmx2n                                      1/1     Running   1          4h48m
machine-config-server-vwxjx                                      1/1     Running   1          4h48m
machine-os-builder-7fb58586bc-sq9rj                              1/1     Running   0          50m

    

Expected results:

MCD pods should only be restarted once for every MOSC
    

Additional info:


    

As an OpenShift cluster admin, I would like to try out on-cluster layering (OCL) to better understand how it works, how to set it up, and how to use it. To that end, a quick-start guide for what I need to do to get started as well as a troubleshooting guide would be indispensable.

 

Done When:

Within BuildController, there is a lot of code concerned with creating all of the ephemeral objects for performing a build, converting secrets from one form to another, cleaning up after the build is completed, etc. Unfortunately, because of how BuildController is currently written, this code has become a bit unwieldy and difficult to modify and test. In addition, it is very difficult to reason about what is actually happening. Therefore, it should be broken up and refactored into separate modules within pkg/controller/build.

By doing this, we can have very high test granularity as well as tighter assertions for the places where it is needed the most while simultaneously allowing looser and more flexible testing for BuildController itself.

 

Done When:

  • ImageBuildRequest and all of the various helpers and test code has been repackaged into a submodule within pkg/controller/build.
  • Repackaged code should only have a few ways to use it as opposed to global structs, methods, and functions. This will ensure that the code is effectively modularized.
  • Unneeded code is removed from BuildController, such as anything referring to OpenShift Image Builds.
  • Unit tests are updated.

Feature Overview 

ETCD backup API was delivered behind a feature gate in 4.14. This feature is to complete the work for allowing any OCP customer to benefit from the automatic etcd backup capability.

The feature introduces automated backups of the etcd database and cluster resources in OpenShift clusters, eliminating the need for user-supplied configuration. This feature ensures that backups are taken and stored on each master node from the day of cluster installation, enhancing disaster recovery capabilities.

Why is it important?

The current method of backing up etcd and cluster resources relies on user-configured CronJobs, which can be cumbersome and prone to errors. This new feature addresses the following key issues:

  • User Experience: Automates backups without requiring any user configuration, improving the overall user experience.
  • Disaster Recovery: Ensures backups are available on all master nodes, significantly improving the chances of successful recovery in disaster scenarios where multiple control-plane nodes are lost.
  • Cluster Stability: Maintains cluster availability by avoiding any impact on etcd and API server operations during the backup process.

Requirements

Complete work to auto-provision internal PVCs when using the local PVC backup option. (right now, the user needs to create PVC before enabling the service).

Out of Scope

The feature does not include saving cluster backups to remote cloud storage (e.g., S3 Bucket), automating cluster restoration, or providing automated backups for non-self-hosted architectures like Hypershift. These could be future enhancements (see OCPSTRAT-464)

 

Epic Goal*

Provide automated backups of etcd saved locally on the cluster on Day 1 with no additional config from the user.

 
Why is this important? (mandatory)

The current etcd automated backups feature requires some configuration on the user's part to save backups to a user specified PersistentVolume.
See: https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/config/v1alpha1/types_backup.go#L46

Before the feature can be shipped as GA, we would require the capability to save backups automatically by default without any configuration. This would help all customers have an improved disaster recovery experience by always having a somewhat recent backup. 

 
Scenarios (mandatory) 

  • After a cluster is installed the etcd-operator should take etcd backups and save them to local storage.
  • The backups must be pruned according to a "reasonable" default retention policy so it doesn't exhaust local storage.
  • A warning alert must be generated upon failure to take backups.

Implementation details:
One issue we need to figure out during the design of this feature is how the current API might change as it is inherently tied to the configuration of the PVC name.
See:
https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/config/v1alpha1/types_backup.go#L99
and 
https://github.com/openshift/api/blob/ba11c1587003dc84cb014fd8db3fa597a3faaa63/operator/v1alpha1/types_etcdbackup.go#L44

Additionally we would need to figure out how the etcd-operator knows about the available space on local storage of the host so it can prune and spread backups accordingly.
 

Dependencies (internal and external) (mandatory)

Depends on changes to the etcd-operator and the tech preview APIs 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - etcd team
  • Documentation - etcd docs team
  • QE - Sandeep Kundu
  • PX - 
  • Others -

Acceptance Criteria (optional)

Upon installing a tech-preview cluster backups must be saved locally and their status and path must be visible to the user e.g on the operator.openshift.io/v1 Etcd cluster object.

An e2e test to verify that the backups are being saved locally with some default retention policy.

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview

  • oc-mirror by default leverages OCI 1.1 referrers or its fallback (tag-based discover) to discover related image signatures for any image that it mirrors
  • this feature is enabled by default and can be disabled globally
  • Optionally, oc-mirror can be configured to include other referring artifacts, e.g. SBOMs or in-toto attestations referenced by their OCI artifact media type

Goals

  • As part of OCPSTRAT-918 and OCPSTRAT-1245 we are introducing broad coverage in the OpenShift platform for signatures produced with the SigStore tooling, which allow for scalable and flexibly validation of the signatures, incl. offline environments
  • In order to enable offline verification, oc-mirror needs to detect whether any image that is in scope for its mirroring operation has one or more related SigStore signatures referring to, by using the OCI 1.1 referrers API or it's fallback, or cosigns tag naming convention for signatures and mirror those artifacts as well

Requirements

  • SigStore-style signature should be mirrored by default, but opt-out has to be available
  • The public key from Red Hat and the public Rekor key from Red Hat used to sign products images needs to be available offline 
  • SigStore-style attachments should optionally be able to be discovered and mirrored as well as an opt-in, the user should be able to supply a list of OCI media types they are interested in (e.g. text/spdx or application/vnd.cyclonedx for SBOMs)
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

Background, and strategic fit

OpenShift is planning to ship all payload and layered product images signed consistently via cosign with OpenShift 4.17. oc-mirror should be able to leverage this to provide a seamless signature verification experience in an offline environment by automatically making all required signature artifacts available in the offline registry.

 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

<your text here>

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Overview

This task is really to ensure oc-mirror v2 has backward compatibility to what v1 was doing regarding signatures

Goal

Ensure the correct configmaps are generated and stored in a folder so that the user can deploy the related artifact/s to the cluster as in v1

Feature Overview

As a user deploying OpenShift on bare metal I want the installer to use the NTP servers that I specify at install time.

Problem

When the Ironic pre-provisioning image containing IPA is running, there is no way to sync the clocks to a custom NTP server. This causes issues with certificates - IPA generates a certificate for itself to be valid starting 1 hour in the past (see OCPBUGSM-21571), so if the hardware clock is more than 1 hour ahead of the real time then the certificate will be rejected by Ironic.

A new field is required in install-config.yaml where the user can specify additional NTP servers that can then be used to set up a chrony config in the IPA ISO. (Potentially this could also be used to automatically generate the MachineConfig manifests to add the same config to the cluster.)

See initial discussion here: OCPBUGS-22957

 

When the Ironic pre-provisioning image containing IPA is running, there is no way to sync the clocks to a custom NTP server. This causes issues with certificates - IPA generates a certificate for itself to be valid starting 1 hour in the past (see OCPBUGSM-21571), so if the hardware clock is more than 1 hour ahead of the real time then the certificate will be rejected by Ironic.

A new field is required in install-config.yaml where the user can specify additional NTP servers that can then be used to set up a chrony config in the IPA ISO. (Potentially this could also be used to automatically generate the MachineConfig manifests to add the same config to the cluster.)

See initial discussion here: OCPBUGS-22957

Create an ICC patch that will read the new env variable for additional NTP servers and use it to create a chrony ingnition file.

Feature description

Oc-mirror v2 is focuses on major enhancements that include making oc-mirror faster and more robust and introduces caching as well as address more complex air-gapped scenarios. OC mirror v2 is a rewritten version with three goals: 

  • Manage complex air-gapped scenarios, providing support for the enclaves feature
  • Faster and more robust: introduces caching, it doesn’t rebuild catalogs from scratch
  • Improves code maintainability, making it more reliable and easier to add features, and fixes, and including a feature plugin interface

 

4.17 version of the delete functionality needs some improvements regarding:

  • CLID-196: should be able to delete operators previously mirrored using mirror to mirror
  • CLID-224: should not delete blobs that are shared with images that were not deleted.

Check if it is possible to delete operators using the delete command when the previous command was mirror to mirror. Probably it won't work because in mirror to mirror the cache is not updated.

It is necessary to find a solution for this scenario.

oc-mirror should account for users who are relying on oc-mirror v1 in production and accomodate an easy migration:

  • namespaces used for release mirroring should be the same
  • icsp to idms

 

The way of tagging images for releases, operators and additional images is different between v1 and v2. So it is necessary to have some kind of migration feature in order to enable customers to migrate from one version to the other.

Use cases:

  • As an oc-mirror user, I'd like to be able to use the delete feature of v2 to delete images mirrored images with v1, so that I can keep the registry volume under control.
    • Since the algorithm is different between version, the delete feature of v2 won't find the images mirrored by v1 and the customer won't be able to delete them.
  • As an oc-mirror user switching to v2, I'd like for ICSP and IDMS/ITMS cohabitation to not cause major cluster problems. 
    • the namespace used for releases in v1 is ocp/release always. In v2 this is different. So IDMS/ITMS of v2 won't recognize release images mirrored by v1.
  • As an oc-mirror user switching to v2, I'd like to apply the new catalog source files without them colliding with the ones generated by v1
  • As an oc-mirror v1 user, I'd like the images already mirrored in v1 to be reusable (recognized) when using oc-mirror v2, so that I don't double the storage volume of my registry unnecessarily
  • As an oc-mirror user switching to v2, I'd like to be able to easily construct the openshift-install.yaml file which is necessary to create a disconnected cluster
  • From Naval: Previously we were doing the mirroring with oc adm release which was pushing in the directory of our choice, but when switching to oc-mirror v2 for release mirroring, I wasn't able to "reuse" the image path that was already created using oc adm release for 4.16.5 because when specifying a path to oc-mirror, it forces two new repository that we can't chose name of. My suggestions would be to let this as an option to avoid having to do dark things like I did (honestly, since we are on a 100 Mb connectivity for pulling images, I ended up doing a crane copy of the oc adm release mirrored path into the oc-mirror v2 created path, then oc-mirror v2 again to let it detect that images were already there...)

The solutions is still to be discussed.

Feature Overview (aka. Goal Summary)  

Customers who deploy a large number of OpenShift on OpenStack clusters want to minimise the resource requirements of their cluster control planes.

Customers deploying RHOSO (OpenShift services for OpenStack, i.e. OpenStack control plane on bare metal OpenShift) already have a bare metal management cluster capable of serving Hosted Control Planes.

We should enable self-hosted (i.e. on-prem) Hosted Control Planes to serve Hosted Control Planes to OpenShift on OpenStack clusters, with a specific focus of serving Hosted Control Planes from the RHOSO management cluster.

Goals (aka. expected user outcomes)

As an enterprise IT department and OpenStack customer, I want to provide self-managed OpenShift clusters to my internal customers with minimum cost to the business.

As an internal customer of said enterprise, I want to be able to provision an OpenShift cluster for myself using the business's existing OpenStack infrastructure.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

TBD
 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Goal

  • Ability to run cinder and manila operators as controller Pods in a hosted control plane
  • Ability to run Node DaemonSet in a guest clusters

Why is this important?

  • Continue supporting usage of CSIs for the guest cluster just how it's possible with standalone OpenShift clusters.

Scenarios

\

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://github.com/openshift/enhancements/blob/master/enhancements/storage/storage-hypershift.md
  2. https://issues.redhat.com/browse/OCPSTRAT-210
  3.  

Open questions::

In OSASINFRA-3483, we modified openshift/cluster-storage-operator to integrate support for kustomize and provide the infrastructure to generate two sets of assets: one for standalone deployment, and one for hypershift deployment. In this story, we will track actually adding support for the latter.

In OSASINFRA-3610, we merged the openshift/csi-driver-manila-operator repository into openshift/csi-operator and modified it to take advantage of the new generator framework provided therein. Now, we want to build on this, adding Hypershift-specific assets and tweaking whatever else is needed.

In OSASINFRA-3608, we merged the openshift/openstack-cinder-csi-driver-operator repository into openshift/csi-operator and modified it to take advantage of the new generator framework provided therein. Now, we want to build on this, adding Hypershift-specific assets and tweaking whatever else is needed.

In OSASINFRA-3483, we modified openshift/cluster-storage-operator to integrate support for kustomize and provide the infrastructure to generate two sets of assets: one for standalone deployment, and one for hypershift deployment. In this story, we will track actually adding support for the latter.

We want to prepare cluster-storage-operator for eventual Hypershift integration. To this end, we need to migrate the assets and references to same to integrate kustomize. This will likely look similar to https://github.com/openshift/cluster-storage-operator/pull/318 once done (albeit, without the Hypershift work).

This is a container Epic for tasks which we know need to be done for Tech Preview but which we don't intend to do now. It needs to be groomed before it is useful for planning.

Matthew Booth is worried about that feature that we added to pre-create a FIP and assign it to the Service object for router-default. This is indeed racy and could be problematic if another controller would take over that field as well, it'll create infinite loops and the result wouldn't be great for customers.

The idea is to remove that feature now and eventually add it back later when it's safer (e.g. feature added to the Ingress operator?). It's worth noting that core kubernetes has deprecated the loadBalancerIP field in the Service object, and it now works with annotations. Maybe we need to investigate that path.

Right now, our pods are SingleReplica because to have multiple replicas we need more than one zone for nodes which translates into AZ in OpenStack. We need to figure that out.

We should not have to explicitly configure the location of the clouds.yaml file, since there is a list of well-known places where these can be found. We should also be able to configure the cloud used from the chosen clouds.yaml.

Being able to connect the node pools to additional networks, like we support already on standalone clusters.

This task will be necessary for some use cases, like using Manila CSI on a storage network, or running NFV workload on a SRIOV provider network or also running ipv6 dual stack workloads on a provider network.

I see at least 2 options:

  • We patch CAPO to support AdditionalNetworks (type: []NetworkParam), and we append what is in the spec.ports when creating the OpenStackMachine with the additional networks
  • We provision CAPO cluster with ManagedSubnets and provide ports to the OpenStackMachine.Spec.Ports  directly. No need to patch CAPO.
    Either way, I start to think that we could simply HCP/OSP:
    in BYON, the user will have to provide Router, Network and Subnets in OpenStackCluster.Spec.
    In non-BYON (the current default and supported way), we would ALWAYS provide ManagedSubnets (we'll have a default []SubnetSpec) so whether we have additional networks, we can have control over OpenStackMachine.Spec.Ports .

 

One thing we need to solve as well is the fact that when a Node has > 1 port, kubelet won't necessarily listen on the primary interface. We need to address that too; and it seems CPO has an option to define the primary network name: https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/openstack-cloud-controller-manager/using-openstack-cloud-controller-manager.md#networking

If we don't solve that, the nodepool (worker) won't join the cluster since Kubelet might listen on the wrong interface.

When the management cluster runs on AWS, make sure we update the DNS record for *apps, so ingress can work out of the box.

HyperShift should be able to deploy the minimum useful OpenShift cluster on OpenStack. This is the minimum requirement to be able to test it. It is not sufficient for GA.

Goal

Stop using the openshift/installer-aro repo during installation of ARO cluster. installer-aro is a fork of openshift/installer with carried patches. Currently it is vendored into openshift/installer-aro-wrapper in place of the upstream installer.

Benefit Hypothesis

Maintaining this fork requires considerable resources from the ARO team, and results in delays of offering new OCP releases through ARO. Removing the fork will eliminate the work involved in keeping it up to date from this process.

Resources

https://docs.google.com/document/d/1xBdl2rrVv0EX5qwhYhEQiCLb86r5Df6q0AZT27fhlf8/edit?usp=sharing

It appears that the only work required to complete this is to move the additional assets that installer-aro adds for the purpose of adding data to the ignition files. These changes can be directly added to the ignition after it is generated by the wrapper. This is the same thing that would be accomplished by OCPSTRAT-732, but that ticket involves adding a Hive API to do this in a generic way.

Responsibilities

The OCP Installer team will contribute code changes to installer-aro-wrapper necessary to eliminate the fork. The ARO team will review and test changes.

Success Criteria

The fork repo is no longer vendored in installer-aro-wrapper.

Results

Add results here once the Initiative is started. Recommend discussions & updates once per quarter in bullets.

 

Epic Goal

  • Eliminate the need to use the openshift/installer-aro fork of openshift/installer during the installation of an ARO cluster.

Why is this important?

  • Maintaining the fork is time-consuming for the ARO and causes delays in rolling out new releases of OpenShift to ARO.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. CORS-1888
  2. CORS-2743
  3. CORS-2744
  4. https://github.com/openshift/installer/pull/7600/files

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Currently the Azure client can only be mocked in unit tests of the pkg/asset/installconfig/azure package. Using the mockable interface consistently and adding a public interface to set it up will allow other packages to write unit tests for code involving the Azure client.

We deprecated "DeploymentConfig" in-favor of "Deployment" in OCP 4.14

Now in 4.18  we want to make "Deployment " as default out of box that means customer will get Deployment when they install OCP 4.18 . 

Deployment Config will still be available in 4.18 as non default for user who still want to use it . 

FYI "DeploymentConfig" is tier 1 API in Openshift and cannot be removed from 4.x product 

Please Review this FAQ : https://docs.google.com/document/d/1OnIrGReZKpc5kzdTgqJvZYWYha4orrGMVjfP1fUpljY/edit#heading=h.oranye5nwtsy 

Epic Goal*

WRKLDS-695 was implemented to make the DC enabled through capability in 4.14. In order to prepare customers for migration to Deployments the capability got enabled by default. After 3 releases we need to reconsider whether disabling the capability by default is feasible.

More about capabilities in https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#capability-sets.
 
Why is this important? (mandatory)

Disabling a capability by default make an OCP installation lighter. Less component running by default reduces a security risk/vulnerability surface.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  Users can still enable the capability in vanilla clusters. Existing cluster will keep the DC capability enabled during a cluster upgrade.

 
Dependencies (internal and external) (mandatory)

None

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - Workloads team
  • Documentation - Docs team
  • QE - Workloads QE team
  • PX - 
  • Others -

Acceptance Criteria (optional)

  • The DC capability is disabled by default in vanilla OCP installations
  • The DC capability can be enabled in a vanilla OCP installation
  • The DC capability is enabled after an upgrade in OCP clusters that have the capability already enabled before the upgrade
  • The DC capability is disabled after an upgrade OCP clusters that have the capability disabled before the upgrade

Drawbacks or Risk (optional)

None. The DC capability can be enabled if needed.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Before the DCs can be disabled by default all the relevant e2e relying on DCs need to be migrated to Deployments to maintain the same testing coverage.

Feature Overview

This feature enables users of Hosted Control Planes (HCP) on bare metal to provision spoke clusters from ACM at scale, supporting hundreds to low thousands of clusters per hub cluster. It will use ACM's multi-tenancy to prevent interference across clusters. The implementation assumes the presence of workers in hosted clusters (either bare metal or KubeVirt).

Why is this important

We have a customer requirement to allow for massive scale & TCO reduction via Multiple ACM Hubs on a single OCP Cluster - Kubevirt Version

Resources

Feature Overview (aka. Goal Summary)  

When using OpenShift in a mixed, multi-architecture environment some key details or checks or not always available. With this feature we will take a first pass at improving the UI/UX for customers as adoption of this configuration continues at pace.

Goals (aka. expected user outcomes)

The UI/UX experience should improved when being used in a mixed architecture OCP cluster

Requirements (aka. Acceptance Criteria):

  • check that only the relevant CSI drivers are deployed to the relevant architectures
  • Improve filtering/autodeterming arches in operatorhub
  • Console improvements, especially node views

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Y
Classic (standalone cluster) Y
Hosted control planes Y
Multi node, Compact (three node), or Single node (SNO), or all Y
Connected / Restricted Network Y
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All architectures
Operator compatibility n/a
Backport needed (list applicable versions) n/a
UI need (e.g. OpenShift Console, dynamic plugin, OCM) OpenShift Console
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal

  • Console improvements, especially node views

Why is this important?

  •  

Scenarios
1. …

Acceptance Criteria

  • (Enter a list of Acceptance Criteria unique to the Epic)

Dependencies (internal and external)
1. …

Previous Work (Optional):
1. …

Open questions::
1. …

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • Release Enablement: <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
  • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
  • QE - Automated tests merged: <link or reference to automated tests>
  • QE - QE to verify documentation when testing
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

Feature Overview (aka. Goal Summary)  

Add support to GCP N4 Machine Series to be used as Control Plane and Compute Nodes when deploying Openshift on Google Cloud

Goals (aka. expected user outcomes)

As a user, I want to deploy OpenShift on Google Cloud using N4 Machine Series for the Control Plane and Compute Node so I can take advantage of these new Machine types

Requirements (aka. Acceptance Criteria):

OpenShift can be deployed in Google Cloud using the new N4 Machine Series for the Control Plane and Compute Nodes

 

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  both
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all all
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Background

Google has made N4 Machine Series available on their cloud offering. These Machine Series use "hyperdisk-balanced" disk for the boot device that are not currently supported

Documentation Considerations

The documentation will be updated adding the new disk type that needs to be supported as part of this enablement. Also the N4 Machine Series will be added as tested Machine types for Google Cloud when deploying OpenShift

Epic Goal

Why is this important?

  • This is a new Machine Series Google has introduced that customers will use for their OpenShift deployments

Scenarios

  1. Deploy an OpenShift Cluster with both the Control Plane and Compute Nodes running on N4 GCP Machines

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. https://issues.redhat.com/browse/CORS-3561

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

As a oc-mirror user, I would like mirrored operator catalogs to reflect the mirrored operators only, so that, after I mirror my catalog I can check that it contains the filtered operators using: 

$ oc-mirror list operators --catalog mirror.syangsao.net:8443/ocp4/redhat/redhat-operator-index:v4.12 

Context

In oc-mirror v2 (and in v1 after bug fix OCPBUGS-31536), oc-mirror doesn't rebuild catalogs.

  • The filtered declarative config isn't recreated based on the imagesetconfig filter
  • The catalog cache isn't regenerated
  • The catalog image isn't rebuilt based on the above 2 elements
    Instead, the original catalog image is pushed as is to the mirror registry. Its declarative config will show all operators, and for each operator all channels and all bundles.
    This behavior is causing some inconvenience to our users.

Concerns, complexity

  • How to deal with caches
  • How to deal with default channels
  • How to deal with keeping a single valid channel head
  • What to do when cross channel filtering is involved

Known ongoing/related work

Additional info:

 

As a oc-mirror user, I would like mirrored operator catalogs to reflect the mirrored operators only, so that, after I mirror my catalog I can check that it contains the filtered operators using:

oc-mirror list operators --catalog mirror.syangsao.net:8443/ocp4/redhat/redhat-operator-index:v4.12

Context

In oc-mirror v2 (and in v1 after bug fix OCPBUGS-31536), oc-mirror doesn't rebuild catalogs.

  • The filtered declarative config isn't recreated based on the imagesetconfig filter
  • The catalog cache isn't regenerated
  • The catalog image isn't rebuilt based on the above 2 elements
    Instead, the original catalog image is pushed as is to the mirror registry. Its declarative config will show all operators, and for each operator all channels and all bundles.
    This behavior is causing some inconvenience to our users.

Concerns, complexity

  • How to deal with caches
  • How to deal with default channels
  • How to deal with keeping a single valid channel head
  • What to do when cross channel filtering is involved
  • How to deal with mirroring by bundle selection (how to rebuild the update graph)
  • Make multi-arch catalogs

Known ongoing/related work

Additional info:

This user story is to cover all the scenarios that were not covered by CLID-230

I found some problems

  • signatures,
  • image previously pulled via podman : it becomes single arch, and therefore we cannot push the manifest list anymore at end of rebuild. Error message is: 

image is not a manifest list
and the only way out was to rm -fr  $HOME/.local/share/containers/storage

Each catalog filtered should have its own folder named by the digest of its contents and inside of this folder the following items should be present:

  • declarative config file with only the operators in
  • container file used to generate the catalog image
  • error log file containing possible errors occurred during the rebuilding

From this comment: https://github.com/openshift/oc-mirror/pull/937#discussion_r1806084982

Since o.Opts is already passed to imagebuilder.NewBuilder(), passing o.Opts.SrcImage.TlsVerify and o.Opts.DestImage.TlsVerify is not needed as additional arguments.

From this comment: https://github.com/openshift/oc-mirror/pull/937#discussion_r1812461948

Ideally ImageBuilderInterface would be the interface to build any kind of image, since RebuildCatalogs is very specific only for catalog images, it would be better to have a separate interface only for that or reuse BuildAndPush.

This implies that we generate a new declarative config containing only a portion of the declarative config.
Acceptance criteria:

  • Only necessary operators should remain
  • Only necessary bundles should remain
  • Channels remaining should reflect the upgrade graph that is possible among remaining bundles
  • cross channel filtering
  • no multiple heads
  • default channel
  • should construct an upgrade graph when selectedBundles are used

This story is about creating an image that contains opm, the declarative config (and optionally the cache)
Multiple solutions here:

  • go-containerregistry as in v1
  • buildah (with a dockerfile generated from opm generate dockerfile)
  • oras (with a dockerfile generated from opm generate dockerfile)
  • an external call to podman

 

Acceptance criteria:

  • should build image in enclave environment (with registries.conf)
  • should build image behind a proxy
  • should build a multi-arch image
  • Should build an image with catalog source in a way that it runs properly on clusters

 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo. Having a common repo will across drivers will ease maintenance burden.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both yes
Classic (standalone cluster) yes
Hosted control planes all
Multi node, Compact (three node), or Single node (SNO), or all all
Connected / Restricted Network all
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Operator compatibility  
Backport needed (list applicable versions) no
UI need (e.g. OpenShift Console, dynamic plugin, OCM) no
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

N/A includes all the CSI operators Red Hat manages as part of OCP

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

This effort started with CSI operators that we included for HCP, we want to align all CSI operator to use the same approach in order to limit maintenance efforts.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

Not customer facing, this should not introduce any regression.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

No doc needed

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

N/A, it's purely tech debt / internal

Epic Goal*

Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.

 
Why is this important? (mandatory)

Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo.

 
Scenarios (mandatory) 

As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

 
Dependencies (internal and external) (mandatory)

None, this can be done just by the storage team and independently on other operators / features.

Contributing Teams(and contacts) (mandatory) 

  • Development - 
  • QE - 

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Epic Goal*

Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.

 
Why is this important? (mandatory)

Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo.

 
Scenarios (mandatory) 

As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

Note: we do not plan to do any changes for HyperShift. The EFS CSI driver will still fully run in the guest cluster, including its control plane.

Dependencies (internal and external) (mandatory)

None, this can be done just by the storage team and independently on other operators / features.

Contributing Teams(and contacts) (mandatory) 

  • Development - 
  • QE - 

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Epic Goal*

Merge the CSI driver operator into csi-operator repo and re-use asset generator and CSI operator code there.

 
Why is this important? (mandatory)

Maintaining separate CSI driver operator repo is hard, especially when dealing with CVEs and library bumps. In addition, we could share even more code when moving all CSI driver operators into a single repo.

 
Scenarios (mandatory) 

As cluster admin, I upgrade my cluster to a version with this epic implemented and I do not see any change, the CSI driver works the same as before. (Some pods, their containers or services may get renamed during the upgrade process).

As OCP developer, I have 1 less repo to worry about when fixing a CVE / bumping library-go or Kubernetes libraries.

 
Dependencies (internal and external) (mandatory)

None, this can be done just by the storage team and independently on other operators / features.

Contributing Teams(and contacts) (mandatory) 

  • Development - 
  • QE - 

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Following the step described in the enhancement, we should do the following:

  • Move existing code from openstack-cinder-csi-driver-operator into a legacy directory in the csi-operator repository,
  • Add a Dockerfile for building the operator image from the new location,
  • Update openshift/release to build the image from new location,
  • Change ocp-build-data repository to ship image from new location, and
  • Coordinate merges in ocp-build-data and release repository

Once this is done, we can work towards rewriting the operator to take advantage of the new generator tooling used for existing migrated operators.

In OSASINFRA-3609 we moved the existing Cinder CSI Driver Operator from openshift/openstack-cinder-csi-driver-operator to openshift/csi-operator, adding the contents of the former in a legacy/openstack-cinder-csi-driver-operator directory in the latter Now, we need to rework or adapt this migrated code to integrate it fully into the csi-operator.

Following the step described in the enhancement, we should do the following:

  • Move the operator to the new structure in csi-operator, and
  • Make post-migration changes, including:
    • Ensuring that we have test manifest available in test/e2e directory,
    • Ensuring nothing in the release repository relies on the legacy directory, and
    • Removing the legacy directory

Once this work is complete, we can investigate adding HyperShift support to this driver. That work will be tracked and addressed via a separate epic.

Feature Overview (aka. Goal Summary)  

Intel VROC (Virtual RAID on CPU) is a nontraditional RAID option that can offer some management and potential performance improvements compared to a traditional hardware raid. RAID devices can be set up from firmware or via remote management tools and present as MD devices.

Initial support was delivered in OpenShift 4.16. This feature is to enhance that support by:

  • streamlining the process
  • plumbing through to the agent installer and baremetal IPI

Out of Scope

Any technologies not already supported by the RHEL kernel.
**

Background

https://www.intel.com/content/www/us/en/software/virtual-raid-on-cpu-vroc.html

 

 
Interoperability Considerations

Feature goal (what are we trying to solve here?)

Allow users of Intel VROC hardware to deploy OpenShift to it via the Assisted Installer.

https://www.intel.com/content/www/us/en/software/virtual-raid-on-cpu-vroc.html

Currently the support only exists with UPI deployments. The Assisted Installer blocks it.

DoD (Definition of Done)

Assisted Installer can deploy to hardware using the Intel VROC.

Does it need documentation support?

Yes

Feature origin (who asked for this feature?)

  • A Customer asked for it

 

  • Catching up with OpenShift

Intel VROC support exists in OpenShift, just not the Assisted Installer, this epic seeks to add it.

Reasoning (why it’s important?)

We support Intel VROC with OpenShift UPI but Assisted Installer blocks it. Please see https://issues.redhat.com/browse/SUPPORTEX-22763 for full details of testing and results.

Customers using Intel VROC with OpenShift will want to use Assisted Installer for their deployments. As do we.

Competitor analysis reference

TBC

Feature usage (do we have numbers/data?)

Assisted installer is part of NPSS so this will benefit Telco customers using NPSS with Intel VROC.

Feature availability (why should/shouldn't it live inside the UI/API?)

Brings Assisted installer into alignment with the rest of the product.

Feature Overview (aka. Goal Summary)  

Generally speaking, customers and partners should not be installing packages client-side, i.e. `rpm-ostree install $pkg` directly on the nodes. It's not officially supported outside of troubleshooting situations, but the documentation is not very explicit on this point and we have anecdotal data that customers and partners do in fact install packages directly on hosts.

Adding some telemetry to help understand how common this is among data-reporting clusters. Hopefully such data will help us understand how important it is to preserve this ability in the bootc-world. While it's not a pattern we want to encourage, we should be careful about dropping it without considering how to avoid breaking users' clusters in unexpected ways.

Goals (aka. expected user outcomes)

Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.

Requirements (aka. Acceptance Criteria):

This needs to be backported to 4.14 so we have a better sense of the fleet as it is.

4.12 might be useful as well, but is optional.

Questions to Answer (Optional):

Why not simply block upgrades if there are locally layered packages?

That is indeed an option. This card is only about gathering data.

Customer Considerations

Some customers are known to layer packages locally but it's worse if the issue is a third party integration. In such a case, if the add-on breaks, the customer will call the 3rd party first because that's what appears to be broken. It may be a long, undelightful trip to get to a satisfying resolution. If they are blocked on upgrade due to that 3rd party integration they may not be able to upgrade the OCP y-version. That could be a lengthy delay.

Description copied from attached feature card: https://issues.redhat.com/browse/OCPSTRAT-1521

 

Feature Overview (aka. Goal Summary)  

Generally speaking, customers and partners should not be installing packages client-side, i.e. `rpm-ostree install $pkg` directly on the nodes. It's not officially supported outside of troubleshooting situations, but the documentation is not very explicit on this point and we have anecdotal data that customers and partners do in fact install packages directly on hosts.

Adding some telemetry to help understand how common this is among data-reporting clusters. Hopefully such data will help us understand how important it is to preserve this ability in the bootc-world. While it's not a pattern we want to encourage, we should be careful about dropping it without considering how to avoid breaking users' clusters in unexpected ways.

Goals (aka. expected user outcomes)

Understand what % of machines (or a proxy thereof) have locally layered packages which aren't CoreOS extensions.

Requirements (aka. Acceptance Criteria):

This needs to be backported to 4.14 so we have a better sense of the fleet as it is.

4.12 might be useful as well, but is optional.

Questions to Answer (Optional):

Why not simply block upgrades if there are locally layered packages?

That is indeed an option. This card is only about gathering data.

Customer Considerations

Some customers are known to layer packages locally but it's worse if the issue is a third party integration. In such a case, if the add-on breaks, the customer will call the 3rd party first because that's what appears to be broken. It may be a long, undelightful trip to get to a satisfying resolution. If they are blocked on upgrade due to that 3rd party integration they may not be able to upgrade the OCP y-version. That could be a lengthy delay.

create an e2e test that confirms that metrics collection in the MCD works and that it collects unsupported package installations using rpm-ostree

Implement the logic in the MCO Daemon to collect the defined metrics and send them to Prometheus. For the Prometheus side of things, this will involve some manipulation in `metrics.go`.

Acceptance Criteria:
1. The MCO daemon should collect package installation data (defined from the spike MCO-1275) during its normal operation.
2. The daemon should report this data to Prometheus at a specified time interval (defined from spike MCO-1277).
3. Include error handling for scenarios where the rpm-ostree command fails or returns unexpected results.

Feature Overview

Ability to install OpenShift on Nutanix with nodes having multiple NICs (multiple subnets) from IPI and for autoscaling with MachineSets.

 

Feature Overview

Ability to install OpenShift on Nutanix with nodes having multiple NICs (multiple subnets) from IPI and for autoscaling with MachineSets.

Phase 3 Deliverable:

TBD

Epic Goal

  • To be refined based on initial feedback on GA

Why is this important?

  •  

Scenarios

  1. As a cluster admin, I want to reconfigure sudo without disrupting workloads.
  2. As a cluster admin, I want to update or reconfigure sshd and reload the service without disrupting workloads.
  3. As a cluster admin, I want to remove mirroring rules from an ICSP, ITMS, IDMS object without disrupting workloads because the scenario in which this might lead to non-pullable images at a undefined later point in time doesn't apply to me.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

Implement authorization to secure API access for different user personas/actors in the agent-based installer.

User Personas:

  • Read-Only Access: For "wait-for" and "monitor-add-nodes" commands.
  • Read-Write Access: For systemd services and the agent service.

This is 

Goals

The agent-based installer APIs have implemented basic security measures through authentication, as covered in AGENT-145.

To further enhance security, it is crucial to implement user persona/actor-based authorization, allowing for differentiated access control, such as read-only or read-write permissions, based on the user's role.

The goal of this implementation is to provide a more robust and secure API framework, ensuring that users can only perform actions appropriate to their role.

Epic Goal

  • Implement authorization to secure API access for different user personas/actors in the agent-based installer.
  • User Personas:
    • Read-Only Access: For "wait-for" and "monitor-add-nodes" commands.
    • Read-Write Access: For systemd services and the agent service.

Why is this important?

  • The agent-based installer APIs have implemented basic security measures through authentication, as covered in AGENT-145. To further enhance security, it is crucial to implement user persona/actor-based authorization, allowing for differentiated access control, such as read-only or read-write permissions, based on the user's role. This approach will provide a more robust and secure API framework, ensuring that users can only perform actions appropriate to their role. 

Scenarios

  1. Users running the wait-for or monitor-add-nodes commands should have read-only permissions. They should not be able to write to the API. If they attempt to perform write operations, appropriate error messages could be displayed, indicating that they are not authorized to write.
  2. Users associated with running systemd services should have both read and write permissions.
  3. Users associated with running the agent service should also have read and write permissions.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1.  

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a developer working on the Assisted Service, I want to:

  • Identify the type of authorization provided to each endpoint.
  • Ensure that the token claims match the authorization level that the endpoint is supposed to have based on the authenticator scheme.
  • Use the authenticator scheme to derive the security definition, ensuring that the correct authorization is enforced for each endpoint.

so that I can achieve

  • Proper authorization control across all endpoints.
  • Verification that only authorized users can access specific endpoints based on their token claims.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a wait-for and monitor-add-nodes user, I want to be able to:

  • Access relevant endpoints in a read-only capacity to monitor the progress of hosts joining a cluster.
  • View and verify the addition of nodes to an existing cluster without the ability to make changes.
  • Receive accurate and up-to-date information related to waiting for hosts and monitoring added nodes without requiring administrative permissions.

So that I can achieve:

  • Secure and restricted read-only access to essential information, ensuring that there is no risk of unintended modifications.
  • Prevent unauthorized or unintended changes to the system, maintaining the security and integrity of the environment.
  • Facilitate clear and appropriate authorization for the read-only role

Acceptance Criteria:

Description of criteria:

  • The swagger.yaml file must be updated to include read-only security definitions specifically for the wait-for and monitor-add-nodes user personas.
  • The relevant endpoints should be configured to utilize these read-only security definitions.
  • Ensure that the wait-for and monitor-add-nodes users can only view data without the ability to make changes.
  • The changes must be tested and validated to confirm the correct implementation of read-only access.
  •  

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a user with userAuth, agentAuth, and watcherAuth persona (wait-for and monitor-add-nodes):

  • I want to be able to authorize actions specific to each user persona (user, agent, watcher) based on predefined claims.
  • I want to ensure that each persona's actions are validated against the claims agreed upon by the installer and Assisted Service.
  • I want to enforce role-based permissions to control access and operations during the installation process.

So that I can achieve:

  • Proper authorization of actions according to each persona's role.
  • Secure execution of tasks by validating them against agreed claims.
  • Controlled access to resources and operations, reducing the risk of unauthorized actions during installation.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview

Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.

When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.

There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.

In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.

Requirements{}

  • Users can define OpenShift zones mapping them to host groups at installation time (day 1)
  • Users can use host groups as OpenShift zones post-installation (day 2)

Epic Goal

Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.

When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.

There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.

In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.

Requirements{}

  • Users can define OpenShift zones mapping them to host groups at installation time (day 1)
  • Users can use host groups as OpenShift zones post-installation (day 2)

Feature Overview

Support in the IPI installer for OpenShift on vSphere to create the OpenShift node VMs with multiple NICs and subnets.

This is necessary when users want to have dedicated network links in the node VMs for storage or database for example, in addition to the service network link that we create now

Requirements

Users can specify multiple NICs for the OpenShift VMs that will be created for the  OpenShift cluster nodes with different subnets.

Epic Goal

Support in the IPI installer for OpenShift on vSphere to create the OpenShift node VMs with multiple NICs and subnets.

This is necessary when users want to have dedicated network links in the node VMs for storage or database for example, in addition to the service network link that we create now

Requirements

Users can specify multiple NICs for the OpenShift VMs that will be created for the  OpenShift cluster nodes with different subnets.

Description:

The machine config operator needs to be bumped to pick up the API change:

I0819 17:50:00.396986       1 machineconfig.go:87] ControllerConfig not found, creating new one
E0819 17:50:00.400599       1 machineconfig.go:90] Failed to create ControllerConfig: ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.infra.spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

 

Acceptance Criteria:

Description:

The infrastructure spec validation needs to be updated to change the network count restriction to [10|https://configmax.esp.vmware.com/guest?vmwareproduct=vSphere&release=vSphere%208.0&categories=1-0.] 

 

When multiple NICs are enabled(the installer allows this?) bootstrapping fails with:

Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: [#1673] failed to create some manifests:
Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

 

Acceptance Criteria:

  • API changes are tested in a payload along with MAPI and the installer

 

issue created by splat-bot

 

{}USER STORY:{}

As an OpenShift provisioner, I want to provision a cluster in which nodes have multiple network adapters so that I can implement the desired network topology.

{}DESCRIPTION:{}

Customers have a need to provision nodes with multiple adapters in day 0. capv supports the ability to specify multiple adapters in its clone spec. The installer should be augmented to support additional NICs.

{}Required:{}

  •  

{}Nice to have:{}

...

{}ACCEPTANCE CRITERIA:{}

  • install-config.yaml is updated to allow multiple NICs
  • CI job testing an install with 2 network adapters
  • Validation of mutliple network adapters

{}ENGINEERING DETAILS:{}

The machine API is failing to render compute nodes when multiple NICs are configured:

Unable to apply 4.17.0-0.ci.test-2024-08-15-193100-ci-ln-igm0nhk-latest: ControllerConfig.mac
hineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.infra.spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules w
ere not checked because the object was invalid; correct the existing errors to complete validation]

Description:

Bump machine-api to pick up changes in openshift/api#2002.

Acceptance Criteria:

  • openshift/api#2002 is merged
  • openshift/library-go#1777 is merged
  • this PR is merged

issue created by splat-bot

Feature Overview

Improve the cluster expansion with the agent workflow added in OpenShift 4.16 (TP) and OpenShift 4.17 (GA) with:

  • Caching RHCOS image for faster node addition, i.e. no extraction of image every time)
  • Add a single node with just one command, no need to write config files describing node
  • Support creating PXE artifacts 

Goals

Improve the user experience and functionality of the commands to add nodes to clusters using the image creation functionality.

Epic Goal

  • Cleanup/carryover work from AGENT-682 and WRKLDS-937 that were non-urgent for GA of the day 2 implementation

Currently dev-scripts supports the add-nodes workflow by using only the ISO. We should be able to select the mode to add a node via an explicit config variables, so that also the pxe approach could be used

Improve the output shown for monitor command, especially in the case of multiple nodes, so that it could be more readable.

Note
A possible approach could be to change the monitoring logic in a polling loop, where nodes are grouped by "stages". A stage represents which point the node reached while working over the add workflow (we don't have yet defined them).

The add-nodes-image command may also generate PXE artifacts (instead of the ISO). This will require an additional command flag (and review the command name)

(evaluate also the possibility to use instead a sub-command)

Currently the oc node-image create command looks for the kube-system/cluster-config-v1 resource to infer some of the required elements for generating the ISO.
The main issue is that the kube-system-cluster-config-v1 resource may be stale, since it contains information used when the cluster was installed, and that may have changed during the lifetime of the cluster.

tech note about the replacement

Field Source
APIDNSName oc get infrastructure cluster -o=jsonpath=' {.status.apiServerURL}

'

ImageDigestSource oc get imagedigestmirrorsets image-digest-mirror -o=jsonpath=' {.spec.imageDigestMirrors}

'

ImageContentSources oc get imagecontentsourcepolicy
ClusterName Derived from APIDNSName (api.<cluster name>.<base domain>)
SSHKey oc get machineconfig 99-worker-ssh -o jsonpath=' {.spec.config.passwd.users[0].sshAuthorizedKeys}

'

FIPS oc get machineconfig 99-worker-ssh -o jsonpath=' {.spec.fips}

'

(see also Zane Bitter comment in https://issues.redhat.com/browse/OCPBUGS-38802)

Currently the oc node-image create command does not report any revelant information that could help the user to understand which element was retrieved from (for example, the SSH key), thus making more difficult to troubleshoot an eventual issue.

For this reason, it could be useful that the node-joiner tool would produce a proper json file, reporting all the details about the relevent resources fetched for generating image. The oc command should be able to expose them when required (ie via command flag)

Currently the error reporting of the oc node-image create command is pretty rough, as it prints out in the console the log traces captured from the node-joiner pod standard output. Even though this could help the user in understanding the problem, a lot of many unnecessary technical details are exposed, making the overall experience cumbersome.

For this reasons, node-joiner tool should generate a proper json file with the outcome of the action, including all the error messages eventually found.
The oc command should fetch such json output and report it in the console, instead of the showing up the node-joiner pod logs output.

Use also a flag to report the full pod logs, in case of troubleshooting
Manage the backward compatibility with the older version of node-joiner that does not support the enhanced output

Support adding nodes using PXE files instead of ISO.

 

Questions

  • What kind of interface would be recommended?
    • Use a different command for generating the pxe artifacts
    • Use a flag for the existing commands

 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

A set of capabilities need to be added to the Hypershift Operator that will enable AWS Shared-VPC deployment for ROSA w/ HCP.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Build capabilities into HyperShift Operator to enable AWS Shared-VPC deployment for ROSA w/ HCP.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Antoni Segura Puimedon Please help with providing what Hypershift will need on the OCPSTRAT side.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both (perhaps) both
Classic (standalone cluster)  
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_64 and Arm
Operator compatibility  
Backport needed (list applicable versions) 4.14+
UI need (e.g. OpenShift Console, dynamic plugin, OCM) no (this is an advanced feature not being exposed via web-UI elements)
Other (please specify) ROSA w/ HCP

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Currently the same SG is used for both workers and VPC endpoint. Create a separate SG for the VPC endpoint and only open the ports necessary on each.

"Shared VPCs" are a unique AWS infrastructure design: https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html

See prior work/explanations/etc here: https://issues.redhat.com/browse/SDE-1239

 

Summary is that in a Shared VPC environment, a VPC is created in Account A and shared to Account B. The owner of Account B wants to create a ROSA cluster, however Account B does not have permissions to create a private hosted zone in the Shared VPC. So they have to ask Account A to create the private hosted zone and link it to the Shared VPC. OpenShift then needs to be able to accept the ID of that private hosted zone for usage instead of creating the private hosted zone itself.

QE should have some environments or testing scripts available to test the Shared VPC scenario

 

The AWS endpoint controller in the CPO currently uses the control plane operator role to create the private link endpoint for the hosted cluster as well as the corresponding dns records in the hypershift.local hosted zone. If a role is created to allow it to create that vpc endpoint in the vpc owner's account, the controller would have to explicitly assume the role so it can create the vpc endpoint, and potentially a separate role for populating dns records in the hypershift.local zone.

The users would need to create a custom policy to enable this

Add the necessary API fields to support a Shared VPC infrastructure, and enable development/testing of Shared VPC support by adding the Shared VPC capability to the hypershift CLI.

Feature Overview

Console enhancements based on customer RFEs that improve customer user experience.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Problem: 

As a user, I want to access the Import from Git and Container image form from the admin perspective as well.

Goal:

Provide Import from Git and Container image option to redirect the users to respective form. 

Why is it important?

Use cases:

  1. Users can navigate to Import from Git and Container image form from the Admin perspective. 

Acceptance criteria:

  1. Change Import YAML to a dropdown
  2. Add 3 menu actions 
    1. Import YAML
    2. Import from GIT
    3. Container image
  3. Add a tooltip `Quick create`

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As a user, I want to access Import from Git to Container image form from anywhere in the console.

Acceptance Criteria

  1. Update Import YAML button to a dropdown
  2. Add 3 options 
    1. Import YAML
    2. Import from Git
    3. Container Image
  3. Add a tooltip 'Quick create'
  4. Add e2e tests

Additional Details:

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Customer would like to be able to start individual CronJobs manually via a button in the OpenShift Webconsole, without having to use the OCI CLI.

To start a Job from a CronJob using CLI, following command is being used:

$ oc create job a-cronjob --from=cronjob/a-cronjob

 

AC:

  • Add a 'Start Job' option to both List and Details pages for a CronJob
  • Add an integration test

 

Created from https://issues.redhat.com/browse/RFE-6131 

As a cluster admin I want to set a cluster wide setting for hiding the "Getting started resources" banner from Overview, for all the console users.

 

AC: 

  • Add new field to the console-operator's config, to its 'spec.customization' section, which would set the console. New field should be named 'GettingStartedBanner', which should be an enum, with states "Show" and "Hide".
  • By default the state should be "Enabled"
  • Pass the state variable to the console-config CM
  • Add e2e and integration test

 

RFE: https://issues.redhat.com/browse/RFE-4475

As a cluster admin I want to set a cluster wide setting for hiding the "Getting started resources" banner from Overview, for all the console users.

 

AC: 

  • Console will read the value of 'GettingStartedBannerState' on start and set it as a SERVER_FLAG. Based on the value it will render the "Getting started resources" banner
  • Add integration test

 

RFE: https://issues.redhat.com/browse/RFE-4475

Problem: ODC UX improvements based on customer RFEs that improve user experience.

Goal:

Why is it important?

Use cases:

  1. <case>

Acceptance criteria:

  1. Add dark/light mode support for the YAML editor, matching the console theme

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

When typing quickly in a search field in OperatorHub and other catalogs, the browser slows down. The input should be debounced, so that searching for operators on OperatorHub will feel faster.

Acceptance Criteria

  1. searching in tile view pages are debounced

Additional Details:

Description

As a user who is visually impaired, or a user who is out in the sun, when I switch the theme in the console to Light mode, then try to edit text files (e.g., the YAML configuration for a pod) using the web console, I want the editor to be in light theme.

Acceptance Criteria

  1. The CodeEditor component should change its base theme from vs-dark to vs-light when the console theme is changed from dark to light.
  2. Similarly, the console theme is changed from light to dark, the base theme for the monaco editor should change from vs-light to vs-dark.

Additional Details:

Feature Overview

Allow users to create an RHCOS image to be used for bootstrapping new clusters.

Goals

The IPI installer is currently uploading the RHCOS image to all AOS Clusters. In environments where each cluster is on a different subnet this uses unnecessary bandwidth and takes a long time on low bandwidth networks.

The goal is to use a pre-existing VM images in Prism Central to bootstrap the cluster

Epic Goal

Allow users to create an RHCOS image to be used for bootstrapping new clusters.

The IPI installer is currently uploading the RHCOS image to all AOS Clusters. In environments where each cluster is on a different subnet this uses unnecessary bandwidth and takes a long time on low bandwidth networks.

The goal is to use a pre-existing VM images in Prism Central to bootstrap the cluster

Feature Overview (aka. Goal Summary)  

Add support to GCP C4/C4A Machine Series to be used as Control Plane and Compute Nodes when deploying Openshift on Google Cloud

Goals (aka. expected user outcomes)

As a user, I want to deploy OpenShift on Google Cloud using C4/C4A Machine Series for the Control Plane and Compute Node so I can take advantage of these new Machine types

Requirements (aka. Acceptance Criteria):

OpenShift can be deployed in Google Cloud using the new C4/C4A Machine Series for the Control Plane and Compute Nodes starting in OpenShift 4.17.z

 

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  both
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all all
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Background

Google has made C4/C4A Machine Series available on their cloud offering.

Documentation Considerations

The documentation will be updated adding the new disk type that needs to be supported as part of this enablement. Also the C4/C4A Machine Series will be added as tested Machine types for Google Cloud when deploying OpenShift

Epic Goal

Why is this important?

  • This is a new Machine Series Google has introduced that customers will use for their OpenShift deployments

Scenarios

  1. Deploy an OpenShift Cluster with both the Control Plane and Compute Nodes running on C4 GCP Machines
  2. Deploy an OpenShift Cluster with both the Control Plane and Compute Nodes running on C4A GCP Machines

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. Hyperdisk-balanced enablement work via OCPSTRAT-1496

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

1. Add C4 and C4A instances to list of tested instances in docs.

2. Reference that the user should know that not all zones can be used for installation of these types. There is no way for the installer to know if these instances can actually be installed in the zones. To successfully install in some zones, specify the zones in the control plane and compute machine pools (in the install config). 

Feature Overview (aka. Goal Summary)

The transition from runc to crun is part of OpenShift’s broader strategy for improved performance and security. In OpenShift clusters with hosted control planes, retaining the original runtime during upgrades was considered complex and unnecessary, given the success of crun in tests and the lack of proof for significant risk. This decision aligns with OpenShift’s default container runtime upgrade and simplifies long-term support.

Requirements (aka. Acceptance Criteria)

  1. Transparent Runtime Change: The switch to crun should be seamless, with minimal disruption to the user experience. Any workload impacts should be minimal and well-communicated.
  2. Documentation: Clear documentation should be provided, explaining the automatic runtime switch, outlining potential performance impacts, and offering guidance on testing workloads after the transition.

Deployment Considerations

Deployment Configurations Specific Needs
Self-managed, managed, or both Both
Classic (standalone cluster) N/A
Hosted control planes Yes
Multi-node, Compact (three-node), SNO All
Connected / Restricted Network N/A
Architectures (x86_64, ARM, IBM Power, IBM Z) All
   
Backport needed None
UI Needs No additional UI needs. OCM may require an acknowledgment for runtime change.

Use Cases

Scenario 1:
A user upgrading from OpenShift 4.17 to 4.18 in a HyperShift environment has NodePools running runc. After the upgrade, the NodePools automatically switch to crun without user intervention, providing consistency across all clusters.

 

Scenario 2:
A user concerned about performance with crun in 4.18 can create a new NodePool to test workloads with crun while keeping existing NodePools running runc. This allows for gradual migration, but default behavior aligns with the crun upgrade. 

 

Scenario 2 needs to be well documented as best practice. 

Questions to Answer

  • How will this automatic transition to crun affect workloads that rely on specific performance characteristics of runc?
  • Are there edge cases where switching to crun might cause compatibility issues with older OpenShift configurations or third-party tools?

Out of Scope

  • Supporting retention of runc as the default runtime post-upgrade is not part of this feature.
  • Direct runtime configuration options for individual NodePools are not within scope, as the goal is to align with OpenShift defaults and reduce complexity.

Documentation Considerations

Based on this conversation, we should make sure we document the following:

  • Canary Update Strategy:
  • Highlight the benefits of HyperShift and HCP as architecture that allows the decoupling of NodePools and controlplanes upgrades better enabling the canary upgrade pattern
  • Reuse or create new docs around canary upgrades with HCP NodePools. 
  • Gradually upgrading a small subset of nodes / Nodepools ("canaries") first to test the new runtime in a production environment before rolling out the upgrade to the rest of the nodes.
  • Release Notes:
  • Clearly announce the switch from runc to crun as the default runtime in version 4.18 and explain HCP’s behaviour. 
  • Briefly explain the rationale behind the change, emphasizing the expected transparency and minimal user impact.
  • Reference the documentation on the canary update strategy for users seeking further information or guidance.

Goal

  • Stay aligned with OCP Standalone with the runtime when relevant events happen.
    • Is there any implication in ControlPlane upgrades.
      • Being HostedControlPlane a MGMT Cluster workload, it will be affected as any other workload, we need to verify if that affects us in any way.
    • NodePool Updates from 4.18
    • NodePool updates from 4.17 to 4.18 (Changing runtime release)
  • Testing
  • Implications to maintain a NodePool with the non-default runtime
    • Deprecation of the runc at some point, forcing the customers to create a new NodePool with the new runtime
    • Perfomance implications
    • Footprint
    • Testing (duplicated tests?)
    • MultiArch config
    • Backup and Restore
  • Documentation
    • How to change the runtime
    • Implications of changing and use
  • Service Delivery affectation

Why is this important?

  • Prevent issues on runtime use and migration with customer workloads in SaaS and Self-Manage platforms
  • Stay aligned with OCP Standalone

Scenarios

  1. Scenario 1: A user upgrading from OpenShift 4.17 to 4.18 wants to ensure that their nodepools continue using runc. The upgrade proceeds without changes to the container runtime, preserving the existing environment.
  2. Scenario 2: A user intends to switch to crun post-upgrade. They create a new nodepool explicitly configured with crun, ensuring a controlled transition.

Acceptance Criteria

  • Dev
    • Validated upgrade from 4.17 to 4.18 does not modify the runtime
    • Validated upgrade from 4.18 does not modify the runtime
    • Validated from 4.18 the new nodepools are based on crun
    • Questions from above answered
    • Keep aligned with OCP Standalone
  • CI
    • MUST be running successfully with tests automated
    • E2E test to validate the desired behaviour
    •  
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • Doc
    • Documentation in place Upstream and the docs team aware of the new additions for downstream

Open questions:

  1. How will the automatic retention of runc impact long-term support for crun as the default runtime?
    1. Deprecation of the runc at some point, forcing the customers to create a new NodePool with the new runtime
    2. Perfomance implications
    3. Footprint
    4. MultiArch config
    5. Backup and Restore
  2. Are there edge cases where the automatic retention of runc could cause issues with newer OpenShift features or configurations?
  3. Is there any implication in ControlPlane upgrades?
  4. Will we cover the testing of both runtimes?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a customer I would like to know how the runtime change from runc to crun could affect me, for that we will need to

  • Document how to change the runtime in HostedControlPlanes
  • Accepted and not accepted scenarios
  • Implications of the changing the runtime (downtime, restarts, etc...)
  • How to validate a good "migration"

Acceptance Criteria:

Description of criteria:

  • Upstream and downstream documentation

User Story:

As a customer I want to upgrade my HostedCluster from 4.17 to 4.18, so I can verify:

  • The runtime changes from runc to crun.
  • The new NodePools are created with crun runtime.
  • The disconnected deployments keep working fine.

If any of the points above fails, we need to fill a bug in order to solve it and put it under same Epic as this user story.

Acceptance Criteria:

Description of criteria:

  • Upstream and downstream documentation.
  • Validated the above scenarios.
  • Filled the proper issues if applies.
  • Modify the Upgrade E2E tests to check the runtime.

Feature Overview (Goal Summary)  

We aim to continue establishing a comprehensive testing strategy for Hosted Control Planes (HCP) that aligns with Red Hat’s support requirements and ensures customer satisfaction. This involves testing across various permutations, including providers, lifecycle, upgrades, and version compatibility. The testing must span management clusters, hubs, MCE, control planes, and nodepools, while coordinating across multiple QE teams to avoid duplication and inefficiencies. We aim to sustain an evolving testing matrix to meet product demands, especially as new versions and extended OCP lifecycles are introduced.

Goals (Expected User Outcomes)

  • Provide a scalable, systematic approach for testing HCP across multiple environments and scenarios.
  • Ensure coordination between all QE teams (ACM/MCE, HCP, KubeVirt, Agent) to avoid redundancies and inefficiencies in testing.
  • Establish a robust testing framework that can handle upgrades and version compatibility while maintaining compliance with Red Hat’s lifecycle policies.
  • Offer a clear view of coverage across different permutations of control planes and node pools.

 

Requirements (Acceptance Criteria)

  • Testing matrix covers all relevant permutations of management clusters, hubs, MCE, control planes, and node pools.
  • Use of representative sampling to ensure critical combinations are tested without unnecessary resource strain.
  • Ensure testing for upgrades includes fresh install scenarios to streamline coverage.
  • Automated processes in place to trigger relevant tests for new MCE builds or HCP updates.
  • Comprehensive tracking of QE teams’ coverage to avoid duplicated efforts.
  • Test execution time is optimized to reduce delays in delivery without compromising coverage.

 

Deployment Considerations

  • Self-managed, managed, or both: self-managed.
  • Classic (standalone cluster): No.
  • Hosted control planes: Yes.
  • Multi-node, Compact (three node), or Single node (SNO), or all: N/A.
  • Connected / Restricted Network: Yes.
  • Architectures: x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x).
  • Operator compatibility: Yes, ensure operator updates don't break testing workflows.
  • Backport needed: N/A
  • UI need:N/A
  • Other: N/A.

Use Cases (Optional)

 

See: https://docs.google.com/spreadsheets/d/1j8TjMfyCfEt8OzTgvrAG3tuC6WMweBh5ElzWu6oAvUw/edit?gid=0#gid=0 

  • Same hub multiple HCP Versions: Using the same managmenent/hub cluster (e.g., 4.15), to provision up to n+4 newer cluster versions
  • MCE ft. Management cluster compatibility. 
  • MCE ft. HCP versions compatibility
  • Upgrade Scenarios: Testing a management cluster upgrade from version 4.14 to 4.15, ensuring all connected node pools and control planes operate seamlessly.
  • Fresh Install Scenarios: Testing a new deployment with different node pool versions to ensure all configurations work correctly without requiring manual interventions.

Background

The HCP architecture introduces decoupled control planes and worker nodes, significantly increasing the number of testing permutations. Ensuring these scenarios are  tested is crucial to maintaining product quality, customer satisfaction, and stay compliant as an OpenShift form-factor.

 

Feature Overview

OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.

In phase 1 provided tech preview for GCP.

In phase 2, GCP support goes to GA and AWS goes to TP.

In phase 3, AWS support goes to GA and vsphere goes TP.

Requirements

Feature Overview (aka. Goal Summary)  

To introduce tests for new permissions required as pre-submit tests on PRs so that PR authors can see whenever their changes affect the minimum required permissions

Goals (aka. expected user outcomes)

Currently, the process is that QE installs with the documented minimum permissions, which starts failing whenever something new unknowingly requires additional permissions.

That test runs once a week. When it fails QE reviews and files bugs, the Installer then goes and adds them to a file which tracks the required permissions in the installer repo.

The issue is that it takes some time to get a permissions change implemented by AWS, so the late discovery of a need can become a release blocker

Requirements (aka. Acceptance Criteria):

Early test new minimum permissions required to deploy OCP on AWS so ROSA can be informed before any feature that alters the minimum permissions requirements gets released.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Documentation Considerations

This is an internal-only feature and should not require any user-facing documentation

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To introduce tests for new permissions required as presubmit tests on PRs so so PR authors can see whenever their changes affect the minimum required permissions

Why is this important?

  • Currently the process is that QE installs with the documented minimum permissions, that starts failing whenever something new unknowingly requires additional permissions. That test runs once a week. When it fails QE reviews and files bugs, installer then goes and adds them to a file which tracks the required permissions in the installer repo.
  • The issue is that it takes some time to get a permissions change implemented by AWS, so the late discovery of a need can become a release blocker

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Open questions::

  1. The details of what would happen when the tests then fail and a new permission is required for example. As in would it be a new PR or what documentation we need to put in place to explain.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Review, refine and harden the CAPI-based Installer implementation introduced in 4.16

Goals (aka. expected user outcomes)

From the implementation of the CAPI-based Installer started with OpenShift 4.16 there is some technical debt that needs to be reviewed and addressed to refine and harden this new installation architecture.

Requirements (aka. Acceptance Criteria):

Review existing implementation, refine as required and harden as possible to remove all the existing technical debt

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Documentation Considerations

There should not be any user-facing documentation required for this work

Epic Goal

  • This epic includes tasks the team would like to tackle to improve our process, QOL, CI. It may include tasks like updating the RHEL base image and vendored assisted-service.

Why is this important?

 

We need a place to add tasks that are not feature oriented.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

 The agent installer does not require the infra-env id to be present in the claim to perform the authentication.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

The agent installer does not require the infra-env id to be present in the claim to perform the authentication.

 

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Continue to refine and harden aspects of CAPI-based Installs launched in 4.16

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

Once a cloud provider uses CAPI by default, the feature gate it used becomes tech debt.

Acceptance Criteria:

Description of criteria:

  • openshift/api PR removing the feature gate
  • remove feature gate conditionals from the installer
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview (aka. Goal Summary)

This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.

Goals (aka. Expected User Outcomes)

  • Unified Codebase: Achieve a consistent and unified codebase across different HCP components, reducing redundancy and making the code easier to understand and maintain.
  • Enhanced Developer Experience: Streamline the developer workflow by reducing boilerplate code, standardizing interfaces, and improving documentation, leading to faster and safer development cycles.
  • Improved Maintainability: Refactor large, complex components into smaller, modular, and more manageable pieces, making the codebase more maintainable and easier to evolve over time.
  • Increased Reliability: Enhance the reliability of the platform by increasing test coverage, enforcing immutability where necessary, and ensuring that all components adhere to best practices for code quality.
  • Simplified Networking and Upgrade Mechanisms: Standardize and simplify the handling of networking flows and NodePool upgrade triggers, providing a clear, consistent, and maintainable approach to these critical operations.

Requirements (aka. Acceptance Criteria)

  • Standardized CLI Implementation: Ensure that the CLI is consistent across all supported platforms, with increased unit test coverage and refactored dependencies.
  • Unified NodePool Upgrade Logic: Implement a common abstraction for NodePool upgrade triggers, consolidating scattered inputs and ensuring a clear, consistent upgrade process.
  • Refactored Controllers: Break down large, monolithic controllers into modular, reusable components, improving maintainability and readability.
  • Improved Networking Documentation and Flows: Update networking documentation to reflect the current state, and refactor network proxies for simplicity and reusability.
  • Centralized Logic for Token and Userdata Generation: Abstract the logic for token and userdata generation into a single, reusable library, improving code clarity and reducing duplication.
  • Enforced Immutability for Critical API Fields: Ensure that immutable fields within key APIs are enforced through proper validation mechanisms, maintaining API coherence and predictability.
  • Documented and Clarified Service Publish Strategies: Provide clear documentation on supported service publish strategies, and lock down the API to prevent unsupported configurations.

Use Cases (Optional)

  • Developer Onboarding: New developers can quickly understand and contribute to the HCP project due to the reduced complexity and improved documentation.
  • Consistent Operations: Operators and administrators experience a more predictable and consistent platform, with reduced bugs and operational overhead due to the standardized and refactored components.

Out of Scope

  • Introduction of new features or functionalities unrelated to the refactor and standardization efforts.
  • Major changes to user-facing commands or APIs beyond what is necessary for standardization.

Background

Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.

Customer Considerations

  • Minimal Disruption: Ensure that existing users experience minimal disruption during the refactor, with clear communication about any changes that might impact their workflows.
  • Enhanced Stability: Customers should benefit from a more stable and reliable platform as a result of the increased test coverage and standardization efforts.

Documentation Considerations

Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.

This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.

Goal

Refactor and modularize controllers and other components to improve maintainability, scalability, and ease of use.

As dev I want to understand at a glance which conditions are relevant for the NodePool.
As dev I want to have the ability to add/collapse conditions easily.
As dev I want any conditions expectation to be unit testable.

Abstract away in a single place all the logic related to token and userdata secrets consuming the output of https://issues.redhat.com/browse/HOSTEDCP-1678
This should result in a single abstraction i.e. "Token" that expose a thin library e.g. Reconcile() and hide all details for token/userdata secrets lifecycle

As as dev I want to easily add and understand which input results in triggering a nodepool upgrade.

There's many scattered things that triggers nodepool rolling upgrade on change.
For code sustainability it'd be good to try to have a common abstraction that discovers all of them based on an input and return the authoritative hash for any targeted config version in time.
Related https://github.com/openshift/hypershift/pull/4057
https://github.com/openshift/hypershift/pull/3969#discussion_r1587198191

Following up to abstracting pieces into cohesively units, capi is the next logic choice since there's many reconciliation business logic for it in the NodePool controller.
Goals:
All capi related logic is driven by a single abstraction/struct.
Almost full unit test coverage
Deeper refactor of the concrete implementation logic is left out of the scope for gradual test driven follow ups

User Story:

As a (user persona), I want to be able to:

  • As an external dev I want to be able to add new components to the CPO easily
  • As a core dev I want to feel safe when adding new components to the CPO
  • As a core dev I want to add new components to the CPO with our copy/pasting big chunks of code

 

https://issues.redhat.com//browse/HOSTEDCP-1801 introduced a new abstraction to be used by ControlPlane components. We need to refactor every component to use this abstraction. 

Acceptance Criteria:

Description of criteria:

All ControlPlane Components are refactored:

  • HCCO
  • kube-apiserver (Mulham)
  • kube-controller-manager (Mulham)
  • ocm (Mulham)
  • etcd (Mulham)
  • oapi (Mulham)
  • scheduler (Mulham)
  • clusterpolicy (Mulham)
  • CVO (Mulham)
  • oauth (Mulham)
  • hcp-router (Mulham)
  • CCO (Mulham)
  • CNO (Jparrill)
  • CSI (Jparrill)
  • dnsoperator
  • ignition (Ahmed)
  • ingressoperator (Bryan)
  • machineapprover
  • nto
  • olm
  • pkioperator
  • registryoperator (Bryan)
  • snapshotcontroller
  • storage

 

Example PR to refactor HCCO: https://github.com/openshift/hypershift/pull/4860
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md

User Story:

As a (user persona), I want to be able to:

  • As an external dev I want to be able to add new components to the CPO easily
  • As a core dev I want to feel safe when adding new components to the CPO
  • As a core dev I want to add new components to the CPO with our copy/pasting big chunks of code

Acceptance Criteria:

Context:
If you ever had to add or modify a component to the control plane operator the need for this becomes very obvious. There should be possible to only add components manifest through a gated interface.
Right now adding a new component requires copy/paste hundreds of lines of boilerplate and there's plenty of room for side effects. A dev need to manually remember to set the right config like AutomountServiceAccountToken false, topology opinions...

We should refactor support/config and all the consumers in the CPO to enforce components creation through audited and common signature/interfaces.
Adding a new component is only possible through this higher abstractions

More Details

  • If you ever had to add or modify a component to the control plane operator the need for this becomes very obvious. There should be possible o only add components manifest through a gated interface.
  • Right now adding a new component requires copy/paste hundreds of lines of boilerplane and there's plenty of room for side effects

Goal

Improve the consistency and reliability of APIs by enforcing immutability and clarifying service publish strategy support.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Goal

Focus on the general modernization of the codebase, addressing technical debt, and ensuring that the platform is easy to maintain and extend.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a user of HyperShift, I want:

  • the hardcoded catalog images removed and fetched from the OCP release image in the HCP

so that I can achieve

  • use the catalog images related to the OCP release image rather than a hardcoded value

Acceptance Criteria:

Description of criteria:

  • Hardcoded catalog images removed
  • Catalog image versions derived from OCP release image version listed in HCP

Out of Scope:

N/A

Engineering Details:

  • Hardcoded images are here
  • Every branching event, we have to remember to update this hardcoded value. Removing the hardcoded value and deriving the version from the OCP release image will remove this requirement.
  • Patryk Stefanski had code in a related PR that would do this, that can be reused here.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Goal

As a dev I want the base code to be easier to read, maintain and test

Why is this important?

If devs are don't have a healthy dev environment the project will go and the business won't make $$

Scenarios

  1. ...

Acceptance Criteria

  • 80% unit tested code
  • No file > 1000 lines of code

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

<your text here>

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal

  • Goal of this epic is to prepare the console codebase as well as dynamic plugins SDK. In order to do that we need to identify areas in console that need to be updated and issues which need to be fixed.

Why is this important?

  • Console as well as its dynamic plugins will need to support PF6 once its available in a stable version

Acceptance Criteria

  • Identity all the areas of code that need to be updated or fixed
  • Create stories which will address those updates and fixes

Open questions::

  1. Should we be removing PF4 as part of 4.16 ?

NOTE:

Nicole Thoen already started with crafting a technical debt impeding PF6 migrations document, which contains list of identified tech -debt items, deprecated components etc...

Locations

frontend/public/components/‎

search-filter-dropdown.tsx (note: Steve has a branch that's converted this) [merged]

 

‎frontend/public/components/monitoring/‎

kebab-dropdown.tsx – code duplicated at https://github.com/openshift/monitoring-plugin/blob/main/web/src/components/kebab-dropdown.tsx and that version will be updated in https://issues.redhat.com/browse/OU-257 as the console version is eventually going away

ListPageCreate.tsx – addressed in https://issues.redhat.com//browse/CONSOLE-4118

alerting.tsx – code duplicated at https://github.com/openshift/monitoring-plugin/blob/main/web/src/components/alerting.tsx and that version should be updated in https://issues.redhat.com/browse/OU-561 as the console version is eventually going away

 

AC: Go though the mentioned files and swap the usage of DropdownDeprecated and KebabToggleDeprecated with PF components, based on their semantics (either Dropdown or Select components).

 

Note:

DropdownDeprecated and KebabToggleDeprecated are replaced with latest components

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/menu-toggle#plain-toggle-with-icon

https://www.patternfly.org/components/menus/dropdown

https://www.patternfly.org/components/menus/select

 

Part of the PF6 adoption should be replacing TableDeprecated with the Table component

Location:

  • frontend/packages/console-app/src/components/console-operator/ConsoleOperatorConfig.tsx
  • frontend/public/components/custom-resource-definition.tsx
  • frontend/public/components/factory/table.tsx
  • frontend/public/components/factory/Table/VirtualizedTable.tsx

AC: 

  • Change the TableDeprecated component in the locations above in favour of PF Table component.
  • Remove the patternfly/react-table/deprecated package from console dependancies.

Locations

‎frontend/packages/console-shared/src/components/‎

GettingStartedGrid.tsx (has KebabToggleDeprecated)

 

Note

DropdownDeprecated is replaced with latest components

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/menu-toggle#plain-toggle-with-icon

https://www.patternfly.org/components/menus/dropdown

https://www.patternfly.org/components/menus/select

 

AC: Go though the mentioned files and swap the usage of DropdownDeprecated and KebabToggleDeprecated with PF components, based on their semantics (either Dropdown or Select components).

multiselectdropdown.tsx (multiple typeahead with placeholder and noResultsFoundText) 

 

Note

SelectDeprecated and SelectOptionDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

 

AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.

multiselectdropdown.tsx (multiple typeahead with placeholder and noResultsFoundText) only used in packages/local-storage-operator moved to https://issues.redhat.com/browse/CONSOLE-4227

UtilizationDurationDropdown.tsx (checkbox select, plain toggle, with placeholder text)

SelectInputField.tsx  (uses most Select props) moved to https://issues.redhat.com/browse/ODC-7655

QueryBrowser.tsx  (Currently using DropdownDeprecated, should be using a Select)

 

Note

SelectDeprecated and SelectOptionDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

 

AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.

AC:

  • Replace ApplicationLauncher, ApplicationLauncherGroup, ApplicationLauncherItem, ApplicationLauncherSeparator with Dropdown and Menu components.
  • Update integration tests

 

PatternFly demo using Dropdown and Menu components

https://www.patternfly.org/components/menus/application-launcher/

 

 

 

resource-dropdown.tsx (checkbox, options have tooltips, grouped options, hasInlineFilter which is not supported in V6 Select, convert to Typeahead)

resource-log.tsx

filter-toolbar.tsx (grouped, checkbox select)

monitoring/dashboards/index.tsx  (checkbox select, hasInlineFilter which is not supported in V6 Select, convert to Typeahead) covered by https://issues.redhat.com/browse/ODC-7655

silence-form.tsx (Currently using DropdownDeprecated, should be using a Select)

timespan-dropdown.ts (Currently using DropdownDeprecated, should be using a Select) covered by https://issues.redhat.com/browse/ODC-7655

poll-interval-dropdown.tsx (Currently using DropdownDeprecated, should be using a Select) covered by https://issues.redhat.com/browse/ODC-7655

 

Note

SelectDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

 

AC: Go though the mentioned files and swap the usage of Deprecated components with PF components, based on their semantics (either Dropdown or Select components).

 

Locations

‎frontend/packages/console-app/src/components/‎

NavHeader.tsx [merged]

PDBForm.tsx (This should be a <Select>) [merged]

 

Acceptance Criteria:

  • Change the DropdownDeprecated component in NavHeader.tsx in favour of PF Select component.
  • Change the DropdownDeprecated component in OAuthConfigDetails.tsx in favour of PF Dropdown component.
  • Change the DropdownDeprecated component in PDBForm.tsx in favour of PF Select component.
  • Create a wrapper for these replacements, if necessary.
  • Update integration tests, if necessary.
  • Add an integration test to verify if the wrapper is accessible via keyboard.

 

DropdownDeprecated are replaced with latest components

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/dropdown

https://www.patternfly.org/components/menus/select

 

 

 

Epic Goal

  • Goal of this epic is to prepare the console codebase as well as dynamic plugins SDK. In order to do that we need to identify areas in console that need to be updated and issues which need to be fixed.

Why is this important?

  • Console as well as its dynamic plugins will need to support PF6 once its available in a stable version

Acceptance Criteria

  • Identity all the areas of code that need to be updated or fixed
  • Create stories which will address those updates and fixes

Open questions::

  1. Should we be removing PF4 as part of 4.16 ?

NOTE:

Nicole Thoen already started with crafting a technical debt impeding PF6 migrations document, which contains list of identified tech -debt items, deprecated components etc...

Locations

‎frontend/packages/pipelines-plugin/src/components/‎

PipelineQuickSearchVersionDropdown.tsx (Currently using DropdownDeprecated, should be using a Select)

PipelineMetricsTimeRangeDropdown.tsx (Currently using DropdownDeprecated, should be using a Select)

 

Note

DropdownDeprecated are replaced with latest Select components

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

 

AC: Go though the mentioned files and swap the usage of SelectDeprecated with PF Select components.

Part of the PF6 adoption should be replacing TableDeprecated with the Table component

Location:

  • frontend/packages/knative-plugin/src/components/overview/FilterTable.tsx
  • frontend/packages/pipelines-plugin/src/components/shared/results/ResultsList.tsx
  • frontend/packages/rhoas-plugin/src/components/service-table/ServiceInstanceTable.tsx dead project
  • frontend/public/components/monitoring/metrics.tsx
  • frontend/public/components/monitoring/dashboards/table.tsx (file will be removed from console in CONSOLE-4236, belongs to OU-499)

AC: 

  • Change the TableDeprecated component in the locations above in favour of PF Table component.
  • Remove the patternfly/react-table/deprecated package from console dependancies.

monitoring/dashboards/index.tsx  (checkbox select, hasInlineFilter which is not supported in V6 Select, convert to Typeahead)

timespan-dropdown.ts (Currently using DropdownDeprecated, should be using a Select) 

poll-interval-dropdown.tsx (Currently using DropdownDeprecated, should be using a Select) 

SelectInputField.tsx  (uses most Select props)

 

`FilterSelect`, `VariableDropdown`, `TimespanDropdown`, and `IntervalDropdown`are the components that need to be updated; frontend/packages/dev-console/src/components/monitoring/MonitoringPage.tsx is the only valid instance usage of `MonitoringDashboardsPage` as web/src/components/alerting.tsx is orphaned.

 

Note

SelectDeprecated are replaced with latest Select component

https://www.patternfly.org/components/menus/menu

https://www.patternfly.org/components/menus/select

 

AC: Go though the mentioned files and swap the usage of Deprecated components with PF components, based on their semantics (either Dropdown or Select components).

 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

K8s 1.31 introduces VolumeAttributesClass as beta (code in external provisioner). We should make it available to customers as tech preview.

VolumeAttributesClass allows PVC to be modified after their creation and while attached. There is as vast number of parameters that can be updated but the most popular is to change the QoS values. Parameters that can be changed depend on the driver used.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Productise VolumeAttributesClass as TP in anticipation for GA. Customer can start testing VolumeAttributesClass.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  • Disabled by default
  • put it under TechPreviewNoUpgrade
  • make sure VolumeAttributeClass object is available in beta APIs
  • enable the feature in external-provisioner and external-resizer at least in AWS EBS CSI driver, check the other drivers.
    • Add RBAC rules for these objects
  • make sure we run its tests in one of TechPreviewNoUpgrade CI jobs (with hostpath CSI driver)
  • reuse / add a job with AWS EBS CSI driver + tech preview.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all all
Connected / Restricted Network both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Operator compatibility N/A core storage
Backport needed (list applicable versions) None
UI need (e.g. OpenShift Console, dynamic plugin, OCM) TBD for TP
Other (please specify) n/A

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an OCP user, I want to change parameters of my existing PVC such as the QoS attributes.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

UI for TP

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

There's been some limitations and complains on the fact that PVC attributes are sealed after their creation avoiding customers to update them. This is particularly impacting for a specific QoS is set and the volume requirements are changing.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

Customer should not use it in production atm.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Document VolumeAttributesClass creation and how to update PVC. Mention any limitation. Mention it's tech preview no upgrade. Add drivers support if needed.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Check which drivers support it for which parameters.

Epic Goal

Support upstream feature "VolumeAttributesClass" in OCP as Beta, i.e. test it and have docs for it.

Why is this important?

  • We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

There are a number of features or use cases supported by metal IPI that currently do not work in the agent-based installer (mostly due to being prevented by validations). 

In phased approach, we first need to close all the identified gaps in ABI (this feature).

In a second phase, we would introduce in the IPI flow the ABI technology, once its on par with the IPI feature-set.

Goals

Close the gaps identified in Baremetal IPI/ABI Feature Gap Analysis

 

Given that IPI (starting 4.10) with nmstate config, the overall configuration seems very similar apart from the fact the it spreaded into different files.

Given: a configuration that works for an IPI methods

When: i do agent based installation on the same configuration

Then: it works (with the exception that isos are entered manually)

Description of problem:

Currently the AdditionalTrustBundlePolicy is not being used and when set  to a value other than "Proxyonly" generates a warning message
{noformat}
Warning AdditionalTrustBundlePolicy: always is ignored
{noformat}

There are certain configurations where its necessary to set this value, see more discussion in https://redhat-internal.slack.com/archives/C02SPBZ4GPR/p1727793787922199

    

Version-Release number of selected component (if applicable):

4.16

    

How reproducible:

Always

    

Steps to Reproduce:

    1. In install-config.yaml set AdditionalTrustBundlePolicy to Always
    2. Note the warning message that is output.
    3.
    

Actual results:

AdditionalTrustBundlePolicy is unused.

    

Expected results:

AdditionalTrustBundlePolicy is used in cluster installation.

    

Additional info:


    

Feature Overview (aka. Goal Summary)  

As we gain hosted control planes customers, that bring in more diverse network topologies, we should evaluate relevant configurations and topologies and provide a more thorough coverage in CI and promotion testing

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Cut down proxy issues in managed and self-managed hosted control planes

Requirements (aka. Acceptance Criteria):

  • E2E testing for Managed Hosted Control Planes with a good trade-off of different topology/configuration coverage
  • E2E testing for self Managed Hosted Control Planes with a good trade-off of different topology/configuration coverage
     
Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) No
Hosted control planes Yes
Multi node, Compact (three node), or Single node (SNO), or all All supported Hosted Control Planes node topologies
Connected / Restricted Network Connected
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Operator compatibility N/A
Backport needed (list applicable versions) Coverage over all supported releases
UI need (e.g. OpenShift Console, dynamic plugin, OCM) No
Other (please specify)  

Background

There's been a few significant customer bugs related to proxy configurations with Hosted Control Planes

Customer Considerations

Will increase reliability for customers, preventing regressions

Documentation Considerations

Documentation improvements that better detail the flow of communication and supported configurations

Interoperability Considerations

E2E should probably cover both ROSA/HCP and ARO/HCP

Goal

  • Solid proxy configuration/topology coverage

Why is this important?

  • Cut down on proxy bugs/incidents/regressions

Scenarios

  • Access external services through proxy (idp, image registries)
  • Workers needing a proxy to access the ignition endpoint
  • Workers needing a proxy to access the APIServer
  • Management cluster uses a proxy for external traffic

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests.

User Story:

As a (user persona), I want to be able to:

  • Specify the VPC CIDR when creating a cluster with the hypershift CLI

so that I can achieve

  • Create separate VPCs that can be peered in CI so that we can test proxy use cases.

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview (aka. Goal Summary)  

A common concern with dealing with escalations/incidents in Managed OpenShift Hosted Control Planes is the resolution time incurred when the fix needs to be delivered in a component of the solution that ships within the OpenShift release payload. This is because OpenShift's release payloads:

  •  Have a hotfix process that is customer/support-exception targeted rather than fleet targeted
  • Can take weeks to be available for Managed OpenShift

This feature seeks to provide mechanisms that put the upper time boundary in delivering such fixes to match the current HyperShift Operator <24h expectation

Goals (aka. expected user outcomes)

  • Hosted Control Plane fixes are delivered through Konflux builds
  • No additional upgrade edges
  • Release specific
  • Adequate, fleet representative, automated testing coverage
  • Reduced human interaction

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  • Overriding Hosted Control Plane components can be done automatically once the PRs are ready and the affected versions have been properly identified
  • Managed OpenShift Hosted Clusters have their Control Planes fix applied without requiring customer intervention and without workload disruption beyond what might already be incurred because of the incident it is solving
  • Fix can be promoted through integration, stage and production canary with a good degree of observability

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both managed (ROSA and ARO)
Classic (standalone cluster) No
Hosted control planes Yes
Multi node, Compact (three node), or Single node (SNO), or all All supported ROSA/HCP topologies
Connected / Restricted Network All supported ROSA/HCP topologies
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All supported ROSA/HCP topologies
Operator compatibility CPO and Operators depending on it
Backport needed (list applicable versions) TBD
UI need (e.g. OpenShift Console, dynamic plugin, OCM) No
Other (please specify) No

Use Cases (Optional):

  • Incident response when the engineering solution is partially or completely in the Hosted Control Plane side rather than in the HyperShift Operator

Out of Scope

  • HyperShift Operator binary bundling

Background

Discussed previously during incident calls. Design discussion document

Customer Considerations

  • Because the Managed Control Plane version does not change but it is overridden, customer visibility and impact should be limited as much as possible.

Documentation Considerations

SOP needs to be defined for:

  • Requesting and approving the fleet wide fixes described above
  • Building and delivering them
  • Identifying clusters with deployed fleet wide fixes

Goal

  • Have a Konflux build for every supported branch on every pull request / merge that modifies the Control Plane Operator

Why is this important?

  • In order to build the Control Plane Operator images to be used for management cluster wide overrides.
  • To be able to deliver managed Hosted Control Plane fixes to managed OpenShift with a similar SLO as the fixes for the HyperShift Operator.

Scenarios

  1. A PR that modifies the control plane in a supported branch is posted for a fix affecting managed OpenShift

Acceptance Criteria

  • Dev - Konflux application and component per supported release
  • Dev - SOPs for managing/troubleshooting the Konflux Application
  • Dev - Release Plan that delivers to the appropriate AppSre production registry
  • QE - HyperShift Operator versions that encode an override must be tested with the CPO Konflux builds that they make

Dependencies (internal and external)

  1. Konflux

Previous Work (Optional):

  1. HOSTEDCP-2027

Open questions:

  1. Antoni Segura Puimedon  How long or how many times should the CPO override be tested?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Konflux App link: <link to Konflux App for CPO>
  • DEV - SOP: <link to meaningful PR or GitHub Issue>
  • QE - Test plan in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Acceptance criteria:

  • Workspace: crt-redhat-acm
  • Components: One per supported branch
  • Separate Containerfile
  • Should only build for area/control-plane-operator

Goal

  • Deliver Control Plane fixes within the same time constraints that we deliver HyperShift Operator fixes for Managed Hosted Control Planes

Why is this important?

  • Drastically cut SLO and contractual risk incurred from outages caused by Control Plane components in Managed OpenShift Hosted Control Planes
  • Improved Managed OpenShift Hosted Control Planes user experience in receiving fixes
  • Reduced SRE / Eng toil

Scenarios

  1. Incident response when the engineering solution is partially or completely in the Hosted Control Plane side rather than in the HyperShift Operator

Acceptance Criteria

  • Dev - Has a valid enhancement
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • SOPs for
    • Requesting and approving the fleet wide fixes described above
    • Building and delivering them
    • Identifying clusters with deployed fleet wide fixes
  • Managed OpenShift Hosted Clusters have their Control Planes fix applied without requiring customer intervention and without workload disruption beyond what might already be incurred because of the incident it is solving

Dependencies (internal and external)

  1. TBD: Konflux automated pipeline for building and delivering these fixes (needs another EPIC)

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a developer, I want to be able to:

  • Setup a mapping in HyperShift operator to replace the CPO for a specific OpenShift release

so that I can

  • Deliver CPO fixes to managed services quickly

Feature Overview

The OpenShift IPsec implementation will be enhanced for a growing set of enterprise use cases, and for larger scale deployments. 

Goals

The OpenShift IPsec implementation was originally built for purpose-driven use cases from telco NEPs, but was also be useful for a specific set of other customer use cases outside of that context.  As customer adoption grew and it was adopted by some of the largest (by number of cluster nodes) deployments in the field, it became obvious that some redesign is necessary in order to continue to deliver enterprise-grade IPsec, for both East-West and North-South traffic, and for some of our most-demanding customer deployments.  

Key enhancements include observability and blocked traffic across paths if IPsec encryption is not functioning properly.   

Requirements

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

Questions to answer…

  •  

Out of Scope

  • Configuration of external-to-cluster IPsec endpoints for N-S IPsec. 

Background, and strategic fit

The OpenShift IPsec feature is fundamental to customer deployments for ensuring that all traffic between cluster nodes (East-West) and between cluster nodes and external-to-the-cluster entities that also are configured for IPsec (North-South) is encrypted by default.  This encryption must scale to the largest of deployments. 

Assumptions

  •  

Customer Considerations

  • Customers require the option to use their own certificates or CA for IPsec. 
  • Customers require observability of configuration (e.g. is the IPsec tunnel up and passing traffic)
  • If the IPsec tunnel is not up or otherwise functioning, traffic across the intended-to-be-encrypted network path should be blocked. 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description of problem:

    In 4.14 libreswan is running as a containerized process inside the pod. SOS-Reports and must-gathers are not collecting libreswan logs and xfrm information from the nodes which is making the debugging process heavier. This should be fixed by working with the sos-report team OR by changing our must-gather scripts in 4.14 alone.

    From 4.15 libreswan is a systemd process running on the host so the swan logs are gathered in sos-report

For 4.14 specially during escalations gathering individual node data over and over is becoming painful for IPSEC. We need to ensure all the data required to debug IPSEC is collected in one place

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

As an OpenShift Administrator, I need to ensure that I rotate signing keys for self-managed Openshift Azure Entra Workload ID enabled clusters to comply with PCI-DSS v4 (see #8 on life cycle management) and NIST (see PCI “Tokenization Product Security Guidelines”) rules.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

When creating a self-managed Openshift cluster on Azure using Azure Entra Workload ID, a dedicated OIDC endpoint is created. This endpoint exposes a document located at .well-known/openid_configuration which contains key jwks_uri, that points itself to JSON Web Key Sets.

Regular key rotations are an important part of PCI-DSS v4 and NIST rules. To ensure PCI-DSS V4 requirements, a mechanism is needed to seamlessly rotate signing keys. Currently, we can only have one signing/private key present in the OpenShift cluster; however, JWKS supports multiple public keys.

This feature will be split into 2 phases:

  • Phase 1: document the feature.
  • Phase 2 (post Phase 1): automate as much as we can of the feature to be informed by what's possible based on what we do in Phase 1 – this will be in a future OCPSTRAT (TBD).

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Self-managed
Classic (standalone cluster) Classic
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all All
Connected / Restricted Network All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_x64, ARM (aarch64)
Operator compatibility  
Backport needed (list applicable versions) TBD
(Affects OpenShift 4.14+)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

Related references

Additional references

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

As an OpenShift Administrator, I need to ensure that I rotate signing keys for self-managed short-term credentials enabled clusters (Openshift Azure Entra Workload ID, GCP Workload Identity, AWS STS) to comply with PCI-DSS v4 (see #8 on life cycle management) and NIST (see PCI “Tokenization Product Security Guidelines”) rules.

Add documentation to the cloud-credential-repo for how to rotate the cluster bound-service-account-signing-key to include adding the new key to the Microsoft Azure Workload Identity issuer file. The process should meet the following requirements:

  • The next-bound-service-account-signing-key is (re)generated by the cluster.
  • The (next-)bound-service-account-signing-key private key never leavers the cluster.
  • There is minimal downtime (preferably zero) for pods using Microsoft Azure WI credentials while authenticating to the Azure API.

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

<your text here>

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

link back to OCPSTRAT-1644 somehow

 

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

 
Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Description of problem:

    Failed ci jobs:
https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-multi-nightly-4.18-cpou-upgrade-from-4.15-aws-ipi-mini-perm-arm-f14/1842004955238502400

https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-arm64-nightly-4.18-cpou-upgrade-from-4.15-azure-ipi-fullyprivate-proxy-f14/1841942041722884096

The 4.15-4.18 upgrade failed at stage of 4.17 to 4.18 update while authentication operator degraded and unavailable due to APIServerDeployment_PreconditionNotFulfilled

$ omc get clusterversion
NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-arm64-2024-10-03-172957   True        True          1h44m   Unable to apply 4.18.0-0.nightly-arm64-2024-10-03-125849: the cluster operator authentication is not available

$ omc get co authentication
NAME             VERSION                                    AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.18.0-0.nightly-arm64-2024-10-03-125849   False       False         True       8h

$ omc get co authentication -ojson|jq .status.conditions[]
{
  "lastTransitionTime": "2024-10-04T04:22:39Z",
  "message": "APIServerDeploymentDegraded: waiting for .status.latestAvailableRevision to be available\nAPIServerDeploymentDegraded: ",
  "reason": "APIServerDeployment_PreconditionNotFulfilled",
  "status": "True",
  "type": "Degraded"
}
{
  "lastTransitionTime": "2024-10-04T03:54:13Z",
  "message": "AuthenticatorCertKeyProgressing: All is well",
  "reason": "AsExpected",
  "status": "False",
  "type": "Progressing"
}
{
  "lastTransitionTime": "2024-10-04T03:52:34Z",
  "reason": "APIServerDeployment_PreconditionNotFulfilled",
  "status": "False",
  "type": "Available"
}
{
  "lastTransitionTime": "2024-10-03T21:32:31Z",
  "message": "All is well",
  "reason": "AsExpected",
  "status": "True",
  "type": "Upgradeable"
}
{
  "lastTransitionTime": "2024-10-04T00:04:57Z",
  "reason": "NoData",
  "status": "Unknown",
  "type": "EvaluationConditionsDetected"
}

Version-Release number of selected component (if applicable):

 4.18.0-0.nightly-arm64-2024-10-03-125849
 4.18.0-0.nightly-multi-2024-10-03-193054

How reproducible:

    always

Steps to Reproduce:

    1. upgrade from 4.15 to 4.16, and then to 4.17, and then to 4.18
    2.
    3.
    

Actual results:

    upgrade stuck on authentication operator

Expected results:

    upgrade succeed

Additional info:

    The issue is found in a control plane only update jobs(with paused worker pool), but it's not cpou specified because it can be reproduced in a normal chain upgrade from 4.15 to 4.18 upgrade. 

Feature Overview (aka. Goal Summary)  

Add OpenStackLoadBalancerParameters and add an option for setting the load-balancer IP address for only those platforms where it can be implemented.

Goals (aka. expected user outcomes)

As a user of on-prem OpenShift, I need to manage DNS for my OpenShift cluster manually. I can already specify an IP address for the API server, but I cannot do this for Ingress. This means that I have to:

  1. Manually create the API endpoint IP
  2. Add DNS for the API endpoint
  3. Create the cluster
  4. Discover the created Ingress endpoint
  5. Add DNS for the Ingress endpoint

I would like to simplify this workflow to:

  1. Manually create the API and Ingress endpoint IPs
  2. Add DNS for the API and Ingress endpoints
  3. Create the cluster

Requirements (aka. Acceptance Criteria):

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Questions to Answer (Optional):

Out of Scope

  • Although the Service API's loadBalancerIP API field was defined to be platform-agnostic, it wasn't consistently supported across platforms, and Kubernetes 1.24 has even deprecated it for this reason: https://github.com/kubernetes/kubernetes/pull/107235. We would not want to add a generic option to set loadBalancerIP given that it is deprecated and that it would work only on some platforms and not on others.

Background

  • This request is similar to RFE-843 (for AWS), RFE-2238 (for GCP), RFE-2824 (for AWS and MetalLB, and maybe others), RFE-2884 (for AWS, Azure, and GCP), and RFE-3498 (for AWS). However, it might make sense to keep this RFE specifically for OpenStack.

Customer Considerations

Documentation Considerations

Interoperability Considerations

Goal

  • Make Ingress working on day 1 without extra step for the customer
  •  

Why is this important?

Scenarios

\

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Goal

  • As an operator installing OCP on OSP with IPI, I would like to have single stack IPv6 enabled in day 1.

Why is this important?

  • Scarcity of IPv4 addresses

Scenarios

  1. Install IPv6 OCP cluster on IPv6 OpenStack with IPI and any address type (slaac, stateful and stateless).

Out of scope

  1. Fast datapath
  2. Conversion from dual-stack to single stack IPv6
  3. UPI

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Docs

Dependencies (internal and external)

  1. https://bugzilla.redhat.com/show_bug.cgi?id=2236671
  2. https://github.com/coreos/ignition/pull/1909
  3. (If SLAAC is necessary) https://bugzilla.redhat.com/show_bug.cgi?id=2304331

Previous Work (Optional):

  1. https://docs.google.com/document/d/1vT8-G2SFvanoeZWx38FiYY272RJjIv-l_5B6cOLzjVY/edit

Open questions::

To overcome the OVN metadata issue, we are adding an additional IPv4 network so metadata can be reached over IPv4 instead of IPv6 and we got a working installation. Now, let's try with config-drive, so we avoid specifying an IPv4 network and get the VMs to be IPv6 only.

  • Update the validation to allow controlPlanePort field with one subnet
  • Revisit the floatingIPs creation step that is done after infraReady to avoid creation of Floatin IP

As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.

Feature Overview (aka. Goal Summary)  

Here are common update improvements from customer interactions on Update experience

  1. Show nodes where pod draining is taking more time.
    Customers have to dig deeper often to find the nodes for further debugging. 
    The ask has been to bubble up this on the update progress window.
  2. oc update status ?
    From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"  
     But the ask is to show more details in a human-readable format.

    Know where the update has stopped. Consider adding at what run level it has stopped.
     
    oc get clusterversion
    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    
    version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
    

     

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API.  Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

  • From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process. 
  • Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

One piece of information that we lost compared to oc adm upgrade command is which ClusterOperators are updated right now. Previously, we presented CVO's Progressing=True message that says:

waiting on cloud-credential, cluster-autoscaler, console, csi-snapshot-controller, image-registry, ingress, insights, kube-storage-version-migrator, machine-approver, marketplace, monitoring, node-tuning, openshift-apiserver, openshift-controller-manager, operator-lifecycle-manager, service-ca, storage

The oc adm upgrade status output presents counts of updated/updating/pending operators, but does not say which ones are in which state. We should show this information somehow.

 

This is what we did for this card (for QE to verify): 
- In the control plane section, we add a line of "Updating" to display the names of Cluster Operators that are being updated.

  • In the "detail" mode, we add a table for those Cluster Operators to show their details.
  • These 2 new parts will be hidden completely if there are no updating COs at the moment.

The following is an example.

= Control Plane =
Assessment:      Progressing
Target Version:  4.14.1 (from 4.14.0)
Updating:        machine-config
Completion:      97% (32 operators updated, 1 updating, 0 waiting)
Duration:        14m (Est. Time Remaining: <10m)
Operator Health: 32 Healthy, 1 Unavailable

Updating Cluster Operators
NAME             SINCE   REASON   MESSAGE
machine-config   1m10s   -        Working towards 4.14.1

The current format of the worker status line is consistent with the original format of the operator status line. However, the operator status line is being reworked and simplified as part of the OTA-1155. The goal of this task is to make the worker status line somewhat consistent with the newly modified operator status line and simplified.

The current worker status line (see the “Worker Status:  ...” line):

= Worker Pool =
Worker Pool:     worker
Assessment:      Degraded
Completion:      39%
Worker Status:   59 Total, 46 Available, 5 Progressing, 36 Outdated, 12 Draining, 0 Excluded, 7 Degraded 

The exact new format is not defined and is for the assignee to create.

A relevant Slack discussion: https://redhat-internal.slack.com/archives/CEGKQ43CP/p1706727395851369 

 

The main goal of this task is to:

  •  Make the worker status line consistent with the new operator status line.
  • Simplify the output information.
    • For example, we don’t have to display zero non-happy values such as “0 Degraded” or zero redundant values such as “0 Excluded”: 
      Worker Status: 4 Total, 4 Available, 0 Progressing, 3 Outdated, 0 Draining, 0 Excluded, 0 Degraded
    • For example, 1 Available and Updated, 2 Available and Outdated may be more clear than 3 Total, 3 Available, 2 Outdated
    • A possible solution:
      Worker Status: <Available and Updated>, <Available and Outdated> [from which X are paused], <Unavailable but Progressing (Progressing and thus Unavailable)>, <Unavailable AND NOT Progressing>

 

Definition of Done:

  • A pull request of a modified worker status line is merged.
  • The new worker status line is somewhat consistent with the new operator status line (OTA-1155) and is simplified.
  • Update the “Omitted additional…” line (shown when a large number of nodes is present and –details=false) respectively as well.

On the call to discuss oc adm upgrade status roadmap to server side-implementation (notes) we agreed on basic architectural direction and we can starting moving in that direction:

  • status API will be backed by a new controller
  • new controller will be a separate binary but delivered in the CVO image (=release payload) to avoid needing new ClusterOperator
  • new operator will maintain a singleton resource of a new UpgradeStatus CRD - this is the interface to the consumers

Let's start building this controller; we can implement the controller perform the functionality currently present in the client, and just expose it through an API. I am not sure how to deal with the fact that we won't have the API merged until it merges into o/api, which is not soon. Maybe we can implement the controller over a temporary fork of o/api and rely on manually inserting the CRD into the cluster when we test the functionality? Not sure.

We need to avoid committing to implementation details and investing effort into things that may change though.

Definition of Done

  • CVO repository has a new controller (a new cluster-version-operator cobra subcommand sounds like a good option; an alternative would a completely new binary included in CVO image)
  • The payload contains manifests (SA, RBAC, Deployment) to deploy the new controller when DevPreviewNoUpgrade feature set is enabled (but not TechPreview)
  • The controller uses properly scoped minimal necessary RBAC through a dedicated SA
  • The controller will react on ClusterVersion changes in the cluster through an informer
  • The controller will maintain a single ClusterVersion status insight as specified by the Update Health API Draft
  • The controller does not need to maintain all fields precisely: it can use placeholders or even ignore fields that need more complicated logic over more resources (estimated finish, completion, assessment)
  • The controller will publish the serialized CV status insight (in yaml or json) through a ConfigMap (this is a provisionary measure until we can get the necessary API and client-go changes merged) under a key that identifies the kube resource ("cv-version")
  • The controller only includes the necessary types code from o/api PR together with the necessary generated code (like deepcopy). These local types will need to be replaced with the types eventually merged into o/api and vendored to o/cluster-version-operator

Testing notes

This card only brings a skeleton of the desired functionality to the DevPreviewNoUpgrade feature set. Its purpose is mainly to enable further development by putting the necessary bits in place so that we can start developing more functionality. There's not much point in automating testing of any of the functionality in this card, but it should be useful to start getting familiar with how the new controller is deployed and what are its concepts.

For seeing the new controller in action:

1. Launch a cluster that includes both the code and manifests. As of Nov 11, #1107 is not yet merged so you need to use launch 4.18,openshift/cluster-version-operator#1107 aws,no-spot
2. Enable the DevPreviewNoUpgrade feature set. CVO will restart and will deploy all functionality gated by this feature set, including the USC. It can take a bit of time, ~10-15m should be enough though.
3. Eventually, you should be able to see the new openshift-update-status-controller Namespace created in the cluster
4. You should be able to see a update-status-controller Deployment in that namespace
5. That Deployment should have one replica running and being ready. It should not crashloop or anything like that. You can inspect its logs for obvious failures and such. At this point, its log should, near its end, say something like "the ConfigMap does not exist so doing nothing"
6. Create the ConfigMap that mimics the future API (make sure to create it in the openshift-update-status-controller namespace): oc create configmap -n openshift-update-status-controller status-api-cm-prototype
7. The controller should immediately-ish insert a usc-cv-version key into the ConfigMap. Its content is a YAML-serialized ClusterVersion status insight (see design doc). As of OTA-1269 the content is not that important, but the (1) reference to the CV (2) versions field should be correct.
8. The status insight should have a condition of Updating type. It should be False at this time (the cluster is not updating).
9. Start upgrading the cluster (it's a cluster bot cluster with ephemeral 4.18 version so you'll need to use --to-image=pullspec and probably force it
10. While updating, you should be able to observe the controller activity in the log (it logs some diffs), but also the content of the status insight in the ConfigMap changing. The versions field should change appropriately (and startedAt too), and the Updating condition should become True.
11. Eventually the update should finish and the Updating condition should flip to False again.

Some of these will turn into automated testcases, but it does not make sense to implement that automation while we're using the ConfigMap instead of the API.

Spun out of https://issues.redhat.com/browse/MCO-668

This aims to capture the work required to rotate the MCS-ignition CA + cert.

 

Original description copied from MCO-668:

Today in OCP there is a TLS certificate generated by the installer , which is called "root-ca" but is really "the MCS CA".

A key derived from this is injected into the pointer Ignition configuration under the "security.tls.certificateAuthorities" section, and this is how the client verifies it's talking to the expected server.

If this key expires (and by default the CA has a 10 year lifetime), newly scaled up nodes will fail in Ignition (and fail to join the cluster).

The MCO should take over management of this cert, and the corresponding user-data secret field, to implement rotation.

Reading:

 - There is a section in the customer facing documentation that touches on this: https://docs.openshift.com/container-platform/4.13/security/certificate_types_descriptions/machine-config-operator-certificates.html

 - There's a section in the customer facing documentation for this: https://docs.openshift.com/container-platform/4.13/security/certificate_types_descriptions/machine-config-operator-certificates.html that needs updating for clarification.
 
 - There's a pending PR to openshift/api: https://github.com/openshift/api/pull/1484/files

 - Also see old (related) bug: https://issues.redhat.com/browse/OCPBUGS-9890 
 

 - This is also separate to https://issues.redhat.com/browse/MCO-499 which describes the management of kubelet certs

We currently writing rootCA to disk via this template: https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/root-ca.yaml 

Nothing that we know of currently uses this file and as it is templated via MC, any updates to configmap(root-ca in the kube-system namespace) used to generated this template will cause a MC roll-out. We will be updating this configmap as part of cert rotation in MCO-643, so we'd like to prevent unnecessary rotation by removing this template.

The machinesets in the machine-api namespace reference a user-data secret (per pool and can be customized) which stores the initial ignition stub configuration pointing to the MCS, and the TLS cert. This today doesn't get updated after creation.

 

The MCO now has the ability to manage some fields of the machineset object as part of the managed bootimage work. We should extend that to also sync in the updated user-data secrets for the ignition tls cert.

 

The MCC should be able to parse both install-time-generated machinesets as well as user-created ones, so as to not break compatibility. One way users are using this today is to use a custom secret + machineset to do non-MCO compatible ignition fields, for example, to partition disks for different device types for nodes in the same pool. Extra care should be taken not to break this use case

Feature Overview (aka. Goal Summary)  

This feature introduces a new command oc adm upgrade recommend in Tech Preview that improves how cluster administrators evaluate and select version upgrades.

Goals (aka. expected user outcomes)

  • Enable users (especially those with limited OpenShift expertise)  to make the upgrade selection process easily.
  • Reduce information overload
    • Provide clear, actionable recommendations for next version
    • Shows relevant warnings and risks per cluster
  • Help customers make informed decisions about when to initiate upgrades

Requirements (aka. Acceptance Criteria):

  • Allows targeting specific versions with --version
  • Shows conditional update risks
  • Shows only recent relevant releases instead of all versions
    • Limited to 2 per release
    • highlight the recommended version
  • Shows upgrade blockers and risks
  • Shows documentation/KCS links for more details about recommendation
  • Shows grouping based on security/performance/features changes in versions

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Self-managed
Classic (standalone cluster) standalone
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all All
Connected / Restricted Network All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Add docs for recommend command

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal

Changes in web-console/GUI and OC CLI where we will change number of update recommendations users see.

  • We want to limit the number of update recommendations users see by-default. Because in our opinion a long list the older versions does not help. 
    • We want to provide command line line option for users to see the older versions only when they are interested.
  • Move conditional risks/known risks out of the switch/button in the GUI. By-default users should see recommended updates as well as update options with known issues/conditional risks
    • Conditional risks should have some way to identify them as risks/known issues (example : adding asterisks beginning or end of the risk)
  • We will adjust the ordering of recommended updates based on freshness

No console changes were made in 4.18, but we may follow up with those changes later if the tech-preview oc adm upgrade recommend is well received.

Why

  • Customers still think that RH removes edges because they are not aware of the flag that hides update options with known issues/ conditional risks.
  • Even when they notice the output advertising --include-not-recommended, customers might assume anything "not recommended" is too complicated to be worth reading about, when sometimes the assessed risk has a straightforward mitigation, and is closer to being a release note. We want to make those messages more accessible, without requiring customers to opt in.
  • A long list of update recommendations does not help users/customers. We want to reduce the paradox of choices.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

Doesn't have to be recommend, but a new subcommand so that we can rip out the oc adm upgrade output about "how is your cluster currently doing" (Failing=True, mid-update, etc.).  The new subcommand would just be focused on "I haven't decided which update I want to move to next; help me pick", including the "I am thinking about 4.y.z, but I'm not completely sure yet; anything I should be aware of for that target?". 

 

Definition of Done:

For this initial ticket, we can just preserve all the current next-hop output, and tuck it behind a feature-gate environment variable, so we can make future pivots in follow-up tickets.

 

 

Conditional update UXes today are built around the assumption that when an update is conditional, it's a Red Hat issue, and some future release will fix the bug, and an update will become recommended. On this assumption, UXes like oc adm upgrade and the web-console both mention the existence of supported-but-not-recommended update targets, but don't push the associated messages in front of the cluster administrator.

But there are also update issues like exposure to Kubernetes API removals, where we will never restore the APIs, and we want the admin to take action (and maybe eventually accept the risk as part of the update). Do we want to adjust our update-risk UXes to be more open about discussing risks. For example, we could expose the message for the tip-most Recommended!=True update? Or something like that? So the cluster admin could read the message, and decide for themselves if it was a "wait for newer releases" thing or a "fix something in my current cluster state" thing. I think this would reduce current confusion about "which updates is Upgradeable=False blocking?" (OCPBUGS-9013) and similar.

Some customers will want an older release than OTA-1272's longest-hops. --show-outdated-version might flood them with many old releases. This card is about giving them an option, maybe --version=4.17.99 that will show them context about that specific release, without distracting them with opinions about other releases.

We currently show all the recommended updates in decreasing order with --include-not-recommended to see all the updates-with-assessed-risks in decreasing order.  But sometimes users want to update to the longest-hop, even if there are known risks.  Or they want to read about the assessed risks, in case there's something they can do to their cluster to mitigate a currently-assessed risk before kicking off the update.  This ticket is about adjusting oc's output to order roughly by release freshness.  For example, for a 4.y cluster in a 4.(y+1) channel:

  • 4.(y+1).tip
  • 4.y.tip
  • 4.(y+1).(tip-1)
  • 4.y.(tip-1)
    ...

Because users are more likely to care about 4.(y+1).tip, even if it has assessed risks, than they are to care about 4.y.reallyOld, even if it doesn't have assessed risks.

Show some number of these by default, and then use --show-outdated-versions or similar to see all the results.

See Scott here and me in OTA-902 both pitching something in this space.

Blocked on OTA-1271, because that will give us a fresh, tech-preview subcommand, where we can iterate without worrying about breaking existing users, until we're happy enough to GA the new approach.

For example, on 4.12.16 in fast-4.13, oc adm upgrade will currently show between 23 and 91 recommended updates (depending on your exposure to declared update risks):

cincinnati-graph-data$ hack/show-edges.py --cincinnati https://api.openshift.com/api/upgrades_info/graph fast-4.13 | grep '^4[.]12[.]16 ->' | wc -l
23
cincinnati-graph-data$ hack/show-edges.py --cincinnati https://api.openshift.com/api/upgrades_info/graph fast-4.13 | grep '^4[.]12[.]16 ' | wc -l
91

but showing folks 4.12.16-to-4.12.17 is not worth the line it takes, because 4.12.17 is so old, and customers would be much better served by 4.12.63 or 4.12.64, which address many bugs that 4.12.17 was exposed to. With this ticket, oc adm upgrade recommend would show something like:

Recommended updates:

  VERSION                   IMAGE
  4.12.64 quay.io/openshift-release-dev/ocp-release@sha256:1263000000000000000000000000000000000000000000000000000000000000
  4.12.63 quay.io/openshift-release-dev/ocp-release@sha256:1262000000000000000000000000000000000000000000000000000000000000

Updates with known issues:

  Version: 4.13.49
  Image: quay.io/openshift-release-dev/ocp-release@sha256:1349111111111111111111111111111111111111111111111111111111111111
  Recommended: False
  Reason: ARODNSWrongBootSequence
  Message: Disconnected ARO clusters or clusters with a UDR 0.0.0.0/0 route definition that are blocking the ARO ACR and quay, are not be able to add or replace nodes after an upgrade https://access.redhat.com/solutions/7074686

There are 21 more recommended updates and 67 more updates with known issues.  Use --show-outdated-versions to see  all older updates.

Goal:
Provide a Technical Preview of Gateway API with Istio to unify the management of cluster ingress with a common, open, expressive, and extensible API.

Description:
Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.

The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.

At its core, OpenShift's implementation of Gateway API will be based on the existing Cluster Ingress Operator and OpenShift Service Mesh (OSSM). The Ingress Operator will manage the Gateway API CRDs (gatewayclasses, gateways, httproutes), install and configure OSSM, and configure DNS records for gateways. OSSM will manage the Istio and Envoy deployments for gateways and configure them based on the associated httproutes. Although OSSM in its normal configuration does support service mesh, the Ingress Operator will configure OSSM without service mesh features enabled; for example, using Gateway API will not require the use of sidecar proxies. Istio will be configured specifically to support Gateway API for cluster ingress. See the gateway-api-with-cluster-ingress-operator enhancement proposal for more details.

Epic Goal

  • Test GWAPI release v1.0.0-* custom resources with current integration

    Why is this important?

  • Help find bugs in the v1.0.0 upstream release
  • Determine if any updates are needed in ingress-cluster-operator based on v1.0.0

    Planning Done Checklist

    The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)

Additional information on each of the above items can be found here: Networking Definition of Planned

Feature Overview (aka. Goal Summary)  

The ability in OpenShift to create trust and directly consume access tokens issued by external OIDC Authentication Providers using an authentication approach similar to upstream Kubernetes.

BYO Identity will help facilitate CLI only workflows and capabilities of the Authentication Provider (such as Keycloak, Dex, Azure AD) similar to upstream Kubernetes. 

Goals (aka. expected user outcomes)

Ability in OpenShift to provide a direct, pluggable Authentication workflow such that the OpenShift/K8s API server can consume access tokens issued by external OIDC identity providers. Kubernetes provides this integration as described here. Customer/Users can then configure their IDPs to support the OIDC protocols and workflows they desire such as Client credential flow.

OpenShift OAuth server is still available as default option, with the ability to tune in the external OIDC provider as a Day-2 configuration.

Requirements (aka. Acceptance Criteria):

  1. The customer should be able to tie into RBAC functionality, similar to how it is closely aligned with OpenShift OAuth 
  2.  

Use Cases (Optional):

  1. As a customer, I would like to integrate my OIDC Identity Provider directly with the OpenShift API server.
  2. As a customer in multi-cluster cloud environment, I have both K8s and non-K8s clusters using my IDP and hence I need seamless authentication directly to the OpenShift/K8sAPI using my Identity Provider 
  3.  

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

The ability to provide a direct authentication workflow such that OpenShift can consume bearer tokens issued by external OIDC identity providers, replacing the built-in OAuth stack by deactivating/removing its components as necessary.

 
Why is this important? (mandatory)

OpenShift has its own built-in OAuth server which can be used to obtain OAuth access tokens for authentication to the API. The server can be configured with an external identity provider (including support for OIDC), however it is still the built-in server that issues tokens, and thus authentication is limited to the capabilities of the oauth-server.

 
Scenarios (mandatory) 

  • As a customer, I want to integrate my OIDC Identity Provider directly with OpenShift so that I can fully use its capabilities in machine-to-machine workflows.
    *As a customer in a hybrid cloud environment, I want to seamlessly use my OIDC Identity Provider across all of my fleet.

 
Dependencies (internal and external) (mandatory)

  • Support in the console/console-operator (already completed)
  • Support in the OpenShift CLI `oc` (already completed)

Contributing Teams(and contacts) (mandatory) 

  • Development - OCP Auth
  • Documentation - OCP Auth
  • QE - OCP Auth
  • PX - 
  • Others -

Acceptance Criteria (optional)

  • external OIDC provider can be configured to be used directly via the kube-apiserver to issue tokens
  • built-in oauth stack no longer operational in the cluster; respective APIs, resources and components deactivated
  • changing back to the built-in oauth stack possible

Drawbacks or Risk (optional)

  • Enabling an external OIDC provider to an OCP cluster will result in the oauth-apiserver being removed from the system; this inherently means that the two API Services it is serving (v1.oauth.openshift.io, v1.user.openshift.io) will be gone from the cluster, and therefore any related data will be lost. It is the user's responsibility to create backups of any required data.
  • Configuring an external OIDC identity provider for authentication by definition means that any security updates or patches must be managed independently from the cluster itself, i.e. cluster updates will not resolve security issues relevant to the provider itself; the provider will have to be updated separately. Additionally, new functionality or features on the provider's side might need integration work in OpenShift (depending on their nature).

Done - Checklist (mandatory)

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

The test will serve as a development aid to test functionality as it gets added; the test will be extended/adapted as new features are implemented. This test will live behind the "ExternalOIDC" feature gate.

Goals of the baseline test:

  • deploy keycloak in the cluster, to use as an OIDC provider
  • configure the OIDC as a direct provider in the KAS
    • update the authentication CR with the oidc provider configuration
    • sync the oidc provider's CA, if necessary, to the KAS pods static resources
    • patch the cluster proxy and the KAS CLI args to provide the OIDC configuration
    • wait for the changes to get rolled out
  • run some basic keycloak sanity checks
  • run some baseline authentication checks via the KAS
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Update OpenShift router to recognize a new annotation key "haproxy.router.openshift.io/ip_allowlist" in addition to the old "haproxy.router.openshift.io/ip_whitelist" annotation key. Continue to allow the old annotation key for now, but use the new one if it is present.

In a future release, we may remove the old annotation key, after allowing ample time for route owners to migrate to the new one. (We may also consider replace the annotation with a formal API field.)

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. 

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn". 

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following tables track progress.

Progress summary

  1. namespaces
4.19 4.18 4.17 4.16 4.15 4.14
monitored 82 82 82 82 82 82
fix needed 68 68 68 68 68 68
fixed 39 39 35 32 39 1
remaining 29 29 33 36 29 67
~ remaining non-runlevel 8 8 12 15 8 46
~ remaining runlevel (low-prio) 21 21 21 21 21 21
~ untested 2 2 2 2 82 82

Progress breakdown

# namespace 4.19 4.18 4.17 4.16 4.15 4.14
1 oc debug node pods #1763 #1816 #1818  
2 openshift-apiserver-operator #573 #581  
3 openshift-authentication #656 #675  
4 openshift-authentication-operator #656 #675  
5 openshift-catalogd #50 #58  
6 openshift-cloud-credential-operator #681 #736  
7 openshift-cloud-network-config-controller #2282 #2490 #2496    
8 openshift-cluster-csi-drivers #6 #118 #524 #131 #306 #265 #75   #170 #459 #484  
9 openshift-cluster-node-tuning-operator #968 #1117  
10 openshift-cluster-olm-operator #54 n/a n/a
11 openshift-cluster-samples-operator #535 #548  
12 openshift-cluster-storage-operator #516   #459 #196 #484 #211  
13 openshift-cluster-version       #1038 #1068  
14 openshift-config-operator #410 #420  
15 openshift-console #871 #908 #924  
16 openshift-console-operator #871 #908 #924  
17 openshift-controller-manager #336 #361  
18 openshift-controller-manager-operator #336 #361  
19 openshift-e2e-loki #56579 #56579 #56579 #56579  
20 openshift-image-registry       #1008 #1067  
21 openshift-ingress   #1032        
22 openshift-ingress-canary   #1031        
23 openshift-ingress-operator   #1031        
24 openshift-insights #1033 #1041 #1049 #915 #967  
25 openshift-kni-infra #4504 #4542 #4539 #4540  
26 openshift-kube-storage-version-migrator #107 #112  
27 openshift-kube-storage-version-migrator-operator #107 #112  
28 openshift-machine-api #1308
#1317 
#1311 #407 #315 #282 #1220 #73 #50 #433 #332 #326 #1288 #81 #57 #443  
29 openshift-machine-config-operator #4636 #4219 #4384 #4393  
30 openshift-manila-csi-driver #234 #235 #236  
31 openshift-marketplace #578 #561 #570
32 openshift-metallb-system #238 #240 #241    
33 openshift-monitoring #2298 #366 #2498   #2335 #2420  
34 openshift-network-console #2545        
35 openshift-network-diagnostics #2282 #2490 #2496    
36 openshift-network-node-identity #2282 #2490 #2496    
37 openshift-nutanix-infra #4504 #4539 #4540  
38 openshift-oauth-apiserver #656 #675  
39 openshift-openstack-infra #4504   #4539 #4540  
40 openshift-operator-controller #100 #120  
41 openshift-operator-lifecycle-manager #703 #828  
42 openshift-route-controller-manager #336 #361  
43 openshift-service-ca #235 #243  
44 openshift-service-ca-operator #235 #243  
45 openshift-sriov-network-operator #995 #999 #1003  
46 openshift-user-workload-monitoring #2335 #2420  
47 openshift-vsphere-infra #4504 #4542 #4539 #4540  
48 (runlevel) kube-system            
49 (runlevel) openshift-cloud-controller-manager            
50 (runlevel) openshift-cloud-controller-manager-operator            
51 (runlevel) openshift-cluster-api            
52 (runlevel) openshift-cluster-machine-approver            
53 (runlevel) openshift-dns            
54 (runlevel) openshift-dns-operator            
55 (runlevel) openshift-etcd            
56 (runlevel) openshift-etcd-operator            
57 (runlevel) openshift-kube-apiserver            
58 (runlevel) openshift-kube-apiserver-operator            
59 (runlevel) openshift-kube-controller-manager            
60 (runlevel) openshift-kube-controller-manager-operator            
61 (runlevel) openshift-kube-proxy            
62 (runlevel) openshift-kube-scheduler            
63 (runlevel) openshift-kube-scheduler-operator            
64 (runlevel) openshift-multus            
65 (runlevel) openshift-network-operator            
66 (runlevel) openshift-ovn-kubernetes            
67 (runlevel) openshift-sdn            
68 (runlevel) openshift-storage            

We should be able to correlate flows with network policies:

  • which policy allowed that flow?
  • what's the dropped flows?
  • provide global stats on dropped / accepted traffic

 

PoC doc: https://docs.google.com/document/d/14Y3YYFxuOs3o-Lkipf-d7ZZp5gpbk6-01ZT_fTraCu8/edit

There are two possible approaches in terms of implementation:

  • Add new "netpolicy flows" on top of existing flows
  • Enrich existing flows with netpolicy info.

The PoC describes the former, however it is probably most interesting to aim the latter. (95% of the PoC is valid in both cases, ie. all the "low level" parts: OvS, OVN). The latter involves more work in FLP.

Epic Goal

Implement observability for ovn-k using OVS sampling.

Why is this important?

This feature should improve packet tracing and debuggability.

 

We need to do a lot of R&D and fix some known issues (e.g., see linked BZs). 

 

R&D targetted at 4.16 and productisation of this feature in 4.17

 

Goal
To make the current implementation of the HAProxy config manager the default configuration.

Objectives

  • Disable pre-allocation route blueprints
  • Limit dynamic server allocation
  • Provide customer opt-out
    • Offer customers a handler to opt out of the default config manager implementation.

 

https://issues.redhat.com/browse/NE-1788 describes 3 gaps in the implementation of DAC:

  • Idled services are waken up by the health check from the servers set by DAC (server-template).
  • ALPN TLS extension is not enabled for reencrypt routes.
  • Dynamic servers produce dummy metrics.

Additional gaps were discovered along the way:

This story aims at fixing those gaps.

The goal of this user story is to combine the code from the smoke test user story and results from the spike into an implementation PR.

Since multiple gaps were discovered a feature gate will be needed to ensure stability of OCP before the feature can be enabled by default.

Overview

Initiative: Improve etcd disaster recovery experience (part3)

With OCPBU-252 and OCPBU-254 we create the foundations for an enhanced experience of a recovery procedure in the case of full control plane loss. This requires researching total control-plane failure scenarios of clusters deployed using the various deployment methodologies.

Scope of this feature:

  • Spike to research if restoring full control plane with identical properties as the original control plane allow re-importing workers and document workload behavior
  • Document procedure to restore from full control plane failure using compact cluster to restore control plane and the re-attachment of workers
  • Enhanced e2e testing for validation of the updated manual procedure under this feature

Epic Goal*

Improve the disaster recovery experience by providing automation for the steps to recover from an etcd quorum loss scenario.

Determining the exact format of the automation (bash script, ansible playbook, CLI) is a part of this epic but ideally it would be something the admin can initiate on the recovery host that then walks through the disaster recovery steps provided the necessary inputs (e.g backup and staticpod files, ssh access to the recovery and non-recovery hosts etc).

 
Why is this important? (mandatory)

There are a large number of manual steps in the currently documented disaster recovery workflow which customers and support staff have voiced concerns as being too cumbersome and error prone.
https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html

Providing more automation would improve that experience and also let the etcd team better support and test the disaster recovery workflow.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

(TBD based on the delivery vehicle for the automation):

  1. As a cluster admin in a DR scenario I can trigger the quorum recovery procedure (e.g via CLI cmd on a recovery host) to reestablish quorum and recover a stable control-plane with API availability. 

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - etcd team
  • Documentation - etcd docs
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

After running the quorum restore script we want to bring the other members back into the cluster automatically. 

Currently the init container in 

https://github.com/openshift/etcd/blob/openshift-4.17/openshift-tools/pkg/discover-etcd-initial-cluster/initial-cluster.go

is guarding that case by checking whether the member is part of the cluster already and has an empty datadir.

We need to adjust this check by testing whether the cluster id of the currently configured member and the current datadir refer to the same cluster.

When we detect a mismatch, we can assume the cluster was recovered by quorum restore and we can attempt to move the folder to automatically make the member join the cluster again.

We need to add an e2e test to our disaster recovery suite in order to exercise that the quorum can be restored automatically.

While we're at it, we can also disable the experimental rev bumping introduced with:

https://github.com/openshift/origin/pull/28073

 

Several steps are covering the shutdown of the etcd static pod. We can provide a script to execute, which you can simply run through ssh:

> ssh core@node disable-etcd.sh

That script should move the static pod into a different folder, wait for the containers to shutdown.

Currently we have the bump guarded by an env variable:

https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/cluster-restore.sh#L151-L155

and a hardcode with 1bn revision numbers in:

https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/restore-pod.yaml#L64-L68

With this story we should remove the feature flag and enable the bumping by default. The bump amount should come from the file created in ETCD-652 plus some slack percentage. If the file doesn't exist we assume the default value of a billion again.

 

based on --force-new-cluster we need to add a quorum restore script that will only do that, without any inputs.

To enable resource version bumps on restore, we would need to know how far into the future (in terms of revisions) we need to bump. 

We can get this information by requesting endpoint status on each member and using the maximum of all RaftIndex fields as the result. Alternatively by finding the current leader and getting its endpoint status directly.

Even though this is not an expensive operation, this should be polled in a sensible interval, e.g. once every 30s. 

The result should be written as a textfile in the hostPath /var/lib/etcd that is already mounted on all relevant pods. An additional etcd sidecar container should be the most sensible choice to run this code.

 

Currently the readiness probe (of the guard pod) will constantly fail because the restore pod containers do not have the readyZ sidecar container. 

Example error message:

> Oct 16 13:42:52 ci-ln-s2hivzb-72292-6r8kj-master-2 kubenswrapper[2624]: I1016 13:42:52.512331    2624 prober.go:107] "Probe failed" probeType="Readiness" pod="openshift-etcd/etcd-guard-ci-ln-s2hivzb-72292-6r8kj-master-2" podUID="2baa50c6-b5cd-463e-9b35-165570e94b76" containerName="guard" probeResult="failure" output="Get \"https://10.0.0.4:9980/readyz\": dial tcp 10.0.0.4:9980: connect: connection refused"

 

AC:

  • The guard pod does not complain about the missing readyZ anymore

 

To be broken into one feature epic and a spike:

  • feature: error type disambiguation and error propagation into operator status
  • *spike: general improvement on making errors more actionable for the end user*

 

The MCO today has multiple layers of errors. There are generally speaking 4 locations where an error message can appear, from highest to lowest:

  1. The MCO operator status
  2. The MCPool status
  3. The MCController/Daemon pod logs
  4. The journal logs on the node

 

The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:

  1. The real error is hard to find
  2. The error message is often generic and ambiguous
  3. The solution/workaround is not clear at all

 

Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:

  1. An incomplete update happened, and something rebooted the node
  2. The node upgrade was successful until rpm-ostree, which failed and atomically rolled back
  3. The user modified something manually
  4. Another operator modified something manually
  5. Some other service/network manager overwrote something MCO writes

Etc. etc.

 

Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:

 

  1. De-ambigufying different error cases with the same message
  2. Adding more error catching, including journal logs and rpm-ostree errors
  3. Propagating full error messages further up the stack, up to the operator status in a clear manner
  4. Adding actionable fix/information messages alongside the error message

 

With a side objective of observability, including reporting all the way to the operator status items such as:

  1. Reporting the status of all pools
  2. Pointing out current status of update/upgrade per pool
  3. What the update/upgrade is blocking on
  4. How to unblock the upgrade

Approaches can include:

  1. Better error messaging starting with common error cases
  2. De-ambigufying config mismatch
  3. Capturing rpm-ostree logs from previous boot, in case of osimageurl mismatch errors
  4. Capturing full daemon error message back to pool/operator status
  5. Adding a new field to the MCO operator spec, that attempts to suggest fixes or where to look next, when an error occurs
  6. Adding better alerting messages for MCO errors

The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:

  1. The real error is hard to find
  2. The error message is often generic and ambiguous
  3. The solution/workaround is not clear at all

 

Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:

  1. An incomplete update happened, and something rebooted the node
  2. The node upgrade was successful until rpm-ostree, which failed and atomically rolled back
  3. The user modified something manually
  4. Another operator modified something manually
  5. Some other service/network manager overwrote something MCO writes

Etc. etc.

 

Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:

 

  1. De-ambigufying different error cases with the same message
  2. Adding more error catching, including journal logs and rpm-ostree errors
  3. Propagating full error messages further up the stack, up to the operator status in a clear manner
  4. Adding actionable fix/information messages alongside the error message

 

With a side objective of observability, including reporting all the way to the operator status items such as:

  1. Reporting the status of all pools
  2. Pointing out current status of update/upgrade per pool
  3. What the update/upgrade is blocking on
  4. How to unblock the upgrade

Approaches can include:

  1. Better error messaging starting with common error cases
  2. De-ambigufying config mismatch
  3. Capturing rpm-ostree logs from previous boot, in case of osimageurl mismatch errors
  4. Capturing full daemon error message back to pool/operator status
  5. Adding a new field to the MCO operator spec, that attempts to suggest fixes or where to look next, when an error occurs
  6. Adding better alerting messages for MCO errors

  • Options

 

Description:

MCC sends drain alert when node drain doesn't succeed within drain timeout period (1 hour today). This is to make sure that admin takes appropriate action if required by looking at MCC pod logs. Alert contains the information on where to look for the logs.

Example alert looks like:

Drain failed on Node <node_name>, updates may be blocked. For more details: oc logs -f -n openshift-machine-config-operator machine-config-controller-xxxxx -c machine-config-controller

It is possible that admin may not be able to interpret exact action to be taken after looking at MCC pod logs. Adding runbook (https://github.com/openshift/runbooks) can help admin in better troubleshooting and taking appropriate action.

 

Acceptance Criteria:

  • Runbook doc is created for MCCDrainError alert
  • Created runbook link is accessible to cluster admin with MCCDrainError alert

 

Feature Overview (aka. Goal Summary)  

Phase 2 Goal:  

  • Complete the design of the Cluster API (CAPI) architecture and build the core operator logic
  • attach and detach of load balancers for internal and external load balancers for control plane machines on AWS, Azure, GCP and other relevant platforms
  • manage the lifecycle of Cluster API components within OpenShift standalone clusters
  • E2E tests

for Phase-1, incorporating the assets from different repositories to simplify asset management.

Background, and strategic fit

Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.

  • Initially CAPI did not meet the requirements for cluster/machine management that OCP had the project has moved on, and CAPI is a better fit now and also has better community involvement.
  • CAPI has much better community interaction than MAPI.
  • Other projects are considering using CAPI and it would be cleaner to have one solution
  • Long term it will allow us to add new features more easily in one place vs. doing this in multiple places.

Acceptance Criteria

There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To add support for generating Cluster and Infrastructure Cluster resources on Cluster API based clusters

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.

To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.

This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.

Steps

  • Create a separate go module in cluster-api-provider-azure/openshift
  • Create a controller in the above module Go to manage the AzureCluster resource for non-capi bootstrapped clusters
  • Ensure the AzureCluster controller is only enabled for Azure platform clusters
  • Create an "externally-managed" AzureCluster resource and manage the status to ensure Machine's can correctly be created
  • Populate any required spec/status fields in the AzureCluster spec using the controller
  • (Refer to openstack implementation)
  •  

Stakeholders

  • Cluster Infra

Definition of Done

  • AzureCluster resource is correctly created and populated on Azure clusters
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.

To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.

This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.

Steps

  • Create a separate go module in cluster-api-provider-vsphere/openshift
  • Create a controller in the above module Go to manage the VSphereCluster resource for non-capi bootstrapped clusters
  • Ensure the VSphereCluster controller is only enabled for VSphere platform clusters
  • Create an "externally-managed" VSphereCluster resource and manage the status to ensure Machine's can correctly be created
  • Populate any required spec/status fields in the VSphereCluster spec using the controller
  • (Refer to openstack implementation)
  •  

Stakeholders

  • Cluster Infra

Definition of Done

  • VSphereCluster resource is correctly created and populated on VSphere clusters
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

We expect every openshift cluster that relies on Cluster API to have an infrastructure cluster and a cluster object.

These resources should exist for the lifetime of the cluster and should not be able to be removed.

We must ensure that infracluster objects from supported platforms cannot be deleted once created.

Changes to go into the cluster-capi-operator.

Steps

  • Build validating admission that prevents InfraCluster objects from being deleted
  • Either use a webhook, or ValidatingAdmissionPolicy to achieve this
  • Apply only to the infracluster object in the openshift-cluster-api namespace

Stakeholders

  • Cluster infra

Definition of Done

  • When installed into a cluster, the cluster's infracluster object cannot be removed using `oc delete`
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

To enable a quick start in CAPI, we want to allow the users to provide just the Machines/MachineSets and the relevant configuration for the Machines. The cluster infrastructure is either not required to be populated, or something they should not care about.

To enable the quick start, we should create and where applicable, populate required fields for the infrastructure cluster.

This will go alongside a generated Cluster object and should mean that the `openshift-cluster-api` Cluster is now infrastructure ready.

Steps

  • Create a separate go module in cluster-api-provider-gcp/openshift
  • Create a controller in the above module Go to manage the GCPCluster resource for non-capi bootstrapped clusters
  • Ensure the GCPCluster controller is only enabled for AWS platform clusters
  • Create an "externally-managed" GCPCluster resource and manage the status to ensure Machine's can correctly be created
  • Populate any required spec/status fields in the GCPCluster spec using the controller
  • (Refer to openstack implementation)
  •  

Stakeholders

  • Cluster Infra

Definition of Done

  • GCPCluster resource is correctly created and populated on GCP clusters
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Feature Overview (aka. Goal Summary)  

Implement Migration core for MAPI to CAPI for AWS

  • This feature covers the design and implementation of converting from using the Machine API (MAPI) to Cluster API (CAPI) for AWS
  • This Design investigates possible solutions for AWS
  • Once AWS shim/sync layer is implemented use the architecture for other clouds in phase-2 & phase 3

Acceptance Criteria

When customers use CAPI, There must be no negative effect to switching over to using CAPI .  Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To bring MAPI and CAPI to feature parity and unblock conversions between MAPI and CAPI resources

Why is this important?

  • Blocks migration to Cluster API

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

MAPA has support for users to configure the Network DeviceIndex.

According to aws, the primary network interface must use the value 0.

It appears that CAPA already forces this (it only supports creating one primary network interface) or assigns these values automatically if you are supplying your own network interfaces.

Therefore, it is likely that we do not need to support this value (MAPA only supports a single network interface), but we must be certain.

Steps

  • Test what happens if the DeviceIndex is a non-zero value in MAPA
  • If it works, we need to come up with a way to convince CAPA to support a custom device index
  • If it does not work, then in theory no customer could be using this, and dropping support should be fine. Document this in the conversion library.

Stakeholders

  • Cluster Infra

Definition of Done

  • We have made a decision about the device index field and how to handle it. Be that always erroring perpetually, or finding a way to get this support into CAPA.
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Why is this important?

  • We need to build out the core so that development of the migration for individual providers can then happen in parallel
  •  

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

We want to build out a sync controller for both Machine and MachineSet resources.

This card is about bootstrapping the basics of the controllers, with more implementation to follow once we have the base structure.

For this card, we are expecting to create 2 controllers, one for Machines, one for MachineSets.

The MachineSet controller should watch MachineSets from both MachineAPI and ClusterAPI in the respective namespaces that we care about. It should also be able to watch the referenced infrastructure templates from the CAPI MachineSets.

For the Machine controller, it should watch both types of Machines in MachineAPI and ClusterAPI in their respective namespaces. It should also be able to watch for InfrastructureMachines for the CAPI Machines in the openshift-cluster-api namespace.

If changes to any of the above resources occur, the controllers should trigger a reconcile which will fetch both the Machine API and Cluster API versions of the resources, and then split the reconcile depending on which version is authoritative.

Deletion logic will be handled by a separate card, but will need a fork point in the main reconcile that accounts for if either of the resources have been deleted, once they have been fetched from the cache.

Note, if a MachineSet exists only in CAPI, the controller can exit and ignore the reconcile request.

If a Machine only exists in CAPI, but is owned by another object (MachineSet for now) that is then mirrored into MAPI, the Machine needs to be reconciled so that we can produce the MAPI mirror of the Machine.

Steps

  • Bootstrap base controllers
  • Fetch resources in the controllers as per description above
  • Set up reconcileMAPItoCAPI and reconcileCAPItoMAPI functions as templates for future work.
  • Set up watches based on the decsription above - note this will need some dynamic watching since the infrastructure refs may refer to any resource
  • Add envtest based testsuite setup for controllers

Stakeholders

  • Cluster Infra

Definition of Done

  • We have the basis of the sync controllers implemented so that we can start implementing that actual business logic.
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

We have now merged a design for the MAPI to CAPI library, but, have not been extensively testing it up to now.

There are a large number of fields that currently cannot be converted, and we should ensure each of these is tested.

Steps

  • Update the testutils providerSpec generation in actuator pkg to be able to configure more fields (those that need to be cleared or configured to specific values)
  • Identify a "base" that will pass the conversion and build a test structure that allows this base to be mutated to create specific test cases
  • Add a test case for each of the expected failures to verify the output error message when misconfiguration occurs
  • Cover AWS MAPI to CAPI, Machine MAPI to CAPI, MachineSet MAPI to CAPI
  • And then reverse the above by doing the same in the CAPI to MAPI version
  • This could be broken down into several tasks and implemented as separate PRs

Stakeholders

  • Cluster Infra

Definition of Done

  • We have both positive and extensive negative testing for the MAPI to CAPI conversions in the capi operator repo
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

Fuzz testing should be used to create round trip testing and pick up issues in conversion.

Fuzz tests auto generate data to put into fields and we can ensure that combinations of fields are converted appropriately and also pick up when new fields are introduced into the APIs by fuzz testing and ensuring that fields are correctly round tripped.

We would like to set up a pattern for fuzz testing that can be used across various providers as we implement new provider conversions.

Steps

  • Set up fuzz testing for AWS MAPI to CAPI conversion
  • Factor the fuzz tests into utils that can easily be re-used
  • Add TODOs where current conversions create fuzz exceptions
  • Do the same for CAPI to MAPI

Stakeholders

  • Cluster Infra

Definition of Done

  • Fuzz testing is introduced to catch future breakages and help identify issues in round trip conversions.
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

When the Machine and MachineSet MAPI resource are non-authoritative, the Machine and MachineSet controllers should observe this condition and should exit, pausing the reconciliation.

When they pause, they should acknowledge this pause by adding a paused condition to the status and ensuring it is set to true.

Behaviours

  • Should not reconcile when .status.authoritativeAPI is not MachineAPI
  • Except when it is empty (prior to defaulting migration webhook)

Steps

  • Ensure MAO has new API fields vendored
  • Add checks in Machine/MachineSet for authoritative API in status not Machine API
  • When not machine API, set paused condition == true, otherwise paused == false (same as CAPI)
    • Condition should be giving reasons for both false and true
  • This feature must be gated on the ClusterAPIMigration feature gate

Stakeholders

  • Cluster Infra

Definition of Done

  • When the status of Machine indicates that the Machine API is not authoritative, the Paused condition should be set and no action should be taken.
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

The core of the Machine API to Cluster API conversion will rely on a bi-directional conversion library that can convert providerSpecs from Machine API into InfraTemplates in Cluster API, and back again.

We should aim to have a platform agnostic interface such that the core logic of the migration mechanism need not care about platforms specific detail.

The library should also be able to return errors when conversion is not possible, which may occur when:

  • A feature in MAPI is not implemented in CAPI
  • A feature in CAPI is not implemented in MAPI
  • A value in MAPI, that now exists on the infrastructure cluster, is not compatible with the existing infrastructure cluster

These errors should resemble the API validation errors from webhooks, for familiarity, using utils such as `field.NewPath` and the InvalidValue error types.

We expect this logic to be used in the core sync controllers, responsible for converting Machine API resources to Cluster API resources and vice versa.

DoD:

  • Flesh out a design for a code interface for the conversion of providerSpecs to InfraTemplates
  • Create an implementation of the interface that handles conversion for AWS providerSpecs
  • Implement testing for the conversion from MAPI to CAPI and vice versa, exercising fields that are known to convert well, as well as including error cases.
  • Commit the code into a package within the cluster-capi-operator repository, ready for use by the conversion controller core code

Background

To be able to continue to operate MachineSets, we need a backwards conversion once the migration has occurred. We do not expect users to remove the MAPI MachineSets immediately, and the logic will be required for when we remove the MAPI controllers.

This covers the case where the CAPI MachineSet is authoritative or only a CAPI MachineSet exists.

Behaviours

  • If the MachineSet only exists in CAPI, do nothing
  • If the MachineSet is mirrored in MAPI
    • Convert the InfraTemplate to a providerSpec, and update the MAPI resource
    • Mirror labels from the CAPI MachineSet to the MAPI MachineSet
    • Ensure spec and status fields (replicas, taints etc) are mirrored between the MachineSets
  • On Failure
    • Set the Synchronized condition to False and apply an appropriate message
  • On Success
    • Set the Synchronized condition to True and update the synchronizedGeneration

Steps

  •  

Stakeholders

  • <Who is interested in this/where did they request this>

Definition of Done

  • <Add items that need to be completed for this card>
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Epic Goal

This is the epic tracking the work to collect a list of TLS artifacts (certificates, keys and CA bundles).

This list will contain a set of required an optional metadata. Required metadata examples are ownership (name of Jira component) and ability to auto-regenerate certificate after it has expired while offline. In most cases metadata can be set via annotations to secret/configmap containing the TLS artifact.

Components not meeting the required metadata will fail CI - i.e. when a pull request makes a component create a new secret, the secret is expected to have all necessary metadata present to pass CI.

This PR will enforce it WIP API-1789: make TLS registry tests required

In order to keep track of existing certs/CA bundles and ensure that they adhere to requirements we need to have a TLS artifact registry setup.

The registry would:

  • have a test which automatically collects existing certs/CA bundles from secrets/configmaps/files on disk
  • have a test which collects necessary metedata from them (from cert contents or annotations)
  • ensure that new certs match expected metadata and have necessary annotations on when a new cert is added

Ref: API-1622

Feature Overview (aka. Goal Summary)  

To improve automation, governance and security, AWS customers extensively use AWS Tags to track resources. Customers wants the ability to change user tags on day 2 without having to recreate a new cluster to have 1 or more tags added/modified.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Complete during New status.

  • Cluster administrator can add one or more tags to an existing cluster. 
  • Cluster administrator can remove one or more tags from an existing cluster.
  • Cluster administrator can add one or more tags just to machine-pool / node-pool in the ROSA with HCP cluster.
  • All ROSA client interfaces (ROSA CLI, API, UI) can utilise the day2 tagging feature on ROSA with HCP clusters
  • All OSD client interfaces (API, UI, CLI) can utilize the day2 tagging feature on ROSA with HCP clusters
  • This feature does not affect the Red Hat owned day1 tags built into OCP/ROSA (there are 10 reserved spaces for tags, of the 50 available, leaving 40 spaces for customer provided tags)

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  • Following capabilities are available for AWS on standalone and HCP clusters.
  • OCP automatically tags the cloud resources with the Cluster's External ID. 
  • Tags added by default on Day 1 are not affected.
  • All existing active AWS resources in the OCP clusters have the tagging changes propagated.
  • All new AWS resources created by OCP reflect the changes to tagging.
  • Hive to support additional list of key=value strings on MachinePools
    • These are AWS user-defined / custom tags, not to be confused with node labels
    • ROSA CLI can accept a list of key=value strings with additional tag values
      • it currently can do this during cluster-install
    • The default tag(s) is/are still applied
    • NOTE: AWS limit of 50 tags per object (2 used automatically by OCP; with a third to be added soon; 10 reserved for Red Hat overall, as at least 2-3 are used by Managed Services) - customer's can only specify 40 tags max!
    • Must be able to modify tags after creation 
  • Support for OpenShift 4.15 onwards.

Out-of-scope

This feature will only apply to ROSA with Hosted Control Planes, and ROSA Classic / standalone is excluded.

Why is this important?

  • Customers want to use custom tagging for
    • access controls
    • chargeback/showback
    • cloud IAM conditional permissions

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

Allow the user to create the Agent ISO image as a minimal ISO (sans rootfs).

This is supported for the external platform, added for OCI in 4.14. This adds support for the rest of the platforms supported by the agent-based installer.

Requirements

All platforms supported by agent can install using a minimal ISO:

  • Bare metal
  • none
  • vSphere
  • Nutanix
  • External

Use Cases

  1. User in a connected environment generates a minimal ISO; rootfs is automatically downloaded from mirror.openshift.com.
  2. User in a disconnected environment generates a minimal ISO and rootfs, then uploads the rootfs to the bootArtifactsBaseURL they specified in agent-config.yaml.
  3. By default users continue to generate a fully self-contained ISO (except on the external platform, where minimal is required).

Epic Goal

  • Allow the user to create the Agent ISO image as a minimal ISO (sans rootfs). We already support/require this for the external platform; we should make it possible on any platform.

Why is this important?

  • Some BMCs do not support images as large as the 1GB CoreOS ISO when using virtualmedia. By generating a minimal ISO, we unlock use of the agent installer on these servers

Scenarios

  1. User in a connected environment generates a minimal ISO; rootfs is automatically downloaded from mirror.openshift.com.
  2. User in a disconnected environment generates a minimal ISO and rootfs, then uploads the rootfs to the bootArtifactsBaseURL they specified in agent-config.yaml.
  3. By default users continue to generate a fully self-contained ISO (except on the external platform).

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. AGENT-656 implemented minimal ISO support in the agent-based installer, including support for minimal ISOs in disconnected environments, for the external platform only. There is no UI to select this, as the external platform is assumed to always require a minimal ISO.

Open questions::

  1. How the user should request a minimal ISO. Perhaps openshift-install agent create minimal-image?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Currently the agent-based installer creates a full ISO for all platforms, except for OCI (External) platforms in which a minimal ISO is created by default. Work is being done to support the minimal ISO for all platforms. In this case either a new command must be used to create the minimal, instead of full ISO, or a flag added to the "agent create image"
command.

UPDATE 9/30: From feedback from Zane (https://github.com/openshift/installer/pull/9056#discussion_r1777838533) the plan has changed to use a new field in agent-config.yaml to define that a minimal ISO should be generated instead of either a new command, or flag to existing command.

Currently minimal ISO support is only provided for the External platform (see https://issues.redhat.com//browse/AGENT-702). As part of the attached Epic, all platforms will now support minimal ISO. The checks that limit minimal ISO to External platform only should be removed.

With the addition of a new field in agent-config.yaml to create a minimal ISO that can be used on all platforms, an integration test should be added to test this support.

The integration test can check that the ISO created is below the size expected for a full ISO and also that the any ignition files are properly set for minimal ISO support.

Currently the internal documentation describes creating a minimal ISO only for an External platform. With the change to support minimal ISO on all platforms, the documentation should be uldated.

Feature Overview (aka. Goal Summary)

Migrate every occurrence of iptables in OpenShift to use nftables, instead.

Goals (aka. expected user outcomes)

Implement a full migration from iptables to nftables within a series of "normal" upgrades of OpenShift with the goal of not causing any more network disruption than would normally be required for an OpenShift upgrade. (Different components may migrate from iptables to nftables in different releases; no coordination is needed between unrelated components.)

Requirements (aka. Acceptance Criteria):

  • Discover what components are using iptables (directly or indirectly, e.g. via ipfailover) and reduce the “unknown unknowns”.
  • Port components away from iptables.

Use Cases (Optional):

Questions to Answer (Optional):

  • Do we need a better “warning: you are using iptables” warning for customers? (eg, per-container rather than per-node, which always fires because OCP itself is using iptables). This could help provide improved visibility of the issue to other components that aren't sure if they need to take action and migrate to nftables, as well.

Out of Scope

  • Non-OVN primary CNI plug-in solutions

Background

Customer Considerations

  • What happens to clusters that don't migrate all iptables use to nftables?
    • In RHEL 9.x it will generate a single log message during node startup on every OpenShift node. There are Insights rules that will trigger on all OpenShift nodes.
    • In RHEL 10 iptables will just no longer work at all. Neither the command-line tools nor the kernel modules will be present.

Documentation Considerations

Interoperability Considerations

iptables is going away in RHEL 10; we need to replace all remaining usage of iptables in OCP with nftables before then.

The gcp-routes and azure-routes scripts in MCO use iptables rules and need to be ported to use nftables.

Goal Summary

This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities. 

Epic Goal

The Cluster Storage Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.

Why is this important?

  • This is needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

  • Cluster Storage Operator is able to authenticate with Azure in ARO HCP using Service Principal with a backing certificate.
  • Updated documentation
  • ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Previous Work (Optional):

STOR-1697

Open questions:

Which degree of coverage should run on AKS e2e vs on existing e2es

Done Checklist

CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

General

The Cloud Storage Operator needs to pass the Secret Provider Class to azure-disk and azure-file csi controllers so they can authenticate with client certificate. 

Why is this important?

  • This is also needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

  • Refactored code passes the Secret Provider Class to azure-disk and azure-file csi controllers so they can authenticate with client certificate
  • Updated documentation
  • ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Done Checklist

CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

The Cluster Network Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.

Why is this important?

  • This is needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

  • Cluster Network Operator is able to authenticate with Azure in ARO HCP using Service Principal with a backing certificate.
  • Updated documentation
  • ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Previous Work (Optional):

SDN-4450

Open questions:

Which degree of coverage should run on AKS e2e vs on existing e2es

Done Checklist

CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

General

The Cloud Ingress Operator would get Azure credentials using Azure SDK's generic NewDefaultAzureCredential function

Why is this important?

  • Different OpenShift components implement different patterns of setting up environment variables to get Azure credentials for different Azure authentication methods.
  • Refactoring the pattern to use `NewDefaultAzureCredential` will enable OpenShift components to have the same pattern in setting up Azure credentials
  • This is also needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

  • Refactored code that uses `NewDefaultAzureCredential`
  • Updated documentation
  • ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Done Checklist

CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal*

Support Managed Service Identity (MSI) authentication in Azure.

 
Why is this important? (mandatory)

This is a requirement to run storage controllers that require cloud access on Azure with hosted control plane topology.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Description of problem:

    We discovered that the azure-disk and azure-file-csi-controllers are reusing CCM managed identity. Each of these three components should have their own managed identity and not reuse another's managed identity.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Every time

Steps to Reproduce:

    1. Create an AKS mgmt cluster
    2. Create a HCP with MI
    3. Observe azure-disk and azure-file controllers are reusing azure CCM MI
    

Actual results:

    the azure-disk and azure-file-csi-controllers are reusing CCM managed identity

Expected results:

    the azure-disk and azure-file-csi-controllers should each have their own managed identity

Additional info:

    

Epic Goal

The Cluster Ingress Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.

Why is this important?

  • This is needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

  • Cluster Ingress Operator is able to authenticate with Azure in ARO HCP using Service Principal with a backing certificate.
  • Updated documentation
  • ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Previous Work (Optional):

NE-1504

Open questions:

Which degree of coverage should run on AKS e2e vs on existing e2es

Done Checklist

CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

General

The Cloud Ingress Operator would get Azure credentials using Azure SDK's generic NewDefaultAzureCredential function

Why is this important?

  • Different OpenShift components implement different patterns of setting up environment variables to get Azure credentials for different Azure authentication methods.
  • Refactoring the pattern to use `NewDefaultAzureCredential` will enable OpenShift components to have the same pattern in setting up Azure credentials
  • This is also needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

  • Refactored code that uses `NewDefaultAzureCredential`
  • Updated documentation
  • ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Done Checklist

CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

The image registry can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.

Why is this important?

  • This is needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

  • Image registry is able to authenticate with Azure in ARO HCP using Service Principal with a backing certificate.
  • Updated documentation
  • ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Previous Work (Optional):

IR-460

Open questions:

Which degree of coverage should run on AKS e2e vs on existing e2es

Done Checklist

CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

General

The image registry would get Azure credentials using Azure SDK's generic NewDefaultAzureCredential function

Why is this important?

  • Different OpenShift components implement different patterns of setting up environment variables to get Azure credentials for different Azure authentication methods.
  • Refactoring the pattern to use `NewDefaultAzureCredential` will enable OpenShift components to have the same pattern in setting up Azure credentials
  • This is also needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

  • Refactored code that uses `NewDefaultAzureCredential`
  • Updated documentation
  • ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Done Checklist

CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Support Managed Service Identity (MSI) authentication in Azure.

Why is this important?

  • MSI authentication is required for any component that will run on the control plane side in ARO hosted control planes.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Problem

Today Azure installation requires manually created service principal which involves relations, permission granting, credential setting, credential storage, credentials rotation, credentials clean up, and service principal deletion. This is not only mundane and time-consuming but also less secure and risks access to resources by adversaries due to lack of credential rotation. 

Goal

Employ Azure managed credentials which drastically reduce the steps required to just managed identity creation, permission granting, and resource deletion. 

Ideally, this should be a HyperShift-native functionality. I.e., HyperShift should use managed identities for the control plane, the kubelet, and any add-on that needs access to Azure resources.  

Operators running management side needing to access azure customer account will use MSI.
Operands running in the guest cluster should rely on workload identity.
This ticket is to solve the latter.

 

We need to implement workload identity support in our components that run on the spoke cluster.

 

Address any TODOs in the code related to this ticket.

https://redhat-external.slack.com/archives/C075PHEFZKQ/p1727710473581569
https://docs.google.com/document/d/1xFJSXi71bl-fpAJBr2MM1iFdUqeQnlcneAjlH8ogQxQ/edit#heading=h.8e4x3inip35u

If we decided to drop the msi init and adapter and expose the certs in management cluster directly via Azure Key Vault Secret Store CSI Driver Pods volume. This would remove complexity and avoid the need for highly permissive pods with net access.

Action items:

  • Each OpenShift component should authenticate via azidentity.NewDefaultAzureCredential
    and let it choose based on the exposed env variables. To account for cert rotation we might either leverage reloading in-process, or use the common OpenShift fsnotify +os.Exit().
    All the above could be implemented in a "library" like fashion that components ideally use as their only auth path. Otherwise they can for now keep their switch case based on the usecase for a gradual transition e.g. https://github.com/openshift/cluster-ingress-operator/compare/master...enxebre:cluster-ingress-operator:dev?expand=1

func azureCreds(options *azidentity.DefaultAzureCredentialOptions) (*azidentity.DefaultAzureCredential, error) {
if certPath := os.Getenv("AZURE_CLIENT_CERTIFICATE_PATH"); certPath != "" {
// Set up a watch on our config file; if it changes, we should exit -
// (we don't have the ability to dynamically reload config changes).
if err := watchForChanges(certPath, stopCh); err != nil

{ return nil, err }

}

return azidentity.NewDefaultAzureCredential(options)
}

  • For mocking besides production and getting CI passing:
    • Management cluster
      • Add the `--enable-addons azure-keyvault-secrets-provider` flag to the AZ CLI command that creates the AKS management cluster. This enables/installs the CSI secrets driver on the worker nodes.
      • Create an Azure Key Vault on the AKS cluster where the SP certs will be stored.
      • Create a managed identity that serve as the "user/thing" that reads the certs out of the Azure Key Vault for the CSI secret driver.
    • HyperShift Side
      • HyperShift CLI should provision a service principal to represent the MSI per HCP component and assigned the right roles/perms on them.
        • The Secret Store CSI driver/kube component should treat the cert as an MSI backing certificate. Regardless if you're using NewDefaultAzureCredential or NewClientCertificateCredential, it should work
      • The backing certificate for the SP should be stored in azure key vault in the same resource group as the AKS management cluster.
      • The HyperShift API will store:
        • The client ID and cert name of each SP for each HCP component
        • The Azure Key Vault name
        • The Azure Key Vault tenant ID
        • The client ID of the managed identity created to read the certs out of the Azure Key Vault
      • When the HCP deploys an OpenShift component that needs to authenticate with Azure:
        • HCP will supply the client ID for each SP 
        • HCP will add the volume mount for the CSI driver with the SP's cert name and the client ID of the managed identity that will read the secret from the Azure Key Vault

Engineering Notes:

Proof of Concept with Ingress as the example OpenShift component - https://github.com/openshift/hypershift/pull/4841/commits/35ac5fd3310b9199309e9e8a47ee661771ec71cf 

 

AZ CLI command to create the key vault

# Create Management Azure Key Vault
az keyvault create \
--name ${PREFIX} \
--resource-group ${AKS_RG} \
--location ${LOCATION} \
--enable-rbac-authorization 

 

AZ CLI command to create the managed identity for the key vault

## Create the managed identity for the Management Azure Key Vault
az identity create --name "${AZURE_KEY_VAULT_AUTHORIZED_USER}" --resource-group "${AKS_RG}"
AZURE_KEY_VAULT_AUTHORIZED_USER_ID=$(az identity show --name "${AZURE_KEY_VAULT_AUTHORIZED_USER}" --resource-group "${AKS_RG}" --query principalId --output tsv)
az role assignment create \
--assignee-object-id "${AZURE_KEY_VAULT_AUTHORIZED_USER_ID}" \
--role "Key Vault Secrets User" \
--scope /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/"${AKS_RG}" \
--assignee-principal-type ServicePrincipal 

 

AZ CLI command that creates a Service Principal with a backing cert stored in the Azure Key Vault

az ad sp create-for-rbac --name ingress --role "Contributor" --scopes /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${MANAGED_RG_NAME} --create-cert --cert ${CERTIFICATE_NAME} --keyvault ${KEY_VAULT_NAME} 

 

Feature Overview (Goal Summary)

This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.

Goal

  • Support configuring azure diagnostics for boot diagnostics on nodepools

Why is this important?

  • When a node fails to join the cluster, serial console logs are useful in troubleshooting, especially for managed services. 

Scenarios

  1. Customer scales / creates nodepool
    1. nodes created
    2. one or more nodes fail to join the cluster
    3. cannot ssh to nodes because ssh daemon did not come online
    4. Can use diagnosics + managed storage account to fetch serial console logs to troubleshoot

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. Capz already supports this, so dependency should be on hypershift team implementing this: https://github.com/openshift/cluster-api-provider-azure/blob/master/api/v1beta1/azuremachine_types.go#L117

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Goal

Before GAing Azure let's make sure we do a final API review

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal:

As an administrator, I would like to use my own managed DNS solution instead of only specific openshift-install supported DNS services (such as AWS Route53, Google Cloud DNS, etc...) for my OpenShift deployment.

 

Problem:

While cloud-based DNS services provide convenient hostname management, there's a number of regulatory (ITAR) and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.

 

Why is this important:

  • Provides customers with the flexibility to leverage their own custom managed ingress DNS solutions already in use within their organizations.
  • Required for regions like AWS GovCloud in which many customers may not be able to use the Route53 service (only for commercial customers) for both internal or ingress DNS.
  • OpenShift managed internal DNS solution ensures cluster operation and nothing breaks during updates.

 

Dependencies (internal and external):

 

Prioritized epics + deliverables (in scope / not in scope):

  • Ability to bootstrap cluster without an OpenShift managed internal DNS service running yet
  • Scalable, cluster (internal) DNS solution that’s not dependent on the operation of the control plane (in case it goes down)
  • Ability to automatically propagate DNS record updates to all nodes running the DNS service within the cluster
  • Option for connecting cluster to customers ingress DNS solution already in place within their organization

 

Estimate (XS, S, M, L, XL, XXL):

 

Previous Work:

 

Open questions:

 

Link to Epic: https://docs.google.com/document/d/1OBrfC4x81PHhpPrC5SEjixzg4eBnnxCZDr-5h3yF2QI/edit?usp=sharing

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • For a customer provided DNS to be utilized with AWS platforms, the Installer should be updated not configure the default cloud provided DNS (Route 53). Installer is also responsible for updating the Infrastructure config resource and bootstrap ignition with the LB IPs for API and API-Int. These would be used to stand-up in-cluster CoreDNS pods on the bootstrap and control plane nodes. 

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

On the bootstrap node, keep NetworkManager generated resolv.conf updated with the nameserver pointing to the localhost.

User Story:

As a (user persona), I want to be able to:

  • Capability 1
  • Capability 2
  • Capability 3

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a (user persona), I want to be able to:

  • Add the User Provisioned DNS data to the installer
  • Vendor the api changes 
  •  

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

User Story:

As a (user persona), I want to be able to:

  • clusterHostedDNS enabled/disabled values should be added
  • similar to GCP,  cloudLoadBalancerConfig should be added to the AWS config so that the infra CR can be populated with the correct information for LB addresses

so that I can achieve

  • Outcome 1
  • Outcome 2
  • Outcome 3

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview

  • This feature is to track automation in ODC, related packages, upgrades and some tech debts

Goals

  • Improve automation for Pipelines dynamic plugins
  • Improve automation for OpenShift Developer console
  • Engineering tech debt for ODC

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer...

  • ...

Out of Scope

  • ...

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Problem:

This epic covers the scope of automation-related stories in ODC

Goal:

Why is it important?

Automation enhancements for ODC

Use cases:

  1. <case>

Acceptance criteria:

  1. Automation enhancements to improve test execution time and efficiency
  2. Improving test execution to get more tests run on CI

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description of problem:

If knative operator is installed without creation of any of its instances tests will fail
    

Version-Release number of selected component (if applicable):


    

How reproducible:

Every time
    

Steps to Reproduce:

    1. Install knative operator without creation of any one or all three instances
    2. Run knative e2e tests
    3.
    

Actual results:

Tests will fail saying: Error from server particular instance not found
    

Expected results:

Mechanism should be present to create missing instance
    

Additional info:


    

Description of problem:

Enabling the topology tests in CI
    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Description

After the addition of CLI method of operator installation, test doesn't necessarily require admin privileges. Currently test add an overhead of creating admin session and page navigations which are not required.

Acceptance Criteria

  1. Test utilising CLI method for operator installation should run in limited privileges.

Additional Details:

Description

Current Shipwright e2e test running on CI are not enough and requires additional tests.

Acceptance Criteria

  1. <criteria>

Additional Details:

Description

The main goal is to incorporate Operator CLI installation method with Operator Availability checks to enable quick issue identification.

Acceptance Criteria

  1. <criteria>

Additional Details:

Description of problem:

KN-05-TC05, KN-02-TC12, SF-01-TC06 are flaking on CI due to variable resource creation time and some other unknown factor which need to be identified. 

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

  1. <steps>

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Workaround:

Additional info:

Feature Overview

Improve onboarding experience for using Shipwright Builds in OpenShift Console

Goals

Enable users to create and use Shipwright Builds in OpenShift Console while requiring minimal expertise about Shipwright 

Requirements

 

Requirements Notes IS MVP
Enable creating Shipwright Builds using a form   Yes
Allow use of Shipwright Builds for image builds during import flows   Yes
Enable access to build strategies through navigation   Yes

Use Cases

  • Use Shipwright Builds for image builds in import flows
  • Enable form-based creation of Shipwright Builds and without YAML expertise
  • Provide access to Shipwright resources through navigation

Out of scope

TBD

Dependencies

TBD

Background, and strategic fit

Shipwright Builds UX in Console should provide a simple onboarding path for users in order to transition them from BuildConfigs to Shipwright Builds.

Assumptions

TBD

Customer Considerations

TBD

Documentation/QE Considerations

TBD

Impact

TBD

Related Architecture/Technical Documents

TBD

Definition of Ready

  • The objectives of the feature are clearly defined and aligned with the business strategy.
  • All feature requirements have been clearly defined by Product Owners.
  • The feature has been broken down into epics.
  • The feature has been stack ranked.
  • Definition of the business outcome is in the Outcome Jira (which must have a parent Jira).

 
 

Problem:

Creating Shipwright Builds through YAML is complex and requires Shipwright expertise which makes it difficult for novice user to user Shipwright

Goal:

Provide a form for creating Shipwright Builds

Why is it important?

To simply adoption of Shipwright and ease onboarding

Use cases:

Create build

Acceptance criteria:

  • User can create Shipwright Builds through a form (instead of YAML editor)
  • The Shipwright build asks user for the following input
    • User can provide Git repository url
    • User can choose to see the advanced options for Git url and provide additional details
      • Branch/tag/ref
      • Context dir
      • Source secret
    • User is able to create a source secret without navigating away from the form
    • User can select a build strategy from strategies that are available in the cluster (cluster-wide or in the namespace)
    • User can provide param values related to the selected build strategy
    • User can provide environment variables (text, from configmap, from secret)
    • User can provide output image url to an image registry and push secret
    • User is able to create a push secret without navigating away from the form
    • User add volumes to the build

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As a user, I want to create a Shipwright build using the form,

Acceptance Criteria

  1. Create form yaml switcher create page
  2. Users should provide Shipwright build name
  3. Users should provide Git repository url
  4. Users should choose to see the advanced options for Git url and provide additional details
    • Branch/tag/ref
    • Context dir
    • Source secret
  5. Users should create a source secret without navigating away from the form
  6. Users should select a build strategy from strategies that are available in the cluster (cluster-wide or in the namespace)
  7. Users should provide param values related to the selected build strategy
  8. Users should provide environment variables
  9. Users should provide output image URL to an image registry and push secret
  10. Users should create a push secret without navigating away from the form
  11. Users should add volumes to the build
  12. Add e2e tests

Additional Details:

Event discovery allows for dynamic and interactive user experiences and event catalogs provide users with a structured way to discover available events within the system. Users can explore different event types, their descriptions, and associated metadata, making it easier to understand the capabilities and functionalities offered by the system.

 

By providing visibility into the available events and their characteristics, event catalogs help users understand how the system behaves and what events they can expect to occur as well as streamline the process of subscribing to and consuming events within the system.

Problem:

Goal:

Why is it important?

Event catalogs provide users with a structured way to discover available events within the system. Users can explore different event types, their descriptions, and associated metadata, making it easier to understand the capabilities and functionalities offered by the system.

Use cases:

  1. <case>

Acceptance criteria:

  1. Create a catalog for Events
  2. Add a add card on the Add page which takes the user to Events catalog page
  3. Show Event details and list attributes and values on the side panel with Subscribe button of the Events
  4. Subscribe form for the Event which will create a Trigger and redirects to Trigger details page

Dependencies (External/Internal):

Design Artifacts:

Exploration:

EventType doc: https://knative.dev/docs/eventing/features/eventtype-auto-creation/#produce-events-to-the-broker

 

Note:

Description

As a user, I want to see the catalogs for the Knative Events.

Acceptance Criteria

  1. Should create a Knative Events catalog
  2. Should add Events add card on the Add page to access the Events catalog
  3. Should show attributes and values  along with descriptions on the side panel of the Knative Event
  4. Should add the Subscribe button on the side panel which redirects the user to the Subscribe form.

Additional Details:

Description

As a user, I want to subscribe to the Knative service using a form

Acceptance Criteria

  1. Form should have a Name field
  2. Form should have a subscriber dropdown that lists Knative services present in the namespace
  3. Form should have an attribute and value field to add filters.
  4. Form should create a Trigger resource

Additional Details:

Feature Overview

Placeholder for small Epics not justifying an standalone Feature, in the context of technical debt and ability to operate and troubleshoot. This Feature is not needed expect during planning phases when we plan Features, until we enter the Epic planning feature.

NO MORE ADDITION OF ANY EPIC post 4.18 planning - Meaning NOW. One Feature per Epic from now on!

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Support `neighbor X remote-as [internal|external|auto] 
  • internal: iBGP peering
  • External: eBGP peering
  • Auto: means the peering can be iBGP or eBGP. It will be automatically detected and adjusted from the OPEN message. This value option was recently introduced in master FRR (> 9.1 version). It may not be possible to support this downstream until RHEL ships with a higher version than FRR 9.1 !!!

Why is this important?

  • Simplicity: Reduces the need to specify ASNs explicitly, especially useful in large configurations or dynamic environments.
  • Readability: Makes the configuration easier to understand by clearly indicating the type of relationship (internal vs. external).
  • Flexibility: Helps in dynamic configurations where the ASN of peers may not be fixed or is subject to change.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. FRR > 9.1 version for `auto` value.

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Add support in Multi Network Policy for IPVLAN CNI

Why is this important?

Scenarios

  1. Test parity with MACVLAN and SR-IOV VF (CNF-1470, CNF-5528) including IPv6
  2. Multi Network Policies enforced on Pods with IPVLAN with `linkInContainer=false` (default)
  3. Multi Network Policies enforced on Pods with IPVLAN with `linkInContainer=true`

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. CNF-1470
  2. CNF-5528

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Support for pod-level bonding with SR-IOV interfaces on Multus multi-network policy

Why is this important?

  • Multi-network policies do not support filtering traffic for pod-level bonded SR-IOV interfaces attached to Pods. It works fine on single interfaces (generates the appropriate iptables rules per the MultiNetworkPolicy) such as MACVLAN and SR-IOV VFs.

Scenarios

  1. As a Pod user, I want secondary Pod interfaces to be fault-tolerant (Pod-level bonding with SR-IOV VFs) and with restricted access (multi-network policies) so that only a pre-approved set of source IPs are permitted.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

BU Priority Overview

To date our work within the telecommunications radio access network space has focused primarily on x86-based solutions. Industry trends around sustainability, and more specific discussions with partners and customers, indicate a desire to progress towards ARM-based solutions with a view to production deployments in roughly a 2025 timeframe. This would mean being able to support one or more RAN partners DU applications on ARM-based servers.

Goals

  • Introduce ARM CPUs for RAN DU scenario (SNO deployment)  with a feature parity to Intel Ice Lake/SPR-EE/SPR-SP w/o QAT for DU with:
    • STD-kernel (RT-Kernel is not supported by RHEL)
    • SR-IOV and DPDK over SR-IOV
    • PTP (OC, BC). Partner asked for LLS-C3, according to Fujitsu - ptp4l and phy2sys to work with NVIDIA Aerial SDK
  • Characterize ARM-based RAN DU solution performance and power metrics (unless performance parameters are specified by partners,  we should propose them, see Open questions)
  • Productize ARM-based RAN DU solution by 2024 (partner’s expectation).

State of the Business

Depending on source 75-85% of service provider network power consumption is attributable to the RAN sites, with data centers making up the remainder. This means that in the face of increased downward pressure on both TCO and carbon footprint (the former for company performance reasons, the later for regulatory reasons) it is an attractive place to make substantial improvements using economies of scale.

There are currently three main obvious thrusts to how to go about this:

  • Introducing tools that improve overall observability w.r.t. power utilization of the network.
  • Improvement of existing RAN architectures via smarter orchestration of workloads, fine tuning hardware utilization on a per site basis in response to network usage, etc.
  • Introducing alternative architectures which have been designed from the ground up with lower power utilization as a goal.

This BU priority focuses on the third of these approaches.

BoM

Out of scope

Open questions:

  • What are the latency KPIs? Do we need a RT-kernel to meet them?
  • What page size is expected?
  • What are the performance/throughout requirements?

Reference Documents:

Planning call notes from Apr 15

Epic Goal

Both the Node Tuning Operator and TuneD assume the Intel x86 architecture is used when a Performance Profile is applied. For example, they both configure Intel x86 specific kernel parameters (e.g. intel_pstate).

In order to support Telco RAN DU deployments on the ARM architecture, we will need a way to apply a performance profile to configure the server for low latency applications. This will include tuning common to both Intel/ARM and tuning specific to one of the architectures.

The purpose of this Epic:

  • Design an NTO/TuneD solution that will support Intel, ARM and AMD specific tunings. Investigate whether the best approach will be to have a common performance profile that can apply to all architectures or separate performance profiles for each architecture.
  • Implement the NTO/TuneD changes to enable multi-architecture support. Depending on the scope of the changes, additional epics may be required.

Why is this important?

  • In order to support Telco RAN DU deployments on the ARM architecture, we need a way to apply a performance profile to configure the server for low latency applications.

Scenarios

  1. SNO configured with Telco 5G RAN DU reference configuration

Acceptance Criteria

  • Design for ARM support in NTO/TuneD has been reviewed and approved by appropriate stakeholders.
  • NTO/TuneD changes implemented to enable multi-architecture support. All the ARM specific tunings will not yet be known, but the framework to support these tunings needs to be there.

Dependencies (internal and external)

  1. Obtaining an ARM server to do the investigation/testing.

Previous Work (Optional):

  1. Some initial prototyping on an HPE ARM server has been done and some initial tuned issues have been documented: https://docs.google.com/presentation/d/1dBQpdVXe3kIjlLjj1orKShktEr1zIqtXoBcIo6oykrs/edit#slide=id.g2ac442e1556_0_69

Open questions::

  1. TBD

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The validator for the Huge Pages sizes in NTO needs to be updated to account for more valid options.

Currently it only allows the values "1G" and "2M" but we want to be able to use "512M" on ARM. We may also want to support other values (https://docs.kernel.org/6.3/arm64/hugetlbpage.html) and we probably also want to validate that the size selected is at least valid for the architecture being used.

The validation is performed here: https://github.com/openshift/cluster-node-tuning-operator/blob/release-4.16/pkg/apis/performanceprofile/v2/performanceprofile_validation.go#L56

Original slack thread: https://redhat-internal.slack.com/archives/CQNBUEVM2/p1717011791766049

Overview

An elevator pitch (value statement) that describes the parts of a Feature in a clear, concise way that will be addressed by his Epic

Acceptance Criteria
The list of requirements to be met to consider this Epic feature-complete

**

Done Criteria

  • All Acceptance Criteria are met
  • All existing/affected SOPs have been updated.
  • New SOPs have been written.
  • Internal training has been developed and delivered.
  • The feature has full, automated test suites passing in all pipelines.
  • If the feature requires QE involvement, QE has signed off.
  • The feature exposes metrics necessary to monitor.
  • The feature has had a security review / Contract impact assessment.
  • Service documentation is fully updated and complete.
  • Product Manager signed off.

References
Links to Gdocs, GitHub, and any other relevant information about this epic.

When setting Autorepair to enabled for a NodePool in OCM, NP controller from HyperShift apply a default CAPI MHC that is defined https://github.com/openshift/hypershift/blob/4954df9582cd647243b42f87f4b10d4302e2b270/hypershift-operator/controllers/nodepool/capi.go#L673 and that has a NodeStartupTimeout (from creation to joining the cluster of 20 minutes).

 

Bare metal instances are known to be slower to boot (see OSD-13791) and so in classic we have defined 2 MHC for workers node:

We should analyse together with the HyperShift team what is the best way forward to cover this use case.

Initial ideas to explore:

  • NodePool APIs expose already some health checks parameters. Can we add an annotation to override the default timeout? This will be explored as first thing.
  • not using AutoRepair logic at all but deploying externally the MHC based on instance type (ACM policy?). This is a larger effort and an architecture change, which ideally should not be needed/wanted
     

The behavior has been observed within XCMSTRAT-1039, but it is already present with bare metal instances and so can provoke poor UX (machine cycled until we are lucky to get a faster boot time).

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

STATUS

2024-09-11: Commited to DevPreview, referencing downstream images using Helm charts from github.com/openshift/cluster-api-operator. This will coincide with the MCE release, but is not included in the bundle MCE 2.7.0.  Bundling inclusion will be MCE 2.8.0.

2024-09-10: Decide on versioning strategy with OCP Install CAPI and HyperShift

2024-08-12: Having a discussion in #forum-capi-ocp on delivery mechanism

2024-08-08: Community meeting discussion on delivery of ROSA-HCP & Sylva Cluster-api-provider-metal3

2024-08-22: F2F meetings, inception of this EPIC

Epic Goal

Include minimum CAPI components in ACM for the supported ROSA-HCP.

  1. CRD's RosaCluster RosaMachine... plus any needed CAPI shared CRD's
  2. Controllers
  3. Make sure these are NOT shipped with OCP.
  4. MCH needs to detect if present and block/fail enabling (we should just always apply in the TP, since it will not be active by default

 

Why is this important?

MCE 2.7.0 enables ROSA-HCP provisioning support, along with OCP starting to use CAPI

Scenarios

I deploy MCE, and should be able to deploy a ROSA-HCP cluster, with correct credentials and information about the cluster.

Acceptance Criteria

ROSA-HCP Cluster API for AWS support in MCE 2.7.0

Dependencies (internal and external)

  1. OCP 4.17 does not include the CRD or controllers

Previous Work (Optional):

  1. OCP TP for Cluster API for AWS includes the CRD's and operator

Open questions:

  1. Confirmation from OCP Joel Speed that we agree on both CRDs and controllers
  2. Just CRD's

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement
    • Konflux
    • CPAAS
  • DEV - Upstream code and tests merged:
  • DEV - Upstream documentation merged:
  • DEV - Downstream code built via konflux or CPAAS
  • QE - Test plans in Polarion:
  • QE - Automated tests merged:
  • DOC - Doc issue opened with a completed template. Separate doc issue
    opened for any deprecation, removal, or any current known
    issue/troubleshooting removal from the doc, if applicable.
  • FIPs Ready
  • Infra-node deployment selector support
  • Must Gather
  • Global Proxy support

Value Statement

 

 

Definition of Done for Engineering Story Owner (Checklist)

  • Be able to deploy CAPI & CAPA in ocp cluster using the predefined CRs

Development Complete

  • Create the required CRs with the required namespace CRs and secret

Tests Automated

  • [ ] Unit/function tests have been automated and incorporated into the
    build.
  • [ ] 100% automated unit/function test coverage for new or changed APIs.

Secure Design

  • [ ] Security has been assessed and incorporated into your threat model.

Multidisciplinary Teams Readiness

  • [ ] Create an informative documentation issue using the Customer

Portal Doc template that you can access from [The Playbook](

https://docs.google.com/document/d/1YTqpZRH54Bnn4WJ2nZmjaCoiRtqmrc2w6DdQxe_yLZ8/edit#heading=h.9fvyr2rdriby),

and ensure doc acceptance criteria is met.

  • Call out this sentence as it's own action:
  • [ ] Link the development issue to the doc issue.

Support Readiness

  • [ ] The must-gather script has been updated.

Value Statement

The downstream capi-operator has a helm chart defined at [1].

We need to:

  1. Helm chart must use the downstream images from the ocp/capi-operator not the upstream k8s-sig/capi-operator images 
  2. Have end to end test for deploying the helm chart on OCP cluster
  3. ref for the chart template in downstream repo https://github.com/openshift/cluster-api-operator/tree/main/hack/charts
  4.  

[1] https://github.com/openshift/cluster-api-operator/blob/main/index.yaml

Definition of Done for Engineering Story Owner (Checklist)

Development Complete

  • The code is complete.
  • Functionality is working.
  • Any required downstream Docker file changes are made.

Tests Automated

  • [ ] Unit/function tests have been automated and incorporated into the
    build.
  • [ ] 100% automated unit/function test coverage for new or changed APIs.

Secure Design

  • [ ] Security has been assessed and incorporated into your threat model.

Multidisciplinary Teams Readiness

  • [ ] Create an informative documentation issue using the Customer

Portal Doc template that you can access from [The Playbook](

https://docs.google.com/document/d/1YTqpZRH54Bnn4WJ2nZmjaCoiRtqmrc2w6DdQxe_yLZ8/edit#heading=h.9fvyr2rdriby),

and ensure doc acceptance criteria is met.

  • Call out this sentence as it's own action:
  • [ ] Link the development issue to the doc issue.

Support Readiness

  • [ ] The must-gather script has been updated.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

We need to know which of the possible error codes reported in the imageregistry_storage_errors_total metric indicate abnormal operations, so that we can create alerts for the relevant metrics.

Current error codes are:

errCodeUnsupportedMethod = "UNSUPPORTED_METHOD"
errCodePathNotFound      = "PATH_NOT_FOUND"
errCodeInvalidPath       = "INVALID_PATH"
errCodeInvalidOffset     = "INVALID_OFFSET"
errCodeReadOnlyFS        = "READ_ONLY_FILESYSTEM"
errCodeFileTooLarge      = "FILE_TOO_LARGE"
errCodeDeviceOutOfSpace  = "DEVICE_OUT_OF_SPACE"
errCodeUnknown           = "UNKNOWN" 

Source: openshift/image-registry/pkg/dockerregistry/server/metrics/errorcodes.go

Acceptance Criteria

  • Understand and document each of the error codes in errorcodes.go
  • Documentation should include when these errors might happen, and how severe/recoverable they are

ACCEPTANCE CRITERIA

  • When storage (pvc) is mounted as read-only, an alert shows up in OCP monitoring
  • The alert remains on-going until the issue is fixed, regardless of whether new read-only errors pop up (TODO: how to achieve this? is it possible?)
  • The alert correctly clears up once the storage (pvc) becomes writable (TODO: this also needs research)

ACCEPTANCE CRITERIA

  • When the storage (pvc) is mounted as read-only, the registry should export a metric
  • The exported metric should be specific enough that an alert can be created for it
  • The metric should be grouped under imageregistry_storage_errors_total

Epic Goal

  • To support multipath disks for installation

Why is this important?

  • Some partners are using SANs for their installation disks

Scenarios

  1. User installs on a multipath Fibre Channel (FC) SAN volume

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 After https://issues.redhat.com/browse/MGMT-17867 fix, the multipath includes the wwn hint. However, the path devices include this hint as well.

The current bmh_agent_controller code may choose any of the devices with the wwn hint as the root device hint.

The code has to be fixed so that in case of multiple devices with wwn hint, the multipath device should be preferred.

Feature goal (what are we trying to solve here?)

The external platform was created to allow cloud providers to supply their own integration components (cloud controller manager, etc.) without prior integration into openshift release artifacts. We need to support this new platform in assisted-installer in order to provide a user friendly way to enable such clusters, and to enable new-to-openshift cloud providers to quickly establish an installation process that is robust and will guide them toward success.

This epic is a follow-up of MGMT-15654 where the external platform API was implemented in Assisted-Installer.

Please describe what conditions must be met in order to mark this feature as "done".

Does it need documentation support?

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it?
    • Can we have a follow-up meeting with the customer(s)?

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it?

 

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • Please describe why this feature is important
  • How does this feature help the product?

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere
  • Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

Manage the effort for adding jobs for release-ocm-2.12 on assisted installer

https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng

 

The repositories which we handle the cut-off for is currently:

  • assisted-service
  • assisted-image-service
  • assisted-installer
  • assisted-installer-agent
  • assisted-test-infra
  • cluster-api-provider-agent

 

Merge order:

  1. Add temporary image streams for Assisted Installer migration - day before (make sure images were created)
  2. Add Assisted Installer fast forwards for ocm-2.x release <depends on #1> - need approval from test-platform team at https://coreos.slack.com/archives/CBN38N3MW . (This jobs also branch out automatically) - At the day of the FF
  3. Prevent merging into release-ocm-2.x - <depends on #3> - At the day of the FF
  4. Update BUNDLE_CHANNELS to ocm-2.x on master - <depends on #3> - At the day of the FF
  5. ClusterServiceVersion for release 2.(x-1) branch references "latest" tag <depends on #5> - After  #5
  6. Update external components to AI 2.x <depends on #3> - After a week, if there are no issues update external branches
  7. Remove unused jobs - after 2 weeks

Note - There is a CI tool for CRUD operations on jobs configuration in the process. We should try and use it for the next FF 

 

LSO has not been published to the 4.18 redhat-operators catalog, so it cannot be installed on OpenShift 4.18. Until this is resolved, we explicitly install the 4.17 catalog as redhat-operators-v4-17 and then subscribe to the LSO version from the 4.17 rather than the 4.18 catalog.

Convert Cluster Configuration single page form into a multi-step wizard. The goal is to avoid overwhelming user with all information on a single page, provide guidance through the configuration process.

Wireframes: 

Phase1:
https://marvelapp.com/prototype/fjj6g57/screen/76442394

Future:
https://marvelapp.com/prototype/78g662d/screen/71444815
https://marvelapp.com/prototype/7ce7ib3/screen/73190117

 

Phase 1 wireframes: https://marvelapp.com/prototype/fjj6g57/screen/76442399

 

This requires UX investigation to handle the case when base dns is not set yet and clusters list has several clusters with the same name.

Description of the problem:

V2CreateClusterManifest should block empty manifests

How reproducible:

100%

Steps to reproduce:

1. POST V2CreateClusterManifest manifest with empty content

Actual results:

Succeeds. Then silently breaks bootkube much later.

Expected results:
API call should fail immediately

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

CMO creates a default Alertmanager configuration on cluster bootstrap. The configuration should have the following snippet when a cluster proxy is configured:

global: 

  http_config: 

    proxy_from_environment: true

 

1. Proposed title of this feature request

Prometheus generating disk activity every two hours causing storage backend issues.

 

2. What is the nature and description of the request?

We're seeing Prometheus doing some type of disk activity every two hours on the hour on all of our clusters. We'd like to change that default setting so that all clusters aren't hitting our storage at the same time. Need help in finding where to make that config change. I see a knowledgebase article which says this is by design, but we'd like to stagger these if possible. [1][2]

 

3. Why does the customer need this? (List the business requirements here)

It appears to be impacting their storage clusters. They use Netapp Trident NFS as their PVC backing which serves multiple clusters and the Prometheus-k8s pods use Netapp Trident NFS PVCs for their data. It appears that this 2 hour interval job occurs at the exact time in every cluster and their hope is stagger this in each cluster such as: 

 

Those two hours for every cluster are midnight, 2:00AM, 4:00AM, etc... The question I've had is, can we change it so one cluster does midnight, 2:00AM, 4:00AM, etc... and another cluster does 12:15AM, 2:15AM, 4:15AM, etc... so they both aren't writing to storage at the same time? It's still a 2 hr default.

 

4. List any affected packages or components.

openshift-monitoring

 

[1] https://access.redhat.com/solutions/6960833
[2] https://prometheus.io/docs/prometheus/latest/storage/

 

Upstream issue: https://github.com/prometheus/prometheus/issues/8278

change proposal accepted at Prometheus dev summit: https://docs.google.com/document/d/11LC3wJcVk00l8w5P3oLQ-m3Y37iom6INAMEu2ZAGIIE/edit#heading=h.4t8053ed1gi 

Epic Goal

  • Allow user-defined monitoring administrators to define PrometheusRules objects spanning multiple/all user namespaces.

Why is this important?

  • There's often a need to define similar alerting rules for multiple user namespaces (typically when the rule works on platform metrics such as kube-state-metrics or kubelet metrics).
  • In the current situation, such rule would have to be duplicated in each user namespace which doesn't scale well:
    • 100 expressions selecting 1 namespace each are more expensive than 1 expression selecting 100 namespaces.
    • updating 100 PrometheusRule resources is more time-consuming and error-prone than updating 1 PrometheusRule object.

Scenarios

  1. A user-defined monitoring admin can provision a PrometheusRules object for which the PromQL expressions aren't scoped to the namespace where the object is defined.
  2. A cluster admin can forbid user-defined monitoring admins to use cross-namespace rules.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Follow FeatureGate Guidelines
  • ...

Dependencies (internal and external)

  1. None (Prometheus-operator supports defining namespace-enforcement exceptions for PrometheusRules).

Previous Work (Optional):

  1.  

Open questions::

In terms of risks:

  • UWM admins may configure rules which overload the platform Prometheus and Thanos Querier.
    • This is not very different from the current situation where ThanosRuler can run many UWM rules.
    • All requests go first through the Thanos Querier which should "protect" Prometheus from DoS queries (there's a hard limit of 4 in-flight queries per Thanos Querier pod).
  • UWM admins may configure rules that access platform metrics unavailable for application owners (e.g. without a namespace label or for an openshift-* label).
    • In practice, UWM admins already have access to these metrics so it isn't a big change.
    • It also enables use cases such as ROSA admin customers that can't deploy their platform alerts to openshift-monitoring today. With this new feature, the limitation will be lifted.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Proposed title of this feature request

Ability to modify UWM Prometheus scrape interval

What is the nature and description of the request?

Customer would like to be able to modify the scrape interval in Prometheus user workload monitoring

Why does the customer need this? (List the business requirements)

Control metric frequency and thus remote write frequency for application monitoring metrics.

List any affected packages or components.

  • Prometheus
  • cluster-monitoring-operator

Proposed title of this feature request

Collect accelerator metrics in OCP

What is the nature and description of the request?

With the rise of OpenShift AI, there's a need to collect metrics about accelerator cards (including but not limited to GPUs). It should require no to little configuration from the customers and we recommend deploying a custom text collector with node_exporter.

Why does the customer need this? (List the business requirements)

Display inventory data about accelerators in the OCP admin console (like we do for CPU, memory, ... in the Overview page).

Better understanding of which accelerators are used (Telemetry requirement).

List any affected packages or components.

node_exporter

CMO

Epic Goal

  • Improve IPI on Power VS in the 4.16 cycle
    • Switch to CAPI provisioned bootstrap and control plane resources

Epic Goal*

OCP storage components (operators + CSI drivers) should not use environment variables for cloud credentials. It's discouraged by OCP hardening guide and reported by compliance operator. Our customers noticed it, https://issues.redhat.com/browse/OCPBUGS-7270 

 
Why is this important? (mandatory)

We should honor our own recommendations.

 
Scenarios (mandatory) 

  1. users/cluster admins should not notice any change when this epic is implemented.
  2. storage operators + CSI drivers use Secrets files to get credentials.{}

Dependencies (internal and external) (mandatory)

none

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Description of problem:

[AWS EBS CSI Driver] could not provision ebs volume succeed on cco manual mode private clusters

Version-Release number of selected component (if applicable):

 4.17.0-0.nightly-2024-07-20-191204   

How reproducible:

Always    

Steps to Reproduce:

    1. Install a private cluster with manual mode ->
       https://docs.openshift.com/container-platform/4.16/authentication/managing_cloud_provider_credentials/cco-short-term-creds.html#cco-short-term-creds-format-aws_cco-short-term-creds     
    2. Create one pvc and pod consume the pvc.
    

Actual results:

  In step 2 the pod,pvc stuck at Pending  
$ oc logs aws-ebs-csi-driver-controller-75cb7dd489-vvb5j -c csi-provisioner|grep new-pvc
I0723 15:25:49.072662       1 controller.go:1366] provision "openshift-cluster-csi-drivers/new-pvc" class "gp3-csi": started
I0723 15:25:49.073701       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/new-pvc"
I0723 15:25:49.656889       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp3-csi": rpc error: code = Internal desc = Could not create volume "pvc-f4f9bbaf-4149-44be-8716-8b7b973e16b8": could not create volume in EC2: NoCredentialProviders: no valid providers in chain
I0723 15:25:50.657418       1 controller.go:1366] provision "openshift-cluster-csi-drivers/new-pvc" class "gp3-csi": started
I0723 15:25:50.658112       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/new-pvc"
I0723 15:25:51.182476       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"openshift-cluster-csi-drivers", Name:"new-pvc", UID:"f4f9bbaf-4149-44be-8716-8b7b973e16b8", APIVersion:"v1", ResourceVersion:"185085", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp3-csi": rpc error: code = Internal desc = Could not create volume "pvc-f4f9bbaf-4149-44be-8716-8b7b973e16b8": could not create volume in EC2: NoCredentialProviders: no valid providers in chain

Expected results:

   In step 2 the pv should become Bond(volume provision succeed) and pod Running well. 

Additional info:

    

Epic Goal*

Out AWS EBS CSI driver operator misses some nice to have functionality. This Epic means to track it, so we finish it in some next OCP release.

 
Why is this important? (mandatory)

In general, AWS EBS CSI driver controller should be a good citizen  in HyperShift's hosted control plane. It should scale appropriately, report metrics and not use kubeadmin privileges in the guest cluster.

Scenarios (mandatory) 

 
Dependencies (internal and external) (mandatory)

None

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • QE -

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be "Release Pending" 

Our operators use Unstructred client to read HostedControlPlane. HyperShift has published their API types that don't require many dependencies and we could import their types.go.

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • csi-operator
  • aws-efs-csi-driver-operator
  • azure-file-csi-driver-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • csi-driver-manila-operator
  • vmware-vsphere-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • csi-driver-shared-resource-operator
  • ibm-powervs-block-csi-driver-operator
  • secrets-store-csi-driver-operator

 

  • cluster-storage-operator
  • cluster-csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

EOL, do not upgrade:

  • github.com/oVirt/csi-driver-operator
  • alibaba-disk-csi-driver-operator
  • aws-ebs-csi-driver-operator (now part of csi-operator)
  • azure-disk-csi-driver-operator (now part of csi-operator)
  •  

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update OCP release number in OLM metadata manifests of:

  • local-storage-operator
  • aws-efs-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • secrets-store-csi-driver-operator
  • smb-csi-driver-operator

OLM metadata of the operators are typically in /config/manifest directory of each operator. Example of such a bump: https://github.com/openshift/aws-efs-csi-driver-operator/pull/56 

We should do it early in the release, so QE can identify new operator builds easily and they are not mixed with the old release.

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • csi-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • csi-driver-manila-operator
  • vmware-vsphere-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • csi-driver-shared-resource-operator
  • ibm-powervs-block-csi-driver-operator
  • secrets-store-csi-driver-operator
  • cluster-storage-operator
  • cluster-csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

(please cross-check with *-operator + vsphere-problem-detector in our tracking sheet)

EOL, do not upgrade:

  • github.com/oVirt/csi-driver-operator
  • github.com/openshift/alibaba-disk-csi-driver-operator

The following operators were migrated to csi-operator, do not update these obsolete repos:

  • github.com/openshift/aws-efs-csi-driver-operator
  • github.com/openshift/azure-disk-csi-driver-operator
  • github.com/openshift/azure-file-csi-driver-operator

tools/library-bump.py  and tools/bump-all  may be useful. For 4.16, this was enough:

mkdir 4.16-bump
cd 4.16-bump
../library-bump.py --debug --web <file with repo list> STOR-1574 --run "$PWD/../bump-all github.com/google/cel-go@v0.17.7" --commit-message "Bump all deps for 4.16" 

4.17 perhaps needs an older prometheus:

../library-bump.py --debug --web <file with repo list> STOR-XXX --run "$PWD/../bump-all github.com/google/cel-go@v0.17.8 github.com/prometheus/common@v0.44.0 github.com/prometheus/client_golang@v1.16.0 github.com/prometheus/client_model@v0.4.0 github.com/prometheus/procfs@v0.10.1" --commit-message "Bump all deps for 4.17" 

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Epic Goal

  • Enable users to add an appliance node on day 2 to a cluster installed with the appliance method

Why is this important?

  • There is currently no other way to expand an appliance on day 2
  • Unified experience for day1 and day2 installation for the agent based installer
  • Unified experience for day1 and day2 installation for appliance workflow
  • Eliminate the requirement of installing MCE that have high requirements (requires 4 cores and 16GB RAM for a multi-node cluster, and if the infrastructure operator is included then it will require storage as well)

Scenarios

  1. Appliance workflow is using ABI for day1 and doesn't want to require users to install MCE on the cluster, reusing ABI will enable simple and similar UX for the users.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. AGENT-682

Previous Work (Optional):

ABI is using assisted installer + kube-api or any other client that communicate with the service, all the building blocks related to day2 installation exist in those components 

Assisted installer can create installed cluster and use it to perform day2 operations

A doc that explains how it's done with kube-api 

Parameters that are required from the user:

  • Cluster name,Domain name - used to get the url that will be used to pull ignition
  • Kubeconfig to day1 cluster 
  • Openshift version
  • Network configuration

Actions required from the user

  • Post reboot the user will need to manually approve the node on the day1 cluster

Implementation suggestion:

To keep similar flow between day1 and day2 i suggest to run the service on each node that user is trying to add, it will create the cluster definition and start the installation, after first reboot it will pull the ignition from the day1 cluster

Open questions::

  1. How should ABI detect an existing day1 cluster? Yes
  2.  Using a provided kubeconfig file? Yes 
  3. If so, should it be added to the config-image API (i.e. included in the ISO) - question to the agent team
  4. Should we add a new command for day2 in agent-installer-client? E.g. question to the agent team 
  5. So this command would import the existing day1 cluster and add the day2 node. It means that every day2 node should run an assisted-service first Yes, those instances are not depending on each other

Add the ability for the node-joiner tool to create a config image (analagous to the one generated by openshift install agent create config-image) with the configuration necessary to import a cluster and add a day 2 node but no OS.

The config image is small enough that we could probably create it unconditionally and leave it up to the client to decide which one to download.

Deploy Hypershift Operator component to the MCs in the MSFT INT environment.

Acceptance criteria

  • Hypershift Operator is packaged as helm chart
  • Hypershift Operator is installed to the MSFT INT tenant
  • installation to DEV is adapted to leverage the helm chart
  • installation is integrated into ADO/EV2 based on the work from ARO-10333

User Story

Implement helm chart for hypershift operator installation

Acceptance Criteria

  • helm chart needs to be low maintenance (no need to update every time there is a new hypershift release)
  • use templatization to allow installation in higher environments

We generate the hypershift operator install manifests by running `hypershift install render`, catch STDOUT and store them. If non-critical errors occure during the generation step, the generated manifests are not processable anymore.

Example: proxy autodiscovery for external-dns fails if no kubeconfig is given. This will not fail the generation task but results in error messages intertwined with the rest of the generated manifests, making them not processable anymore.

We will add a new config parameter to `hypershift install render` to render the manifests to a file instead of STDOUT.

Description
This epic covers the changes needed to the ARO RP for

  1. Introducing workload identity support in the ARO RP
    1. Creating (or validating) user assigned identities during installation
      1. Adding role assignments to the custom roles and proper scope
      2. Validating permissions
    2. Creating OIDC blob store for the cluster
      1. Attaching a cert/pem file to the kube-serviceaccount signer
      2. Exposing the issuerURL to the customer
  2. Introduce MSI / Cluster MSI

ACCEPTANCE CRITERIA:

What is "done", and how do we measure it? You might need to duplicate this a few times.
 

  1. RP level changes
    1. Create keypair, and generate OIDC documents
      1. Generate issuerURL
      2. Validate permissions / roles on the identities
  1. Code changes are completed and merged to ARO-RP and its components to allow customers to install workload identity clusters.
    1. Clusters are configured with proper credentialsrequests
    2. The authentication configuration is configured with the correct service account issuer
    3. The pod identity webhook config is created
    4. Bound service account key is passed to the installer

NON GOALS:

  1. Release the API
  2. Support migration between tenants
  3. Hive changes
  4. Allow migration between service principal clusters and workload identity clusters
  5. Support key rotation for OIDC
     
    CUSTOMER EXPERIENCE:

Only fill this out for Product Management / customer-driven work. Otherwise, delete it.

  • Does this feature require customer facing documentation? YES/NO
    • If yes, provide the link once available
  • Does this feature need to be communicated with the customer? YES/NO
      • How far in advance does the customer need to be notified?
      • Ensure PM signoff that communications for enabling this feature are complete
  • Does this feature require a feature enablement run (i.e. feature flags update) YES/NO
    • If YES, what feature flags need to change?
      • FLAG1=valueA
    • If YES, is it safe to bundle this feature enablement with other feature enablement tasks? YES/NO

 
BREADCRUMBS:

Where can SREs look for additional information? Mark with "N/A" if these items do not exist yet so Functional Teams know they need to create them.

NOTES:

Need to determine if (in 4.14 azure workload identity functionality) we need to create secrets/secret manifests for each operator manually as part of the ARO cluster install, or if we can leverage credentialsrequests to do this automatically somehow. How will necessary secrets be created?

DESCRIPTION:

  • This story covers investigating SDK / code changes to allow auth to cluster storage accounts with RBAC / Azure AD rather than shared keys.
    • Currently, our cluster storage accounts deploy with shared key access enabled. The flow for auth is:
      • Image registry operator uses a secret in its own namespace to pull the account keys. It uses these keys to fetch an access token that then grants data plane access to the storage account.
      • Cluster storage account is accessed by the RP using a SAS token. This storage account is used to host ignition files, graph, and boot diagnostics.
        • RP accesses it for boot diagnostics when SRE executes a geneva action to view them. Graph and ignition are also stored here.
          • This storage account doesn’t appear to be accessed from inside of the cluster, only by the first party service principal
  • Image registry team has asked ARO for assistance in identifying how to best migrate away from the shared key access.

ACCEPTANCE CRITERIA:

  • Image registry uses managed identity auth instead of SAS tokens. SRE understands how to make the changes.
  • Cluster storage account uses managed identity auth instead of SAS tokens. SRE understands how to make the changes

NON GOALS:

  •  

BREADCRUMBS:

Where can SREs look for additional information? Mark with "N/A" if these items do not exist yet so Functional Teams know they need to create them.

What

  • Maintain kube-rbac-proxy by taking care of issues and PRs.
  • Push the project towards k8s/sig-auth ownership

Why

  • It is a widely used and core component in OpenShift and within other companies.

What

Merge upstream kube-rbac-proxy v0.18.1 into downstream.

Why

We need to update the deps to get rid of CVE issues.

Clean up GCPLabelsTags feature gate created for OCPSTRAT-768 feature. Feature was made available as TechPreview in 4.14 and GA in 4.17.

GCPLabelsTags feature gate validation checks should be removed in installer, operator and API.

FeatureGate check added in installer for userLabels and userTags should be removed and the reference made in the install-config GCP schema should be removed.

Acceptance Criteria

  • Defining userLabel and userTags is not restricted through a feature gate.

GCPLabelsTags feature gate check added in machine-api-provider-gcp operator for userLabels and userTags.

And the featureGate added in openshift/api should also be removed.

Acceptance Criteria

  • userLabel and userTags should be controlled through a feature gate.

Epic Goal

To ensure the NUMA Resources Operator can be deployed, managed, and utilized effectively within HyperShift hosted OpenShift clusters.

Why is this important?

The NUMA resources operator enables NUMA-aware pod scheduling. As HyperShift gains popularity as a cost-effective and portable OpenShift form factor, it becomes important to ensure that the NUMA Resources Operator, like other operators, is fully functional in this environment. This will enable users to leverage NUMA aware pod scheduling, which is important for low-latency and high performance workloads like telco environments.

Scenarios

Deploying the NUMA Resources Operator on a HyperShift hosted OpenShift cluster.

Ensure the operands run correctly on a HyperShift hosted OpenShift cluster.
Pass the e2e test suite on Hypershift hosted OpenShift cluster

Acceptance Criteria

  • NUMA Resources Operator can be successfully deployed on a HyperShift hosted OpenShift cluster.
  • CI pipeline is established and running successfully with automated tests covering all the mentioned scenarios.
  • The e2e test suite on Hypershift hosted OpenShift cluster.
  • Release technical enablement details and documents are provided.
  • Support and SRE teams are equipped with the necessary skills to support the Node Feature Discovery Operator in a production environment.

Dependencies (internal and external)

  1. N/A

Previous Work (Optional):

  1. N/A

Open questions:

  1. N/A

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • To have PAO working on Hypershift environment with feature parity with SNO ( as much as possible ) - GA the feature
  • Compatible functioning hypershift CI
  • Full QE coverage for hypershift in d/s

Why is this important?

  • Hypershift is a very interesting platform for clients and PAO has a key role in node tuning so make it work in Hypershift is a good way to ease the migration to this new platform as clients will not lose their tuning capabilities.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • No regressions are introduced

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. CNF-11959 [TP] PAO operations in a Hypershift hosted cluster

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Previously, when integrating 1_performance_suite, we faced an issue with a test case  Number of CPU requests as multiple of SMT count allowed when HT enabled failure, which occurred due to the reason the testpod failed to be admitted (API server couldn't find a node with the worker-cnf label). 
We started to investigate this as we couldn't find how and who is adding the worker-cnf label to the podspec. since we couldn't figure that out, a workaround we introduced was to reapply the worker-cnf label to the worker nodes after each tuning update
Another thing we were curious about is why the node lost its labels after the performance profile application. We believe this relates to the nodepool rollingUpdate policy (upgradeType: Replace) that replaces the nodes when tuning configuration changes.
 ** 
This issue will track the following items:
1. An answer for how the worker-cnf label was added to the testpod.
2. Check with hypershift folks if we can change the nodepool rollingUpdate  policy to Inplace, for our CI tests, and discuss the benefits/drawbacks.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • This is to support the change of frr-k8s from being deployed by the metallb-operator to be depolyed by CNO

Why is this important?

  • For letting the users know about the change
  • For validating the change of deployment and ensuring metallb keeps working
  •  

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal

Address miscellaneous technical debt items in order to maintain code quality, maintainability, and improved user experience.

User Stories

Non-Requirements

Notes

  • Any additional details or decisions made/needed

Owners

Role Contact
PM Peter Lauterbach
Documentation Owner TBD
Delivery Owner (See assignee)
Quality Engineer (See QA contact)

Done Checklist

Who What Reference
DEV Upstream roadmap issue <link to GitHub Issue>
DEV Upstream code and tests merged <link to meaningful PR or GitHub Issue>
DEV Upstream documentation merged <link to meaningful PR or GitHub Issue>
DEV gap doc updated <name sheet and cell>
DEV Upgrade consideration <link to upgrade-related test or design doc>
DEV CEE/PX summary presentation label epic with cee-training and add a <link to your support-facing preso>
QE Test plans in Polarion <link or reference to Polarion>
QE Automated tests merged <link or reference to automated tests>
DOC Downstream documentation merged <link to meaningful PR>

The PR https://github.com/openshift/origin/pull/25483 introduced a report which infers a storage driver's virtualization compatibility by post-processing the openshift-tests results. Unfortunately this doesn't provide an accurate enough picture about CNV compatibility and thus we now have and promote the kubevirt-storage-checks. Avoid sending mixed messages and revert this post-processor from openshift-tests.

This epic is to track any stories for hypershift kubevirt development that do not fit cleanly within a larger effort.

Here are some examples of tasks that this "catch all" epic can capture

  • dependency update maintenance tasks
  • ci and testing changes/fixes
  • investigation spikes

Users need the ability to set labels on the HostedCluster in order to influence how MCE installs addons into that cluster.

In MCE, when a HostedCluster is created, MCE imports that cluster as a ManagedCluster. MCE has the ability to install addons into ManagedClusters by matching a managedCluster to an install strategy using label selectors. During the import process of importing a HostedCluster as a ManagedCluster, MCE now syncs the labels from the HostedCluster to the ManagedCluster.

This means by being able to set the labels on the HostedCluster, someone can now influence what addons are installed by MCE into that cluster.

https://docs.google.com/spreadsheets/d/15jtfdjgAZZf3F8jtdCcrsJ-YeygZWDale7j4x7xLWEQ/edit?usp=sharing

Epic Goal

  • Goal is to locate and replace old custom components with PatternFly components.

Why is this important?

  • Custom components require supportive css to mimic the visual theme of PatternFly. Over time these supportive styles have grown and interspersed through the console codebase, which require ongoing efforts to carry along, update and maintain consistency across product areas and packages.
  • Also, custom components can have varying behaviors that diverge from PatternFly components, causing bugs and create discordance across the product.
  • Future PatternFly version upgrades will be more straightforward and require less work.

Acceptance Criteria

  • Identify custom components that have a PatternFly equivalent component.
  • Create stories which will address those updates and fixes
  • Update integration tests if necessary.

Open questions::

ContainerDropdown

frontend/packages/dev-console/src/components/health-checks/AddHealthChecks.tsx

frontend/public/components/environment.jsx

frontend/public/components/pod-logs.jsx

 

Epic Goal

  • Migrate all components to functional components
  • Remove all HOC patterns
  • Break the file down into smaller files
  • Improve type definitions
  • Improve naming for better self-documentation
  • Address any React anti-patterns like nested components, or mirroring props in state.
  • Address issues with handling binary data
  • Add unit tests to these components

Acceptance Criteria

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Move shared Type definitions for CreateSecret to "createsecret/type.ts" file

 

A.C.

  - All CreateSecret  components shared Type definitions  are in "createsecret/type.ts" file

The SSHAuthSubform component needs to be refactored to address several tech debt issues: * Rename to SSHAuthSecretForm

  • Refactor into a function component
  • Remove i18n withTranslation HOC pattern
  • Improve type definitions

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

As part of the spike to determine outdated plugins, the monaco-editor dev dependency is out of date and needs to be updated.

Acceptance criteria:

Need to follow the steps in https://webpack.js.org/migrate/5/#upgrade-webpack-4-and-its-pluginsloaders in order to migrate to Webpack v5'

 

Acceptance criteria:

  • All webpack-related packages are updated to the last versions that support v4 and with no new warnings

As a developer, I want to take advantage of the `status` prop that was introduced in PatternFly 5.3.0, so that I can use it for stories such as ODC-7655, which need it for form validation

AC:

  • @patternfly/react-core is updated to at least 5.3.0, preferably 5.4.0 for the bugfixes

Motivation:

Content-Security-Policy (CSP) header provides a defense-in-depth measure in client-side security, as a second layer of protection against Cross-site Scripting (XSS) and clickjacking attacks.

It is not yet implemented in the OpenShift web console, however, there are some other related security headers present in the OpenShift console that cover some aspects of CSP functionality:

  • X-Frame-Options: When set to DENY, this disallows allow attempts to iframe site (related CSP directive: `frame-ancestors`)
  • X-XSS-Protection: Protects against reflected XSS attacks in Chrome and Internet Explorer (related CSP directive: `unsafe-inline`)
  • X-Content-Type-Options: Protects against loading of external scripts and stylesheets unless the server indicates the correct MIME type, which can lead to some types of XSS attacks.

Epic Goal

  • Implement CSP in a two phases
    • 1. Report only, in which the console will report the violations to the user and to the telemetry, so developers can have a better idea what type of violations are appearing.
      • This phase should ideally take place during several releases, at least two during which, data about the appeared violations would be gathered through telemetry. This will get plugin creators time to adapt to the change.
      • During this phase an API needs to be added to the ConsolePlugin CRD, to give plugin creators option how to provide list of their content sources, which console will register and take into account.
      • Also Console itself should remove any CSP violations which is causing.
    • 2. Enforcing, in which console will start enforcing the CSP, which will prevent plugins from loading assets from non-registered sources.

Why is this important?

  • Add additional security level to console

Acceptance Criteria

  • Implement 1 phase of the CSP
    • report only mode
    • surface the violations to the user
    • extend the ConsolePlugins API to give plugin creators a option how to extend plugin's source
    • extend CI to catch any CSP violations
    • update README and announce this feature so developers are aware of it

Open questions::

  1. TBD

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

This story follows up on spike https://issues.redhat.com/browse/CONSOLE-4170

The aim of this story is to add initial CSP implementation for Console web application that will use Content-Security-Policy-Report-Only HTTP header to report on CSP violations.

CSP violations should be handled directly by Console code via custom SecurityPolicyViolationEvent handler, which logs the relevant CSP violation data to browser console.

AC:

  • Console HTML index page must be served with CSP report-only response header
  • Running dynamic demo plugin in Console must not trigger any CSP violations
  • CSP violations must be logged to browser console
  • dynamic plugins README should contain a section that describes CSP usage in Console

CSP violations caused by dynamic plugins should trigger a warning within the cluster dashboard / dynamic plugin status.

 

AC:

  • Add additional section to the DynamicPluginsPopover with plugins which are violating the CSP and showing their count together with linking to the ConsolePlugin list page.

 

 

We should add a custom ConsolePlugin details page that shows additional plugin information as well as controls (e.g. enable/disable plugin) for consistency with ConsolePlugin list page.

 

AC:

  • details page
    • display name
    • mechanism for enable/disable plugin
    • CSP violations
    • plugin status (reported by plugin store)
    • backend service - linkable
    • proxy services - linkable if its as k8s Service type
    • content of plugin manifest JSON (possibly into a separate tab or section) - only available when a plugin is loaded in plugin store
    • version - only available when a plugin is loaded in plugin store
    • description - only available when a plugin is loaded in plugin store

CONSOLE-4265 introduced additional ConsolePlugin CRD field for CSP configuration, so plugins can provide their own list of allowed source. Console-operator needs to vendor this changes and also provide a way how to configure the default CSP directives.

AC:

  • Vendor changes from CONSOLE-4265 in to console-operator and console
    • console-operator
      • Aggregate the CSP directives of each enabled plugin and set them on the console-config.yaml CM
      • Add integration tests
      • Add unit-tests
    • console
      • consume the aggregated CSP directives from console-config.yaml CM and append it do the default CSPs directives which console backend is setting
      • Add unit-tests

When serving Console HTML index page, we generate the policy that includes allowed (trustworthy) sources.

It may be necessary for some dynamic plugins to add new sources in order to avoid CSP violations at Console runtime.

AC:

  • Add new API field to ConsolePlugin CRD for allowing additional CSP sources
  • Add unit tests for the 

Console HTML index template contains an inline script tag used to set up SERVER_FLAGS and visual theme config.

This inline script tag triggers a CSP violation at Console runtime (see attachment for details).

The proper way to address this error is to allow this script tag - either generate a SHA hash representing its contents or generate a cryptographically secure random token for the script.

AC:

  • There is no CSP violation reported for inline script tag.

As part of the AI we would like to supply/generate a manifest file that will install: 

  • nmstate
  • MTV
    • vddk
    • Installing the operator
    • Creating a forklift controller 
    • The user supplies the vCenter creds
    • Creating a new MTV provider (for the vCenter) 
  • local storage would be enough with warning 
  • Preflights: 
    • Annotation on the storage class (default for virtualization) 
    • Check if the MTV host configured
    • Check the storage disks 
    • Connectivity with the vCenter warning only and not blocking the installation (we should check if it’s part of creating new MTV provider and if not deploy a new feature) 
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
: [sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]

failed
job link: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.16-periodics-mce-e2e-agent-connected-ovn-dualstack-metal3-conformance/1822988278547091456 

failed log

  [sig-network-edge] DNS should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/dns/dns.go:499
    STEP: Creating a kubernetes client @ 08/12/24 15:55:02.255
    STEP: Building a namespace api object, basename dns @ 08/12/24 15:55:02.257
    STEP: Waiting for a default service account to be provisioned in namespace @ 08/12/24 15:55:02.517
    STEP: Waiting for kube-root-ca.crt to be provisioned in namespace @ 08/12/24 15:55:02.581
    STEP: Creating a kubernetes client @ 08/12/24 15:55:02.646
  Aug 12 15:55:03.941: INFO: configPath is now "/tmp/configfile2098808007"
  Aug 12 15:55:03.941: INFO: The user is now "e2e-test-dns-dualstack-9bgpm-user"
  Aug 12 15:55:03.941: INFO: Creating project "e2e-test-dns-dualstack-9bgpm"
  Aug 12 15:55:04.299: INFO: Waiting on permissions in project "e2e-test-dns-dualstack-9bgpm" ...
  Aug 12 15:55:04.632: INFO: Waiting for ServiceAccount "default" to be provisioned...
  Aug 12 15:55:04.788: INFO: Waiting for ServiceAccount "deployer" to be provisioned...
  Aug 12 15:55:04.972: INFO: Waiting for ServiceAccount "builder" to be provisioned...
  Aug 12 15:55:05.132: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned...
  Aug 12 15:55:05.213: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned...
  Aug 12 15:55:05.281: INFO: Waiting for RoleBinding "system:deployers" to be provisioned...
  Aug 12 15:55:05.641: INFO: Project "e2e-test-dns-dualstack-9bgpm" has been fully provisioned.
    STEP: creating a dual-stack service on a dual-stack cluster @ 08/12/24 15:55:05.775
    STEP: Running these commands:for i in `seq 1 10`; do [ "$$(dig +short +notcp +noall +answer +search v4v6.e2e-dns-2700.svc A | sort | xargs echo)" = "172.31.255.230" ] && echo "test_endpoints@v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search v4v6.e2e-dns-2700.svc AAAA | sort | xargs echo)" = "fd02::7321" ] && echo "test_endpoints_v6@v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search ipv4.v4v6.e2e-dns-2700.svc A | sort | xargs echo)" = "3.3.3.3 4.4.4.4" ] && echo "test_endpoints@ipv4.v4v6.e2e-dns-2700.svc"; [ "$$(dig +short +notcp +noall +answer +search ipv6.v4v6.e2e-dns-2700.svc AAAA | sort | xargs echo)" = "2001:4860:4860::3333 2001:4860:4860::4444" ] && echo "test_endpoints_v6@ipv6.v4v6.e2e-dns-2700.svc";sleep 1; done
     @ 08/12/24 15:55:05.935
    STEP: creating a pod to probe DNS @ 08/12/24 15:55:05.935
    STEP: submitting the pod to kubernetes @ 08/12/24 15:55:05.935
    STEP: deleting the pod @ 08/12/24 16:00:06.034
    [FAILED] in [It] - github.com/openshift/origin/test/extended/dns/dns.go:251 @ 08/12/24 16:00:06.074
    STEP: Collecting events from namespace "e2e-test-dns-dualstack-9bgpm". @ 08/12/24 16:00:06.074
    STEP: Found 0 events. @ 08/12/24 16:00:06.207
  Aug 12 16:00:06.239: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
  Aug 12 16:00:06.239: INFO: 
  Aug 12 16:00:06.334: INFO: skipping dumping cluster info - cluster too large
  Aug 12 16:00:06.469: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-dns-dualstack-9bgpm-user}, err: <nil>
  Aug 12 16:00:06.506: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-dns-dualstack-9bgpm}, err: <nil>
  Aug 12 16:00:06.544: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  sha256~4QgFXAn8lyosshoHOjJeddr3MJbIL2DnCsoIvJVOGb4}, err: <nil>
    STEP: Destroying namespace "e2e-test-dns-dualstack-9bgpm" for this suite. @ 08/12/24 16:00:06.544
    STEP: dump namespace information after failure @ 08/12/24 16:00:06.58
    STEP: Collecting events from namespace "e2e-dns-2700". @ 08/12/24 16:00:06.58
    STEP: Found 2 events. @ 08/12/24 16:00:06.615
  Aug 12 16:00:06.615: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30: { } FailedScheduling: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Aug 12 16:00:06.615: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30: { } FailedScheduling: skip schedule deleting pod: e2e-dns-2700/dns-test-d93fff7e-90a3-408e-a197-fc4ff0738b30
  Aug 12 16:00:06.648: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
  Aug 12 16:00:06.648: INFO: 
  Aug 12 16:00:06.743: INFO: skipping dumping cluster info - cluster too large
    STEP: Destroying namespace "e2e-dns-2700" for this suite. @ 08/12/24 16:00:06.743
  • [FAILED] [304.528 seconds]
  [sig-network-edge] DNS [It] should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
  github.com/openshift/origin/test/extended/dns/dns.go:499

    [FAILED] Failed: timed out waiting for the condition
    In [It] at: github.com/openshift/origin/test/extended/dns/dns.go:251 @ 08/12/24 16:00:06.074
  ------------------------------

  Summarizing 1 Failure:
    [FAIL] [sig-network-edge] DNS [It] should answer A and AAAA queries for a dual-stack service [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
    github.com/openshift/origin/test/extended/dns/dns.go:251

  Ran 1 of 1 Specs in 304.528 seconds
  FAIL! -- 0 Passed | 1 Failed | 0 Pending | 0 Skipped
fail [github.com/openshift/origin/test/extended/dns/dns.go:251]: Failed: timed out waiting for the condition
Ginkgo exit error 1: exit with code 1

failure reason
TODO

Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

User Story:

As a ARO HCP user, I want to be able to:

  • automatically install the prometheus and OpenShift route CRs when I install the HO

so that I can remove

  • the burden of manually applying those CRs

Acceptance Criteria:

Description of criteria:

  • prometheus and OpenShift route CRs are automatically installed when the HO is installed for AKS mgmt clusters only

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

These are the CRs that need to are manually installed today

oc apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
oc apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
oc apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
oc apply -f https://raw.githubusercontent.com/openshift/api/master/route/v1/zz_generated.crd-manifests/routes-Default.crd.yaml 

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Description of problem:

    $ hypershift  --help
{"level":"error","ts":"2024-11-05T09:26:54Z","logger":"controller-runtime.client.config","msg":"unable to load in-cluster config","error":"unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must bedefined",
...
ERROR	Failed to get client	{"error": "unable to get kubernetes config: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable"}

Version-Release number of selected component (if applicable):

    

How reproducible:

    100%

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

Cant run hypershift --help without kubeconfig

Expected results:

Can run hypershift --help without kubeconfig

Additional info:

    

Currently, if we don't specify the NSG ID or VNet ID the CLI will create these for us in the managed RG. In prod ARO  these will be in separate RGs as they will be provided by the customer, we should reflect this in our env.

This will also make the AKS e2e simpler as the jobs won't have to create these resource groups for each cluster.

Steps to Reproduce:

1. Run any hypershift CLI command in an environment without a live cluster e.g.

hypershift create cluster --help
2024-10-30T12:19:21+08:00	ERROR	Failed to create default options	{"error": "failed to retrieve feature-gate ConfigMap: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://ci-op-68zb-ci-op-68zbrc3h-2-53b8f5-qycsv9k7.hcp.northcentralus.azmk8s.io:443/api/v1\": dial tcp: lookup ci-op-68zb-ci-op-68zbrc3h-2-53b8f5-qycsv9k7.hcp.northcentralus.azmk8s.io: no such host"}
github.com/openshift/hypershift/cmd/cluster/azure.NewCreateCommand
	/Users/fxie/Projects/hypershift/cmd/cluster/azure/create.go:480
github.com/openshift/hypershift/cmd/cluster.NewCreateCommands
	/Users/fxie/Projects/hypershift/cmd/cluster/cluster.go:36
github.com/openshift/hypershift/cmd/create.NewCommand
	/Users/fxie/Projects/hypershift/cmd/create/create.go:20
main.main
	/Users/fxie/Projects/hypershift/main.go:64
runtime.main
	/usr/local/go/src/runtime/proc.go:271
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x10455ea8c]


goroutine 1 [running]:
github.com/spf13/cobra.(*Command).AddCommand(0x1400069db08, {0x14000d91a18, 0x1, 0x1})
	/Users/fxie/Projects/hypershift/vendor/github.com/spf13/cobra/command.go:1311 +0xbc
github.com/openshift/hypershift/cmd/cluster.NewCreateCommands()
	/Users/fxie/Projects/hypershift/cmd/cluster/cluster.go:36 +0x4c4
github.com/openshift/hypershift/cmd/create.NewCommand()
	/Users/fxie/Projects/hypershift/cmd/create/create.go:20 +0x11c
main.main()
	/Users/fxie/Projects/hypershift/main.go:64 +0x368
    

Actual results:

panic: runtime error: invalid memory address or nil pointer dereference

Background

This is intended to be a place to capture general "tech debt" items so they don't get lost. I very much doubt that this will ever get completed as a feature, but that's okay, the desire is more that stores get pulled out of here and put with feature work "opportunistically" when it makes sense. 

Goal

If you find a "tech debt" item, and it doesn't have an obvious home with something else (e.g. with MCO-1 if it's metrics and alerting) then put it here, and we can start splitting these out/marrying them up with other epics when it makes sense.

 

This is a follow up story for: https://issues.redhat.com/browse/OCPBUGS-7836

The pivot command currently prints an error message and warns the user that it will be removed soon. We are planning to land this in 4.15.

This story will be complete when:

  • pivot.go(and all related files) completely removed from the daemon. 

tracking here all the work that needs to be done to configure the ironic containers (ironic-image and ironic-agent-image) to be ready for OCP 4.19
this includes also CI configuration, tools and documentation updates

all the configuration bits need to happen at least one sprint BEFORE 4.19 branching (current target November 22 2024)
docs tasks can be completed after the configuration tasks
the CI tasks need to be completed RIGHT AFTER 4.19 branching happens

tag creation is now automated during OCP tags creation

builder creation has been automated

before moving forward with 4.19 configuration, we need to be sure that the dependencies versions in 4.18 are correctly aligned with the latest upper-constraints

Tools that we use to install python libraries in container move much faster than the corresponding package built for the operating system
In the latest version, sushy uses now pyproject.toml specifying pbr and setuptools as "build requirements" and using pbr as "build engine"
Because of this, due to PEP 517 and 518, pip will use an isolated environment to build the package, blocking the usage of system installed packages as dependencies.
We need to either install pbr, setuptools and wheel from source including them in the pip isolated build environment, or use the "no-build-isolation" pip option to allow using system installed build packages

Feature goal (what are we trying to solve here?)

During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15.

DoD (Definition of Done)

iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend. 

When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1` kargs during install to enable iSCSI booting.

Does it need documentation support?

yes

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Oracle
    • NetApp
    • Cisco

Reasoning (why it’s important?)

  • In OCI there are bare metal instances with iscsi support and we want to allow customers to use it{}

Description of the problem:
Since machine networks are computed at installation time in case of UMN (in the right way), the validation no-iscsi-nic-belongs-to-machine-cidr should be skipped in this case.

We should also skip this validation in case of day2 and imported clusters, because the cluster are not created with all the network information that make his validation work.


 

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

In order to successfully install OCP on an iSCSI boot volume, we need to make sure that the machine has 2 network interfaces:

  • an interface connected to the iSCSI volume
  • an interface used as default gateway that will be used by OCP

This is required because on startup OVS/OVN will reconfigure the default interface (the network interface used for the default gateway). This behavior makes the usage of the default interface impracticable for the iSCSI traffic because we loose the root volume, and the node becomes unusable. See https://issues.redhat.com/browse/OCPBUGS-26071

In the scope of this issue we need to:

  • report iSCSI host IP address from the assisted agent
  • check that the network interface used for the iSCSI boot volume is not the default one (default gateway is goes to one of the other interfaces) => implies 2 network interfaces
  • ensure that the network interface connected to the iSCSI network is configured with DHCP in the kernel args in order to mount the root volume over iSCSI
  • workaround https://issues.redhat.com/browse/OCPBUGS-26580 by dropping a script in a MachineConfig manifest that will reload the network interfaces on first boot

  In case of CMN or SNO, the user should not be able to select the subnet used for the iSCSI traffic.

 Historically, assisted-service has only allowed one mirror configuration that would be applied to all spoke clusters. This was done for assisted service to pull the images needed to install OCP on the spoke cluster. The mirror was then copied over to the spoke cluster.

 

Feature request: Allow each cluster to have its own mirror configuration

 

Use-case: This came out of the Sylva CAPI project where they have a pull-through proxy that caches images from docker. Each spoke cluster created might not have a connectivity on the same network so they will need different mirror configurations per cluster created. 

 

The only way to do this right now is using an install config override for every cluster. https://github.com/openshift/assisted-service/blob/master/docs/user-guide/cloud-with-mirror.md

 

Add per cluster support in AgentCLusterInstall and update ImageDigestSource in the install config the same as we are doing in per service mirror registry

 

Feature goal (what are we trying to solve here?)

The Nmstate is included into RHEL CoreOS 4.14+ providing `nmstate.service` which apply the YAML files into `/etc/nmstate/` folder. Currently, assisted installer(maybe other OCP install methods also) is using `nmstatectl gc` to generate NetworkManager keyfiles.

Benefit of using `nmstate.service`:
1. No need to generate keyfiles anymore.
2. `nmstate.service` is providing nmpolicy support, for example, below YAML is nmpolicy creating a bond with ports holding specified MAC address without knowing the interface names.

capture: 
  port1: interfaces.mac-address=="00:23:45:67:89:1B"
  port2: interfaces.mac-address=="00:23:45:67:89:1A"
desiredState: 
  interfaces: 
  - name: bond0
    type: bond
    state: up
    link-aggregation: 
      mode: active-backup
      ports: 
        - "{{ capture.port1.interfaces.0.name }}"
        - "{{ capture.port2.interfaces.0.name }}"

3. Follow up day1 and day2 network configuration tools could look up `/etc/nmstate` to understand the network topology created in day0.

4. Fallback support with verification. For example, we can have `00-fallback.yml` holding fallback network setup and `01-install.yml` holding user defined network. Nmstate will apply them sequentially, if `01-install.yml` failed the verification check of nmstate, nmstate will rollback to `00-fallback.yml` state.

Please describe what this feature is going to do.

DoD (Definition of Done)

Installer use nmstate.service without deploying NetworkManager keyfiles.

Please describe what conditions must be met in order to mark this feature as "done".

Does it need documentation support?

Document could mention:

  • nmpolicy is supported for installation with common use cases.
  • Fallback network support.

If the answer is "yes", please make sure to check the corresponding option.

Feature origin (who asked for this feature?)

A Customer asked for it

Not customer related
 

A solution architect asked for it

Not from architect

Internal request

Gris Ge <fge@redhat.com>, maintainer of nmstate.
 

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • Benefits listed above.

Competitor analysis reference

    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere

Feature availability (why should/shouldn't it live inside the UI/API?)

This is internal processing of network setup.

For ISOs that have nmstate binary use nmpolicy + nmstate service instead of pre generating the nmconnection files and the script.

The epic should contain tasks that ease the process of handling security issues.

Description of the problem:

Dependabot can't merge PRs as it doesn't tidy and vendor other modules.

for example - https://github.com/openshift/assisted-service/pull/6595

It seems like the reason causing it is that dependabot only updates one module at a time, then if a package is bumped in module A and module B requires module A, then dependabot should bump this package in module B as well, which currently not happening.

 

We want to male sure dependabot is bumping all required versions across all branches/repositories 

 

How reproducible:

Almost each PR

 

Actual results:

Failing jobs on dependabot PRs

 

Expected results:

Dependabot  bumping dependencies successfully

Feature goal (what are we trying to solve here?)

The Assisted Installer should support backup/restore and disaster recovery scenarios, either using OADP (OpenShift API for Data Protection) for ACM (Advanced Cluster Management), or, using ZTP (Zero Touch Provisioning) flows. I.e. the assisted-service should be resilient in such scenarios which, for this context and effort, means that restored/moved spoke clusters should keep the same state and behave the same on the new hub cluster.

Reasoning (why it’s important?)

Provide resiliency in the assisted-service for safe backup/restore flows, allowing spoke clusters to be used without any restriction after DR scenarios or moving between hubs.

Acceptance Criteria

  • Restored or moved spoke clusters should have the same status and support all functionality on the target hub.
  • Existing e2e tests should pass on these clusters and all scenarios should work as expected.

Dependencies (internal and external)

TBD

Previous Work (Optional):

Document outlining issues and potential solutions: https://docs.google.com/document/d/1g77MDYOsULHoTWtjjpr7P5_9L4Fsurn0ZUJHCjQXC1Q/edit?usp=sharing

Feature goal (what are we trying to solve here?)

Backup and restore managed (hosted) clusters installed with hosted control planes with the agent platform (assisted-service).

DoD (Definition of Done)

  • Managed (hosted) cluster can be restored from one hub to another
  • Restored managed (hosted) cluster has all of the workloads from before restoring
  • Restored managed (hosted) cluster can function as if it were installed on the original hub

Does it need documentation support?

Yes

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Name of the customer(s)
    • How many customers asked for it? [several]
    • Can we have a follow-up meeting with the customer(s)? [unsure]

 

  • A solution architect asked for it

    • Name of the solution architect and contact details
    • How many solution architects asked for it?
    • Can we have a follow-up meeting with the solution architect(s)?

 

  • Internal request

    • Who asked for it? Hypershift PM

Reasoning (why it’s important?)

  • Please describe why this feature is important
  • How does this feature help the product?
    •  

Competitor analysis reference

  • Do our competitors have this feature?
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • Related data - the feature doesn’t exist but we have info about the usage of associated features that can help us
    • Please list all related data usage information
  • We have the numbers and can relate to them
    • Please list all related data usage information

Feature availability (why should/shouldn't it live inside the UI/API?)

  • Please describe the reasoning behind why it should/shouldn't live inside the UI/API
  • If it's for a specific customer we should consider using AMS
  • Does this feature exist in the UI of other installers?

 During the Govtech spike [1]: backup and restore of HCP clusters from one ACM hub to a new ACM hub, it was discovered that the data currently saved for the first iteration [2] of restoring a host isn't enough.

After restoring, the NodePool, Machine, and AgentMachine still showed they were unready and that they were unable to adopt the Nodes. The Agents were completely missing their statuses, which is likely to have caused this.

We'll need to uncover all the issues and all that needs to be saved in order for the restore to complete successfully.

 

[1] HOSTEDCP-2052Slack thread

[2] MGMT-18635

 

Feature goal (what are we trying to solve here?)

Currently, we have few issues with our current OCM authorization -

  • organizational tenancy - It is enabled in stage but disabled in prod. We want to align the same behavior for both environments.
  • assisted service may not handle all possible OCM roles. (requires investgation)
  • assisted-service may not handle cluster ownership transfer well. (requires investigation)
  • assisted service testing deployment with rhsso auth is currently not working. important to fix that for being able to test
  • More documentation is required.

DoD (Definition of Done)

 

assisted service rhsso auth type is aligned with OCM

OCM docs - https://docs.redhat.com/en/documentation/openshift_cluster_manager/1-latest/html/managing_clusters/index

 Currently, deployment of assisted-installer using authentication mode "rhsso" doesn't work properly, we need to fix this type of deployment to test it 

Epic goal

When an Assisted Service SaaS user performs the creation of a new OpenShift cluster, provide the option to enable the Migration Kit for Virtualization (MTV) operator.

Why is this important?

  • Expose users in the Assisted Service SaaS to the value of MTV.
  • Customers/users want to leverage the migration toolkit for virtualization capabilities on premises environment.
  • Out of box install of MTV for Assisted Migrations.

Scenarios

  1. When a RH cloud user logs into console.redhat SaaS, they can leverage the Assisted Service SaaS flow to create a new cluster
  2. During the Assisted Service SaaS create flow, a RH cloud user can see a list of available operators that they want to install at the same time as the cluster create. 
  3. An option is offered to select check a box next to "Migration Tooklit for Virtualization (MTV)" 
  4. The RH cloud user can read a tool-tip or info-box with short description of the MTV and click a link for more details to review MTV documentation

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Ensure MTV release channel can automatically deploy the latest x.y.z without needing any DevOps/SRE intervention
  • Ensure MTV release channel can be updated quickly (if not automatically) to ensure the later release x.y can be offered to the cloud user.

 

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: PR
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>{}
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of the problem:
when creating a SNO Cluster
The U blocks user to select MTV operator
 

How reproducible:

 

Steps to reproduce:

1. create sno cluster 4.17

2.go to operator page

3.

Actual results:
operator mtv is disabled and can not be selected
 

Expected results:
should be selected

Feature goal (what are we trying to solve here?)

Allow users to do a basic OpenShift AI intallation with one click in the "operators" page of the cluster creation wizard, similar to how the ODF or MCE operators can be installed.

DoD (Definition of Done)

This feature will be done when users can click on the "OpenShift AI" check box on the operators page of the cluster creation wizard, and end having an installation that can be used for basic tasks.

Does it need documentation support?

Yes.

Feature origin (who asked for this feature?)

  • Internal request

Feature usage (do we have numbers/data?)

  • According to existing data most of the existing RHOAI clusters have been installed using assisted installer.

Feature availability (why should/shouldn't it live inside the UI/API?)

  • It needs to be in the API because that is the way to automate it in assisted installer.
  • It needs to be in the UI because we want the user experience to be just clicking one check-box.

In order to complete the setup of some operators it is necessary to do things that can't be done creating a simple manifest. For example, in order to complete the setup of ODF so that it can be used by OpenShift AI it is necessary to configure the default storage class, and that can be done with a simple manifest.

One possible way to overcome that limitation is to create a simple manifest that contains a job, so that the job will execute the required operation. In the example above the job will run something like this:

oc annotate storageclass ocs-storagecluster-ceph-rbd storageclass.kubernetes.io/is-default-class=true

Doing that is already possible, but the problem is that the assisted installer will not wait for these jobs to complete before declaring that the cluster is ready. The intent of this ticket is to change the installer so that it will wait.

Epic Goal

  • Scrape Profiles was introduced as Tech Preview in 4.13, goal it to now promote it to GA
  • Scrape Profiles Enhancement Proposal should be merged
  • OpenShift developers that want to adopt the feature should have the necessary tooling and documentation on how to do so
  • OpenShift CI should validate if possible changes in profiles that might break a profile or cluster functionality

This has no link to a planing session, as this predates our Epic workflow definition.

Why is this important?

  • Enables users to minimize the resrource overhead for Monitoring.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/MON-2483

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provide precise and automatically tested examples/use cases for how to use the various monitoring APIs.

Why is this important?

  • Currently, the doc only provides examples for some use cases.
  • The the provided examples assume the user has access to the API.
  • The the provided examples are not regularly tested and can diverge.
  • The rest of the examples are split across multiple KCS, which make them hard to find, also those examples are not regularly tested.
  • Support requests about a procedure or a permission not working are time consuming.

Scenarios

  1. ...

Acceptance Criteria

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  • The utils that will run the procedures/examples as CMO e2e tests
  • The utils to have those procedures/examples as scripts/code blocks.

The history of this epic starts with this PR which triggered a lengthy conversation around the workings of the image  API with respect to importing imagestreams  images as single vs manifestlisted. The imagestreams today by default have the `importMode` flag set to `Legacy` to avoid breaking behavior of existing clusters in the field. This makes sense for single arch clusters deployed with a single  arch payload, but when users migrate to use the multi payload, more often than not, their intent is to add nodes of other architecture types. When this happens - it gives rise to problems when using imagestreams with the default behavior of importing a single manifest image. The oc commands do have a new flag to toggle the importMode, but this breaks functionality of existing users who just want to create an imagestream and use it with existing commands.

There was a discussion with David Eads and other staff engineers and it was decided that the approach to be taken is to default imagestreams' importMode to `preserveOriginal` if the cluster is installed with/ upgraded to a multi payload. So a few things need to happen to achieve this:

  • CVO would need to expose a field in the status section indicative of the type of payload in the cluster (single vs multi)
  • cluster-openshift-apiserver-operator would read this field and add it to the apiserver configmap. openshift-apiserver would use this value to determine the setting of importMode value.
  • Document clearly that the behavior of imagestreams in a cluster with multi payload is different from the traditional single payload

Some open questions:

  • What happens to existing imagestreams on upgrades
  • How do we handle CVO managed imagestreams (IMO, CVO managed imagestreams should always set importMode to preserveOriginal as the images are associated with the payload)

 

This change enables the setting of import mode through the image config API which is then synced to the apiserver's observed config which then enables apiserver to set the import mode based on this value. The import mode in the observed config is also populated by default based on the payload type

  • single => Legacy
  • multi => PreserveOriginal

poc: https://github.com/Prashanth684/api/commit/c660fba709b71a884d0fc96dd007581a25d2d17a

For the apiserver operator to figure out the payload type and set the import mode defaults, the CVO needs to expose that value through the status field. This information is available today in the conditions list, but it's not pretty to extract it and infer the payload type as it is contained in the message string. The way to do it today is shown here. It would be better for CVO to expose it as a separate field which can be easily consumed by any controller and also be used for telemetry in the future.

 

Track improvements to IPI on Power VS made in the 4.18 release cycle.

Epic Goal

Bump vendored Kubernetes packages (k8s.io/api, k8s.io/apimachinery, k8s.io/client-go, etc.) to v0.31.0 or newer version.

Why is this important?

Keep vendored packages up to date.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? Have they been notified? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • -Release Technical Enablement - Provide necessary release enablement
    details and documents.-

Dependencies (internal and external)

1. Other vendored dependencies (such as openshift/api and controller-runtime) may also need to be updated to Kubernetes 1.31.

Previous Work (Optional)

1. We tracked these bumps as bugs in the past. For example, for OpenShift 4.17 and Kubernetes 1.30: OCPBUGS-38079, OCPBUGS-38101, and OCPBUGS-38102.

Open questions

None.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem

The openshift/cluster-ingress-operator repository vendors k8s.io/* v0.30.2. OpenShift 4.18 is based on Kubernetes 1.31.

Version-Release number of selected component (if applicable)

4.18.

How reproducible

Always.

Steps to Reproduce

Check https://github.com/openshift/cluster-ingress-/blob/release-4.18/go.mod.

Actual results

The k8s.io/* packages are at v0.30.2.

Expected results

The k8s.io/* packages are at v0.31.0 or newer.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Stop setting `-cloud-provider` and `-cloud-config` arguments on KAS, KCM and MCO
  • Remove `CloudControllerOwner` condition from CCM and KCM ClusterOperators
  • Remove feature gating reliance in library-go IsCloudProviderExternal
  • Remove CloudProvider feature gates from openshift/api

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

OCPCLOUD-2514 prevented feature gates from being used with the CCMs.
We have been asked not to remove the feature gates themselves until 4.18.

PR to track: https://github.com/openshift/api/pull/1780

We should remove the reliance on the feature gate from this part of the code and clean up references to feature gate access at the call sites.

Steps

  • Update library go to remove reliance on feature gates
  • Update callers to no longer rely on feature gate accessor (KCMO, KASO, MCO, CCMO)
  • Remove feature gates from API repo

Stakeholders

  • Cluster Infra
  • MCO team
  • Workloads team
  • API server team

Definition of Done

  • Feature gates for external cloud providers are removed from the product
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Goal

  • Stabilize the new Kernel RT CI lanes (go green)
  • Transfer Kernel RT blocking lanes from GCP to the new EC2 Metal lanes

Why is this important?

  • The new EC2 metal-based lanes are a better representation of how customers will use the real-time kernel and as such should replace the outdated lanes that were built on virtualized GCP hardware. 

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully (Green)

Dependencies (internal and external)

  1.  

Previous Work (Optional):

Open questions:

None

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Problem:

ClusterTask has been deprecated and will be removed in Pipelines Operator 1.17. In the console UI, we have a ClusterTask list page, and ClusterTasks are also listed in the Tasks quick search in the Pipeline builder form.

Goal:

Remove ClusterTask and references from the console UI and use Tasks from `openshift-pipelines` namespace.

Why is it important?

Use cases:

  1. <case>

Acceptance criteria:

  1. Remove the ClusterTasks tab and the list page
  2. Remove ClusterTasks from Tasks quick search
  3. List Tasks from the `openshift-pipelines` namespace in the Tasks quick search
  4. Users should be able to create pipelines using the tasks from `openshift-pipelines` namespace in the Pipeline builder.
  5. Remove the ClusterTasks tab, list page, and from Task quick search from the static plugin only if 1.17 Pipelines operator is installed.
  6. Backport the static plugin changes to the previous OCP version supported by 1.17

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Resolver in Tekton https://tekton.dev/docs/pipelines/resolution-getting-started/ 

Task resolution: https://tekton.dev/docs/pipelines/cluster-resolver/#task-resolution 

Note:

Description of problem:

In local after setting flags we can see the Community tasks. After the change with the pr, cluster tasks are removed and community tasks can't be seen even after setting the flag
    

Version-Release number of selected component (if applicable):


    

How reproducible:

Every time
    

Steps to Reproduce:

    1. Set the window.SERVER_FLAGS.GOARCH and window.SERVER_FLAGS.GOOS
    2. Go to the pipeline builder page
    3.  
    

Actual results:

You can't see any tasks     

Expected results:

Community tasks should appear after setting the flag
    

Additional info:


    

Description

ClusterTask has been deprecated and will be removed in Pipelines Operator 1.17

We have to use Tasks from `openshift-pipelines` namespace. This change will happen in console-plugin repo(dynamic plugin). So in console repository we have to remove all the dependency of ClusterTask if the Pipelines Operator is 1.17 and above

Acceptance Criteria

  1. Remove ClusterTask list page in search menu
  2. Remove ClusterTask list page tab in Tasks navigation menu
  3. ClusterTask to be removed from quick search in Pipelines builder
  4. Update the test cases (can we remove ClusterTask test for Pipelines 1.17 and above??)

Additional Details:

Problem:

Goal:

Acceptance criteria:

  1. Move the PipelinesBuilder to the dynamic plugin
  2. The Pipeline Builder should work without and with the new dynamic plugin

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description of problem:

Add a flag to disallowed the Pipeline edit URL in console pipelines-plugin so that it will not conflict between the console and Pipelines console-plugin    

Description of problem:

    Add disallowed flag to hide the pipelines-plugin pipeline builder route, add action and to catalog provider extension as it is migrated to Pipelines console-plugin. So, that no duplicate action in console

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Simplify the resolv-prepender process to eliminate consistently problematic aspects. Most notably this will include reducing our reliance on the dispatcher script for proper configuration of /etc/resolv.conf by replacing it with a systemd watch on the /var/run/NetworkManager/resolv.conf file.

Why is this important?

Over the past five years or so of on-prem networking, the resolv-prepender script has consistently been a problem. Most of these problems relate to the fact that it is triggered as a NetworkManager dispatcher script, which has proven unreliable, despite years of playing whack-a-mole with various bugs and misbehaviors. We believe there is a simpler, less bug-prone way to do this that will improve both the user experience and reduce the bug load from this particular area of on-prem networking.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement
    details and documents.

...

Dependencies (internal and external)

1.

...

Previous Work (Optional):

1. …

Open questions::

1. …

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As a followup to https://issues.redhat.com/browse/OPNET-568 we should remove the triggering of resolv-prepender from the dispatcher script. There is still some other functionality in the dispatcher script that we need to keep, but once we have the systemd watch it won't be necessary to trigger the service from the script at all.

This is being tracked separately because it is a more invasive change than just adding the watch, so we probably don't want to backport it.

Template:

Networking Definition of Planned

Epic Template descriptions and documentation

Epic Goal

Goal is to create a set of E2E test (in o/origin repository) testing keepalived and haproxy.

Why is this important?

Based on the past experience, implementing this will be extremely helpful to various teams when debugging networking (and not only) issues.

Network stack is complex and currently debugging keepalived relies mostly on parsing log lines.

Planning Done Checklist

The following items must be completed on the Epic prior to moving the Epic from Planning to the ToDo status

  • Priority+ is set by engineering
  • Epic must be Linked to a +Parent Feature
  • Target version+ must be set
  • Assignee+ must be set
  • (Enhancement Proposal is Implementable
  • (No outstanding questions about major work breakdown
  • (Are all Stakeholders known? Have they all been notified about this item?
  • Does this epic affect SD? {}Have they been notified{+}? (View plan definition for current suggested assignee)
    1. Please use the “Discussion Needed: Service Delivery Architecture Overview” checkbox to facilitate the conversation with SD Architects. The SD architecture team monitors this checkbox which should then spur the conversation between SD and epic stakeholders. Once the conversation has occurred, uncheck the “Discussion Needed: Service Delivery Architecture Overview” checkbox and record the outcome of the discussion in the epic description here.
    2. The guidance here is that unless it is very clear that your epic doesn’t have any managed services impact, default to use the Discussion Needed checkbox to facilitate that conversation.

Additional information on each of the above items can be found here: Networking Definition of Planned

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

-

Previous Work (Optional):

-

Open questions::

-

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

We had an incident (outage) in the past  where OSUS impacted other applications running in that multi tenant environment along with itself. Refer  [1][2] for more details.

We initially created all Jira cards as part of OTA-552. But the epic grew very large. So moving some cards to this epic. The associated Jira cards which are created to improve the ability of OSUS to handle more requests without causing issues with other application in an multi-tenant environment.

 [1]https://docs.google.com/document/d/1saZDbZTraComFUFsDehCSPhyuuj6uEsy0s9c1pSo3aI/edit#heading=h.egy1agkrq2v1

 [2]https://docs.google.com/document/d/1wxXTiXLK8v7JuwOnm7Jte5-jU50qAhEljFTsIexM6Ho/edit#heading=h.lzjvop3jozc5

Why is this important?

  • ...

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Description of problem

Update advice is append-only, with 4.y.z releases being added to channels regularly, and new update risks being declared occasionally. This makes caching a very safe behavior, and client-side caching in the CVO would reduce the disruption caused by OpenShift Update Service (OSUS) outages like OTA-1376.

Version-Release number of selected component

A single failed update-service retrieval currently clears the cache in 4.18. The code is pretty old, so I expect this behavior goes back through 4.12, our oldest release that's not yet end-of-life.

How reproducible

Every time.

Steps to Reproduce

1. Run a happy cluster with update advice.
2. Break the update service, e.g. by using OTA-520 for a mock update service.
3. Wait a few minutes for the cluster to notice the breakage.
4. Check it's update recommendations, with oc adm upgrade or the new-in-4.18 oc adm upgrade recommend.

Actual results

No recommendations while the cluster is RetrievedUpdates=False.

Expected results

Preserving the cached recommendations while the cluster is RetrievedUpdates=False, at least for 24 hours. I'm not committed to a particular time, but 24h is much larger than any OSUS outage we've ever had, and still not so long that we'd expect much in the way of recommendation changes if the service had remained healthy.

Description

This is a placeholder epic to group refactoring and maintenance work required in the monitoring plugin

Background

In order to provide customers the option to process alert data externally, we need to provide a way the data can be downloaded from the OpenShift console. The monitoring plugin uses a Virtualized table from the dynamic plugin SDK. We should include the change in this table so is available for others.

Outcomes

  • a CSV can be downloaded from the alerts table, including the alert labels, severity and state (firing)

 

--- 

NOTE: 

There is a duplicate issue in the OpenShift console board:  https://issues.redhat.com//browse/CONSOLE-4185

This is because the console > CI/CD > prow configurations require that any PR in the openshift/console repo needs to have an associated Jira issue in the openshift console Jira board. 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Given hostedcluster:hypershift_cluster_vcpus:max now exists, we need to use it to derive a vCPU-hours metric.

Related slack thread: https://redhat-internal.slack.com/archives/C0493H149DK/p1719329224733099?thread_ts=1719252265.181669&cid=C0493H149DK

Draft recording rule:

record:  hostedcluster:hypershift_cluster_vcpus:vcpu_hours
expr: max by(_id)(count_over_time(hostedcluster:hypershift_cluster_vcpus:max[1h:5m])) / scalar(count_over_time(vector(1)[1h:5m])) 

In order to simplify querying a rosa cluster's effective CPU hours, create a consolidated metric for rosa vcpu-hours

Related slack thread: https://redhat-internal.slack.com/archives/C0493H149DK/p1719329224733099?thread_ts=1719252265.181669&cid=C0493H149DK

Draft recording rule:

record:  rosa:cluster:vcpu_hours
expr: (hostedcluster:hypershift_cluster_vcpus:vcpu_hours or on (_id) cluster:usage:workload:capacity_virtual_cpu_hours)

OKD update the samples more or less independently from OCP. It would be good to add support for this in library-sync.sh so that OCP and OKD don't "step on each others' toes" when doing the updates.

library-sync.sh should accept a parameter, say --okd, that when set will update only the OKD samples (all of them, because we don't have unsupported samples in OKD) and when not set, it will update the supported OCP samples.

The samples need to be resynced for OCP 4.18. Pay attention to only update the OCP samples. OKD does this independently.

Note that the rails templates are currently out-of-sync of the upstream so care needs to be taken to not mess those up and adopt the upstream version again.

Epic Goal

  • Create E2E workflow (core/default) to validate the OpenShift installation with platform external in well-known cloud provider (AWS).
  • Create native support in the regular/default e2e CI workflow to create platform External installation type
  • Create/reuse infrastructure provisioning steps to use UPI-provisioning flow to test a well-known cloud provider (AWS), supported by CI,  with platform external on OpenShift CI, consuming the native e2e workflows, and post-analyzing tooling (sippy, junit processors, collectors).
  • Implement the E2E described in the documentation shared to the partners: https://docs.providers.openshift.org/platform-external/installing/ 

Epic Non-goal

  • Create workflows using Assisted Installer
  • Create E2E workflows to new provider or not-supported by the CI infrastructure
  • Expand research to new Cloud Providers
  • Write new e2e tests

Why is this important?

  • Platform External is a native OCP feature introduced in 4.13. Clusters installed with Platform External type does not have installation automation, requiring to run UPI-style to provisione infrastructure resources required to install OpenShift. Currently there is no workflow supporting the default E2E CI Step[1], only OPCT workflow[2]. The E2E workflow has a lot of integrations with OpenShift CI ecosystem, including results processing, external tooling like Sippy providing feedback and so on.
  • The default tooling we are advising partners to self-running, and self-evaluting, is OPCT[3]. The tool aims to provide quick access and post-processing results outside OpenShift CI infra to our partners. Having native E2E workflow and OPCT workflow running side-by-side in OpenShift CI infrastructure, would help to;
    • Unblock OCP Engineers to implement custom e2e test using native conformance workflow and test in OpenShift CI using well-known provider (AWS)
    • Feed OPCT with native E2E conformance tests executed by CI, allowing to compare/benchmark the tool to improve the quality of the results from OPCT, and decrease the risks of the tool-specific issues.
    • Unblock OCP engineers to implement workflows using OPCT, same used by the partners, decreasing the gap in the knowledge and requirement to run manually the tool

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running with the default E2E openshift/conformance/parallel suite in clusters installed on AWS using platform type External
  • CI - MUST be running with the default E2E openshift/conformance/parallel suite in clusters installed on AWS using platform type External and CCM installed on Day-0/1 (as described in the documentation)
  • CI - MUST be running with the OPCT default workflow in clusters installed on AWS using platform type External
  • CI - MUST be running with the OPCT default workflow in clusters installed on AWS using platform type External and CCM installed on Day-0/1 (as described in the documentation)
  •  
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  •  

ATTENTION: this card is blocked by SPLAT-1158 (implementing the workflow proposed in https://docs.providers.openshift.org/platform-external )

Background

As a followup to SPLAT-1158 and  SPLAT-1425, we should create an cluster with platform type "External" and workflows/steps/jobs that run on vSphere infrastructure using regular OpenShift CI e2e workflow, using the provisioning steps proposed in the docs.providers (https://docs.providers.openshift.org/platform-external).

There are currently a few platform "External" steps (install) that are associated with vSphere, but supposedly only OPCT (need more investigation) conformance workflow is using it.

in the ci-operator, these should be used as reference for building a new test that will deploy OpenShift on vSphere using platform "External" with and without CCM. This will be similar to the vSphere platform "None" (and platform "External" from SPLAT-1782.

Steps

  • research existing platform external steps and workflows using by OPCT, and if there is using regular e2e workflow.
  • create/aggregate steps to deploy a platform External for vSphere
  • create a new workflow/chain to capture a full test run using regular e2e workflow
  • create a new job in ci-operator to run as a periodic test, once a week to begin with
  • review if OPCT jobs needs to be updated
     

Caveats:

Currently there is a workflow "upi-vsphere-platform-external-ccm" but it isn't used for any jobs. In other hand, there are a few workflows on OPCT Conformance using the step "upi-vsphere-platform-external-ovn-pre" to install a cluster on vSphere using platform type external.

Recently in SPLAT-1425 the regular e2e step incorporated support to platform external type, we need to create an workflow consuming the default OCP CI e2e workflow to get signals using the same workflow as the other platforms, and engineers are familiar.

Stakeholders

  • openshift eng

Definition of Done

  • platform External test is passing on vSphere
  • Docs
  • n/a
  • Testing
  • n/a

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • We want to remove official support for UPI and IPI support for Alibaba Cloud provider. Going forward, we are recommending installations on Alibaba Cloud with either external platform or agnostic platform installation method.

Why is this important?

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:

(1) Low customer interest of using Openshift on Alibaba Cloud

(2) Removal of Terraform usage

(3) MAPI to CAPI migration

(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)

Scenarios

Impacted areas based on CI:

alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI jobs are removed
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

<!--

Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:

https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/

As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.

Before submitting it, please make sure to remove all comments like this one.

-->

*USER STORY:*

<!--

One sentence describing this story from an end-user perspective.

-->

As a [type of user], I want [an action] so that [a benefit/a value].

*DESCRIPTION:*

<!--

Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.

-->

*Required:*

...

*Nice to have:*

...

*ACCEPTANCE CRITERIA:*

<!--

Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.

-->

*ENGINEERING DETAILS:*

<!--

Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.

-->

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update OCP release number in OLM metadata manifests of:

  • local-storage-operator
  • aws-efs-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • secrets-store-csi-driver-operator
  • smb-csi-driver-operator

OLM metadata of the operators are typically in /config/manifest directory of each operator. Example of such a bump: https://github.com/openshift/aws-efs-csi-driver-operator/pull/56 

We should do it early in the release, so QE can identify new operator builds easily and they are not mixed with the old release.

Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories

  • external-attacher
  • external-provisioner
  • external-resizer
  • external-snapshotter
  • node-driver-registrar
  • livenessprobe

Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.

This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in  go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

https://github.com/openshift/vmware-vsphere-csi-driver

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
Please wait for openshift/api, openshift/library-go, and openshift/client-go  are updated to the newest Kubernetes release! There may be non-trivial changes in these libraries.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • csi-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • csi-driver-manila-operator
  • vmware-vsphere-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • ibm-powervs-block-csi-driver-operator
  • secrets-store-csi-driver-operator
  • cluster-storage-operator
  • cluster-csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

(please cross-check with *-operator + vsphere-problem-detector in our tracking sheet)

EOL, do not upgrade:

  • github.com/oVirt/csi-driver-operator
  • github.com/openshift/alibaba-disk-csi-driver-operator
  • github.com/openshift/csi-driver-shared-resource-operator

The following operators were migrated to csi-operator, do not update these obsolete repos:

  • github.com/openshift/aws-efs-csi-driver-operator
  • github.com/openshift/azure-disk-csi-driver-operator
  • github.com/openshift/azure-file-csi-driver-operator

tools/library-bump.py  and tools/bump-all  may be useful. For 4.16, this was enough:

mkdir 4.16-bump
cd 4.16-bump
../library-bump.py --debug --web <file with repo list> STOR-1574 --run "$PWD/../bump-all github.com/google/cel-go@v0.17.7" --commit-message "Bump all deps for 4.16" 

4.17 perhaps needs an older prometheus:

../library-bump.py --debug --web <file with repo list> STOR-XXX --run "$PWD/../bump-all github.com/google/cel-go@v0.17.8 github.com/prometheus/common@v0.44.0 github.com/prometheus/client_golang@v1.16.0 github.com/prometheus/client_model@v0.4.0 github.com/prometheus/procfs@v0.10.1" --commit-message "Bump all deps for 4.17" 

4.18 special:

Add "spec.unhealthyEvictionPolicy: AlwaysAllow" to all PodDisruptionBudget objects of all our operators + operands. See WRKLDS-1490 for details

There has been change in library-go function called `WithReplicasHook`. See https://github.com/openshift/library-go/pull/1796.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

https://github.com/openshift/azure-file-csi-driver/

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

https://github.com/openshift/aws-ebs-csi-driver

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

https://github.com/openshift/gcp-pd-csi-driver

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

https://github.com/openshift/azure-disk-csi-driver

This epic is part of the 4.18 initiatives we discussed, it includes:

  1. Expanding external testing sources beyond openshift/kubernetes
  2. Test graduation from informing -> blocking
  3. Enforcing 95% pass rate on newly added tests to OCP in Component Readiness
  4. Finding regressions in tests for low frequency but high importance variants

Once we have an MVP of openshift-tests-extension, migrate k8s-tests in openshift/kubernetes to use it.

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Description of problem:

    When IBM Cloud Infrastructure bugs/outages prevent proper cleanup of resources, it can prevent the deletion of the Resource Group during cluster destroy. The errors returned because of this is not always helpful and can be confusing.

Version-Release number of selected component (if applicable):

    4.16 (and earlier)

How reproducible:

    80% when IBM Cloud Infrastructure experiences issues

Steps to Reproduce:

    1. When there is a know issue with IBM Cloud Infrastructure (COS, Block Storage, etc.), create an IPI cluster on IBM Cloud
    2. Destroy the cluster
    

Actual results:

    WARNING Failed to delete resource group us-east-block-test-2-d5ssx: Resource groups with active or pending reclamation instances can't be deleted. Use the CLI commands "ibmcloud resource service-instances --type all" and "ibmcloud resource reclamations" to check for remaining instances, then delete the instances and try again.

Expected results:

    More descriptive details on the blocking resource service-instances (not always storage reclamation related). Potentially something helpful to provide to IBM Cloud Support for assistance.

Additional info:

    IBM Cloud is working on a PR to help enhance the debug details when these kind of errors occur.
At this time, an ongoing issue, https://issues.redhat.com/browse/OCPBUGS-28870, is causing these failures, where this additional debug information can help identify and guide IBM Cloud Support to resolve. But this information does not resolve that bug (which is an Infrastructure bug).

Description of problem:

The created Node ISO is missing the architecture (<arch>) in its filename, which breaks consistency with other generated ISOs such as the Agent ISO.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

100%    

Actual results:

Currently, the Node ISO is being created with the filename node.iso.

Expected results:

Node ISO should be created as node.<arch>.iso to maintain consistency.

Description of problem:

The network-status annotation includes multiple default:true entries for OVN's UDN    

Version-Release number of selected component (if applicable):

    4.17+

How reproducible:

    Always

Steps to Reproduce:

    1. Use UDN
    2. View network-status annotation, see multiple default:true entries
    

Actual results:

multiple default:true entries

Expected results:

single default:true entries    

Description of problem:

On route create page, the Hostname has id "host", and Service name field has id "toggle-host", which should be "toggle-service".
    

Version-Release number of selected component (if applicable):

 4.17.0-0.nightly-2024-09-13-193731
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Check hostname and service name elements for route creation page,
    2.
    3.
    

Actual results:

1. Service name field has id "toggle-host".
screenshot: https://drive.google.com/file/d/1qkUhhzUPsfFw_o2Gj8XXr9QCISH3g1rK/view?usp=drive_link
    

Expected results:

1. The id should be "toggle-service".
    

Additional info:


    

Description of problem:

 user is unable to switch to other projects successfully on network policies list page   

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-08-27-051932    

How reproducible:

Always    

Steps to Reproduce:

    1. cluster-admin or normal user visit network policies list page via Networking -> NetworkPolicies
    2. open project dropdown and choose different project
    3.
    

Actual results:

2. user is unable to switch to other project successfully   

Expected results:

2. user should be able to switch project any time project is changed    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1730

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

During the build02 update from 4.14.0-ec.1 to ec.2 I have noticed the following:


$ b02 get clusterversion version -o json | jq '.status.conditions[] | select (.type=="Failing")'
{
  "lastTransitionTime": "2023-06-20T13:40:12Z",
  "message": "Multiple errors are preventing progress:\n* Cluster operator authentication is updating versions\n* Could not update customresourcedefinition \"alertingrules.monitoring.openshift.io\" (512 of 993): the object is invalid, possibly due to local cluster configuration",
  "reason": "MultipleErrors",
  "status": "True",
  "type": "Failing"
}

There is a valid error (the Could not update customresourcedefinition... one) but the whole thing is cluttered by the "Cluster operator authentication is updating versions" message, which is imo not a legit reason for Failing=True condition and should not be there. Before I captured this one I saw the message with three operators instead of just one.

Version-Release number of selected component (if applicable):

4.14.0-ec.2

How reproducible:

No idea

Description of problem:

When using an installer with amd64 payload, configuring the VM to use aarch64 is possible through the installer-config.yaml:

additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: ci.devcluster.openshift.com
compute:
- architecture: arm64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
controlPlane:
  architecture: arm64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3

However, the installation will fail with ambiguous error messages:

ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.build11.ci.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 13.59.207.137:6443: connect: connection refused

The actual error hides in the bootstrap VM's System Log:

Red Hat Enterprise Linux CoreOS 417.94.202407010929-0 4.17

SSH host key: SHA256:Ng1GpBIlNHcCik8VJZ3pm9k+bMoq+WdjEcMebmWzI4Y (ECDSA)

SSH host key: SHA256:Mo5RgzEmZc+b3rL0IPAJKUmO9mTmiwjBuoslgNcAa2U (ED25519)

SSH host key: SHA256:ckQ3mPUmJGMMIgK/TplMv12zobr7NKrTpmj+6DKh63k (RSA)

ens5: 10.29.3.15 fe80::1947:eff6:7e1b:baac

Ignition: ran on 2024/08/14 12:34:24 UTC (this boot)

Ignition: user-provided config was applied

[0;33mIgnition: warning at $.kernelArguments: Unused key kernelArguments[0m



[1;31mRelease image arch amd64 does not match host arch arm64[0m

ip-10-29-3-15 login: [   89.141099] Warning: Unmaintained driver is detected: nft_compat

    

Version-Release number of selected component (if applicable):

4.16
    

How reproducible:

Use amd64 installer to install a cluster with aarch64 nodes
    

Steps to Reproduce:

    1. download amd64 installer
    2. generate the install-config.yaml
    3. edit install-config.yaml to use aarch64 nodes
    4. invoke the installer
    

Actual results:

installation timed out after ~30mins
    

Expected results:

installation failed immediately with proper error message indicating the installation is not possible
    

Additional info:

https://redhat-internal.slack.com/archives/C68TNFWA2/p1723640243828379
    

Description of problem:

"Edit Route" from action list doesn't support Form edit.
    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-21-014704
    4.17.0-rc.5
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Go to one route detail page, click "Edit Route" from action dropdown list.
    2.
    3.
    

Actual results:

1. It opens YAML tab directly.
    

Expected results:

1. Should support both Form and YAML edit.
    

Additional info:


    

Description of problem:

    A slice of something like

idPointers := make([]*string, len(ids))

should be corrected to 

idPointers := make([]*string, 0, len(ids))

When the initial size is not provided to the make for slice creating, the slice is made to length (last argument) and filled with the default value. For instance _ := make([]int, 5) creates an array {0, 0, 0, 0, 0}. If this appended to rather than accessing and setting the information by index, then there are extra values. 

1. If we append to the array then we leave behind the default values (this could change the behavior of the function that the array is passed to). This could also pose as a malloc issue.
2. If we dont fill the array completely (ie. create a size of 5 and only fill 4 elements), then the same issue as above could come in to play.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Test with dynamic namespaces in the name break aggregation (and everything else):

 

: [sig-architecture] platform pods in ns/openshift-must-gather-8tbzj that restart more than 2 is considered a flake for now

 

It's only finding 1 of that test and failing aggregation.

Description of problem:

    container_network* metrics disappeared from pods

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-08-13-031847

How reproducible:

    always

Steps to Reproduce:

    1.create a pod
    2.check container_network* metrics from the pod
$oc get --raw /api/v1/nodes/jimabug02-95wr2-worker-westus-b2cpv/proxy/metrics/cadvisor  | grep container_network_transmit | grep $pod_name
 
    

Actual results:

2 It failed to report container_network* metrics

Expected results:

2 It should report container_network* metrics   

Additional info:

This may be a regression issue, we hit it in 4.14 https://issues.redhat.com/browse/OCPBUGS-13741

 

Description of problem:

i18n misses for some provisioner on Create storageclass page

Navigation to Storage -> StorageClasses -> Create StorageClass page 

For Provisioner -> kubernetes.io/glusterfs
Missed: Gluster REST/Heketi URL  Issue:

For Provisioner -> kubernetes.io/quobyte
Missed: User

For Provisioner -> kubernetes.io/vsphere-volume
Missed: Disk format

For Provisioner -> kubernetes.io/portworx-volume
Missed: Filesystem, Select Filesystem, 

For Provisioner -> kubernetes.io/scaleio
Missed: Reference to a configured Secret object

Missed: Select Provisioner  for placeholder text

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-08-19-002129

How reproducible:

    Always

Steps to Reproduce:

    1. Add  ?pseudolocalization=true&lng=en at the end of URL
    2. Navigation to Storage -> StorageClasses -> Create StorageClass page,click the provisioner dropdown list, choose the provisioner
    3. Check whether the text is in i18n mode
    

Actual results:

    the text is not in i18n mode

Expected results:

    the text should in i18n mode

Additional info:

    

As of now, it is possible to set different architectures for the compute machine pools when both the 'worker' and 'edge' machine pools are defined in the install-config.

Example:

compute:
- name: worker
  architecture: arm64
...
- name: edge
  architecture: amd64
  platform:
    aws:
      zones: ${edge_zones_str}

See https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L631

Description of problem:

login admin console, go to "Observe -> Metrics" page, there is one additional and useless button to the left of "Actions" button. see picture: https://drive.google.com/file/d/11CxilYmIzRyrcaISHje4QYhMsx9It3TU/view?usp=drive_link,

according to 4.17, the button is for Refresh interval, but it failed to load

NOTE: same issue for the developer console

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-07-200953

How reproducible:

always

Steps to Reproduce:

1. login admin/developer console, go to "Observe -> Metrics" page     

Actual results:

Refresh interval button on "Observe -> Metrics" page failed to load

Expected results:

no error

Additional info:

    

Description of problem:

The samples operator sync for OCP 4.18 includes an update to the ruby imagestream. This removes EOLed versions of Ruby and upgrades the images to be ubi9 based
    

Version-Release number of selected component (if applicable):

4.18
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Run build suite tests
    2.
    3.
    

Actual results:

Tests fail trying to pull image. Example: Error pulling image "image-registry.openshift-image-registry.svc:5000/openshift/ruby:3.0-ubi8": initializing source docker://image-registry.openshift-image-registry.svc:5000/openshift/ruby:3.0-ubi8: reading manifest 3.0-ubi8 in image-registry.openshift-image-registry.svc:5000/openshift/ruby: manifest unknown
    

Expected results:

Builds can pull image, and the tests succeed.
    

Additional info:

As part of the continued deprecation of the Samples Operator, these tests should create their own Ruby imagestream that is kept current.
    

Description of problem:

The example fails in the CI of the Samples Operator because it references a base image (perl:5.30-el7) that is no longer available in the OpenShift library.

This needs to be fixed to unblock the release of the Samples Operator for OCP 4.17.

There are essentially 2 ways to fix this:

1. Fix the Perl test template to reference a Perl image available in the OpenShift library.
2. Remove the test (which might be OK because the template seems to actually only be used in the tests).

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

 

Actual results:

    

Expected results:

    

Additional info:

The test breaks here: https://github.com/openshift/origin/blob/master/test/extended/image_ecosystem/s2i_perl.go#L78

and the line in the test template that specifies the outdated Perl image is here: https://github.com/openshift/origin/blob/master/test/extended/testdata/image_ecosystem/perl-hotdeploy/perl.json#L50

Description of problem:

When we enable OCB in the worker pool and a new image is build, once the builder pod has finished building the image it takes about 10-20 minutes to start applying this new image in the first node.

Version-Release number of selected component (if applicable):

The issue was found while pre-merge verifying https://github.com/openshift/machine-config-operator/pull/4395

How reproducible:

Always

Steps to Reproduce:

1. Enable techpreview 
2. Create this MOSC

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: worker
spec:
  buildOutputs:
    currentImagePullSecret:
      name: $(oc get -n openshift-machine-config-operator sa default -ojsonpath='{.secrets[0].name}')
  machineConfigPool:
    name: worker
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
    renderedImagePushSecret:
      name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
    renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
    containerFile:
    - containerfileArch: noarch
      content: |-
        # Pull the centos base image and enable the EPEL repository.
        FROM quay.io/centos/centos:stream9 AS centos
        RUN dnf install -y epel-release

        # Pull an image containing the yq utility.
        FROM docker.io/mikefarah/yq:latest AS yq

        # Build the final OS image for this MachineConfigPool.
        FROM configs AS final

        # Copy the EPEL configs into the final image.
        COPY --from=yq /usr/bin/yq /usr/bin/yq
        COPY --from=centos /etc/yum.repos.d /etc/yum.repos.d
        COPY --from=centos /etc/pki/rpm-gpg/RPM-GPG-KEY-* /etc/pki/rpm-gpg/

        # Install cowsay and ripgrep from the EPEL repository into the final image,
        # along with a custom cow file.
        RUN sed -i 's/\$stream/9-stream/g' /etc/yum.repos.d/centos*.repo && \
            rpm-ostree install cowsay ripgrep
EOF

Actual results:

The machine-os-builder pod will be created, then the build pod will be created too, the image will be built and then it will take about 10-20 minutes to start applying the new build in the first node.


Expected results:

After MCO finishes building the image it should not take 10/20 minutes to start applying the image in the first node.

Additional info:


Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/129

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This ticket was created by ART pipline run sync-ci-images

Description of problem:

    Circular dependencies in OCP Console prevent migration of Webpack 5 

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enable the CHECK_CYCLES env var while building
    2. Observe errors
    3.
    

Actual results:

    There are errors

Expected results:

    No errors

Additional info:

 

These are the cycles I can observe in public:

 webpack compilation dbe21e029f8714842299

41 total cycles, 26 min-length cycles (A -> B -> A)

Cycle count per directory:
  public (41)

Index files occurring within cycles:
  public/components/secrets/create-secret/index.tsx (9)
  public/components/utils/index.tsx (4)
  public/module/k8s/index.ts (2)
  public/components/graphs/index.tsx (1)

frontend/public/tokener.html
  public/tokener.html
  public/tokener.html

frontend/public/index.html
  public/index.html
  public/index.html

frontend/public/redux.ts
  public/redux.ts
  public/reducers/features.ts
  public/actions/features.ts
  public/redux.ts

frontend/public/co-fetch.ts
  public/co-fetch.ts
  public/module/auth.js
  public/co-fetch.ts

frontend/public/actions/features.ts
  public/actions/features.ts
  public/redux.ts
  public/reducers/features.ts
  public/actions/features.ts

frontend/public/components/masthead.jsx
  public/components/masthead.jsx
  public/components/masthead-toolbar.jsx
  public/components/about-modal.tsx
  public/components/masthead.jsx

frontend/public/components/utils/index.tsx
  public/components/utils/index.tsx
  public/components/utils/kebab.tsx
  public/components/utils/index.tsx

frontend/public/module/k8s/index.ts
  public/module/k8s/index.ts
  public/module/k8s/k8s.ts
  public/module/k8s/index.ts

frontend/public/reducers/features.ts
  public/reducers/features.ts
  public/actions/features.ts
  public/redux.ts
  public/reducers/features.ts

frontend/public/module/auth.js
  public/module/auth.js
  public/co-fetch.ts
  public/module/auth.js

frontend/public/components/cluster-settings/cluster-settings.tsx
  public/components/cluster-settings/cluster-settings.tsx
  public/components/cluster-settings/cluster-operator.tsx
  public/components/cluster-settings/cluster-settings.tsx

frontend/public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx

frontend/public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/utils.ts
  public/components/secret.jsx
  public/components/secrets/create-secret/index.tsx

frontend/public/components/masthead-toolbar.jsx
  public/components/masthead-toolbar.jsx
  public/components/about-modal.tsx
  public/components/masthead.jsx
  public/components/masthead-toolbar.jsx

frontend/public/actions/features.gql
  public/actions/features.gql
  public/actions/features.gql

frontend/public/components/utils/kebab.tsx
  public/components/utils/kebab.tsx
  public/components/utils/index.tsx
  public/components/utils/kebab.tsx

frontend/public/module/k8s/k8s.ts
  public/module/k8s/k8s.ts
  public/module/k8s/index.ts
  public/module/k8s/k8s.ts

frontend/public/module/k8s/swagger.ts
  public/module/k8s/swagger.ts
  public/module/k8s/index.ts
  public/module/k8s/swagger.ts

frontend/public/graphql/client.gql
  public/graphql/client.gql
  public/graphql/client.gql

frontend/public/components/cluster-settings/cluster-operator.tsx
  public/components/cluster-settings/cluster-operator.tsx
  public/components/cluster-settings/cluster-settings.tsx
  public/components/cluster-settings/cluster-operator.tsx

frontend/public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx
  public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/pagerduty-receiver-form.tsx

frontend/public/components/monitoring/receiver-forms/webhook-receiver-form.tsx
  public/components/monitoring/receiver-forms/webhook-receiver-form.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/webhook-receiver-form.tsx

frontend/public/components/monitoring/receiver-forms/email-receiver-form.tsx
  public/components/monitoring/receiver-forms/email-receiver-form.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/email-receiver-form.tsx

frontend/public/components/monitoring/receiver-forms/slack-receiver-form.tsx
  public/components/monitoring/receiver-forms/slack-receiver-form.tsx
  public/components/monitoring/receiver-forms/alert-manager-receiver-forms.tsx
  public/components/monitoring/receiver-forms/slack-receiver-form.tsx

frontend/public/components/secrets/create-secret/utils.ts
  public/components/secrets/create-secret/utils.ts
  public/components/secret.jsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/utils.ts

frontend/public/components/secrets/create-secret/CreateConfigSubform.tsx
  public/components/secrets/create-secret/CreateConfigSubform.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/CreateConfigSubform.tsx

frontend/public/components/secrets/create-secret/UploadConfigSubform.tsx
  public/components/secrets/create-secret/UploadConfigSubform.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/UploadConfigSubform.tsx

frontend/public/components/secrets/create-secret/WebHookSecretForm.tsx
  public/components/secrets/create-secret/WebHookSecretForm.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/WebHookSecretForm.tsx

frontend/public/components/secrets/create-secret/SSHAuthSubform.tsx
  public/components/secrets/create-secret/SSHAuthSubform.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/SSHAuthSubform.tsx

frontend/public/components/secrets/create-secret/GenericSecretForm.tsx
  public/components/secrets/create-secret/GenericSecretForm.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/GenericSecretForm.tsx

frontend/public/components/secrets/create-secret/KeyValueEntryForm.tsx
  public/components/secrets/create-secret/KeyValueEntryForm.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/KeyValueEntryForm.tsx

frontend/public/components/secrets/create-secret/CreateSecret.tsx
  public/components/secrets/create-secret/CreateSecret.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/CreateSecret.tsx

frontend/public/components/secrets/create-secret/SecretSubForm.tsx
  public/components/secrets/create-secret/SecretSubForm.tsx
  public/components/secrets/create-secret/index.tsx
  public/components/secrets/create-secret/SecretSubForm.tsx

frontend/public/components/about-modal.tsx
  public/components/about-modal.tsx
  public/components/masthead.jsx
  public/components/masthead-toolbar.jsx
  public/components/about-modal.tsx

frontend/public/components/graphs/index.tsx
  public/components/graphs/index.tsx
  public/components/graphs/status.jsx
  public/components/graphs/index.tsx

frontend/public/components/modals/error-modal.tsx
  public/components/modals/error-modal.tsx
  public/components/utils/index.tsx
  public/components/utils/webhooks.tsx
  public/components/modals/error-modal.tsx

frontend/public/components/image-stream.tsx
  public/components/image-stream.tsx
  public/components/image-stream-timeline.tsx
  public/components/image-stream.tsx

frontend/public/components/graphs/status.jsx
  public/components/graphs/status.jsx
  public/components/graphs/index.tsx
  public/components/graphs/status.jsx

frontend/public/components/build-pipeline.tsx
  public/components/build-pipeline.tsx
  public/components/utils/index.tsx
  public/components/utils/build-strategy.tsx
  public/components/build.tsx
  public/components/build-pipeline.tsx

frontend/public/components/build-logs.jsx
  public/components/build-logs.jsx
  public/components/utils/index.tsx
  public/components/utils/build-strategy.tsx
  public/components/build.tsx
  public/components/build-logs.jsx

frontend/public/components/image-stream-timeline.tsx
  public/components/image-stream-timeline.tsx
  public/components/image-stream.tsx
  public/components/image-stream-timeline.tsx

    

Description of problem:

The cluster-wide proxy is getting injected for remote-write config automatically but not the noProxy URLs in Prometheus k8s CR which is available in openshift-monitoring project which is expected. However, if the remote-write endpoint is in noProxy region, then metrics are not transferred.

Version-Release number of selected component (if applicable):

RHOCP 4.16.4

How reproducible:

100%

Steps to Reproduce:

1. Configure proxy custom resource in RHOCP 4.16.4 cluster
2. Create cluster-monitoring-config configmap in openshift-monitoring project
3. Inject remote-write config (without specifically configuring proxy for remote-write)
4. After saving the modification in  cluster-monitoring-config configmap, check the remoteWrite config in Prometheus k8s CR. Now it contains the proxyUrl but NOT the noProxy URL(referenced from cluster proxy). Example snippet:
==============
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
[...]
  name: k8s
  namespace: openshift-monitoring
spec:
[...]
  remoteWrite:
  - proxyUrl: http://proxy.abc.com:8080     <<<<<====== Injected Automatically but there is no noProxy URL.
    url: http://test-remotewrite.test.svc.cluster.local:9090
    

Actual results:

The proxy URL from proxy CR is getting injected in Prometheus k8s CR automatically when configuring remoteWrite but it doesn't have noProxy inherited from cluster proxy resource.

Expected results:

The noProxy URL should get injected in Prometheus k8s CR as well.

Additional info:

 

Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/819

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    'Are you sure' pop-up windows on 'Create NetworkPolicy' -> Policy type section -> both for Ingress and Egress does not closes automatically after user triggering the 'Remove all' action

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-09-120947
    4.18.0-0.nightly-2024-09-09-212926

How reproducible:

    Always

Steps to Reproduce:

    1. Naviage to Networking -> NetworkPolicies page, click 'create NetworkPolicies' button, and change to Form view
    2. On Policy type -> Ingress/Egress section, click 'Add Ingress rule' buttong
    3. Click 'Remove all', and trigger 'remove all' action on the pops-up windows
    

Actual results:

The ingress/egress data has been removed, but the pops up windows are not closed automatically

Expected results:

Compare with the same behavior on OCP4.16, after the 'Remove all' action is triggered and executed successfully, the windows will be closed automatically

Additional info:

    

Description of problem:

    As part of https://issues.redhat.com/browse/CFE-811, we added a featuregate "RouteExternalCertificate" to release the feature as TP, and all the code implementations were behind this gate.

However, it seems https://github.com/openshift/api/pull/1731 inadvertently duplicated "ExternalRouteCertificate" as "RouteExternalCertificate".

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    100%

Steps to Reproduce:

   $ oc get featuregates.config.openshift.io cluster -oyaml 
<......>
spec:
  featureSet: TechPreviewNoUpgrade
status:
  featureGates:
    enabled:
    - name: ExternalRouteCertificate
    - name: RouteExternalCertificate
<......>     

Actual results:

    Both RouteExternalCertificate and ExternalRouteCertificate were added in the API

Expected results:

We should have only one featuregate "RouteExternalCertificate" and the same should be displayed in https://docs.openshift.com/container-platform/4.16/nodes/clusters/nodes-cluster-enabling-features.html

Additional info:

 Git commits

https://github.com/openshift/api/commit/11f491c2c64c3f47cea6c12cc58611301bac10b3

https://github.com/openshift/api/commit/ff31f9c1a0e4553cb63c3e530e46a3e8d2e30930

Slack thread: https://redhat-internal.slack.com/archives/C06EK9ZH3Q8/p1719867937186219

Description of problem:

Adding a node with `oc adm node-image` fails:

oc adm node-image monitor --ip-addresses 192.168.250.77
time=2024-10-10T11:31:19Z level=info msg=Monitoring IPs: [192.168.250.77]
time=2024-10-10T11:31:19Z level=info msg=Cannot resolve IP address 192.168.250.77 to a hostname. Skipping checks for pending CSRs.
time=2024-10-10T11:31:19Z level=info msg=Node 192.168.250.77: Assisted Service API is available
time=2024-10-10T11:31:19Z level=info msg=Node 192.168.250.77: Cluster is adding hosts
time=2024-10-10T11:31:19Z level=warning msg=Node 192.168.250.77: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking

 

Version-Release number of selected component (if applicable):

    4.17.0

How reproducible: Always

Extra information:

The cluster is deployed using Platform:None and userManagedNetworking on an OpenStack cluster which is used as a test bed for the real hardware Agent Based Installer.

Bootstrap of the cluster itself is successfull, but adding nodes as day 2 is not working.

During the cluster bootstrap, we see the following log message:

{\"id\":\"valid-platform-network-settings\",\"status\":\"success\",\"message\":\"Platform OpenStack Compute is allowed\"}

So after looking at https://github.com/openshift/assisted-service/blob/master/internal/host/validator.go#L569

we suppose that the error is related to `userManagedNetworking`
being set to true when bootstraping and false when adding a node.

A second related issue, is why the platform is seen as openstack, as neither the cluster-config-v1 configmap containing  install-config or the infrastructure/cluster object mentions OpenStack.

Not sure if this is relevant but an external CNI plugin is used here, we have networkType: Calico in the install config.

Description of problem:

In CONSOLE-4187, the metrics page was removed from the console, but some related packages (i.e., the codemirror ones) remained, even though they are now unnecessary

 

Version-Release number of selected component (if applicable):

4.18.0    

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

IHAC who is facing issue while deploying nutanix IPI cluster 4.16.x with dhcp.ENV DETAILS: Nutanix Versions: AOS: 6.5.4 NCC: 4.6.6.3 PC: pc.2023.4.0.2 LCM: 3.0.0.1During the installation process after the bootstrap nodes and control-planes are created, the IP addresses on the nodes shown in the Nutanix Dashboard conflict, even when Infinite DHCP leases are set. The installation will work successfully only when using the Nutanix IPAM. Also 4.14 and 4.15 releases install successfully. IPS of master0 and master2 are conflicting, Please chk attachment. Sos-report of master0 and master1 : https://drive.google.com/drive/folders/140ATq1zbRfqd1Vbew-L_7N4-C5ijMao3?usp=sharing The issue was reported via the slack thread:https://redhat-internal.slack.com/archives/C02A3BM5DGS/p1721837567181699

Version-Release number of selected component (if applicable):

    

How reproducible:

Use the OCP 4.16.z installer to create an OCP cluster with Nutanix using DHCP network. The installation will fail. Always reproducible.    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    The installation will fail. 

Expected results:

    The installation succeeds to create a Nutanix OCP cluster with the DHCP network.

Additional info:

    

When provisioning a cluster using IPI with FIPS enabled,

if using virtual media then then IPA fails to boot with FIPS, there is an error in machine-os-images

 

Oct 29 15:57:19 localhost.localdomain extract-machine-os.sh[3757]: Adding kernel argument ip=dhcp
Oct 29 15:57:19 localhost.localdomain extract-machine-os.sh[3757]: Adding kernel argument fips=1
Oct 29 15:57:19 localhost.localdomain extract-machine-os.sh[3757]: /bin/copy-iso: line 34: [: ip=dhcp: binary operator expected             

 

Description of problem:

    [AWS]Installer should have pre-check for user tags

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    always

Steps to Reproduce:

Setting user tags as below in install-config: 

    userTags:
      usage-user: cloud-team-rebase-bot[bot]

The user tags will be applied to many resources, including roles, but [] does not allowed to tag to roles

https://drive.google.com/file/d/148y-cYrfzNQzDwWlUrgMYAGsZAY6gbW4/view?usp=sharing

    

Actual results:

Installation failed as failed to create IAM roles, ref job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-api-provider-aws/529/pull-ci-openshift-cluster-api-provider-aws-master-regression-clusterinfra-aws-ipi-proxy-techpreview/1852197133122277376

Expected results:

Installer should have pre-check for this scenario and exit with error message if user tags contain unsupported chars  

Additional info:

    discussion on slack: https://redhat-internal.slack.com/archives/CF8SMALS1/p1730443557188649

Description of problem:

  Move Events option above Event Source and rename it to Event Types. And Keep the Eventing option together on add page.

Validation failures in assisted-service are reported to the user in the output of openshift-install agent wait-for bootstrap-complete. However, when reporting issues to support or escalating to engineering, we quite often have only the agent-gather archive to go on.

Most validation failures in assisted-service are host validations. These can be reconstructed with some difficulty from the assisted-service log, and are readily available in that log starting with 4.17 since we enabled debugging in AGENT-944.

However, there are also cluster validation failures and these are not well logged.

Description of problem: Clicking Size control in PVC form throws a warning error. See the below and attached:

`react-dom.development.js:67 Warning: A component is changing an uncontrolled input to be controlled.`

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Goto PVC form and open the Browser Dev console
    2. Click on the Size control to set a value
    The warning error `Warning: A component is changing an uncontrolled input to be controlled. This is likely caused by the value changing from undefined to a defined value, which should not happen. Decide between using a controlled or uncontrolled input element for the lifetime of the component.` is logged out in the console tab.

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

clicking on any route to view its detail will wrongly take route name as selected project name    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-09-212926    

How reproducible:

Always    

Steps to Reproduce:

    1. goes to Routes list page
    2. click on any route name
    3.
    

Actual results:

2. the route name will be taken as selected project name so the page will always be loading because the project doesn't exist    

Expected results:

2. route detail page should be returned    

Additional info:

    

The story is to track i18n upload/download routine tasks which are perform every sprint. 

 

A.C.

  - Upload strings to Memosource at the start of the sprint and reach out to localization team

  - Download translated strings from Memsource when it is ready

  -  Review the translated strings and open a pull request

  -  Open a followup story for next sprint

Description of problem:

Create cluster with publish:Mixed by using CAPZ,
1. publish: Mixed + apiserver: Internal
install-config:
=================
publish: Mixed
operatorPublishingStrategy:
  apiserver: Internal
  ingress: External

In this case, api dns should not be created in public dns zone, but it was created.
==================
$ az network dns record-set cname show --name api.jima07api --resource-group os4-common --zone-name qe.azure.devcluster.openshift.com
{
  "TTL": 300,
  "etag": "6b13d901-07d1-4cd8-92de-8f3accd92a19",
  "fqdn": "api.jima07api.qe.azure.devcluster.openshift.com.",
  "id": "/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com/CNAME/api.jima07api",
  "metadata": {},
  "name": "api.jima07api",
  "provisioningState": "Succeeded",
  "resourceGroup": "os4-common",
  "targetResource": {},
  "type": "Microsoft.Network/dnszones/CNAME"
}

2. publish: Mixed + ingress: Internal
install-config:
=============
publish: Mixed
operatorPublishingStrategy:
  apiserver: External
  ingress: Internal

In this case, load balance rule on port 6443 should be created in external load balancer, but it could not be found.
================
$ az network lb rule list --lb-name jima07ingress-krf5b -g jima07ingress-krf5b-rg
[]

Version-Release number of selected component (if applicable):

    4.17 nightly build

How reproducible:

    Always

Steps to Reproduce:

    1. Specify publish: Mixed + mixed External/Internal for api/ingress 
    2. Create cluster
    3. check public dns records and load balancer rules in internal/external load balancer to be created expected
    

Actual results:

    see description, some resources are unexpected to be created or missed.

Expected results:

    public dns records and load balancer rules in internal/external load balancer to be created expected based on setting in install-config

Additional info:

    

Description of problem:

Multipart upload issues with Cloudflare R2 using S3 api. Some S3 compatible object storage systems like R2 require that all multipart chunks are the same size. This was mostly true before, except the final chunk was larger than the requested chunk size which causes uploads to fail.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Problem shows itself on OpenShift CI clusters intermittently.

Steps to Reproduce:

This behavior has been causing 504 Gateway Timeout issues in the image registry instances in OpenShift CI clusters.
It is connected to uploading big images (i.e 35GB), but we do not currently have the exact steps that reproduce it.

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    https://github.com/distribution/distribution/issues/3873 
    https://github.com/distribution/distribution/issues/3873#issuecomment-2258926705
    https://developers.cloudflare.com/r2/api/workers/workers-api-reference/#r2multipartupload-definition (look for "uniform in size")

Please review the following PR: https://github.com/openshift/images/pull/195

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/525

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-aws-cluster-api-controllers-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

Description of problem:

When the user provides an existing VPC, the IBM CAPI will not add ports 443, 5000, and 6443 to the VPC's security group. It is safe to always check for these ports since we only add them if they are missing.
    

Update kubernetes-apiserver and openshift-apiserver to use k8s 1.31.x which is currently in use for OCP 4.18.

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/126

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

B[Staging]BE 2.35.0, UI 2.34.2 - [Staging] - UI allows LVMS and ODF to be selected and then throws an error

How reproducible:

100%

Steps to reproduce:

1.

Actual results:

 

Expected results:

Description of problem:

when normal user tries to create namespace scoped network policy, selected project in project selection dropdown was not taken

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-07-17-183402    

How reproducible:

Always    

Steps to Reproduce:

1. normal user with a project view networkpolicy page
/k8s/ns/yapei1-1/networkpolicies/~new/form
2. Hit on 'affected pods' in Pod selector section OR keep everything with default value and click on 'Create'     

Actual results:

2. User will see following error when click on 'affected pods'
Can't preview pods
r: pods is forbidden: User "yapei1" cannot list resource "pods" in API group "" at the cluster scope  

User will see following error when click on 'Create' button
An error occurrednetworkpolicies.networking.k8s.io is forbidden: User "yapei1" cannot create resource "networkpolicies" in API group "networking.k8s.io" at the cluster scope  

Expected results:

2. switching to 'YAML view' we can see that the selected project name was not auto populated in YAML  

Additional info:

    

Description of problem:

Alert that have been silenced are still seen on Console overview page, 

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    

Steps to Reproduce:

    1.for a cluster installed on version 4.15
    2. Silence a alert that is firing by going to Console --> Observe --> Alerting --> Alerts
    3. Check if the alert is added to silenced alert Console --> Observe --> Alerting --> Silences
    4. Go back to Console (Overview page) silenced alert is still seen there

Actual results:

    Silenced alert can be seen on ocp overview page

Expected results:

    Silenced alert should not be seen on overview page

Additional info:

    

Description of problem:

Navigation:
           Storage -> PersistentVolumeClaims -> Details -> Mouse hover on 'PersistentVolumeClaim details' diagram
Issue:
           "Available" translated in-side diagram but not in mouse hover text

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-01-063526

How reproducible:

Always

Steps to Reproduce:

1. Log into web console and set language to non en_US
2. Navigate to Storage -> PersistentVolumeClaims
3. Click on PersistentVolumeClaim from list
4. In Details tab, mouse hover on 'PersistentVolumeClaim details' diagram
5. Text "xx.yy GiB Available" in English.
6. Same "Available" translated in-side diagram but not in mouse hover text

Actual results:

"Available" translated in-side diagram but not in mouse hover text

Expected results:

"Available" in mouse hover text should be in set language

Additional info:

screenshot reference attached

Description of problem:

FDP released a new OVS 3.4 version, that will be used on the host.

We want to maintain the same version in the container.

This is mostly needed for OVN observability feature.

Our e2e jobs fail with:

pods/aws-efs-csi-driver-controller-66f7d8bcf5-zf8vr initContainers[init-aws-credentials-file] must have terminationMessagePolicy="FallbackToLogsOnError"
pods/aws-efs-csi-driver-node-7qj9p containers[csi-driver] must have terminationMessagePolicy="FallbackToLogsOnError"
pods/aws-efs-csi-driver-operator-fcc56998b-2d5x6 containers[aws-efs-csi-driver-operator] must have terminationMessagePolicy="FallbackToLogsOnError" 

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/55652/rehearse-55652-periodic-ci-openshift-csi-operator-release-4.19-periodic-e2e-aws-efs-csi/1824483696548253696

The jobs should succeed.

Description of problem:

Various tests in Console's master branch CI are failing due to missing content of <li.pf-v5-c-menu__list-item> element.

Check https://search.dptools.openshift.org/?search=within+the+element%3A+%3Cli.pf-v5-c-menu__list-item%3E+but+never+did&maxAge=168h&context=1&type=all&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    The actual issue is with the created project via CLI, which is not being available in the NS dropdown

 When one of our partner was trying to deploy a  4.16 Spoke cluster with ZTP/Gitops Approach, they get the following error message in their assisted-service pod:

error msg="failed to get corresponding infraEnv" func="github.com/openshift/assisted-service/internal/controller/controllers.(*PreprovisioningImageReconciler).AddIronicAgentToInfraEnv" file="/remote-source/assisted-service/app/internal/controller/controllers/preprovisioningimage_controller.go:409" error="record not found" go-id=497 preprovisioning_image=storage-1.fi-911.tre.nsn-rdnet.net preprovisioning_image_namespace=fi-911 request_id=cc62d8f6-d31f-4f74-af50-3237df186dc2

 

After some discussion in Assisted-Installer forum(https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1723196754444999), Nick Carboni and Alona Paz suggested that "identifier: mac-address" is not supported. Partner has currently ACM 2.11.0 and MCE 2.6.0 versions. However, their older cluster had ACM 2.10 and MCE 2.4.5 and this parameter was working. Nick and Alona suggested to remove "identifier: mac-address" from siteconfig and then installation started to progress. Based on suggestion from Nick, I opened this bug ticket to understand why it started not work now. Partner asked for an official documentation on why this parameter is no more working anymore or if this parameter is not supported any more.

Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/44

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

On "VolumeSnapshot" list page, when project dropdown is "All Projects", click "Create VolumeSnapshot", the project "Undefined" is shown on project field.

    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-27-213503
4.18.0-0.nightly-2024-09-28-162600
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Go to "VolumeSnapshot" list page, set "All Projects" in project dropdown list.
    2.Click "Create VolumeSnapshot", check project field on the creation page.
    3.
    

Actual results:

2. The project is "Undefined"
    

Expected results:

2. The project should be "default".
    

Additional info:


    

Description of problem:

 

Both TestAWSEIPAllocationsForNLB and TestAWSLBSubnets are flaking on verifyExternalIngressController waiting for DNS to resolve.

Example error:

lb_eip_test.go:119: loadbalancer domain apps.eiptest.ci-op-d2nddmn0-43abb.origin-ci-int-aws.dev.rhcloud.com was unable to resolve:

 

Version-Release number of selected component (if applicable):

4.17.0

How reproducible:

50%

Steps to Reproduce:

    1. Run TestAWSEIPAllocationsForNLB or TestAWSLBSubnets in CI

Actual results:

    Flakes

Expected results:

    Shouldn't flake

Additional info:

CI Search: FAIL: TestAll/parallel/TestAWSEIPAllocationsForNLB

CI Search: FAIL: TestAll/parallel/TestUnmanagedAWSEIPAllocations

CI Search: FAIL: TestAll/parallel/TestAWSLBSubnets

Hello Team,

 

After the hard reboot of all nodes due to a power outage,  failure of image pull of NTO preventing "ocp-tuned-one-shot.service" startup result in dependency failure for kubelet and crio services,

------------

journalctl_--no-pager

Aug 26 17:07:46 ocp05 systemd[1]: Reached target The firstboot OS update has completed.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3577]: NM resolv-prepender: Starting download of baremetal runtime cfg image
Aug 26 17:07:46 ocp05 systemd[1]: Starting Writes IP address configuration so that kubelet and crio services select a valid node IP...
Aug 26 17:07:46 ocp05 systemd[1]: Starting TuneD service from NTO image...
Aug 26 17:07:46 ocp05 nm-dispatcher[3687]: NM resolv-prepender triggered by lo up.
Aug 26 17:07:46 ocp05 resolv-prepender.sh[3644]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + [[ lo == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3720]: + exit 0
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3722]: + exit 0
Aug 26 17:07:46 ocp05 bash[3655]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cf4faeb258c222ba4e04806fd3a7373d3bc1f43a66e141d4b7ece0307f597c72...
Aug 26 17:07:46 ocp05 podman[3661]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26...
Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Main process exited, code=exited, status=125/n/a
Aug 26 17:07:46 ocp05 nm-dispatcher[3793]: NM resolv-prepender triggered by brtrunk up.
Aug 26 17:07:46 ocp05 systemd[1]: ocp-tuned-one-shot.service: Failed with result 'exit-code'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + [[ brtrunk == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + '[' -z ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + echo 'Not a DHCP4 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: Not a DHCP4 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3803]: + exit 0
Aug 26 17:07:46 ocp05 systemd[1]: Failed to start TuneD service from NTO image.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Dependencies necessary to run kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Kubernetes Kubelet.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet.service: Job kubelet.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: Dependency failed for Container Runtime Interface for OCI (CRI-O).
Aug 26 17:07:46 ocp05 systemd[1]: crio.service: Job crio.service/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 systemd[1]: kubelet-dependencies.target: Job kubelet-dependencies.target/start failed with result 'dependency'.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + '[' -z '' ']'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + echo 'Not a DHCP6 address. Ignoring.'
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: Not a DHCP6 address. Ignoring.
Aug 26 17:07:46 ocp05 nm-dispatcher[3804]: + exit 0

-----------

-----------

$ oc get proxy config cluster  -oyaml
  status:
    httpProxy: http://proxy_ip:8080
    httpsProxy: http://proxy_ip:8080

$ cat /etc/mco/proxy.env
HTTP_PROXY=http://proxy_ip:8080
HTTPS_PROXY=http://proxy_ip:8080

-----------

-----------
× ocp-tuned-one-shot.service - TuneD service from NTO image
     Loaded: loaded (/etc/systemd/system/ocp-tuned-one-shot.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Mon 2024-08-26 17:07:46 UTC; 2h 30min ago
   Main PID: 3661 (code=exited, status=125)

Aug 26 17:07:46 ocp05 podman[3661]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4b6ace44ba73bc0cef451bcf755c7fcddabe66b79df649058dc4b263e052ae26: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 10.112.227.10:53: server misbehaving
-----------

  • Customer has proxy configured in their environment. However,  nodes can not start after hard reboot of all nodes as it looks that NTO ignoring cluster wide proxy settings. To resolve NTO image pull issue, customer has to include proxy variable in  /etc/systemd/system.conf manually.

Description of problem:

    When we added new bundle metadata encoding as `olm.csv.metadata` in https://github.com/operator-framework/operator-registry/pull/1094 (downstreamed for 4.15+) we created situations where
- konflux onboarded operators, encouraged to use upstream:latest to generate FBC from templates; and
- IIB-generated catalog images which used earlier opm versions to serve content

could generate the new format but not be able to serve it. 

One only has to `opm render` an SQLite catalog image, or expand a catalog template.

 

 

Version-Release number of selected component (if applicable):

    

How reproducible:

every time    

Steps to Reproduce:

    1. opm render an SQLite catalog image
    2.
    3.
    

Actual results:

    uses `olm.csv.metadata` in the output

Expected results:

    only using `olm.bundle.object` in the output

Additional info:

    

Description of problem:

    When a HostedCluster is upgraded to a new minor version, its OLM catalog imagestreams are not updated to use the tag corresponding to the new minor version.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Always

Steps to Reproduce:

    1. Create a HostedCluster (4.15.z)
    2. Upgrade the HostedCluster to a new minor version (4.16.z)
    

Actual results:

    OLM catalog imagestreams remain at the previous version (4.15)

Expected results:

    OLM catalog imagestreams are updated to new minor version (4.16)

Additional info:

    

Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/95

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

this is case 2 from OCPBUGS-14673 

Description of problem:

MHC for control plane cannot work right for some cases

2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded.

This is a regression bug, because I tested this on 4.12 around September 2022, case 2 and case 3 work right.
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-54326

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-05-112833
4.13.0-0.nightly-2023-06-06-194351
4.12.0-0.nightly-2023-06-07-005319

How reproducible:

Always

Steps to Reproduce:

1.Create MHC for control plane

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: control-plane-health
  namespace: openshift-machine-api
spec:
  maxUnhealthy: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-type: master
  unhealthyConditions:
  - status: "False"
    timeout: 300s
    type: Ready
  - status: "Unknown"
    timeout: 300s
    type: Ready


liuhuali@Lius-MacBook-Pro huali-test % oc create -f mhc-master3.yaml 
machinehealthcheck.machine.openshift.io/control-plane-health created
liuhuali@Lius-MacBook-Pro huali-test % oc get mhc
NAME                              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
control-plane-health              1              3                  3
machine-api-termination-handler   100%           0                  0 

Case 2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded.
liuhuali@Lius-MacBook-Pro huali-test % oc debug node/huliu-az7c-svq9q-master-1 
Starting pod/huliu-az7c-svq9q-master-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.6
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# systemctl stop kubelet


Removing debug pod ...
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                   STATUS   ROLES                  AGE   VERSION
huliu-az7c-svq9q-master-1              Ready    control-plane,master   95m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-2              Ready    control-plane,master   95m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0        Ready    control-plane,master   19m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf   Ready    worker                 34m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l   Ready    worker                 47m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn   Ready    worker                 83m   v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                   PHASE     TYPE              REGION   ZONE   AGE
huliu-az7c-svq9q-master-1              Running   Standard_D8s_v3   westus          97m
huliu-az7c-svq9q-master-2              Running   Standard_D8s_v3   westus          97m
huliu-az7c-svq9q-master-c96k8-0        Running   Standard_D8s_v3   westus          23m
huliu-az7c-svq9q-worker-westus-5r8jf   Running   Standard_D4s_v3   westus          39m
huliu-az7c-svq9q-worker-westus-k747l   Running   Standard_D4s_v3   westus          53m
huliu-az7c-svq9q-worker-westus-r2vdn   Running   Standard_D4s_v3   westus          91m
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                   STATUS     ROLES                  AGE     VERSION
huliu-az7c-svq9q-master-1              NotReady   control-plane,master   107m    v1.26.5+7a891f0
huliu-az7c-svq9q-master-2              Ready      control-plane,master   107m    v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0        Ready      control-plane,master   32m     v1.26.5+7a891f0
huliu-az7c-svq9q-master-jdhgg-1        Ready      control-plane,master   2m10s   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf   Ready      worker                 46m     v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l   Ready      worker                 59m     v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn   Ready      worker                 95m     v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                   PHASE      TYPE              REGION   ZONE   AGE
huliu-az7c-svq9q-master-1              Deleting   Standard_D8s_v3   westus          110m
huliu-az7c-svq9q-master-2              Running    Standard_D8s_v3   westus          110m
huliu-az7c-svq9q-master-c96k8-0        Running    Standard_D8s_v3   westus          36m
huliu-az7c-svq9q-master-jdhgg-1        Running    Standard_D8s_v3   westus          5m55s
huliu-az7c-svq9q-worker-westus-5r8jf   Running    Standard_D4s_v3   westus          52m
huliu-az7c-svq9q-worker-westus-k747l   Running    Standard_D4s_v3   westus          65m
huliu-az7c-svq9q-worker-westus-r2vdn   Running    Standard_D4s_v3   westus          103m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                   PHASE      TYPE              REGION   ZONE   AGE
huliu-az7c-svq9q-master-1              Deleting   Standard_D8s_v3   westus          3h
huliu-az7c-svq9q-master-2              Running    Standard_D8s_v3   westus          3h
huliu-az7c-svq9q-master-c96k8-0        Running    Standard_D8s_v3   westus          105m
huliu-az7c-svq9q-master-jdhgg-1        Running    Standard_D8s_v3   westus          75m
huliu-az7c-svq9q-worker-westus-5r8jf   Running    Standard_D4s_v3   westus          122m
huliu-az7c-svq9q-worker-westus-k747l   Running    Standard_D4s_v3   westus          135m
huliu-az7c-svq9q-worker-westus-r2vdn   Running    Standard_D4s_v3   westus          173m
liuhuali@Lius-MacBook-Pro huali-test % oc get node   
NAME                                   STATUS     ROLES                  AGE    VERSION
huliu-az7c-svq9q-master-1              NotReady   control-plane,master   178m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-2              Ready      control-plane,master   178m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-c96k8-0        Ready      control-plane,master   102m   v1.26.5+7a891f0
huliu-az7c-svq9q-master-jdhgg-1        Ready      control-plane,master   72m    v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-5r8jf   Ready      worker                 116m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-k747l   Ready      worker                 129m   v1.26.5+7a891f0
huliu-az7c-svq9q-worker-westus-r2vdn   Ready      worker                 165m   v1.26.5+7a891f0
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.0-0.nightly-2023-06-06-194351   True        True          True       107m    APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal                                  4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
cloud-controller-manager                   4.13.0-0.nightly-2023-06-06-194351   True        False         False      176m    
cloud-credential                           4.13.0-0.nightly-2023-06-06-194351   True        False         False      3h      
cluster-autoscaler                         4.13.0-0.nightly-2023-06-06-194351   True        False         False      173m    
config-operator                            4.13.0-0.nightly-2023-06-06-194351   True        False         False      175m    
console                                    4.13.0-0.nightly-2023-06-06-194351   True        False         False      136m    
control-plane-machine-set                  4.13.0-0.nightly-2023-06-06-194351   True        False         False      71m     
csi-snapshot-controller                    4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
dns                                        4.13.0-0.nightly-2023-06-06-194351   True        True          False      173m    DNS "default" reports Progressing=True: "Have 6 available node-resolver pods, want 7."
etcd                                       4.13.0-0.nightly-2023-06-06-194351   True        True          True       173m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
image-registry                             4.13.0-0.nightly-2023-06-06-194351   True        True          False      165m    Progressing: The registry is ready...
ingress                                    4.13.0-0.nightly-2023-06-06-194351   True        False         False      165m    
insights                                   4.13.0-0.nightly-2023-06-06-194351   True        False         False      168m    
kube-apiserver                             4.13.0-0.nightly-2023-06-06-194351   True        True          True       171m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-controller-manager                    4.13.0-0.nightly-2023-06-06-194351   True        False         True       171m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-scheduler                             4.13.0-0.nightly-2023-06-06-194351   True        False         True       171m    NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
kube-storage-version-migrator              4.13.0-0.nightly-2023-06-06-194351   True        False         False      106m    
machine-api                                4.13.0-0.nightly-2023-06-06-194351   True        False         False      167m    
machine-approver                           4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
machine-config                             4.13.0-0.nightly-2023-06-06-194351   False       False         True       60m     Cluster not available for [{operator 4.13.0-0.nightly-2023-06-06-194351}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)]
marketplace                                4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
monitoring                                 4.13.0-0.nightly-2023-06-06-194351   True        False         False      106m    
network                                    4.13.0-0.nightly-2023-06-06-194351   True        True          False      177m    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)...
node-tuning                                4.13.0-0.nightly-2023-06-06-194351   True        False         False      173m    
openshift-apiserver                        4.13.0-0.nightly-2023-06-06-194351   True        True          True       107m    APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver ()
openshift-controller-manager               4.13.0-0.nightly-2023-06-06-194351   True        False         False      170m    
openshift-samples                          4.13.0-0.nightly-2023-06-06-194351   True        False         False      167m    
operator-lifecycle-manager                 4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
operator-lifecycle-manager-catalog         4.13.0-0.nightly-2023-06-06-194351   True        False         False      174m    
operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2023-06-06-194351   True        False         False      168m    
service-ca                                 4.13.0-0.nightly-2023-06-06-194351   True        False         False      175m    
storage                                    4.13.0-0.nightly-2023-06-06-194351   True        True          False      174m    AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...
liuhuali@Lius-MacBook-Pro huali-test % 

-----------------------

There might be an easier way by just rolling a revision in etcd, stopping kubelet and then observing the same issue.

Actual results:

CEO's member removal controller is getting stuck on the IsBootstrapComplete check that was introduced to fix another bug: 

 https://github.com/openshift/cluster-etcd-operator/commit/c96150992a8aba3654835787be92188e947f557c#diff-d91047e39d2c1ab6b35e69359a24e83c19ad9b3e9ad4e44f9b1ac90e50f7b650R97 

 turns out IsBootstrapComplete checks whether a revision is currently rolling out (makes sense) and that one NotReady node with kubelet gone still has a revision going (rev 7, target 9).

more info: https://issues.redhat.com/browse/OCPBUGS-14673?focusedId=22726712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22726712

This causes the etcd member to not be removed. 

Which in turn blocks the vertical scale-down procedure to remove the pre-drain hook as the member is still present. Effectively you end up with a cluster of 4 CP machines, where one is stuck in Deleting state.

 

Expected results:

The etcd member should be removed and the machine/node should be deleted

Additional info:

Removing the revision check does fix this issue reliably, but might not be desirable:
https://github.com/openshift/cluster-etcd-operator/pull/1087  

Description of problem:

    Once min-node is reached, the remain nodes' taints shouldn't have DeletionCandidateOfClusterAutoscaler 

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-arm64-2024-09-13-023103

How reproducible:

    Always

Steps to Reproduce:

    1.Create ipi cluster
    2.Create machineautoscaler and clusterautoscaler
    3.Create workload so that , scaling would happen
    
    

Actual results:

    DeletionCandidateOfClusterAutoscaler, taint are present even after min nodes are reached

Expected results:

    above taints not present on nodes once min node count is reached 

Additional info:

    logs from the test - https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Runner/1037951/console

must-gather - https://drive.google.com/file/d/1zB2r-BRHjC12g17_Abc-xvtEqpJOopI5/view?usp=sharing

We did reproduce it manually and waited around 15 mins,  taint was present.

Description of problem:

When the TelemeterClientFailures alert fires, there's no runbook link explaining the meaning of the alert and what to do about it.

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Check the TelemeterClientFailures alerting rule's annotations
2.
3.

Actual results:

No runbook_url annotation.

Expected results:

runbook_url annotation is present.

Additional info:

This is a consequence of a telemeter server outage that triggered questions from customers about the alert:
https://issues.redhat.com/browse/OHSS-25947
https://issues.redhat.com/browse/OCPBUGS-17966
Also in relation to https://issues.redhat.com/browse/OCPBUGS-17797

When adding a BMH with 

 

spec:
  online: true
  customDeploy:
    method: install_coreos

 

after inspection the BMO will provision the node in ironic 

but the node is now being created without any userdata/ignition data, 
IPA ironic_coreos_install then goes down a seldom used path to create ignition from scratch, the created ignition is invalid and the node fails to boot after it is provisioned.

 

Boot stalls with a ignition error "invalid config version (couldn't parse)"

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/853

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

   Prometheus write_relabel_configs in remotewrite unable to drop metric in Grafana  

Version-Release number of selected component (if applicable):

    

How reproducible:

 Customer has tried both configurations to drop MQ metric with source_label(configuration 1) or without source_label(configuration 2) but it's not working.

It seems to me that  drop configuration is not working properly and is buggy. 


Configuration 1:

```
 remoteWrite:
        - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
          write_relabel_configs:
          - source_labels: ['__name__']
            regex: 'ibmmq_qmgr_uptime'
            action: 'drop'
          basicAuth:
            username:
              name: kubepromsecret
              key: username
            password:
              name: kubepromsecret
              key: password
```

Configuration 2:
```
remoteWrite:
        - url: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
          write_relabel_configs:
          - regex: 'ibmmq_qmgr_uptime'
            action: 'drop'
          basicAuth:
            username:
              name: kubepromsecret
              key: username
            password:
              name: kubepromsecret
              key: password
```


Customer wants to know what's the correct remote_write configuration to drop metric in Grafana ?

Document links:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write
https://docs.openshift.com/container-platform/4.14/observability/monitoring/configuring-the-monitoring-stack.html#configuring-remote-write-storage_configuring-the-monitoring-stack
https://docs.openshift.com/container-platform/4.14/observability/monitoring/configuring-the-monitoring-stack.html#creating-user-defined-workload-monitoring-configmap_configuring-the-monitoring-stack

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    prometheus remote_write configurations NOT droppping metric in Grafana   

Expected results:

prometheus  remote_write configurations should drop metric in Grafana    

Additional info:

    

Description of problem:

Using payload built with https://github.com/openshift/installer/pull/8666/ so that master instances can be provisioned from gen2 image, which is required when configuring security type in install-config.

Enable TrustedLaunch security type in install-config:
==================
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure: 
      encryptionAtHost: true
      settings:
        securityType: TrustedLaunch
        trustedLaunch:
          uefiSettings:
            secureBoot: Enabled
            virtualizedTrustedPlatformModule: Enabled

Launch capi-based installation, installer failed after waiting 15min for machines to provision...
INFO GalleryImage.ID=/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/galleries/gallery_jima08conf01_9vgq5/images/jima08conf01-9vgq5 
INFO GalleryImage.ID=/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/galleries/gallery_jima08conf01_9vgq5/images/jima08conf01-9vgq5-gen2 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-0 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-1 
INFO Created manifest *v1beta1.AzureMachine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-2 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-0 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-1 
INFO Created manifest *v1beta1.Machine, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master-2 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-bootstrap 
INFO Created manifest *v1.Secret, namespace=openshift-cluster-api-guests name=jima08conf01-9vgq5-master 
INFO Waiting up to 15m0s (until 6:26AM UTC) for machines [jima08conf01-9vgq5-bootstrap jima08conf01-9vgq5-master-0 jima08conf01-9vgq5-master-1 jima08conf01-9vgq5-master-2] to provision... 
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded 
INFO Shutting down local Cluster API control plane... 
INFO Stopped controller: Cluster API              
INFO Stopped controller: azure infrastructure provider 
INFO Stopped controller: azureaso infrastructure provider 
INFO Local Cluster API system has completed operations 

In openshift-install.log,
time="2024-07-08T06:25:49Z" level=debug msg="\tfailed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create or update resource jima08conf01-9vgq5-rg/jima08conf01-9vgq5-bootstrap (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima08conf01-9vgq5-rg/providers/Microsoft.Compute/virtualMachines/jima08conf01-9vgq5-bootstrap"
time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-08T06:25:49Z" level=debug msg="\tRESPONSE 400: 400 Bad Request"
time="2024-07-08T06:25:49Z" level=debug msg="\tERROR CODE: BadRequest"
time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-08T06:25:49Z" level=debug msg="\t{"
time="2024-07-08T06:25:49Z" level=debug msg="\t  \"error\": {"
time="2024-07-08T06:25:49Z" level=debug msg="\t    \"code\": \"BadRequest\","
time="2024-07-08T06:25:49Z" level=debug msg="\t    \"message\": \"Use of TrustedLaunch setting is not supported for the provided image. Please select Trusted Launch Supported Gen2 OS Image. For more information, see https://aka.ms/TrustedLaunch-FAQ.\""
time="2024-07-08T06:25:49Z" level=debug msg="\t  }"
time="2024-07-08T06:25:49Z" level=debug msg="\t}"
time="2024-07-08T06:25:49Z" level=debug msg="\t--------------------------------------------------------------------------------"
time="2024-07-08T06:25:49Z" level=debug msg=" > controller=\"azuremachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AzureMachine\" AzureMachine=\"openshift-cluster-api-guests/jima08conf01-9vgq5-bootstrap\" namespace=\"openshift-cluster-api-guests\" name=\"jima08conf01-9vgq5-bootstrap\" reconcileID=\"bee8a459-c3c8-4295-ba4a-f3d560d6a68b\""

Looks like that capi-based installer missed to enable security features during creating gen2 image, which can be found in terraform code.
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L166-L169

Gen2 image definition created by terraform:
$ az sig image-definition show --gallery-image-definition jima08conf02-4mrnz-gen2 -r gallery_jima08conf02_4mrnz -g jima08conf02-4mrnz-rg --query 'features'
[
  {
    "name": "SecurityType",
    "value": "TrustedLaunch"
  }
]
It's empty when querying from gen2 image created by using CAPI.
$ az sig image-definition show --gallery-image-definition jima08conf01-9vgq5-gen2 -r gallery_jima08conf01_9vgq5 -g jima08conf01-9vgq5-rg --query 'features'
$ 

Version-Release number of selected component (if applicable):

4.17 payload built from cluster-bot with PR https://github.com/openshift/installer/pull/8666/

How reproducible:

Always

Steps to Reproduce:

    1. Enable security type in install-config
    2. Create cluster by using CAPI
    3. 
    

Actual results:

    Install failed.

Expected results:

    Install succeeded.

Additional info:

   It impacts installation with security type ConfidentialVM or TrustedLaunch enabled.  

 

Description of the problem:
Cluster ** installation with static configuration for ipv4 and ipv6
Discovery done but without the configured ip addresses , installation aborted on bootstrap reboot.

https://redhat-internal.slack.com/archives/C02RD175109/p1727157947875779

Two issues: 
#1  static configuration not applied because missing autoconf: 'false'\n"
It was working before but now its mandatory for ipv6

#2 need to update test-infra code.

 

 

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of problem:

When we move one node from one custom MCP to another custom MCP, the MCPs are reporting a wrong number of nodes.

For example, we reach this situation (worker-perf MCP is not reporting the right number of nodes)

$ oc get mcp,nodes
NAME                                                                     CONFIG                                                         UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
machineconfigpool.machineconfiguration.openshift.io/master               rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6               True      False      False      3              3                   3                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker               rendered-worker-36ee1fdc485685ac9c324769889c3348               True      False      False      1              1                   1                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker-perf          rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556          True      False      False      2              2                   2                     0                      24m
machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary   rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556   True      False      False      1              1                   1                     0                      7m52s

NAME                                             STATUS   ROLES                       AGE    VERSION
node/ip-10-0-13-228.us-east-2.compute.internal   Ready    worker,worker-perf-canary   138m   v1.30.4
node/ip-10-0-2-250.us-east-2.compute.internal    Ready    control-plane,master        145m   v1.30.4
node/ip-10-0-34-223.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-35-61.us-east-2.compute.internal    Ready    worker,worker-perf          136m   v1.30.4
node/ip-10-0-79-232.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-86-124.us-east-2.compute.internal   Ready    worker                      139m   v1.30.4



After 20 minutes or half an hour the MCPs start reporting the right number of nodes

    

Version-Release number of selected component (if applicable):
IPI on AWS version:

$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.17.0-0.nightly-2024-09-13-040101 True False 124m Cluster version is 4.17.0-0.nightly-2024-09-13-040101

    

How reproducible:
Always

    

Steps to Reproduce:

    1. Create a MCP
    
     oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-perf
spec:
  machineConfigSelector:
    matchExpressions:
      - {
         key: machineconfiguration.openshift.io/role,
         operator: In,
         values: [worker,worker-perf]
        }
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-perf: ""
EOF

    
    2. Add 2 nodes to the MCP
    
   $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf=
   $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[1].metadata.name}") node-role.kubernetes.io/worker-perf=

    3. Create another MCP
    oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker-perf-canary
spec:
  machineConfigSelector:
    matchExpressions:
      - {
         key: machineconfiguration.openshift.io/role,
         operator: In,
         values: [worker,worker-perf,worker-perf-canary]
        }
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker-perf-canary: ""
EOF

    3. Move one node from the MCP created in step 1 to the MCP created in step 3
    $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-canary=
    $ oc label node $(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/worker-perf-
    
    
    

Actual results:

The worker-perf pool is not reporting the right number of nodes. It continues reporting 2 nodes even though one of them was moved to the worker-perf-canary MCP.
$ oc get mcp,nodes
NAME                                                                     CONFIG                                                         UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
machineconfigpool.machineconfiguration.openshift.io/master               rendered-master-c8d23b071e1ccf6cf85c7f1b31c0def6               True      False      False      3              3                   3                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker               rendered-worker-36ee1fdc485685ac9c324769889c3348               True      False      False      1              1                   1                     0                      142m
machineconfigpool.machineconfiguration.openshift.io/worker-perf          rendered-worker-perf-6b5fbffac62c3d437e307e849c44b556          True      False      False      2              2                   2                     0                      24m
machineconfigpool.machineconfiguration.openshift.io/worker-perf-canary   rendered-worker-perf-canary-6b5fbffac62c3d437e307e849c44b556   True      False      False      1              1                   1                     0                      7m52s

NAME                                             STATUS   ROLES                       AGE    VERSION
node/ip-10-0-13-228.us-east-2.compute.internal   Ready    worker,worker-perf-canary   138m   v1.30.4
node/ip-10-0-2-250.us-east-2.compute.internal    Ready    control-plane,master        145m   v1.30.4
node/ip-10-0-34-223.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-35-61.us-east-2.compute.internal    Ready    worker,worker-perf          136m   v1.30.4
node/ip-10-0-79-232.us-east-2.compute.internal   Ready    control-plane,master        144m   v1.30.4
node/ip-10-0-86-124.us-east-2.compute.internal   Ready    worker                      139m   v1.30.4


    

Expected results:

MCPs should always report the right number of nodes
    

Additional info:

It is very similar to this other issue 
https://bugzilla.redhat.com/show_bug.cgi?id=2090436
That was discussed in this slack conversation
https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1653479831004619
    

Description of problem:

1. Creating a Normal User:

```
$ oc create user test
user.user.openshift.io/test created

$ oc get user
NAME       UID                                    FULL NAME   IDENTITIES
test       cef90f53-715e-4c10-9e26-c431d31de8c3               
```

This command worked as expected, and the user appeared correctly in both the CLI and the web console.

2. Using Special Characters:

```
$ oc create user test$*(
> test)
user.user.openshift.io/test(
test) created

$ oc get user
NAME       UID                                    FULL NAME   IDENTITIES
test       cef90f53-715e-4c10-9e26-c431d31de8c3               
test(...   50f2ad2b-1385-4b3c-b32c-b84531808864
```

In this case, the user was created successfully and displayed correctly in the web console as test( test). However, the CLI output was not as expected.

3. Handling Quoted Names:

```
$ oc create user test'
> test'

$ oc get user
NAME       UID                                    FULL NAME   IDENTITIES
test...    1fdaadf0-7522-4d38-9894-ee046a58d835
```

Similarly, creating a user with quotes produced a discrepancy: the CLI displayed test..., but the web console showed it as test test.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

100%

Steps to Reproduce:

Given in the description.     

Actual results:

The user list is not getting listed properly.

Expected results:

1. User should not be created with a line break.
2. If they are being created, then they should be displayed properly.

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-operator/pull/270

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

After upgrading from 4.12 to 4.14, the customer reports that the pods cannot reach their service when a NetworkAttachmentDefinition is set.

How reproducible:

    Create a NetworkAttachmentDefinition

Steps to Reproduce:

    1.Create a pod with a service.
    2. Curl the service from inside the pod. Works.
    3. Create a NetworkAttachmentDefinition.
    4. The same curl does not work     

Actual results:

Pod does not reach service    

Expected results:

Pod reaches service 

Additional info:

    specifically updating the bug overview for posterity here but the specific issue is that we have pods set up with an exposed port (8080 - port doesn't matter), and a service with 1 endpoint pointing to the specific pod. We can call OTHER PODS in the same namespace via their single-endpoint call service, but we cannot call OURSELVES from inside the pod. 

The issue is with hairpinning loopback return. Is not affected by networkpolicy and appears to be an issue with (as discovered later in this jira) asymmetric routing in that return path to the container after it leaves the local net. 

This behavior is only observed when a network-attachment-definition is added to the pod and appears to be an issue with the way route rules are defined.

A workaround is available to inject the container with a route specicically, or modify the Net-attach-def to ensure a loopback route is available to the container space.

KCS for this problem with workarounds + patch fix versions (when available): https://access.redhat.com/solutions/7084866 

Description of problem:

Unable to deploy performance profile on multi nodepool hypershift cluster

Version-Release number of selected component (if applicable):

Server Version: 4.17.0-0.nightly-2024-07-28-191830 (management cluster)
Server Version: 4.17.0-0.nightly-2024-08-08-013133 (hosted cluster)

How reproducible:

    Always

Steps to Reproduce:

    1. In a multi nodepool hypershift cluster, attach performance profile unique to each nodepool.
    2. Check the configmap and nodepool status.

Actual results:

root@helix52:~# oc get cm -n clusters-foobar2 | grep foo
kubeletconfig-performance-foobar2            1      21h
kubeletconfig-pp2-foobar3                    1      21h
machineconfig-performance-foobar2            1      21h
machineconfig-pp2-foobar3                    1      21h
nto-mc-foobar2                               1      21h
nto-mc-foobar3                               1      21h
performance-foobar2                          1      21h
pp2-foobar3                                  1      21h
status-performance-foobar2                   1      21h
status-pp2-foobar3                           1      21h
tuned-performance-foobar2                    1      21h
tuned-pp2-foobar3                            1      21h
root@helix52:~# oc get np
NAME      CLUSTER   DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION                         UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
foobar2   foobar2   2               2               False         False        4.17.0-0.ci-2024-08-08-225819   False             True             
foobar3   foobar2   1               1               False         False        4.17.0-0.ci-2024-08-08-225819   False             True      
Hypershift Pod logs -

{"level":"debug","ts":"2024-08-14T08:54:27Z","logger":"events","msg":"there cannot be more than one PerformanceProfile ConfigMap status per NodePool. found: 2 NodePool: foobar3","type":"Warning","object":{"kind":"NodePool","namespace":"clusters","name":"foobar3","uid":"c2ba814a-31fe-409d-88c2-b4e6b9a41b26","apiVersion":"hypershift.openshift.io/v1beta1","resourceVersion":"6411003"},"reason":"ReconcileError"}

Expected results:

   Performance profile should apply correctly on both node pools

Additional info:

    

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/294

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

There are two additional zones, syd05 and us-east(dal13) that have PER capabilities but are not present in the installer. Add them.    

Version-Release number of selected component (if applicable):

4.18.0    

Description of problem:

When running oc-mirror in mirror to disk mode in an air gapped environment with `graph: true`, and having UPDATE_URL_OVERRIDE environment variable defined, oc-mirror is still reaching out to api.openshift.com, to get the graph.tar.gz. This causes the mirroring to fail, as this URL is not reacheable from an air-gapped environment
    

Version-Release number of selected component (if applicable):

WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.16.0-202407260908.p0.gdfed9f1.assembly.stream.el9-dfed9f1", GitCommit:"dfed9f10cd9aabfe3fe8dae0e6a8afe237c901ba", GitTreeState:"clean", BuildDate:"2024-07-26T09:52:14Z", GoVersion:"go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
    

How reproducible:

Always
    

Steps to Reproduce:

    1.  Setup OSUS in a reacheable  network 
    2. Cut all internet connection except for the mirror registry and OSUS service
    3. Run oc-mirror in mirror to disk mode with graph:true in the imagesetconfig
    

Actual results:


    

Expected results:

Should not fail
    

Additional info:


    

Description of problem:

IBM Cloud CCM was reconfigured to use loopback as the bind address in 4.16. However, the liveness probe was not configured to use loopback too, so the CCM constantly fails the liveness probe and restarts continuously.    

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1. Create a IPI cluster on IBM Cloud
    2. Watch the IBM Cloud CCM pod and restarts, increase every 5 mins (liveness probe timeout)
    

Actual results:

    # oc --kubeconfig cluster-deploys/eu-de-4.17-rc2-3/auth/kubeconfig get po -n openshift-cloud-controller-manager
NAME                                            READY   STATUS             RESTARTS          AGE
ibm-cloud-controller-manager-58f7747d75-j82z8   0/1     CrashLoopBackOff   262 (39s ago)     23h
ibm-cloud-controller-manager-58f7747d75-l7mpk   0/1     CrashLoopBackOff   261 (2m30s ago)   23h



  Normal   Killing     34m (x2 over 40m)    kubelet            Container cloud-controller-manager failed liveness probe, will be restarted
  Normal   Pulled      34m (x2 over 40m)    kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5ac9fb24a0e051aba6b16a1f9b4b3f9d2dd98f33554844953dd4d1e504fb301e" already present on machine
  Normal   Created     34m (x3 over 45m)    kubelet            Created container cloud-controller-manager
  Normal   Started     34m (x3 over 45m)    kubelet            Started container cloud-controller-manager
  Warning  Unhealthy   29m (x8 over 40m)    kubelet            Liveness probe failed: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused
  Warning  ProbeError  3m4s (x22 over 40m)  kubelet            Liveness probe error: Get "https://10.242.129.4:10258/healthz": dial tcp 10.242.129.4:10258: connect: connection refused
body:

Expected results:

    CCM runs continuously, as it does on 4.15

# oc --kubeconfig cluster-deploys/eu-de-4.15.10-1/auth/kubeconfig get po -n openshift-cloud-controller-manager
NAME                                            READY   STATUS    RESTARTS   AGE
ibm-cloud-controller-manager-66d4779cb8-gv8d4   1/1     Running   0          63m
ibm-cloud-controller-manager-66d4779cb8-pxdrs   1/1     Running   0          63m

Additional info:

    IBM Cloud have a PR open to fix the liveness probe.
https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/360

Description of problem:

BuildConfig form breaks on manually enter the Git URL after selecting the source type as Git    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Navigate to Create BuildConfig form page
    2. Select source type as Git
    3. Enter the git url by typing manually do not paste or select from the suggestion
    

Actual results:

Console breaks    

Expected results:

   Console should not break and user should be able tocreate BuildConfig 

Additional info:

    

https://github.com/openshift/machine-api-provider-azure/tree/main/pkg/cloud/azure/services/virtualnetworks

This package is not used within MAPI, but its presence indicates that the operator needs permissions over VNets, specifically to delete VNets. This is a sensitive permission that if exercised could lead to an unrecoverable cluster, or deletion of other critical infrastructure within the same Azure subscription or resource group that's not related to the cluster itself. This package should be removed as well as the relevant permissions from the CredentialsRequest.

Tracker issue for bootimage bump in 4.18. This issue should block issues which need a bootimage bump to fix.

Description of problem:

gcp destroy fail to acknowledge the deletion of forwarding rules that have already been removed. Did you intend to change the logic here? The new version appears to be ignoring when there is an error of  http.StatusNotFound (ie, the thing is already deleted).

time="2024-10-03T23:05:47Z" level=debug msg="Listing regional forwarding rules"
time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule jstuever28743-9q9lk-api-internal"
time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:47Z" level=debug msg="Deleting regional forwarding rule jstuever28743-9q9lk-api-internal"
time="2024-10-03T23:05:47Z" level=debug msg="Listing global forwarding rules"
time="2024-10-03T23:05:47Z" level=debug msg="Deleting global forwarding rule a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:48Z" level=debug msg="Deleting global forwarding rule jstuever28743-9q9lk-api-internal"
time="2024-10-03T23:05:48Z" level=debug msg="Deleting global forwarding rule a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:48Z" level=debug msg="Deleting global forwarding rule jstuever28743-9q9lk-api-internal"
time="2024-10-03T23:05:48Z" level=debug msg="Listing target pools"
time="2024-10-03T23:05:48Z" level=debug msg="Listing instance groups"
time="2024-10-03T23:05:49Z" level=debug msg="Listing target tcp proxies"
time="2024-10-03T23:05:49Z" level=debug msg="Listing target tcp proxies"
time="2024-10-03T23:05:49Z" level=debug msg="Listing backend services"
time="2024-10-03T23:05:49Z" level=debug msg="Listing backend services"
time="2024-10-03T23:05:49Z" level=debug msg="Deleting backend service a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:49Z" level=info msg="Deleted backend service a36027772a1a948d08721afe4e52fcd4"
time="2024-10-03T23:05:49Z" level=debug msg="Backend services: 1 global backend service pending"

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    Looping on destroy

Expected results:

Destroy successful    

Additional info:

    HIVE team found this bug.

Description of problem:

Creating C2S/SC2S cluster using via CLuster API, got following error:

time="2024-05-06T00:57:17-04:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: infrastructure was not ready within 15m0s: timed out waiting for the condition"

    

Version-Release number of selected component (if applicable):


    

How reproducible:

4.16.0-0.nightly-2024-05-05-102537
    

Steps to Reproduce:

1. Install a C2S or an SC2S cluster via Cluster API

    

Actual results:

See description

    

Expected results:


    

Additional info:

Cluster could be created successfully on C2S/SC2S
    

Description of problem:

On Administrator-> Observe->Dashboards page, click dropdown list for "Time Range" and "Refresh Interval", there is no response.
On Observe->Metrics page(for both Administrator and Developer), click dropdown list beside "Actions", it's original "Refresh off", there is no response.
There is error “react-dom.production.min.js:101 Uncaught TypeError: r is not a function” in F12 developer console.

    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-07-200953
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Refer to description
    2.
    3.
    

Actual results:

1. Dropdown list doesn't work well. There is error “react-dom.production.min.js:101 Uncaught TypeError: r is not a function” in F12 developer console.
    

Expected results:

1. Dropdown list should work fine.
    

Additional info:


    

Description of problem:

    If multiple NICs are configured in install-config, the installer will provision nodes properly but will fail in bootstrap due to API validation. > 4.17 will support multiple NICs, < 4.17 will not and will fail.

Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: [#1672] failed to create some manifests:
Aug 15 18:30:57 2.252.83.01.in-addr.arpa cluster-bootstrap[4889]: "cluster-infrastructure-02-config.yml": failed to create infrastructures.v1.config.openshift.io/cluster -n : Infrastructure.config.openshift.io "cluster" is invalid: [spec.platformSpec.vsphere.failureDomains[0].topology.networks: Too many: 2: must have at most 1 items, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Testcase occassionally flakes:

--- FAIL: TestRunGraph (1.04s)
    --- FAIL: TestRunGraph/mid-task_cancellation_with_work_in_queue_does_not_deadlock (0.01s)
        task_graph_test.go:943: unexpected error: [context canceled context canceled]

Version-Release number of selected component (if applicable):

Reproducible with current CVO git master revision 00d0940531743e6a0e8bbba151f68c9031bf0df6

How reproducible:

Well with --race and iterations

Steps to Reproduce:

1. go test --count 30 --race ./pkg/payload/...

Actual results:

Some failures

Expected results:

no failures

Additional info:

Seeing this occassionally flake last few months, finally isolated it but I didn't feel like digging into timing test code so I'm at least filing it instead

Description of problem:

On pages under "Observe"->"Alerting", it shows "Not found" when no resources found 
    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-07-11-082305
    

How reproducible:


    

Steps to Reproduce:

    1.Check tabs under "Observe"->"Alerting" when there is not any related resources, eg, "Alerts", "Silence","Alerting rules".
    2.
    3.
    

Actual results:

1. 'Not found' is shown under each tab.
    

Expected results:

1. It's better to show "No <resource> found" like other resources pages. eg: "No Deployments found"
    

Additional info:


    

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/162

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In some cases, the tmp files for resolved prepender are not removed on prem platforms.

Version-Release number of selected component (if applicable):

4.18    

How reproducible:

When deploying Shift On Stack, check in /tmp and we should not see any tmp.XXX files anymore.

Actual results:

tmp files are there

Expected results:

tmp files are removed when not needed anymore

Please review the following PR: https://github.com/openshift/csi-operator/pull/242

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/118

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/machine-api-provider-openstack/pull/125

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component Readiness has found a potential regression in the following test:

[sig-node][apigroup:config.openshift.io] CPU Partitioning node validation should have correct cpuset and cpushare set in crio containers [Suite:openshift/conformance/parallel]

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.18
Start Time: 2024-08-14T00:00:00Z
End Time: 2024-08-21T23:59:59Z
Success Rate: 94.89%
Successes: 128
Failures: 7
Flakes: 2

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 647
Failures: 0
Flakes: 15

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=azure&Platform=azure&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=micro&Upgrade=micro&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Node%20%2F%20Kubelet&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20azure%20unknown%20ha%20micro&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-21%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-08-14%2000%3A00%3A00&testId=openshift-tests%3A9292c0072700a528a33e44338d37a514&testName=%5Bsig-node%5D%5Bapigroup%3Aconfig.openshift.io%5D%20CPU%20Partitioning%20node%20validation%20should%20have%20correct%20cpuset%20and%20cpushare%20set%20in%20crio%20containers%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D

The test is permafailing on latest payloads on multiple platforms, not just azure. It does seem to coincide with arrival of the 4.18 rhcos images.

{  fail [github.com/openshift/origin/test/extended/cpu_partitioning/crio.go:166]: error getting crio container data from node ci-op-z5sh003f-431b2-r2nm4-master-0
Unexpected error:
    <*errors.errorString | 0xc001e80190>: 
    err execing command jq: error (at <stdin>:1): Cannot index array with string "info"
    jq: error (at <stdin>:1): Cannot iterate over null (null)
    {
        s: "err execing command jq: error (at <stdin>:1): Cannot index array with string \"info\"\njq: error (at <stdin>:1): Cannot iterate over null (null)",
    }
occurred
Ginkgo exit error 1: exit with code 1}

The script involved is likely in: https://github.com/openshift/origin/blob/a365380cb3a39cfc26b9f28f04b66418c993a879/test/extended/cpu_partitioning/crio.go#L4

Nightly payloads are fully blocked as multiple blocking aggregated jobs are permafailing this test.

Example failed test:

4/1291 Tests Failed.expand_less: user system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller in ns/openshift-infra must not produce too many applies 

{had 7618 applies, check the audit log and operator log to figure out why  details in audit log}    

Description of problem:

Some references to files did not exist, e.g., `NetworkPolicyListPage` in `console-app` and `functionsComponent` in `knative-plugin`

Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    TestNodePoolReplaceUpgrade e2e test on openstack is expereiencing common failures like this https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/4515/pull-ci-openshift-hypershift-main-e2e-openstack/1849445285156098048

After investigating this failure it looks like the imageRollout on openstack is completed instantly and it gives the noedpool very little time between the node becoming ready and the nodepool status version being set. 

The short amount of time causes a failure on this check https://github.com/openshift/hypershift/blob/6f6a78b7ff2932087b47609c5a16436bad5aeb1c/test/e2e/nodepool_upgrade_test.go#L166

 

 

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Flaky test

Steps to Reproduce:

    1. Run the openstack e2e
    2.
    3.
    

Actual results:

    TestNodePoolReplaceUpgrade fails

Expected results:

    TestNodePoolReplaceUpgrade passes

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/221

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

On "Search" page, search resource Node and filter with label, the filter doesn't work.
Similarly, click label in "Node selector" field on one mcp detail page, it won't filter out nodes with this label.
    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-04-08-024331
    

How reproducible:

always
    

Steps to Reproduce:

    1. On "Search" page, choose "Node(core/v1)" resource, filter with any label, eg "test=node","node-role.kubernetes.io/worker"
    2. On one mcp details page, click label in "Node selector" field on one mcp detail page.
    3.
    

Actual results:

1. Lable filter doesn't work.
2. Nodes are listed without filtered by label.
    

Expected results:

1. Node should be filtered by label.
2. Should only show nodes with label.

    

Additional info:

Screenshot: https://drive.google.com/drive/folders/1XZh4MTOzgrzZKIT6HcZ44HFAAip3ENwT?usp=drive_link
    

Description of problem:

    The test tries to schedule pods on all workers but fails to schedule on infra nodes

 Warning  FailedScheduling  86s                default-scheduler  0/9 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 6 node(s) didn
't match pod anti-affinity rules. preemption: 0/9 nodes are available: 3 Preemption is not helpful for scheduling, 6 No preemption victims found for incoming pod.         

$ oc get nodes
NAME                          STATUS   ROLES                  AGE   VERSION
ostest-b6fns-infra-0-m4v7t    Ready    infra,worker           19h   v1.30.4
ostest-b6fns-infra-0-pllsf    Ready    infra,worker           19h   v1.30.4
ostest-b6fns-infra-0-vnbp8    Ready    infra,worker           19h   v1.30.4
ostest-b6fns-master-0         Ready    control-plane,master   19h   v1.30.4
ostest-b6fns-master-2         Ready    control-plane,master   19h   v1.30.4
ostest-b6fns-master-lmlxf-1   Ready    control-plane,master   17h   v1.30.4
ostest-b6fns-worker-0-h527q   Ready    worker                 19h   v1.30.4
ostest-b6fns-worker-0-kpvdx   Ready    worker                 19h   v1.30.4
ostest-b6fns-worker-0-xfcjf   Ready    worker                 19h   v1.30.4

Infra nodes should be removed from the worker nodes in the test

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-09-09-173813

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    The operator cannot succeed removing resources when networkAccess is set to Removed.
    It looks like the authorization error changes from bloberror.AuthorizationPermissionMismatch to bloberror.AuthorizationFailure after the storage account becomes private (networkAccess: Internal).
    This is either caused by weird behavior in the azure sdk, or in the azure api itself.
    The easiest way to solve it is to also handle bloberror.AuthorizationFailure here: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/azure/azure.go?plain=1#L1145

    The error condition is the following:

status:
  conditions:
  - lastTransitionTime: "2024-09-27T09:04:20Z"
    message: "Unable to delete storage container: DELETE https://imageregistrywxj927q6bpj.blob.core.windows.net/wxj-927d-jv8fc-image-registry-rwccleepmieiyukdxbhasjyvklsshhee\n--------------------------------------------------------------------------------\nRESPONSE
      403: 403 This request is not authorized to perform this operation.\nERROR CODE:
      AuthorizationFailure\n--------------------------------------------------------------------------------\n\uFEFF<?xml
      version=\"1.0\" encoding=\"utf-8\"?><Error><Code>AuthorizationFailure</Code><Message>This
      request is not authorized to perform this operation.\nRequestId:ababfe86-301e-0005-73bd-10d7af000000\nTime:2024-09-27T09:10:46.1231255Z</Message></Error>\n--------------------------------------------------------------------------------\n"
    reason: AzureError
    status: Unknown
    type: StorageExists
  - lastTransitionTime: "2024-09-27T09:02:26Z"
    message: The registry is removed
    reason: Removed
    status: "True"
    type: Available 

Version-Release number of selected component (if applicable):

    4.18, 4.17, 4.16 (needs confirmation), 4.15 (needs confirmation)

How reproducible:

    Always

Steps to Reproduce:

    1. Get an Azure cluster
    2. In the operator config, set networkAccess to Internal
    3. Wait until the operator reconciles the change (watch networkAccess in status with `oc get configs.imageregistry/cluster -oyaml |yq '.status.storage'`)
    4. In the operator config, set management state to removed: `oc patch configs.imageregistry/cluster -p '{"spec":{"managementState":"Removed"}}' --type=merge`
    5. Watch the cluster operator conditions for the error

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

4.17: [VSphereCSIDriverOperator] [Upgrade] VMwareVSphereControllerDegraded: runtime error: invalid memory address or nil pointer dereference 

UPI installed vsphere cluster upgrade failed caused by CSO degrade
Upgrade path: 4.8 -> 4.17

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-10-12-174022      

How reproducible:

 Always   

Steps to Reproduce:

    1. Install the OCP cluster on vSphere by UPI with version 4.8.
    2. Upgrade the cluster to 4.17 nightly.
    

Actual results:

    In Step 2: The upgrade failed from path 4.16 to 4.17.    

Expected results:

    In Step 2: The upgrade should be successful.

Additional info:

$ omc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-10-12-102620   True        True          1h8m    Unable to apply 4.17.0-0.nightly-2024-10-12-174022: wait has exceeded 40 minutes for these operators: storage
$ omc get co storage
NAME      VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.17.0-0.nightly-2024-10-12-174022   True        True          True       15h  
$  omc get co storage -oyaml   
...
status:
  conditions:
  - lastTransitionTime: "2024-10-13T17:22:06Z"
    message: |-
      VSphereCSIDriverOperatorCRDegraded: VMwareVSphereControllerDegraded: panic caught:
      VSphereCSIDriverOperatorCRDegraded: VMwareVSphereControllerDegraded: runtime error: invalid memory address or nil pointer dereference
    reason: VSphereCSIDriverOperatorCR_VMwareVSphereController_SyncError
    status: "True"
    type: Degraded
...

$ omc logs vmware-vsphere-csi-driver-operator-5c7db457-nffp4|tail -n 50
2024-10-13T19:00:02.531545739Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker(0xc000ab7e00?, {0x3900f30?, 0xc0000b9ae0?})
2024-10-13T19:00:02.531545739Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:183 +0x4d
2024-10-13T19:00:02.531545739Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run.func2()
2024-10-13T19:00:02.531545739Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:117 +0x65
2024-10-13T19:00:02.531545739Z created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run in goroutine 500
2024-10-13T19:00:02.531545739Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:112 +0x2c9
2024-10-13T19:00:02.534308382Z I1013 19:00:02.532858       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"vmware-vsphere-csi-driver-operator", UID:"e44ce388-4878-4400-afae-744530b62281", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'Vmware-Vsphere-Csi-Driver-OperatorPanic' Panic observed: runtime error: invalid memory address or nil pointer dereference
2024-10-13T19:00:03.532125885Z E1013 19:00:03.532044       1 config_yaml.go:208] Unmarshal failed: yaml: unmarshal errors:
2024-10-13T19:00:03.532125885Z   line 1: cannot unmarshal !!seq into config.CommonConfigYAML
2024-10-13T19:00:03.532498631Z I1013 19:00:03.532460       1 config.go:272] ReadConfig INI succeeded. INI-based cloud-config is deprecated and will be removed in 2.0. Please use YAML based cloud-config.
2024-10-13T19:00:03.532708025Z I1013 19:00:03.532571       1 config.go:283] Config initialized
2024-10-13T19:00:03.533270439Z E1013 19:00:03.533160       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2024-10-13T19:00:03.533270439Z goroutine 701 [running]:
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2cf3100, 0x54fd210})
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:75 +0x85
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0xc0014c54e8, 0x1, 0xc000e7e1c0?})
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:49 +0x6b
2024-10-13T19:00:03.533270439Z panic({0x2cf3100?, 0x54fd210?})
2024-10-13T19:00:03.533270439Z     runtime/panic.go:770 +0x132
2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).createVCenterConnection(0xc0008b2788, {0xc0022cf600?, 0xc0014c57c0?}, 0xc0006a3448)
2024-10-13T19:00:03.533270439Z     github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:491 +0x94
2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).loginToVCenter(0xc0008b2788, {0x3900f30, 0xc0000b9ae0}, 0x3377a7c?)
2024-10-13T19:00:03.533270439Z     github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:446 +0x5e
2024-10-13T19:00:03.533270439Z github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller.(*VSphereController).sync(0xc0008b2788, {0x3900f30, 0xc0000b9ae0}, {0x38ee700, 0xc0011d08d0})
2024-10-13T19:00:03.533270439Z     github.com/openshift/vmware-vsphere-csi-driver-operator/pkg/operator/vspherecontroller/vspherecontroller.go:240 +0x6fc
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc000b3ecf0, {0x3900f30, 0xc0000b9ae0}, {0x38ee700?, 0xc0011d08d0?})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:201 +0x43
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).processNextWorkItem(0xc000b3ecf0, {0x3900f30, 0xc0000b9ae0})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:260 +0x1ae
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker.func1({0x3900f30, 0xc0000b9ae0})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:192 +0x89
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:259 +0x1f
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002bb1e80?)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:226 +0x33
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0014c5f10, {0x38cf7e0, 0xc00142b470}, 0x1, 0xc0013ae960)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:227 +0xaf
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00115bf10, 0x3b9aca00, 0x0, 0x1, 0xc0013ae960)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:204 +0x7f
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x3900f30, 0xc0000b9ae0}, 0xc00115bf70, 0x3b9aca00, 0x0, 0x1)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:259 +0x93
2024-10-13T19:00:03.533270439Z k8s.io/apimachinery/pkg/util/wait.UntilWithContext(...)
2024-10-13T19:00:03.533270439Z     k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:170
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker(0xc000ab7e00?, {0x3900f30?, 0xc0000b9ae0?})
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:183 +0x4d
2024-10-13T19:00:03.533270439Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run.func2()
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:117 +0x65
2024-10-13T19:00:03.533270439Z created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).Run in goroutine 500
2024-10-13T19:00:03.533270439Z     github.com/openshift/library-go@v0.0.0-20240904190755-22d0c848b7a2/pkg/controller/factory/base_controller.go:112 +0x2c9

Starting about 5/24 or 5/25, we see a massive increase in the number of watch establishments from all clients to the kube-apiserver during non-upgrade jobs. While this could theoretically be every single client merged a bug on the same day, the more likely explanation is that the kube update is exposed or produced some kind of a bug.

 

This is a clear regression and it is only present on 4.17, not 4.16.  It is present across all platforms, though I've selected AWS for links and screenshots.

 

4.17 graph - shows the change

4.16 graph - shows no change

slack thread if there are questions

courtesy screen shot

Description of problem:

    After changing LB type from CLB to NLB, the "status.endpointPublishingStrategy.loadBalancer.providerParameters.aws.classicLoadBalancer" is still there, but if create new NLB ingresscontroller the "classicLoadBalancer" will not appear.

// after changing default ingresscontroller to NLB
$ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
classicLoadBalancer:                   <<<< 
  connectionIdleTimeout: 0s            <<<<
networkLoadBalancer: {}
type: NLB

// create new ingresscontroller with NLB
$ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
networkLoadBalancer: {}
type: NLB



Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-08-08-013133

How reproducible:

    100%

Steps to Reproduce:

    1. changing default ingresscontroller to NLB
$ oc -n openshift-ingress-operator patch ingresscontroller/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"providerParameters":{"type":"AWS","aws":{"type":"NLB"}},"scope":"External"}}}}'

    2. create new ingresscontroller with NLB
kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: nlb
  namespace: openshift-ingress-operator
spec:
  domain: nlb.<base-domain>
  replicas: 1
  endpointPublishingStrategy:
    loadBalancer:
      providerParameters:
        aws:
          type: NLB
        type: AWS
      scope: External
    type: LoadBalancerService

    3. check both ingresscontrollers status
    

Actual results:

// after changing default ingresscontroller to NLB 
$ oc -n openshift-ingress-operator get ingresscontroller/default -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
classicLoadBalancer:
  connectionIdleTimeout: 0s
networkLoadBalancer: {}
type: NLB
 
// new ingresscontroller with NLB
$ oc -n openshift-ingress-operator get ingresscontroller/nlb -oyaml | yq .status.endpointPublishingStrategy.loadBalancer.providerParameters.aws
networkLoadBalancer: {}
type: NLB
 

Expected results:

    If type=NLB, then "classicLoadBalancer" should not appear in the status. and the status part should keep consistent whatever changing ingresscontroller to NLB or creating new one with NLB. 

Additional info:

    

Description of problem:

   Compare with the same behavior on OCP 4.17. The function of 'shortname seach' on OCP 4.18 is not working

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-10-16-094159

How reproducible:

    Always

Steps to Reproduce:

    1. Create a CRD resource with code https://github.com/medik8s/fence-agents-remediation/blob/main/config/crd/bases/fence-agents-remediation.medik8s.io_fenceagentsremediationtemplates.yaml
    2. Navigate to Home -> Search page
    3. Use Shortname 'FAR' to search the created resource 'FenceAgentsRemediationTemplates'
    4. Search the resource with shortname 'AM' for example

Actual results:

    3. No result was found will return
    4. The first result list on dropdown list is 'Config (sample.operator.openshit)', which is incorrect

Expected results:

    3. the Resource 'FenceAgentsRemediationTemplates' should list on the dropdown list 
    4. The first result list on dropdown list should be 'Alertmanager'

Additional info:

    

Please review the following PR: https://github.com/openshift/frr/pull/64

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

{  fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:134]: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta3.flowcontrol.apiserver.k8s.io 6 times
 

All jobs failed on https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-upgrade-4.18-minor-release-openshift-release-analysis-aggregator/1846018782808510464

Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1283

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-provider-azure/pull/319

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-azure-cluster-api-controllers-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

Description of problem:

When deploying 4.16, customer identified an inbound rule security risk for the "node" security group allowing access from 0.0.0.0/0 to node port range 30000-32767.
This issue did not exist in versions prior to 4.16 and suspect this may be a regression.  It seems to be related to the use of CAPI which could have changed the behavior.  
Trying to understand why this was allowed.

Version-Release number of selected component (if applicable):

4.16

How reproducible:

  

Steps to Reproduce:

    1. Install 4.16 cluster

*** On 4.12 installations, this is not the case ***
    

Actual results:

The installer configures an inbound rule for the node security group allowing access from 0.0.0.0/0 for port range 30000-32767.     

Expected results:

The installer should *NOT* create an inbound security rule allowing access to node port range 30000-32767 from any CIDR range (0.0.0.0/0)

Additional info:

#forum-ocp-cloud slack discussion:
https://redhat-internal.slack.com/archives/CBZHF4DHC/p1728484197441409

Relevant Code :

https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/v2.4.0/pkg/cloud/services/securitygroup/securitygroups.go#L551

Description of problem:

Despite passing in '--attach-default-network false', the nodepool still has attachDefaultNetwork: true


hcp create cluster kubevirt --name ocp-lab-int-6 --base-domain paas.com --cores 6 --memory 64Gi --additional-network "name:default/ppcore-547" --attach-default-network false --cluster-cidr 100.64.0.0/20 --service-cidr 100.64.16.0/20 --network-type OVNKubernetes --node-pool-replicas 3 --ssh-key ~/deploy --pull-secret pull-secret.txt --release-image quay.io/openshift-release-dev/ocp-release:4.16.18-x86_64

  platform:
    kubevirt:
      additionalNetworks:
        - name: default/ppcore-547
      attachDefaultNetwork: true


    

Version-Release number of selected component (if applicable):

Client Version: openshift/hypershift: b9e977da802d07591cd9fb8ad91ba24116f4a3a8. Latest supported OCP: 4.17.0
Server Version: b9e977da802d07591cd9fb8ad91ba24116f4a3a8
Server Supports OCP Versions: 4.17, 4.16, 4.15, 4.14 
    

How reproducible:


    

Steps to Reproduce:

    1. hcp install as per the above
    2.
    3.
    

Actual results:

The default network is attached
    

Expected results:

No default network 
    

Additional info:


    

Description of problem:

When running on a FIPS-enabled cluster, the e2e test TestFirstBootHasSSHKeys times out.

Version-Release number of selected component (if applicable):

    

How reproducible:

Always

Steps to Reproduce:

1. Open a PR to the MCO repository.
2. Run the e2e-aws-ovn-fips-op job by commenting /test e2e-aws-ovn-fips-op (this job does not run automatically).
3. Eventually, the test will fail.

Actual results:

=== RUN TestFirstBootHasSSHKeys1065mcd_test.go:1019: did not get new node
--- FAIL: TestFirstBootHasSSHKeys (1201.83s)    

Expected results:

=== RUN   TestFirstBootHasSSHKeys
    mcd_test.go:929: Got ssh key file data: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        
--- PASS: TestFirstBootHasSSHKeys (334.86s)

Additional info:

It looks like we're hitting a 20-minute timeout during the test. By comparison, the passing case seems to execute in approximately 5.5 minutes.

I have two preliminary hypothesis' for this:
1. This operation takes longer in FIPS-enabled clusters for some reason.
2. It is possible that this is occurring due to a difference in which cloud these tests run. Our normal e2e-gcp-op tests run in GCP whereas this test suite runs in AWS. The underlying operations performed by the Machine API may just take longer in AWS than they do in GCP. If that is the case, this bug can be resolved as-is.

 

Failing job link: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4172/pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-fips-op/1757476347388628992

 

Must-Gather link: https://drive.google.com/file/d/12GhTIP9bgcoNje0Jvyhr-c-akV3XnGn2/view?usp=sharing

Error from SNYK code:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/56687/rehearse-56687-pull-ci-openshift-hypershift-main-security/1834567227643269120 

✗ [High] Cross-site Scripting (XSS) 
  Path: ignition-server/cmd/start.go, line 250 
  Info: Unsanitized input from an HTTP header flows into Write, where it is used to render an HTML page returned to the user. This may result in a Reflected Cross-Site Scripting attack (XSS).

Enabling FIP results in an error during machine-os-images /bin/copy-iso

 

    /bin/copy-iso: line 29: [: missing `]'

Description of problem:

    namespace value in Ingres details page is incorrect

Version-Release number of selected component (if applicable):

  4.18.0-0.nightly-2024-09-10-234322    

How reproducible:

    Always

Steps to Reproduce:

    1. Create a sample ingress into default namespace
    2. Navigate to Networking -> Ingresses -> Ingresses details page
       /k8s/ns/default/ingresses/<ingress sample name>
    3. Check the Namespace value
    

Actual results:

    it shown the Ingress name which is incorrect

Expected results:

    it should update to the name store in metadata.namespace

Additional info:

    

Description of problem:

In the OpenShift WebConsole, when using the Instantiate Template screen, the values entered into the form are automatically cleared.

This issue occurs for users with developer roles who do not have administrator privileges, but does not occur for users with the cluster-admin cluster role. 


Additionally, using the developer tools of the web browser, I observed the following console logs when the values were cleared:


https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/prometheus/api/v1/rules 403 (Forbidden)
https://console-openshift-console.apps.mmatsuta-blue.apac.aws.cee.support/api/alertmanager/api/v2/silences 403 (Forbidden)


It appears that a script attempting to fetch information periodically from PrometheusRule and Alertmanager's silences encounters a 403 error due to insufficient permissions, which causes the script to halt and the values in the form to be reset and cleared.


This bug prevents users from successfully creating instances from templates in the WebConsole.

Version-Release number of selected component (if applicable):

4.15 4.14 

How reproducible:

YES

Steps to Reproduce:

1. Log in with a non-administrator account.
2. Select a template from the developer catalog and click on Instantiate Template.
3. Enter values into the initially empty form.
4. Wait for several seconds, and the entered values will disappear.

Actual results:

Entered values are disappeard

Expected results:

Entered values are appeard

Additional info:

I could not find the appropriate component to report this issue. I reluctantly chose Dev Console, but please adjust it to the correct component.

Description of problem

Router pods use the "hostnetwork" SCC even when they do not use the host network.

Version-Release number of selected component (if applicable)

All versions of OpenShift from 4.11 through 4.17.

How reproducible

100%.

Steps to Reproduce

1. Install a new cluster with OpenShift 4.11 or later on a cloud platform.

Actual results

The router-default pods do not use the host network, yet they use the "hostnetwork" SCC:

% oc -n openshift-ingress get pods -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default -o go-template --template='{{range .items}}{{.metadata.name}} {{with .metadata.annotations}}{{index . "openshift.io/scc"}}{{end}} {{.spec.hostNetwork}}{{"\n"}}{{end}}'
router-default-5ffd4ff7cd-mhhv6 hostnetwork <no value>
router-default-5ffd4ff7cd-wmqnj hostnetwork <no value>
% 

Expected results

The router-default pods should use the "restricted" SCC.

Additional info

We missed this change from the OCP 4.11 release notes:

The restricted SCC is no longer available to users of new clusters, unless the access is explicitly granted. In clusters originally installed in OpenShift Container Platform 4.10 or earlier, all authenticated users can use the restricted SCC when upgrading to OpenShift Container Platform 4.11 and later.

Artifacts from CI jobs confirm that router pods used "restricted" for new 4.10 clusters and for 4.10→4.11 upgraded clusters, and "hostnetwork" for new 4.11 clusters:

% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1790552355406614528/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"restricted"
"restricted"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1790422949342220288/artifacts/e2e-aws-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"restricted"
"restricted"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1793013806733987840/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"restricted"
"restricted"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial/1793013781534609408/artifacts/e2e-aws-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade/1793670820518694912/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-serial/1793670819998601216/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
% curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1793062832263139328/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/pods.json' | jq '.items|.[]|select(.metadata.name|startswith("router-default-"))|.metadata.annotations["openshift.io/scc"]'
"hostnetwork"
"hostnetwork"
% 

Description of problem:

    The disk and instance types for gcp machines should be validated further. The current implementation provides validation for each individually, but the disk types and instance types should be checked against each other for valid combinations.

The attached spreadsheet displays the combinations of valid disk and instance types.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/296

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

1. We are making 2 API calls to get the logs for the PipelineRuns. instead, we can make use of `results.tekton.dev/record` annotation and replace the `records` in the value of the annotation with `logs` to get the logs of the PipelineRuns.

2. Tekton results will return back only v1 version of PipelineRun and TaskRun from Pipelines 1.16, so data type has to be v1 version for 1.16 version and for lower version it is v1beta1

Description of problem:

documentationBaseURL still points to 4.17

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-23-112324

How reproducible:

Always

Steps to Reproduce:

1. check documentationBaseURL on a 4.18 cluster
$ oc get cm console-config -n openshift-console -o yaml | grep documentationon
      documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.17/
2.
3.

Actual results:

documentationBaseURL still links to 4.17

Expected results:

documentationBaseURL should link to 4.18

Additional info:

 

Description of the problem:

Unbinding s390x (Z) hosts no longer reboots them into discovery. Instead the reclaim agent runs on the node and continuously reboots them. 

 

How reproducible:

 

Steps to reproduce:

1. Boot Z hosts with discovery image and install them to a cluster (original issue did so with hypershift) 

2. Unbind the hosts from the cluster (original issue scaled down nodepool) and watch as the hosts constantly reboot (not into discovery)

 

Actual results:

Hosts are not reclaimed, unbound, and ready to be used again. Instead they are stuck and constantly reboot.

Expected results:

Hosts are unbound and ready to be used.

 


Additional information

Contents of RHCOS boot config files

 

#  cat ostree-1-rhcos.conf 
title Red Hat Enterprise Linux CoreOS 415.92.202311241643-0 (Plow) (ostree:1)
version 1
options ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c22af6ecd95/0 root=UUID=36ac8acd-bf01-40e4-8043-3682716e3b91 rw rootflags=prjquota boot=UUID=879d4744-c4b2-4cd3-a4a3-ca601d7dadd7
linux /ostree/rhcos-5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c22af6ecd95/vmlinuz-5.14.0-284.41.1.el9_2.s390x
initrd /ostree/rhcos-5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c22af6ecd95/initramfs-5.14.0-284.41.1.el9_2.s390x.img
aboot /ostree/deploy/rhcos/deploy/01b96f07863b8bf16cb4e9a187fefe5bcc1b443a825a503355a1f658a2e856d7.0/usr/lib/ostree-boot/aboot.img
abootcfg /ostree/deploy/rhcos/deploy/01b96f07863b8bf16cb4e9a187fefe5bcc1b443a825a503355a1f658a2e856d7.0/usr/lib/ostree-boot/aboot.cfg7:51

$ cat ostree-2-rhcos.conf 
title Red Hat Enterprise Linux CoreOS 415.92.202312250243-0 (Plow) (ostree:0)
version 2
options ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/1023d42feb111a96705089345808aa014c74b171248026fd0be18949980bc322/0 root=UUID=36ac8acd-bf01-40e4-8043-3682716e3b91 rw rootflags=prjquota boot=UUID=879d4744-c4b2-4cd3-a4a3-ca601d7dadd7 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" psi=1
linux /ostree/rhcos-1023d42feb111a96705089345808aa014c74b171248026fd0be18949980bc322/vmlinuz-5.14.0-284.45.1.el9_2.s390x
initrd /ostree/rhcos-1023d42feb111a96705089345808aa014c74b171248026fd0be18949980bc322/initramfs-5.14.0-284.45.1.el9_2.s390x.img
aboot /ostree/deploy/rhcos/deploy/90229475c67473a16f77b3679a5b7a3d90d268d70adf24668f14cf00c06d83e5.1/usr/lib/ostree-boot/aboot.img
abootcfg /ostree/deploy/rhcos/deploy/90229475c67473a16f77b3679a5b7a3d90d268d70adf24668f14cf00c06d83e5.1/usr/lib/ostree-boot/aboot.cfg 

Interesting journal log

Feb 15 16:51:07 localhost kernel: Kernel command line: ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842>
Feb 15 16:51:07 localhost kernel: Unknown kernel command line parameters "ostree=/ostree/boot.1/rhcos/5a67059f4750a7dc58bd91275a6e148a5f6e88e4b48842a7969b3c> 

See attached images for reclaim files

 

Please review the following PR: https://github.com/openshift/azure-kubernetes-kms/pull/8

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

User Story:

As a (user persona), I want to be able to:

  • install Hypershift with the minimum set of required CAPI/CAPx CRDs

so that I can achieve

  • CRDs not utilized by Hypershift shouldn't be installed 

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

  • Today, `hypershift install` command installs ALL CAPI providers CRDs, which includes for example `ROSACluster` & `ROSAMachinePool` which are not needed by Hypershift.
  • We need to review and remove any CRD that is not required.
     

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Description of problem:

    When using an amd64 release image and setting the multi-arch flag to false, HCP CLI cannot create a HostedCluster. The following error happens:
/tmp/hcp create cluster aws --role-arn arn:aws:iam::460538899914:role/cc1c0f586e92c42a7d50 --sts-creds /tmp/secret/sts-creds.json --name cc1c0f586e92c42a7d50 --infra-id cc1c0f586e92c42a7d50 --node-pool-replicas 3 --base-domain origin-ci-int-aws.dev.rhcloud.com --region us-east-1 --pull-secret /etc/ci-pull-credentials/.dockerconfigjson --namespace local-cluster --release-image registry.build01.ci.openshift.org/ci-op-0bi6jr1l/release@sha256:11351a958a409b8e34321edfc459f389058d978e87063bebac764823e0ae3183
2024-08-29T06:23:25Z	ERROR	Failed to create cluster	{"error": "release image is not a multi-arch image"}
github.com/openshift/hypershift/product-cli/cmd/cluster/aws.NewCreateCommand.func1
	/remote-source/app/product-cli/cmd/cluster/aws/create.go:35
github.com/spf13/cobra.(*Command).execute
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:983
github.com/spf13/cobra.(*Command).ExecuteC
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1115
github.com/spf13/cobra.(*Command).Execute
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1039
github.com/spf13/cobra.(*Command).ExecuteContext
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1032
main.main
	/remote-source/app/product-cli/main.go:59
runtime.main
	/usr/lib/golang/src/runtime/proc.go:271
Error: release image is not a multi-arch image
release image is not a multi-arch image

Version-Release number of selected component (if applicable):

    

How reproducible:

    Every time

Steps to Reproduce:

    1. Try to create a HC with an amd64 release image and multi-arch flag set to false
    

Actual results:

   HC does not create and this error is displayed:
Error: release image is not a multi-arch image release image is not a multi-arch image 

Expected results:

    HC should create without errors

Additional info:

  This bug seems to have occurred as a result of HOSTEDCP-1778 and this line:  https://github.com/openshift/hypershift/blob/e2f75a7247ab803634a1cc7f7beaf99f8a97194c/cmd/cluster/aws/create.go#L520

Description of problem:

The control loop that manages /var/run/keepalived/iptables-rule-exists looks at the error returned by os.Stat and decides that the file exists as long as os.IsNotExist returns false. In other words, if the error is some non-nil error other than NotExist, the sentinel file would not be created.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    The "oc adm node-image create" command sometimes throw a "image can't be pulled" error the first time the command is executed against a cluster.

Example:

+(./agent/07_agent_add_node.sh:138): case "${AGENT_E2E_TEST_BOOT_MODE}" in
+(./agent/07_agent_add_node.sh:42): oc adm node-image create --dir ocp/ostest/add-node/ --registry-config /opt/dev-scripts/pull_secret.json --loglevel=2
I1108 05:09:07.504614   85927 create.go:406] Starting command in pod node-joiner-4r4hq
I1108 05:09:07.517491   85927 create.go:826] Waiting for pod
**snip**
I1108 05:09:39.512594   85927 create.go:826] Waiting for pod
I1108 05:09:39.512634   85927 create.go:322] Printing pod logs
Error from server (BadRequest): container "node-joiner" in pod "node-joiner-4r4hq" is waiting to start: image can't be pulled

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    sometimes

Steps to Reproduce:

    1. Install a new cluster
    2. Run "oc adm node-image create" to create an image
    3.
    

Actual results:

    Error from server (BadRequest): container "node-joiner" in pod "node-joiner-4r4hq" is waiting to start: image can't be pulled

Expected results:

    No errors

Additional info:

    The error occurs the first the the command is executed. If one retry running the command again, it succeeds.

Description of problem:
Nodes couldn't recover when missing worker role in the custom mcp, all of the configuration missed in the node, the kubelet and crio services couldn't start.

Version-Release number of selected component (if applicable):
OCP 4.14

How reproducible:
Steps to Reproduce:

1. Create a custom MCP without worker role
$ cat mc.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker-t
generation: 3
name: 80-user-kernal
spec: {}

$ cat mcp.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker-t
spec:
configuration:
machineConfigSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker-t
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker-t: ""

$ oc create -f mc.yaml
$ oc create -f mcp.yaml

2. Add label worker-t to worker03

$ oc get no
NAME STATUS ROLES AGE VERSION
master01.ocp4.danliu.com Ready master 454d v1.27.13+e709aa5
master02.ocp4.danliu.com Ready master 453d v1.27.13+e709aa5
master03.ocp4.danliu.com Ready master 453d v1.27.13+e709aa5
worker01.ocp4.danliu.com Ready worker 453d v1.27.13+e709aa5
worker02.ocp4.danliu.com Ready worker 51d v1.27.13+e709aa5
worker03.ocp4.danliu.com Ready worker,worker-t 69d v1.27.13+e709aa5

$ oc label nodes worker03.ocp4.danliu.com node-role.kubernetes.io/worker-t=
node/worker03.ocp4.danliu.com labeled

Actual results:
worker03 run into NotReady status, kubelet and crio couldn't startup.

Expected results:
Prevent to sync up the mc when missing worker role

Additional info:
In the previous version (4.13 & 4.12), the task stuck with below error:

Marking Unreconcilable due to: can't reconcile config rendered-worker-8f464eb07d2e2d2fbdb84ab2204fea65 with rendered-worker-t-5b6179e2fb4fedb853c900504edad9ce: ignition passwd user section contains unsupported changes: user core may not be deleted

Description of problem:

Customer is unable to scale deploymentConfig in RHOCP 4.14.21 cluster.

If they scale a DeploymentConfig they get the error: "New size: 4; reason: cpu resource utilization (percentage of request) above target; error: Internal error occurred: converting (apps.DeploymentConfig) to (v1beta1.Scale): unknown conversion"  

Version-Release number of selected component (if applicable):

4.14.21    

How reproducible:

N/A    

Steps to Reproduce:

    1. deploy apps using DC
    2. configure an admission webhook matching the dc/scale subresource
    3. create HPA
    4. observe pods unable to scale. Also manual scaling fails
    

Actual results:

Pods are not getting scaled    

Expected results:

Pods should be scaled using HPA    

Additional info:

    

Description of problem:

Additional IBM Cloud Services require the ability to override their service endpoints within the Installer. The list of available services provided in openshift/api must be expanded to account for this.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

100%

Steps to Reproduce:

    1. Create an install-config for IBM Cloud
    2. Define serviceEndpoints, including one for "resourceCatalog"
    3. Attempt to run IPI
    

Actual results:

 

Expected results:

Successful IPI installation, using additional IBM Cloud Service endpoint overrides.

Additional info:

IBM Cloud is working on multiple patches to incorporate these additional services. The full list is still a work in progress, but currently includes:
- Resource (Global) Catalog endpoint
- COS Config endpoint

Changes are required in the follow components currently. May open separate Jira's (if required) to track their progress.
- openshift/api
- openshift-installer
- openshift/cluster-image-registry-operator

Description of problem:

When we add a userCA bundle to a cluster that has MCPs with yum based rhel nodes, the MCP with rhel nodes are degraded.
    

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.17.0-0.nightly-2024-08-18-131731   True        False         101m    Cluster version is 4.17.0-0.nightly-2024-08-18-131731

    

How reproducible:

Always

In the CI we found this issue running test case "[sig-mco] MCO security Author:sregidor-NonHyperShiftHOST-High-67660-MCS generates ignition configs with certs [Disruptive] [Serial]" on prow job periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-gcp-ipi-workers-rhel8-fips-f28-destructive

    

Steps to Reproduce:

    1. Create a certificate 
    
   	$ openssl genrsa -out privateKey.pem 4096
    	$ openssl req -new -x509 -nodes -days 3600 -key privateKey.pem -out ca-bundle.crt -subj "/OU=MCO qe/CN=example.com"
    
    2. Add the certificate to the cluster
    
   	# Create the configmap with the certificate
	$ oc create cm cm-test-cert -n openshift-config --from-file=ca-bundle.crt
	configmap/cm-test-cert created

	#Configure the proxy with the new test certificate
	$ oc patch proxy/cluster --type merge -p '{"spec": {"trustedCA": {"name": "cm-test-cert"}}}'
	proxy.config.openshift.io/cluster patched
    
    3. Check the MCP status and the MCD logs
    

Actual results:

    
    The MCP is degraded
    $ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-3251b00997d5f49171e70f7cf9b64776   True      False      False      3              3                   3                     0                      130m
worker   rendered-worker-05e7664fa4758a39f13a2b57708807f7   False     True       True       3              0                   0                     1                      130m

    We can see this message in the MCP
      - lastTransitionTime: "2024-08-19T11:00:34Z"
    message: 'Node ci-op-jr7hwqkk-48b44-6mcjk-rhel-1 is reporting: "could not apply
      update: restarting coreos-update-ca-trust.service service failed. Error: error
      running systemctl restart coreos-update-ca-trust.service: Failed to restart
      coreos-update-ca-trust.service: Unit coreos-update-ca-trust.service not found.\n:
      exit status 5"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded

In the MCD logs we can see:

I0819 11:38:55.089991    7239 update.go:2665] Removing SIGTERM protection
E0819 11:38:55.090067    7239 writer.go:226] Marking Degraded due to: could not apply update: restarting coreos-update-ca-trust.service service failed. Error: error running systemctl restart coreos-update-ca-trust.service: Failed to restart coreos-update-ca-trust.service: Unit coreos-update-ca-trust.service not found.
    

Expected results:

	No degradation should happen. The certificate should be added without problems.
    

Additional info:


    

Description of problem:

when cluster-admin user or normal user tries to create the first networkpolicy resource for one project, click on `affected pods` before submitting the creation form will result in error     

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-08-27-051932    

How reproducible:

Always    

Steps to Reproduce:

    1. Open Networking -> NetworkPolicies, normal user or cluster-admin user tries to create the first networkpolicy resource into one project
    2. on Form view, click on `affected pods` button before hit on 'Create' button
    3.
    

Actual results:

2. For cluster-admin user, we will see error
Cannot set properties of undefined (setting 'tabIndex')   

For normal user, we will see
undefined has no properties
 

Expected results:

no errors    

Additional info:

    

Description of problem:

HCP cluster is being updated but the nodepool is stuck updating:
~~~
NAME                   CLUSTER   DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
nodepool-dev-cluster   dev       2               2               False         False        4.15.22   True              True
~~~

Version-Release number of selected component (if applicable):

Hosting OCP cluster 4.15
HCP 4.15.23

How reproducible:

N/A

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

Nodepool stuck in upgrade

Expected results:

Upgrade success

Additional info:

I have found this error repeating continually in the ignition-server pods:
~~~
{"level":"error","ts":"2024-08-20T09:02:19Z","msg":"Reconciler error","controller":"secret","controllerGroup":"","controllerKind":"Secret","Secret":{"name":"token-nodepool-dev-cluster-3146da34","namespace":"dev-dev"},"namespace":"dev-dev","name":"token-nodepool-dev-cluster-3146da34","reconcileID":"ec1f0a7f-1657-4245-99ef-c984977ff0f8","error":"error getting ignition payload: failed to download binaries: failed to extract image file: failed to extract image file: file not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

{"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"discovered machine-config-operator image","image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede"}
{"level":"info","ts":"2024-08-20T09:02:20Z","logger":"get-payload","msg":"created working directory","dir":"/payloads/get-payload4089452863"}

{"level":"info","ts":"2024-08-20T09:02:28Z","logger":"get-payload","msg":"extracted image-references","time":"8s"}

{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"extracted templates","time":"10s"}
{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"image-cache","msg":"retrieved cached file","imageRef":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f3b55cc8f88b9e6564fe6ad0bc431cd7270c0586a06d9b4a19ff2b518c461ede","file":"usr/lib/os-release"}
{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"read os-release","mcoRHELMajorVersion":"8","cpoRHELMajorVersion":"9"}
{"level":"info","ts":"2024-08-20T09:02:38Z","logger":"get-payload","msg":"copying file","src":"usr/bin/machine-config-operator.rhel9","dest":"/payloads/get-payload4089452863/bin/machine-config-operator"}
~~~

Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/118

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-machine-api-provider-azure-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

Description of problem:

    Since about 4 days ago, the techpreview jobs have been failing on MCO namespace: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.18/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial%22%7D%5D%7D

Example run: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial/1843057579794632704

The daemons appear to be applying MCN's too early in the process, which causes it to degrade for a few loops: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial/1842877807659585536/artifacts/e2e-aws-ovn-techpreview-serial/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-daemon-79f7s_machine-config-daemon.log

This is semi-blocking techpreview jobs and should be fixed high priority. This shouldn't be blocking release as MCN is not GA and likely won't be in 4.18.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    This PR introduces graceful shutdown functionality to the Multus daemon by adding a /readyz endpoint alongside the existing /healthz. The /readyz endpoint starts returning 500 once a SIGTERM is received, indicating the daemon is in shutdown mode. During this time, CNI requests can still be processed for a short window. The daemonset configs have been updated to increase terminationGracePeriodSeconds from 10 to 30 seconds, ensuring we have a bit more time for these clean shutdowns.This addresses a race condition during pod transitions where the readiness check might return true, but a subsequent CNI request could fail if the daemon shuts down too quickly. By introducing the /readyz endpoint and delaying the shutdown, we can handle ongoing CNI requests more gracefully, reducing the risk of disruptions during critical transitions.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Difficult to reproduce, might require CI signal

Description of problem:

    Console and OLM engineering and BU have decided to remove the Extension Catalog navigation item until the feature has matured more.

Description of problem:

    cluster-openshift-apiserver-operator is still in 1.29 and should be updated to 1.30 to reduce conflicts and other issues

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

 As a part of deploying SNO clusters in the field based on the IBI install process we need a way to apply NODE labels to the resulting cluster. As an example, once the cluster has had an IBI config applied to it, it should have a node label of "edge.io/isedgedevice: true" ... the label is only an example, and the user should have the ability to add one or more labels to the resulting node.

 

See: https://redhat-internal.slack.com/archives/C05JHD9QYTC/p1730298666011899 for additional context.

Description of problem:

While accessing the node terminal of the cluster from web-console the below warning message observed.
~~~
Admission Webhook WarningPod master-0.americancluster222.lab.psi.pnq2.redhat.com-debug violates policy 299 - "metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]"
~~~



Note: This is not impacting the cluster. However creating confusion among customers due to the warning message.

Version-Release number of selected component (if applicable):

4.16    

How reproducible:

    Everytime.

Steps to Reproduce:

    1. Install cluster of version 4.16.11 
    2. Upgrade the cluster from web-console to the next-minor version 4.16.13
    3. Try to access the node terminal from UI
    

Actual results:

    Showing warning while accessing the node terminal.

Expected results:

    Does not show any warning.

Additional info:

    

Please review the following PR: https://github.com/openshift/hypershift/pull/4672

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-update-keys/pull/62

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

v1alpha1 schema is still present in the v1 ConsolePlugin CRD and should be removed manually since the generator is re-adding it automatically.    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-operator/pull/126

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The pod of catalogsource without registryPoll wasn't recreated during the node failure

    jiazha-mac:~ jiazha$ oc get pods 
NAME                                    READY   STATUS        RESTARTS       AGE
certified-operators-rcs64               1/1     Running       0              123m
community-operators-8mxh6               1/1     Running       0              123m
marketplace-operator-769fbb9898-czsfn   1/1     Running       4 (117m ago)   136m
qe-app-registry-5jxlx                   1/1     Running       0              106m
redhat-marketplace-4bgv9                1/1     Running       0              123m
redhat-operators-ww5tb                  1/1     Running       0              123m
test-2xvt8                              1/1     Terminating   0              12m

jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide 
NAME         READY   STATUS    RESTARTS   AGE    IP            NODE                                          NOMINATED NODE   READINESS GATES
test-2xvt8   1/1     Running   0          7m6s   10.129.2.26   qe-daily-417-0708-cv2p6-worker-westus-gcrrc   <none>           <none>

jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME                                          STATUS     ROLES    AGE    VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc   NotReady   worker   116m   v1.30.2+421e90e

Version-Release number of selected component (if applicable):

     Cluster version is 4.17.0-0.nightly-2024-07-07-131215

How reproducible:

    always

Steps to Reproduce:

    1. create a catalogsource without the registryPoll configure.

jiazha-mac:~ jiazha$ cat cs-32183.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test
  namespace: openshift-marketplace
spec:
  displayName: Test Operators
  image: registry.redhat.io/redhat/redhat-operator-index:v4.16
  publisher: OpenShift QE
  sourceType: grpc

jiazha-mac:~ jiazha$ oc create -f cs-32183.yaml 
catalogsource.operators.coreos.com/test created

jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide 
NAME         READY   STATUS    RESTARTS   AGE     IP            NODE                                          NOMINATED NODE   READINESS GATES
test-2xvt8   1/1     Running   0          3m18s   10.129.2.26   qe-daily-417-0708-cv2p6-worker-westus-gcrrc   <none>           <none>


     2. Stop the node 
jiazha-mac:~ jiazha$ oc debug node/qe-daily-417-0708-cv2p6-worker-westus-gcrrc 
Temporary namespace openshift-debug-q4d5k is created for debugging node...
Starting pod/qe-daily-417-0708-cv2p6-worker-westus-gcrrc-debug-v665f ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.5
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# systemctl stop kubelet; sleep 600; systemctl start kubelet


Removing debug pod ...
Temporary namespace openshift-debug-q4d5k was removed.

jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME                                          STATUS     ROLES    AGE    VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc   NotReady   worker   115m   v1.30.2+421e90e


    3. check it this catalogsource's pod recreated.

    

Actual results:

No new pod was generated. 

    jiazha-mac:~ jiazha$ oc get pods 
NAME                                    READY   STATUS        RESTARTS       AGE
certified-operators-rcs64               1/1     Running       0              123m
community-operators-8mxh6               1/1     Running       0              123m
marketplace-operator-769fbb9898-czsfn   1/1     Running       4 (117m ago)   136m
qe-app-registry-5jxlx                   1/1     Running       0              106m
redhat-marketplace-4bgv9                1/1     Running       0              123m
redhat-operators-ww5tb                  1/1     Running       0              123m
test-2xvt8                              1/1     Terminating   0              12m

once node recovery, a new pod was generated.


jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME                                          STATUS   ROLES    AGE    VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc   Ready    worker   127m   v1.30.2+421e90e

jiazha-mac:~ jiazha$ oc get pods 
NAME                                    READY   STATUS    RESTARTS       AGE
certified-operators-rcs64               1/1     Running   0              127m
community-operators-8mxh6               1/1     Running   0              127m
marketplace-operator-769fbb9898-czsfn   1/1     Running   4 (121m ago)   140m
qe-app-registry-5jxlx                   1/1     Running   0              109m
redhat-marketplace-4bgv9                1/1     Running   0              127m
redhat-operators-ww5tb                  1/1     Running   0              127m
test-wqxvg                              1/1     Running   0              27s 

Expected results:

During the node failure, a new catalog source pod should be generated.

    

Additional info:

Hi Team,

After some more investigating the source code of operator-lifecycle-manager, we figure out the reason.

  • The commit [1] try to fix this issue by adding "force deleting dead pod" process into ensurePod() function.
  • The ensurePod() is called by EnsureRegistryServer() [2].
  • However, the syncRegistryServer() will return immediately without calling EnsureRegistryServer() if there is no registryPoll in catalog [3].
  • There is no registryPoll defined in catalogsource that were generated when we build catalog image following Doc [4].
    apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
      name: redhat-operator-index
      namespace: openshift-marketplace
    spec:
      image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5
      sourceType: grpc
    
  • So the catalog pod created by the catalogsource cannot recovered.

And we verified that the catalog pod can be recreated on other node if we add the configuration of registryPoll to catalogsource as the following (The lines with <==).

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operator-index
  namespace: openshift-marketplace
spec:
  image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5
  sourceType: grpc
  updateStrategy:   <==
    registryPoll:   <==
      interval: 10m <==

The registryPoll is NOT MUST for catalogsource.
So the commit [1] trying to fix the issue in EnsureRegistryServer() is not properly.

[1] https://github.com/operator-framework/operator-lifecycle-manager/pull/3201/files
[2] https://github.com/joelanford/operator-lifecycle-manager/blob/82f499723e52e85f28653af0610b6e7feff096cf/pkg/controller/registry/reconciler/grpc.go#L290
[3] https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/catalog/operator.go#L1009
[4] https://docs.openshift.com/container-platform/4.16/operators/admin/olm-managing-custom-catalogs.html

Observed in 

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-serial-ovn-ipv6/1786198211774386176

 

there was a delay provisioning one of the master nodes, we should figure out why this is happening and if it can be prevented

 

from the ironic logs, there was a 5 minute delay during cleaning, on the other 2 masters this too a few seconds

 

 

01:20:53 1f90131a...moved to provision state "verifying" from state "enroll"
01:20:59 1f90131a...moved to provision state "manageable" from state "verifying"
01:21:04 1f90131a...moved to provision state "inspecting" from state "manageable"
01:21:35 1f90131a...moved to provision state "inspect wait" from state "inspecting"
01:26:26 1f90131a...moved to provision state "inspecting" from state "inspect wait" 
01:26:26 1f90131a...moved to provision state "manageable" from state "inspecting"
01:26:30 1f90131a...moved to provision state "cleaning" from state "manageable"
01:27:17 1f90131a...moved to provision state "clean wait" from state "cleaning"
>>> whats this 5 minute gap about ?? <<<
01:32:07 1f90131a...moved to provision state "cleaning" from state "clean wait" 
01:32:08 1f90131a...moved to provision state "clean wait" from state "cleaning"
01:32:12 1f90131a...moved to provision state "cleaning" from state "clean wait"
01:32:13 1f90131a...moved to provision state "available" from state "cleaning"
01:32:23 1f90131a...moved to provision state "deploying" from state "available"
01:32:28 1f90131a...moved to provision state "wait call-back" from state "deploying"
01:32:58 1f90131a...moved to provision state "deploying" from state "wait call-back"
01:33:14 1f90131a...moved to provision state "active" from state "deploying"
 

 

 

 

 

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible: Always

Repro Steps:

Add: "bridge=br0:enpf0,enpf2 ip=br0:dhcp" to dracut cmdline. Make sure either enpf0/enpf2 is the primary network of the cluster subnet.

The linux bridge can be configured to add a virtual switch between one or many ports. This can be done by a simple machine config that adds:
"bridge=br0:enpf0,enpf2 ip=br0:dhcp"
to the the kernel command line options which will be processed by dracut.

The use case of adding such a virtual bridge for simple IEEE802.1 switching is to support PCIe devices that act as co-processors in a baremetal server. For example:
-------- ---------------------

Host PCIe Co-processor
eth0 <-------> enpf0 <br0> enpf2 <---> network
     

-------- ---------------------
This co-processor could be a "DPU" network interface card. Thus the co-processor can be part of the same underlay network as the cluster and pods can be scheduled on the Host and the Co-processor. This allows for pods to be offloaded to the co-processor for scaling workloads.

Actual results:

ovs-configuration service fails.

Expected results:

ovs-configuration service passes with the bridge interface added to the ovs bridge.

Description of problem:

v4.17 baselineCapabilitySet is not recognized.
  
# ./oc adm release extract --install-config v4.17-basecap.yaml --included --credentials-requests --from quay.io/openshift-release-dev/ocp-release:4.17.0-rc.1-x86_64 --to /tmp/test

error: unrecognized baselineCapabilitySet "v4.17"

# cat v4.17-basecap.yaml
---
apiVersion: v1
platform:
  gcp:
    foo: bar
capabilities:
  baselineCapabilitySet: v4.17 

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-04-132247

How reproducible:

    always

Steps to Reproduce:

    1. Run `oc adm release extract --install-config --included` against an install-config file including baselineCapabilitySet: v4.17. 
    2.
    3.
    

Actual results:

    `oc adm release extract` throw unrecognized error

Expected results:

    `oc adm release extract` should extract correct manifests

Additional info:

    If specifying baselineCapabilitySet: v4.16, it works well. 

TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.

The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.

The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:

source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]

Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed

More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.

The operator degraded is probably the strongest symptom to persue as it appears in most of the above.

If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.

Context
Some ROSA HCP users host their own container registries (e.g., self-hosted Quay servers) that are only accessible from inside of their VPCs. This is often achieved through the use of private DNS zones that resolve non-public domains like quay.mycompany.intranet to non-public IP addresses. The private registries at those addresses then present self-signed SSL certificates to the client that can be validated against the HCP's additional CA trust bundle.

Problem Description
A user of a ROSA HCP cluster with a configuration like the one described above is encountering errors when attempting to import a container image from their private registry into their HCP's internal registry via oc import-image. Originally, these errors showed up in openshift-apiserver logs as DNS resolution errors, i.e., OCPBUGS-36944. After the user upgraded their cluster to 4.14.37 (which fixes OCPBUGS-36944), openshift-apiserver was able to properly resolve the domain name but complains of HTTP 502 Bad Gateway errors. We suspect these 502 Bad Gateway errors are coming from the Konnectivity-agent while it proxies traffic between the control and data planes.

We've confirmed that the private registry is accessible from the HCP data plane (worker nodes) and that the certificate presented by the registry can be validated against the cluster's additional trust bundle. IOW, curl-ing the private registry from a worker node returns a HTTP 200 OK, but doing the same from a control plane node returns a HTTP 502. Notably, this cluster is not configured with a cluster-wide proxy, nor does the user's VPC feature a transparent proxy.

Version-Release number of selected component
OCP v4.14.37

How reproducible
Can be reliably reproduced, although the network config (see Context above) is quite specific

Steps to Reproduce

  1. Run the following command from the HCP data plane
    oc import-image imagegroup/imagename:v1.2.3 --from=quay.mycompany.intranet/imagegroup/imagename:v1.2.3 --confirm
    
  2. Observe the command output, the resulting ImageStream object, and openshift-apiserver logs

Actual Results

error: tag v1.2.3 failed: Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway
imagestream.image.openshift.io/imagename imported with errors

Name:            imagename
Namespace:        mynamespace
Created:        Less than a second ago
Labels:            <none>
Annotations:        openshift.io/image.dockerRepositoryCheck=2024-10-01T12:46:02Z
Image Repository:    default-route-openshift-image-registry.apps.rosa.clustername.abcd.p1.openshiftapps.com/mynamespace/imagename
Image Lookup:        local=false
Unique Images:        0
Tags:            1

v1.2.3
  tagged from quay.mycompany.intranet/imagegroup/imagename:v1.2.3

  ! error: Import failed (InternalError): Internal error occurred: quay.mycompany.intranet/imagegroup/imagename:v1.2.3: Get "https://quay.mycompany.intranet/v2/": Bad Gateway
      Less than a second ago

error: imported completed with errors

Expected Results
Desired container image is imported from private external image registry into cluster's internal image registry without error

Description of problem:

We ignore errors from the existence check in https://github.com/openshift/baremetal-runtimecfg/blob/723290ec4b31bc4e032ff62198ae3dd0d0e36313/pkg/monitor/iptables.go#L116 and that can make it more difficult to debug errors in the healthchecks. In particular, this made it more difficult to debug an issue with permissions on the monitor container because there were no log messages to let us know the check had failed.

Version-Release number of selected component (if applicable):

4.17    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

 

Description of problem:

We should decrease the verbosity level for the IBM CAPI module.  This will affect the output of the file .openshift_install.log
    

Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/196

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

While updating an HC with controllerAvailabilityPolicy of SingleReplica, the HCP doesn't fully rollout with 3 pod stuck in Pending

multus-admission-controller-5b5c95684b-v5qgd          0/2     Pending   0               4m36s
network-node-identity-7b54d84df4-dxx27                0/3     Pending   0               4m12s
ovnkube-control-plane-647ffb5f4d-hk6fg                0/3     Pending   0               4m21s

This is because these deployment all have requiredDuringSchedulingIgnoredDuringExecution zone anti-affinity and maxUnavailable: 25% (i.e. 1)

Thus the old pod blocks scheduling of the new pod.

Description of problem

Kubelet logs contain entries like:

Jun 13 10:05:14.141073 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:14.141043    1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"

I'm not sure if that's a problem or not, but it is distracting noise for folks trying to understand Kubelet behavior, and we should either fix the problem, or denoise the red-herring.

Version-Release number of selected component

Seen in 4.13.44, 4.14.31, and 4.17.0-0.nightly-2024-06-25-162526 (details in Additional info).
Not seen in 4.12.60, so presumably a 4.12 to 4.13 change.

How reproducible

Every time.

Steps to Reproduce

1. Run a cluster.
2. Check node/kubelet logs for one control-plane node.

Actual results

Lots of can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt messages.

Expected results

No can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt messages.

Additional info

Checking recent builds in assorted 4.y streams.

4.12.60

4.12.60 > aws-sdn-serial > Artifacts > ... > gather-extra artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1803708035177123840/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name'
ip-10-0-156-214.us-west-1.compute.internal
ip-10-0-158-171.us-west-1.compute.internal
ip-10-0-203-59.us-west-1.compute.internal
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1803708035177123840/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes/ip-10-0-156-214.us-west-1.compute.internal/journal | zgrep kubelet-ca.crt | tail -n3
Jun 20 08:47:07.734060 ip-10-0-156-214 ignition[1087]: INFO     : files: createFilesystemsFiles: createFiles: op(11): [finished] writing file "/sysroot/etc/kubernetes/kubelet-ca.crt"
Jun 20 08:49:29.274949 ip-10-0-156-214 kubenswrapper[1384]: I0620 08:49:29.274923    1384 dynamic_cafile_content.go:119] "Loaded a new CA Bundle and Verifier" name="client-ca-bundle::/etc/kubernetes/kubelet-ca.crt"
Jun 20 08:49:29.275084 ip-10-0-156-214 kubenswrapper[1384]: I0620 08:49:29.275067    1384 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/etc/kubernetes/kubelet-ca.crt"

is clean.

4.13.44

4.13.44 > aws-sdn-serial > Artifacts > ... > gather-extra artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-serial/1801188570212339712/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes.json |  jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name'
ip-10-0-133-167.us-west-1.compute.internal
ip-10-0-170-3.us-west-1.compute.internal
ip-10-0-203-13.us-west-1.compute.internal
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-serial/1801188570212339712/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes/ip-10-0-133-167.us-west-1.compute.internal/journal | zgrep kubelet-ca.crt | tail -n3
Jun 13 10:05:00.464260 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:00.464190    1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"
Jun 13 10:05:13.320867 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:13.320824    1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"
Jun 13 10:05:14.141073 ip-10-0-133-167 kubenswrapper[1385]: I0613 10:05:14.141043    1385 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"

is exposed.

4.14.31

4.14.31 > aws-ovn-serial > Artifacts > ... > gather-extra artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1803746771264868352/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name'
ip-10-0-17-181.us-west-2.compute.internal
ip-10-0-66-68.us-west-2.compute.internal
ip-10-0-97-83.us-west-2.compute.internal
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial/1803746771264868352/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes/ip-10-0-17-181.us-west-2.compute.internal/journal | zgrep kubelet-ca.crt | tail -n3
Jun 20 11:42:31.931470 ip-10-0-17-181 kubenswrapper[2226]: I0620 11:42:31.931404    2226 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"
Jun 20 11:42:31.980499 ip-10-0-17-181 kubenswrapper[2226]: I0620 11:42:31.980448    2226 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"
Jun 20 11:42:32.757888 ip-10-0-17-181 kubenswrapper[2226]: I0620 11:42:32.757846    2226 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="can't remove non-existent watcher: /etc/kubernetes/kubelet-ca.crt"

4.17.0-0.nightly-2024-06-25-162526

4.17.0-0.nightly-2024-06-25-162526 > aws-ovn-serial > Artifacts > ... > gather-extra artifacts:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1805639599624556544/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | select(.labels["node-role.kubernetes.io/control-plane"] == "").name'
ip-10-0-125-200.ec2.internal
ip-10-0-47-81.ec2.internal
ip-10-0-8-158.ec2.internal
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial/1805639599624556544/artifacts/e2e-aws-ovn-serial/gather-extra/artifacts/nodes/ip-10-0-8-158.ec2.internal/journal | zgrep kubelet-ca.crt | tail -n3
Jun 25 19:56:13.452559 ip-10-0-8-158 kubenswrapper[2243]: I0625 19:56:13.452512    2243 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="fsnotify: can't remove non-existent watch: /etc/kubernetes/kubelet-ca.crt"
Jun 25 19:56:13.512277 ip-10-0-8-158 kubenswrapper[2243]: I0625 19:56:13.512213    2243 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="fsnotify: can't remove non-existent watch: /etc/kubernetes/kubelet-ca.crt"
Jun 25 19:56:14.403001 ip-10-0-8-158 kubenswrapper[2243]: I0625 19:56:14.402953    2243 dynamic_cafile_content.go:211] "Failed to remove file watch, it may have been deleted" file="/etc/kubernetes/kubelet-ca.crt" err="fsnotify: can't remove non-existent watch: /etc/kubernetes/kubelet-ca.crt"

Description of problem:

Version-Release number of selected component (if applicable):

When navigating from Lightspeed's "Don't show again" link, it can be hard to know which element is relevant.  We should look at utilizing Spotlight to highlight the relevant user preference.

Also, there is an undesirable gap before the Lightspeed user preference caused by an empty div from data-test="console.telemetryAnalytics".

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

User Story:

As a (user persona), I want to be able to:

  • Run a capg install and select a pd-balanced disk type
  • Run a capg install and select a hyperdisk-balanced disk type

so that I can achieve

  • Installations with capg where no regressions are available.
  • Support N4 and Metal Machine Types in GCP

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Description of problem:

    Based on feature https://issues.redhat.com/browse/CONSOLE-3243 - Rename "master" to "control plane node" in node pages
    The name of 'master' on ‘Filter by Node type’ on Cluster Utilization section on Overview page should be updated to 'control plane'
    But the changes have been covered by PR https://github.com/openshift/console/pull/14121 which bring this issue

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-10-15-032107

How reproducible:

    Always

Steps to Reproduce:

    1. Make sure your node role has 'control plan' 
       eg: 
$ oc get nodes -o wide
NAME                                         STATUS   ROLES                  AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
qe-uidaily-1016-dsclx-master-0               Ready    control-plane,master   3h     v1.31.1   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 418.94.202410111739-0   5.14.0-427.40.1.el9_4.x86_64   cri-o://1.31.1-4.rhaos4.18.gitd8950b8.el9
qe-uidaily-1016-dsclx-master-1               Ready    control-plane,master   3h     v1.31.1   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 418.94.202410111739-0   5.14.0-427.40.1.el9_4.x86_64   cri-o://1.31.1-4.rhaos4.18.gitd8950b8.el9


     2. Navigate to Overview page, check the option on the 'Filter by Node type' dropdown list on Cluster utilization section
    3.
    

Actual results:

    control plane option is missing 

Expected results:

    the master option should update to 'contorl plane'

Additional info:

    

Description of problem:

    The cert-manager operator from redhat-operators is not yet available in the 4.18 catalog. We'll need to use a different candidate in order to update our default catalog images to 4.18 without creating test failures.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

For resources under Networking menu, eg, service, route, ingress, networkpolicy, when access a non-existing resource, the page should show "404 not found" instead of keeping loading the page.
    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-10-133647
4.17.0-0.nightly-2024-09-09-120947
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Access a non-existing resource under Networking menu, eg "testconsole" service with url "/k8s/ns/openshift-console/services/testconsole".
    2.
    3.
    

Actual results:

1. The page will always be loading.
screenshot: https://drive.google.com/file/d/1HpH2BfVUACivI0KghXhsKt3FYgYFOhxx/view?usp=drive_link
    

Expected results:

1. Should show "404 not found"
    

Additional info:


    

Perform the SnykDuty

  • As Toni mentioned in the 1:1 conversation, In this Snyk duty session I will add the openshift-ci-security job to all our presubmit jobs

Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/46

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When setting .spec.storage.azure.networkAccess.type: Internal (without providing vnet and subnet names), the image registry will attempt to discover the vnet by tag. 

Previous to the installer switching to cluster-api, the vnet tagging happened here: https://github.com/openshift/installer/blob/10951c555dec2f156fad77ef43b9fb0824520015/pkg/asset/cluster/azure/azure.go#L79-L92.

After the switch to cluster-api, this code no longer seems to be in use, so the tags are no longer there.

From inspection of a failed job, the new tags in use seem to be in the form of `sigs.k8s.io_cluster-api-provider-azure_cluster_$infraID` instead of the previous `kubernetes.io_cluster.$infraID`.

Image registry operator code responsible for this: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/storage/azure/azure.go?plain=1#L678-L682

More details in slack discussion with installer team: https://redhat-internal.slack.com/archives/C68TNFWA2/p1726732108990319

Version-Release number of selected component (if applicable):

    4.17, 4.18

How reproducible:

    Always

Steps to Reproduce:

    1. Get an Azure 4.17 or 4.18 cluster
    2. oc edit configs.imageregistry/cluster
    3. set .spec.storage.azure.networkAccess.type to Internal  

Actual results:

    The operator cannot find the vnet (look for "not found" in operator logs)

Expected results:

    The operator should be able to find the vnet by tag and configure the storage account as private

Additional info:

If we make the switch to look for vnet tagged with `sigs.k8s.io_cluster-api-provider-azure_cluster_$infraID`, one thing that needs to be tested is BYO vnet/subnet clusters. What I have currently observed in CI is that the cluster has the new tag key with `owned` value, but for BYO networks the value *should* be `shared`, but I have not tested it.
---

Although this bug is a regression, I'm not going to mark it as such because this affects a fairly new feature (introduced on 4.15), and there's a very easy workaround (manually setting the vnet and subnet names when configuring network access to internal).

 

Description of problem:

See https://search.dptools.openshift.org/?search=Kubernetes+resource+CRUD+operations+Secret+displays+detail+view+for+newly+created+resource+instance&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Description of problem:

When use UPDATE_URL_OVERRIDE env, the information is confused: 

./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1 

2024/06/19 12:22:38  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/06/19 12:22:38  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/06/19 12:22:38  [INFO]   : ⚙️  setting up the environment for you...
2024/06/19 12:22:38  [INFO]   : 🔀 workflow mode: mirrorToDisk 
I0619 12:22:38.832303   66173 client.go:44] Usage of the UPDATE_URL_OVERRIDE environment variable is unsupported
2024/06/19 12:22:38  [INFO]   : 🕵️  going to discover the necessary images...

 

Version-Release number of selected component (if applicable):

./oc-mirror.latest  version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202406131541.p0.g157eb08.assembly.stream.el9-157eb08", GitCommit:"157eb085db0ca66fb689220119ab47a6dd9e1233", GitTreeState:"clean", BuildDate:"2024-06-13T17:25:46Z", GoVersion:"go1.22.1 (Red Hat 1.22.1-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

always

Steps to Reproduce:

1) Set registry on the ocp cluster;
2) do mirror2disk + disk2mirror with following isc:
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  additionalImages:
   - name: quay.io/openshifttest/bench-army-knife@sha256:078db36d45ce0ece589e58e8de97ac1188695ac155bc668345558a8dd77059f6
  platform:
    channels:
    - name: stable-4.15
      type: ocp
      minVersion: '4.15.10'
      maxVersion: '4.15.11'
    graph: true
  operators:
    - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
      packages:
       - name: elasticsearch-operator

 3) set  ~/.config/containers/registries.conf
[[registry]]
  location = "quay.io"
  insecure = false
  blocked = false
  mirror-by-digest-only = false
  prefix = ""
  [[registry.mirror]]
    location = "my-route-testzy.apps.yinzhou-619.qe.devcluster.openshift.com"
    insecure = false

4) use the isc from step 2 and mirror2disk with different dir:
`./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1`

Actual results: 

 

./oc-mirror.latest -c config-19.yaml --v2 file://disk-enc1 
2024/06/19 12:22:38  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/06/19 12:22:38  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/06/19 12:22:38  [INFO]   : ⚙️  setting up the environment for you...
2024/06/19 12:22:38  [INFO]   : 🔀 workflow mode: mirrorToDisk 
I0619 12:22:38.832303   66173 client.go:44] Usage of the UPDATE_URL_OVERRIDE environment variable is unsupported
2024/06/19 12:22:38  [INFO]   : 🕵️  going to discover the necessary images...
2024/06/19 12:22:38  [INFO]   : 🔍 collecting release images...
 

Expected results:

Give clear information to clarify the UPDATE_URL_OVERRIDE environment variable


slack discuss is here : https://redhat-internal.slack.com/archives/C050P27C71S/p1718800641718869?thread_ts=1718175617.310629&cid=C050P27C71S

The CPO reconciliation aborts when the OIDC/LDAP IDP validation check fails and this result in failure to reconcile for any components that are reconciled after that point in the code.

This failure should not be fatal to the CPO reconcile and should likely be reported as a condition on the HC.

xref

Customer incident
https://issues.redhat.com/browse/OCPBUGS-38071

RFE for bypassing the check
https://issues.redhat.com/browse/RFE-5638

PR to proxy the IDP check through the data plane network
https://github.com/openshift/hypershift/pull/4273

 

This is a feature request. Sorry, I couldn't find anywhere else to file it. Our team can also potentially implement this feature, so really we're looking for design input before possibly submitting a PR.

User story:

As a user of on-prem OpenShift, I need to manage DNS for my OpenShift cluster manually. I can already specify an IP address for the API server, but I cannot do this for Ingress. This means that I have to:

  • Manually create the API endpoint IP
  • Add DNS for the API endpoint
  • Create the cluster
  • Discover the created Ingress endpoint
  • Add DNS for the Ingress endpoint

I would like to simplify this workflow to:

  • Manually create the API and Ingress endpoint IPs
  • Add DNS for the API and Ingress endpoints
  • Create the cluster

Implementation suggestion:

Our specific target is OpenStack. We could add `OpenStackLoadBalancerParameters` to `ProviderLoadBalancerParameters`, but the parameter we would be adding is 'loadBalancerIP`. This isn't OpenStack-specific. For example, it would be equally applicable to users of either OpenStack's built-in Octavia loadbalancer, or MetalLB, both of which may reasonably be deployed on OpenStack.

I suggest adding an optional LoadBalancerIP to LoadBalancerStrategy here: https://github.com/openshift/cluster-ingress-operator/blob/8252ac492c04d161fbcf60ef82af2989c99f4a9d/vendor/github.com/openshift/api/operator/v1/types_ingress.go#L395-L440

This would be used to pre-populate spec.loadBalancerIP when creating the Service for the default router.

The list of known plugin names for telemetry does not include kuadrant-console-plugin, which is a Red Hat maintained plugin.

Description of problem:

In upstream and downstream automation testing, we see occasional failures coming from monitoring-plugin

For example:

Check JUnit report for https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_console/14468/pull-ci-openshift-console-master-e2e-gcp-console/1856100921105190912


Check JUnit report for https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_console/14475/pull-ci-openshift-console-master-e2e-gcp-console/1856095554396753920 


Check screenshot when visiting /monitoring/alerts 
https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-amd64-nightly-gcp-upi-f7-ui/1855343143403130880/artifacts/gcp-upi-f7-ui/cucushift-e2e/artifacts/ui1/embedded_files/2024-11-09T22:21:41+00:00-screenshot.png     

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-11-144244    

How reproducible:

more reproducible in automation testing

Steps to Reproduce:

 

Actual results:

runtime errors

Expected results:

no errors    

Additional info:

    

Description of problem:

Login on admin console with normal user, there is "User workload notifications" option in "Notifications" menu on "User Preferences" page. It's not necessary, since normal user have no permission to get alerts.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-05-23-103225

How reproducible:

Always

Steps to Reproduce:

1.Login on admin console with normal user, go to "User Preferences" page.
2.Click "Notifications" menu, check/uncheck "Hide user workload notifications" for "User workload notifications"
3.

Actual results:

2. User could set the option.

Expected results:

3. It's better not show option for "User workload notifications". Since normal user could not get alerts, and there is no Notification Drawer on masthead.

Additional info:

Screenshorts: https://drive.google.com/drive/folders/15_qGw1IkbK1_rIKNiageNlYUYKTrsdKp?usp=share_link

Description of problem:

The pinned images functionality is not working
    

Version-Release number of selected component (if applicable):

IPI on AWS version:
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.nightly-2024-10-28-052434   True        False         6h46m   Cluster version is 4.18.0-0.nightly-2024-10-28-052434

    

How reproducible:

Always
    

Steps to Reproduce:

    1. Enable techpreview
    2.  Create a pinnedimagesets resource

$ oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: PinnedImageSet
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: tc-73623-worker-pinned-images
spec:
  pinnedImages:
  - name: "quay.io/openshifttest/busybox@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019"
  - name: quay.io/openshifttest/alpine@sha256:be92b18a369e989a6e86ac840b7f23ce0052467de551b064796d67280dfa06d5
EOF

    

Actual results:

The images are not pinned and the pool is degraded

We can see these logs in the MCDs

 I1028 14:26:32.514096    2341 pinned_image_set.go:304] Reconciling pinned image set: tc-73623-worker-pinned-images: generation: 1
E1028 14:26:32.514183    2341 pinned_image_set.go:240] failed to get image status for "quay.io/openshifttest/busybox@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019": rpc error: code = Unavailable desc = name resolver error: produced zero addresses

And we can see the machineconfignodes resources reporting pinnedimagesets degradation:

  - lastTransitionTime: "2024-10-28T14:27:58Z"
    message: 'failed to get image status for "quay.io/openshifttest/busybox@sha256:0415f56ccc05526f2af5a7ae8654baec97d4a614f24736e8eef41a4591f08019":
      rpc error: code = Unavailable desc = name resolver error: produced zero addresses'
    reason: PrefetchFailed
    status: "True"
    type: PinnedImageSetsDegraded

    

Expected results:

The images should be pinned without errors.

    

Additional info:


Slack conversation: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1730125766377509

This is Sam's guess (thank you [~sbatschelet] for your quick help, I really appreciate it):
My guess is that it is related to https://github.com/openshift/machine-config-operator/pull/4629
Specifically the changes to pkg/daemon/cri/cri.go where we swapped out DialContext for NewClient. Per docs.
One subtle difference between NewClient and Dial and DialContext is that the former uses "dns" as the default name resolver, while the latter use "passthrough" for backward compatibility. This distinction should not matter to most users, but could matter to legacy users that specify a custom dialer and expect it to receive the target string directly.
    

Description of problem:

Remove the extra . from below INFO message when running add-nodes workdflow

INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z 

Version-Release number of selected component (if applicable):

    

How reproducible:

    100%

Steps to Reproduce:

    1. Run  oc adm node-image create command to create a node iso
    2. See the INFO message at the end
    3.
    

Actual results:

 INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso.. The ISO is valid up to 2024-08-15T16:48:00Z   

Expected results:

    INFO[2024-08-15T12:48:45-04:00] Generated ISO at ../day2-worker-4/node.x86_64.iso. The ISO is valid up to 2024-08-15T16:48:00Z 

Additional info:

    

Please review the following PR: https://github.com/openshift/installer/pull/8957

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.

The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.

The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:

source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]

Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed

More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.

The operator degraded is probably the strongest symptom to persue as it appears in most of the above.

If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.

Description of problem:

Before the kubelet systemd service runs kubelet binary it calls the restorecon command:
https://github.com/openshift/machine-config-operator/blob/master/templates/worker/01-worker-kubelet/on-prem/units/kubelet.service.yaml#L13 

But the restorecon command expects a path to be given.
providing a path is mandatory.
see man page: https://linux.die.net/man/8/restorecon

At the moment the command does nothing and the error
is swallowed due to the dash (-) in the beginning
of the command.

This results with files that are labeled with wrong SELinux labels.
for example:
After https://github.com/containers/container-selinux/pull/329 got merged /var/lib/kubelet/pod-resources/* expected to be running with kubelet_var_lib_t label but it's not. it's running with the old label - container_var_lib_t

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Always

Steps to Reproduce:

    1. Check the SELinux labels of files under the system with ls -Z command.
    

Actual results:

    files are labeled with a wrong SELinux labels

Expected results:

file's SELinux labels are suppose the match their configuration as it captured in the container-selinux package.

Additional info:

    

Description of problem:

We have an OKD 4.12 cluster which has persistent and 
increasing ingresswithoutclassname alerts with no ingresses normally 
present in the cluster. I believe the ingresswithoutclassname being 
counted is created as part of the ACME validation process managed by the
 cert-manager operator with it's openshift route addon which are torn down once the ACME validation is complete.

Version-Release number of selected component (if applicable):

 4.12.0-0.okd-2023-04-16-041331

How reproducible:

seems very consistent. went away during an update but came back shortly after and continues to increase.

Steps to Reproduce:

1. create ingress w/o classname
2. see counter increase
3. delete classless ingress
4. counter does not decrease.

Additional info:

https://github.com/openshift/cluster-ingress-operator/issues/912

Please review the following PR: https://github.com/openshift/csi-operator/pull/241

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-operator/pull/269

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The reality is that a lot of bare-metal clusters end up using platform=none. For example, SNO's only have this platform value, so SNO users can never use provisioning network (and thus any hardware that does not support virtual media). UPI and UPI-like clusters are by definition something that operators configure for themselves, so locking them out of features makes even less sense.

With OpenStack based on OCP nowadays, I expect to see a sharp increase in complaints about this topic.

Add e2e ests to Show deprecated operators in OperatorHub work. 

 

Open question:

What kind of tests would be most appropriate for this situation, considering the dependencies required for end-to-end (e2e) tests?

Dependencies:

  • Create a CatalogSource resource
  • Install a test operator i.e. Kiali Community Operator

AC:

  • Add integration tests for both pre and post installation steps [2]
    • Here we should use Kiali Community Operator, which we should install though the CLI, so the test wont take to much of time.

In https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-shiftstack-ci-release-4.18-e2e-openstack-ovn-etcd-scaling/1834144693181485056 I noticed the following panic:

 Undiagnosed panic detected in pod expand_less 	0s
{  pods/openshift-monitoring_prometheus-k8s-1_prometheus_previous.log.gz:ts=2024-09-12T09:30:09.273Z caller=klog.go:124 level=error component=k8s_client_runtime func=Errorf msg="Observed a panic: &runtime.TypeAssertionError{_interface:(*abi.Type)(0x3180480), concrete:(*abi.Type)(0x34a31c0), asserted:(*abi.Type)(0x3a0ac40), missingMethod:\"\"} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node)\ngoroutine 13218 [running]:\nk8s.io/apimachinery/pkg/util/runtime.logPanic({0x32f1080, 0xc05be06840})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x90\nk8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc010ef6000?})\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b\npanic({0x32f1080?, 0xc05be06840?})\n\t/usr/lib/golang/src/runtime/panic.go:770 +0x132\ngithub.com/prometheus/prometheus/discovery/kubernetes.NewEndpoints.func11({0x34a31c0?, 0xc05bf3a580?})\n\t/go/src/github.com/prometheus/prometheus/discovery/kubernetes/endpoints.go:170 +0x4e\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/controller.go:253\nk8s.io/client-go/tools/cache.(*processorListener).run.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:977 +0x9f\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00fc92f70, {0x456ed60, 0xc031a6ba10}, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc011678f70, 0x3b9aca00, 0x0, 0x1, 0xc015a04fc0)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f\nk8s.io/apimachinery/pkg/util/wait.Until(...)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161\nk8s.io/client-go/tools/cache.(*processorListener).run(0xc04c607440)\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52\ncreated by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 12933\n\t/go/src/github.com/prometheus/prometheus/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73\n"}

This issue seems relatively common on openstack, these runs seem to very frequently be this failure.

Linked test name: Undiagnosed panic detected in pod

Description of problem:

    Alerts with non-standard severity labels are sent to Telemeter.

Version-Release number of selected component (if applicable):

    All supported versions

How reproducible:

    Always

Steps to Reproduce:

    1. Create an always firing alerting rule with severity=foo.
    2. Make sure that telemetry is enabled for the cluster.
    3.
    

Actual results:

    The alert can be seen on the telemeter server side.

Expected results:

    The alert is dropped by the telemeter allow-list.

Additional info:

Red Hat operators should use standard severities: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide
Looking at the current data, it looks like ~2% of the alerts reported to Telemter have an invalid severity.

Description of problem:

    After upgrading OCP and LSO to version 4.14, elasticsearch pods in the openshift-logging deployment are unable to schedule to their respective nodes and remain Pending, even though the LSO managed PVs are bound to the PVCs. A test pod using a newly created test PV managed by the LSO is able to schedule correctly however.

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    Consistently

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    Pods consuming previously existing LSO managed PVs are unable to schedule and remain in a Pending state after upgrading OCP and LSO to 4.14.

Expected results:

    That pods would be able to consume LSO managed PVs and schedule correctly to nodes.

Additional info:

    

Description of problem:

    When HO is installed without a pullsecret the shared ingress controller fails to create the router pod because the pullsecret is missing 

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    100%

Steps to Reproduce:

    1.Install HO without pullsecret
    2.Watch HO report error   "error":"failed to get pull secret &Secret{ObjectMeta:{
      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [][]},Data:map[string[]byte{},Type:,StringData:map[string]string{},Immutabl:nil,}: Secret \"pull-secret\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.
    3. Observe that no Router pod is created in the hypershift sharedingress namespace 

 

Actual results:

    router pod doesnt get created in hyeprshift sharedingress namespace

Expected results:

    router pod gets created in hyeprshift sharedingress namespace

Additional info:

    

Description of problem:

    The description and name for GCP Pool ID is not consist
Issue is related to bug https://issues.redhat.com/browse/OCPBUGS-38557 

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-08-19-002129

How reproducible:

    Always

Steps to Reproduce:

    1. Prepare a WI/FI enabled GCP cluster
    2. Go to the web Terminal operator installtion page
    3. Check the description and name for GCP Pool ID
    

Actual results:

    The description and name for GCP Pool ID is not consist    

Expected results:

    The description and name for GCP Pool ID should consist

Additional info:

    Screenshot: https://drive.google.com/file/d/1PwiH3xk39pGzCgcHPzIHlv3ABzXYqz1O/view?usp=drive_link

When the openshift-install agent wait-for bootstrap-complete command logs the status of the host validations, it logs the same hostname for all validations, regardless of which host they apply to. This makes it impossible for the user to determine which host needs remediation when a validation fails.

Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/68

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

This is a spinoff of https://issues.redhat.com/browse/OCPBUGS-38012. For additional context please see that bug.

The TLDR is that Restart=on-failure for oneshot units were only supported in systemd v244 and onwards, meaning any bootimage for 4.12 and previous doesn't support this on firstboot, and upgraded clusters would no longer be able to scale nodes if it references any such service.

Right now this is only https://github.com/openshift/machine-config-operator/blob/master/templates/common/openstack/units/afterburn-hostname.service.yaml#L16-L24 which isn't covered by https://issues.redhat.com/browse/OCPBUGS-38012

Version-Release number of selected component (if applicable):

4.16 right now

How reproducible:

Uncertain, but https://issues.redhat.com/browse/OCPBUGS-38012 is 100%

Steps to Reproduce:

    1.install old openstack cluster
    2.upgrade to 4.16
    3.attempt to scale node
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1a774eb]goroutine 11358 [running]:
testing.tRunner.func1.2({0x1d3d600, 0x3428a50})
    /usr/lib/golang/src/testing/testing.go:1631 +0x24a
testing.tRunner.func1()
    /usr/lib/golang/src/testing/testing.go:1634 +0x377
panic({0x1d3d600?, 0x3428a50?})
    /usr/lib/golang/src/runtime/panic.go:770 +0x132
github.com/openshift/cluster-ingress-operator/test/e2e.updateDNSConfig(...)
    /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/dns_ingressdegrade_test.go:89
github.com/openshift/cluster-ingress-operator/test/e2e.TestIngressStatus(0xc000511380)
    /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/dns_ingressdegrade_test.go:53 +0x34b
testing.tRunner(0xc000511380, 0x218c9f8)
    /usr/lib/golang/src/testing/testing.go:1689 +0xfb
created by testing.(*T).Run in goroutine 11200
    /usr/lib/golang/src/testing/testing.go:1742 +0x390
FAIL    github.com/openshift/cluster-ingress-operator/test/e2e    1612.553s
FAIL
make: *** [Makefile:56: test-e2e] Error 1

Version-Release number of selected component (if applicable):

    master

How reproducible:

    run the cluster-ingress-operator e2e tests against the OpenStack platform.

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    the nil pointer error

Expected results:

    no error

Additional info:

    

Description of problem:

- One node [ rendezvous]   is failed to add the cluster and there are some pending CSR's.

- omc get csr 
NAME                                                            AGE   SIGNERNAME                                    REQUESTOR                                                                   REQUESTEDDURATION   CONDITION
csr-44qjs                                                       21m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-9n9hc                                                       5m    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-9xw24                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-brm6f                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-dz75g                                                       36m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-l8c7v                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-mv7w5                                                       52m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
csr-v6pgd                                                       1h    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending
In order to complete the installation, cu needs to approve the those CSR's manually.    

Steps to Reproduce:

   agent-based installation. 
    

Actual results:

    CSR's are in pending state. 

Expected results:

    CSR's should approved automatically 

Additional info:

Logs : https://drive.google.com/drive/folders/1UCgC6oMx28k-_WXy8w1iN_t9h9rtmnfo?usp=sharing

A string comparison is being done with "-eq", it should be using "="

 

[derekh@u07 assisted-installer-agent]$ sudo podman build -f Dockerfile.ocp 
STEP 1/3: FROM registry.ci.openshift.org/ocp/builder:rhel-9-golang-1.21-openshift-4.16 AS builder
STEP 2/3: RUN if [ "$(arch)" -eq "x86_64" ]; then dnf install -y biosdevname dmidecode; fi
/bin/sh: line 1: [: x86_64: integer expression expected
--> cb5707d9d703
STEP 3/3: RUN if [ "$(arch)" -eq "aarch64" ]; then dnf install -y dmidecode; fi
/bin/sh: line 1: [: x86_64: integer expression expected
COMMIT
--> 0b12a705f47e
0b12a705f47e015f43d7815743f2ad71da764b1358decc151454ec8802a827fc

 

Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/99

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-provider-ibmcloud/pull/85

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

When discovering ARM and try to install CNV, I get the following 

  • CNV requirements: CPU does not have virtualization support.
  •  

From inventory, CPU flags are:

cpu":{
      "architecture":"aarch64",
      "count":16,
      "flags":[
         "fp",
         "asimd",
         "evtstrm",
         "aes",
         "pmull",
         "sha1",
         "sha2",
         "crc32",
         "atomics",
         "fphp",
         "asimdhp",
         "cpuid",
         "asimdrdm",
         "lrcpc",
         "dcpop",
         "asimddp",
         "ssbs"
      ],
      "model_name":"Neoverse-N1" 

How reproducible:
100%

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of problem:

normal user without any projects visiting Networking pages, it is always loading    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-08-130531    

How reproducible:

Always    

Steps to Reproduce:

    1. user without any project visit Services, Routes, Ingresses, NetworkPolicies page
    2.
    3.
    

Actual results:

these list page are always loading    

Expected results:

show getting started guide and dim resources list     

Additional info:

    

Description of problem:

The placeholder "Select one or more NetworkAttachmentDefinitions" is highlighted while selecting nad

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

We need to remove the dra_manager_state on kubelet restart to prevent mismatch errors on restart with TechPreview or DevPreview clusters. 
failed to run Kubelet: failed to create claimInfo cache: error calling GetOrCreate() on checkpoint state: failed to get checkpoint dra_manager_state: checkpoint is corrupted"   

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

See https://github.com/openshift/console/pull/14030/files/0eba7f7db6c35bbf7bca5e0b8eebd578e47b15cc#r1707020700

Description of problem

The cluster-dns-operator repository vendors controller-runtime v0.17.3, which uses Kubernetes 1.29 packages. The cluster-dns-operator repository also vendors k8s.io/* v0.29.2 packages. However, OpenShift 4.17 is based on Kubernetes 1.30.

Version-Release number of selected component (if applicable)

4.17.

How reproducible

Always.

Steps to Reproduce

Check https://github.com/openshift/cluster-dns-operator/blob/release-4.17/go.mod.

Actual results

The sigs.k8s.io/controller-runtime package is at v0.17.3, and the k8s.io/* packages are at v0.29.2.

Expected results

The sigs.k8s.io/controller-runtime package is at v0.18.0 or newer, and the k8s.io/* packages are at v0.30.0 or newer.

Additional info

The controller-runtime v0.18 release includes some breaking changes; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.18.0.

Description of problem:

    No pagination on the NetworkPolicies table list

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-09-09-212926
    4.17.0-0.nightly-2024-09-09-120947

How reproducible:

    Always

Steps to Reproduce:

    1. Naviagate to Networking -> NetworkPolicies page, create multiple resources, at least more than 20
    2. Check the NetworkPolicies table list
    3.
    

Actual results:

    No pagination on the table

Expected results:

    Add pagination, and also it could control by the 'pagination_nav-control' related button/function  

Additional info:

    

Converted the story to track i18n upload/download routine tasks to a bug so that it could be backported to 4.17, as this latest translations batch contains missing translations, including ES language for the 4.17 release.

Original story: https://issues.redhat.com/browse/CONSOLE-4238

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Running oc scale on a nodepool fails with 404 not found 

Version-Release number of selected component (if applicable):

Latest hypershift operator

How reproducible:

100%

Steps to Reproduce:

  1. Deploy latest hypershift operator and create a hosted cluster
  2. Scale the nodepool with oc scale nodepool

Actual results:

Scaling fails

 

[2024-10-20 22:13:17] + oc scale nodepool/assisted-test-cluster -n assisted-spoke-cluster --replicas=1
[2024-10-20 22:13:17] Error from server (NotFound): nodepools.hypershift.openshift.io "assisted-test-cluster" not found 

 

 

Expected results:

Scaling succeeds

Additional info:

Discovered in our CI tests beginning October 17th https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-cluster-api-provider-agent-master-e2e-ai-operator-ztp-capi-periodic

  • Note: we had to put in a workaround: directly patching the nodepool so tests may succeed starting from Oct 22

Slack thread discussion

 

Description of problem:

see from screen recording https://drive.google.com/file/d/1LwNdyISRmQqa8taup3nfLRqYBEXzH_YH/view?usp=sharing

dev console, "Observe -> Metrics" tab, input in in the query-browser input text-area, the cursor would focus in the project drop-down list, this issue exists in 4.17.0-0.nightly-2024-08-19-165854 and 4.18.0-0.nightly-2024-08-19-002129, no such issue with admin console

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-19-165854 and 4.18.0-0.nightly-2024-08-19-002129

How reproducible:

always

Steps to Reproduce:

1. see the description
    

Actual results:

cursor would focus in the project drop-down 

Expected results:

cursor should not move

Additional info:

    

Description of the problem:

B[Staging]BE 2.35.0, UI 2.34.2 - User is not abl to select ODF once CNV is selected as LVMS is repeatedly enabled

How reproducible:

100%

Steps to reproduce:

1. Create new cluster

2. Select cnv

3. LVMS is enabled. disabling it ends up with it being enabled again

Actual results:

 

Expected results:

Description of problem:

Customer has a cluster in AWS that was born on an old OCP version (4.7) and was upgraded all the way through 4.15.
During the lifetime of the cluster they changed the DHCP option in AWS to "domain name". 
During the node provisioning during MachineSet scaling the Machine can successfully be created at the cloud provider but the Node is never added to the cluster. 
The CSR remain pending and do not get auto-approved

This issue is eventually related or similar to the bug fixed via https://issues.redhat.com/browse/OCPBUGS-29290

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

   CSR don't get auto approved. New nodes have a different domain name when CSR is approved manually. 

Expected results:

    CSR should get approved automatically and domain name scheme should not change.

Additional info:

    

Description of problem:

    Navigation:
               Storage -> VolumeSnapshots -> kebab-menu -> Mouse hover on 'Restore as new PVC'

    Issue:
               "Volume Snapshot is not Ready" is in English.

Version-Release number of selected component (if applicable):

    4.17.0

How reproducible:

    

Steps to Reproduce:

    1. Log into webconsole and add "?pseudolocalization=true&lng=en" to URL
    2. Navigate to Storage -> VolumeSnapshots -> kebab-menu -> Mouse hover on 'Restore as new PVC'     
    3. "Volume Snapshot is not Ready" is in English.     

Actual results:

    Content is not marked for translation

Expected results:

    Content sould be marked for translation

Additional info:

    Reference screenshot added

Description of problem:

The fix to remove the ssh connection and just add an ssh port test causes a problem with ssh as its not formatted correctly. We see:

level=debug msg=Failed to connect to the Rendezvous Host on port 22: dial tcp: address fd2e:6f44:5dd8:c956::50:22: too many colons in address

Description of problem:

    The css of some components isn't loading properly (Banner, Jumplinks)

See screenshot: https://photos.app.goo.gl/2Z1cK5puufGBVBcu5

On the screen cast, ex-aao in namespace default is a banner, and should look like: https://photos.app.goo.gl/n4LUgrGNzQT7n1Pr8

The vertical jumplinks should look like: https://photos.app.goo.gl/8GAs71S43PnAS7wH7

 

You can test our plugin: https://github.com/artemiscloud/activemq-artemis-self-provisioning-plugin/pull/278

 

1. yarn

2. yarn start

3. navigate to http://localhost:9000/k8s/ns/default/add-broker

 

Description of problem:

Customers are unable to scale-up the OCP nodes when the initial setup is done with OCP 4.8/4.9 and then upgraded to 4.15.22/4.15.23

At first customer observed that the node scale-up failed and the /etc/resolv.conf was empty on the nodes.
As a workaround, customer copy/paste the resolv.conf content from a correct resolv.conf and then it continued with setting up the new node.

However then they observed the rendered MachineConfig assembled with the 00-worker, and suspected that something can be wrong with the on-prem-resolv-prepender.service service definition.
As a workaround, customer manually changed this service definition which helped them to scale up new nodes.

Version-Release number of selected component (if applicable):

4.15 , 4.16

How reproducible:

100%

Steps to Reproduce:

1. Install OCP vSphere IPI cluster version 4.8 or 4.9
2. Check "on-prem-resolv-prepender.service" service definition
3. Upgrade it to 4.15.22 or 4.15.23
4. Check if the node scaling is working 
5. Check "on-prem-resolv-prepender.service" service definition     

Actual results:

Unable to scaleup node with default service definition. After manually making changes in the service definition , scaling is working.

Expected results:

Node sclaing should work without making any manual changes in the service definition.

Additional info:

on-prem-resolv-prepender.service content on the clusters build with 4.8 / 4.9 version and then upgraded to 4.15.22 / 4.25.23 :
~~~
[Unit]
Description=Populates resolv.conf according to on-prem IPI needs
# Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
After=crio-wipe.service
[Service]
Type=oneshot
Restart=on-failure
RestartSec=10
StartLimitIntervalSec=0
ExecStart=/usr/local/bin/resolv-prepender.sh
EnvironmentFile=/run/resolv-prepender/env
~~~

After manually correcting the service definition as below, scaling works on 4.15.22 / 4.15.23 :
~~~
[Unit]
Description=Populates resolv.conf according to on-prem IPI needs
# Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
After=crio-wipe.service
StartLimitIntervalSec=0                -----------> this
[Service]
Type=oneshot
#Restart=on-failure                    -----------> this
RestartSec=10
ExecStart=/usr/local/bin/resolv-prepender.sh
EnvironmentFile=/run/resolv-prepender/env
~~~

Below is the on-prem-resolv-prepender.service on a freshly intsalled 4.15.23 where sclaing is working fine :
~~~
[Unit]
Description=Populates resolv.conf according to on-prem IPI needs
# Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
After=crio-wipe.service
StartLimitIntervalSec=0
[Service]
Type=oneshot
Restart=on-failure
RestartSec=10
ExecStart=/usr/local/bin/resolv-prepender.sh
EnvironmentFile=/run/resolv-prepender/env
~~~

Observed this in the rendered MachineConfig which is assembled with the 00-worker

Description of problem:

    If the `template:` field in the vsphere platform spec is defined the installer should not be downloading the OVA

Version-Release number of selected component (if applicable):

    4.16.x 4.17.x

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Failed to create NetworkAttachmentDefinition for namespace scoped CRD in layer3

Version-Release number of selected component (if applicable):

4.17

How reproducible:

always

Steps to Reproduce:

1. apply CRD yaml file
2. check the NetworkAttachmentDefinition status

Actual results:

status with error 

Expected results:

NetworkAttachmentDefinition has been created 

 

 

Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/275

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

With the newer azure-sdk-for-go replacing go-autorest, there was a change to use ClientCertificateCredential that did not include the `SendCertificateChain` option by default that used to be there.  The ARO team requires this be set otherwise the 1p integration for SNI will not work.  

Old version: https://github.com/Azure/go-autorest/blob/f7ea664c9cff3a5257b6dbc4402acadfd8be79f1/autorest/adal/token.go#L262-L264

New version: https://github.com/openshift/installer-aro/pull/37/files#diff-da950a4ddabbede621d9d3b1058bb34f8931c89179306ee88a0e4d76a4cf0b13R294

    

Version-Release number of selected component (if applicable):

This was introduced in the OpenShift installer PR: https://github.com/openshift/installer/pull/6003    

How reproducible:

Every time we authenticate using SNI in Azure.  

Steps to Reproduce:

    1.  Configure a service principal in the Microsoft tenant using SNI
    2.  Attempt to run the installer using client-certificate credentials to install a cluster with credentials mode in manual
    

Actual results:

Installation fails as we're unable to authenticate using SNI.  
    

Expected results:

We're able to authenticate using SNI.  
    

Additional info:

This should not have any affect on existing non-SNI based authentication methods using client certificate credentials.  It was previously set in autorest for golang, but is not defaulted to in the newer azure-sdk-for-go.  


Note that only first party Microsoft services will be able to leverage SNI in Microsoft tenants.  The test case for this on the installer side would be to ensure it doesn't break manual credential mode installs using a certificate pinned to a service principal.  

 

 

All we would need changed is to this  pass the ` SendCertificateChain: true,` option only on client certificate credentials.  Ideally we could back-port this as well to all openshift versions which received the migration from AAD to Microsoft Graph changes. 

Description of problem:

    When the image from a build is rolling out on the nodes, the update progress on the node is not displaying correctly. 

Version-Release number of selected component (if applicable):

    

How reproducible:

Always     

Steps to Reproduce:

    1. Enable OCL functionality 
    2. Opt the pool in by MachineOSConfig 
    3. Wait for the image to build and roll out
    4. Track mcp update status by oc get mcp 
    

Actual results:

The MCP start with O ready pool. While there are 1-2 pools got updated already, the count still remains 0. The count jump to 3 when all the pools are ready.     

Expected results:

The update progress should be reflected in the mcp status correctly. 

Additional info:

    

Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/333

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

CAPA is leaking one EIP in the bootstrap life cycle when creating clustres on 4.16+ with BYO IPv4 Pool on config.

The install logs is showing the message of duplicated EIP, there is a kind of race condition when the EIP is created and tried to be associated when the instance isn't ready (Running state):

~~~
time="2024-05-08T15:49:33-03:00" level=debug msg="I0508 15:49:33.785472 2878400 recorder.go:104] 
\"Failed to associate Elastic IP for \\\"ec2-i-03de70744825f25c5\\\": InvalidInstanceID: 
The pending instance 'i-03de70744825f25c5' is not in a valid state for this operation.\\n\\tstatus code: 
400, request id: 7582391c-b35e-44b9-8455-e68663d90fed\" logger=\"events\" type=\"Warning\" 
object=[...]\"name\":\"mrb-byoip-32-kbcz9\",\"[...] reason=\"FailedAssociateEIP\""

time="2024-05-08T15:49:33-03:00" level=debug msg="E0508 15:49:33.803742 2878400 controller.go:329] \"Reconciler error\" err=<"

time="2024-05-08T15:49:33-03:00" level=debug msg="\tfailed to reconcile EIP: failed to associate Elastic IP 
\"eipalloc-08faccab2dbb28d4f\" to instance \"i-03de70744825f25c5\": 
InvalidInstanceID: The pending instance 'i-03de70744825f25c5' is not in a valid state for this operation."
~~~

The EIP is deleted when the bootstrap node is removed after a success installation, although the bug impacts any new machine with Public IP set using BYO IPv4 provisioned by CAPA. Upstream issue has been opened: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5038

Version-Release number of selected component (if applicable):

   4.16+

How reproducible:

    always

Steps to Reproduce:

    1. create install-config.yaml setting platform.aws.publicIpv4Pool=poolID
    2. create cluster
    3. check the AWS Console, EIP page filtering by your cluster, you will see the duplicated EIP, while only one is associated to the correct bootstrap instance
    

Actual results:

    

Expected results:

- installer/capa creates only one EIP for bootstrap when provisioning the cluster
- no error messages for expected behavior (ec2 association errors in pending state)     

Additional info:

    CAPA issue: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/5038 

 

Please review the following PR: https://github.com/openshift/configmap-reload/pull/64

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

for some provisioners, the access mode is not correct. It would be good if we can have someone from storage team to confirm about the access mode values in https://github.com/openshift/console/blob/master/frontend/public/components/storage/shared.ts#L107 

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-12-101500

How reproducible:

Always

Steps to Reproduce:

1. setup a cluster in GCP, check storageclasses
$ oc get sc
NAME                     PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
ssd-csi                  pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true                   5h37m
standard-csi (default)   pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true                   5h37m

2. goes to PVC creation page, choose any storageclass in the dropdown and check `Access mode` list

Actual results:

there is only `RWO` access mode

Expected results:

pd.csi.storage.gke.io support both RWO and RWOP

supported access mode reference https://docs.openshift.com/container-platform/4.15/storage/understanding-persistent-storage.html#pv-access-modes_understanding-persistent-storage

Additional info:

 

 The fields for `last_installation_preparation_status` for a cluster are currently reset when the user sends a request to `install` the cluster.

In the case that multiple requests are received, this can lead to this status being illegally cleared when it should not be.

It is safer to move this to the state machine where it can be ensured that states have changed in the correct way prior to the reset of this field.

Description of problem:

L3 Egress traffic from pod in segmented network does not work.

Version-Release number of selected component (if applicable):

build openshift/ovn-kubernetes#2274,openshift/api#2005

oc version

Client Version: 4.15.9
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.17.0-0.ci.test-2024-08-28-123437-ci-ln-v5g4wb2-latest
Kubernetes Version: v1.30.3-dirty
 

 

How reproducible:

Always

Steps to Reproduce:

1. Create a cluster UPI GCP with build from cluster bot

2. Create a namespace test wih NAD as below

 oc -n test get network-attachment-definition l3-network-nad -oyaml

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  creationTimestamp: "2024-08-28T17:44:14Z"
  generation: 1
  name: l3-network-nad
  namespace: test
  resourceVersion: "108224"
  uid: 5db4ca26-39dd-45b7-8016-215664e21f5d
spec:
  config: |
    {
      "cniVersion": "0.3.1",
      "name": "l3-network",
      "type": "ovn-k8s-cni-overlay",
      "topology":"layer3",
      "subnets": "10.150.0.0/16",
      "mtu": 1300,
      "netAttachDefName": "test/l3-network-nad",
      "role": "primary"
    }

3. Create a pod in the segmented namespace test

oc -n test exec -it hello-pod – ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:83:00:11 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.131.0.17/23 brd 10.131.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe83:11/64 scope link 
       valid_lft forever preferred_lft forever
3: ovn-udn1@if26: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1300 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:96:03:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.150.3.3/24 brd 10.150.3.255 scope global ovn-udn1
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe96:303/64 scope link 
       valid_lft forever preferred_lft forever

 oc -n test exec -it hello-pod – ip r

default via 10.150.3.1 dev ovn-udn1 
10.128.0.0/14 via 10.131.0.1 dev eth0 
10.131.0.0/23 dev eth0 proto kernel scope link src 10.131.0.17 
10.150.0.0/16 via 10.150.3.1 dev ovn-udn1 
10.150.3.0/24 dev ovn-udn1 proto kernel scope link src 10.150.3.3 
100.64.0.0/16 via 10.131.0.1 dev eth0 
100.65.0.0/16 via 10.150.3.1 dev ovn-udn1 
172.30.0.0/16 via 10.150.3.1 dev ovn-udn1 

4. Try to curl the IP echo server running outside the cluster to see it fail.

 oc -n test exec -it hello-pod – curl 10.0.0.2:9095 --connect-timeout 5

curl: (28) Connection timeout after 5001 ms
command terminated with exit code 28
 

Actual results:

curl request fails

Expected results:

curl request should pass

Additional info:

The egress from pod in regular namespace works 

 oc -n test1 exec -it hello-pod – curl 10.0.0.2:9095 --connect-timeout 5

10.0.128.4

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Do presume that Engineering will access attachments through supportshell.
  • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • When showing the results from commands, include the entire command in the output.  
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
  • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
  • For guidance on using this template please see
    OCPBUGS Template Training for Networking  components

Description of problem:

The catalogsource file for mirror2mirror is invalid with local cache     

Version-Release number of selected component (if applicable):

  ./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202409091841.p0.g45b1fcd.assembly.stream.el9-45b1fcd", GitCommit:"45b1fcd9df95420d5837dfdd2775891ae3dd6adf", GitTreeState:"clean", BuildDate:"2024-09-09T20:48:47Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

     Always
    

Steps to Reproduce:

  1. run the mirror2mirror command : 

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  operators:
  - catalog: quay.io/openshifttest/nginxolm-operator-index:mirrortest1


`oc-mirror -c config-head.yaml --workspace file://out-head docker://my-route-zhouy.apps.yinzhou0910.qe.azure.devcluster.openshift.com  --v2 --dest-tls-verify=false`     

Actual results:

The catalogsource file is invalid and create the twice for the catalogsource file:

2024/09/10 10:47:35  [INFO]   : 📄 Generating CatalogSource file...
2024/09/10 10:47:35  [INFO]   : out11re/working-dir/cluster-resources/cs-redhat-operator-index-v4-15.yaml file created
2024/09/10 10:47:35  [INFO]   : out11re/working-dir/cluster-resources/cs-redhat-operator-index-v4-15.yaml file created
2024/09/10 10:47:35  [INFO]   : mirror time     : 1m41.028961606s
2024/09/10 10:47:35  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
[fedora@preserve-fedora-yinzhou yinzhou]$ ll out11re/working-dir/cluster-resources/
total 8
-rw-r--r--. 1 fedora fedora 242 Sep 10 10:47 cs-redhat-operator-index-v4-15.yaml
-rw-r--r--. 1 fedora fedora 289 Sep 10 10:47 idms-oc-mirror.yaml
[fedora@preserve-fedora-yinzhou yinzhou]$ cat out11re/working-dir/cluster-resources/cs-redhat-operator-index-v4-15.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: cs-redhat-operator-index-v4-15
  namespace: openshift-marketplace
spec:
  image: localhost:55000/redhat/redhat-operator-index:v4.15
  sourceType: grpc
status: {}

Expected results:

    The catalogsource file should be created with the registry route not the local cache     

Additional info:

 

Description of problem:

    IDMS is set on HostedCluster and reflected in their respective CR in-cluster.  Customers can create, update, and delete these today.  In-cluster IDMS has no impact.

Version-Release number of selected component (if applicable):

    4.14+

How reproducible:

    100%

Steps to Reproduce:

    1. Create HCP
    2. Create IDMS
    3. Observe it does nothing
    

Actual results:

    IDMS doesn't change anything if manipulated in data plane

Expected results:

    IDMS either allows updates OR IDMS updates are blocked.

Additional info:

    

Please review the following PR: https://github.com/openshift/prometheus-operator/pull/304

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When the machineconfig tab is opened on the console the below error is displayed.

Oh no! Something went wrong
Type Error
Description:
Cannot read properties of undefined (reading 'toString")

Version-Release number of selected component (if applicable):

    OCP version 4.17.3

How reproducible:

    Every time at customers end. 

Steps to Reproduce:

    1. Go on console.
    2. Under compute tab go to machineconfig tab.
    
    

Actual results:

     Oh no! Something went wrong 

Expected results:

     Should be able to see all the available mc.

Additional info:

    

Description of problem:

When Ingress configuration is specified for a HostedCluster in .spec.configuration.ingress, the configuration fails to make it into the HostedCluster because VAP {{ingress-config-validation.managed.openshift.io}} prevents it.
    

Version-Release number of selected component (if applicable):

4.18 Hosted ROSA
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create a hosted cluster in ROSA with 
spec:
  configuration:
     ingress:
       domain: ""
       loadBalancer:
         platform:
           aws:
             type: NLB
           type: AWS
    2. Wait for the cluster to come up
    3.
    

Actual results:

    Cluster never finishes applying the payload (reaches Complete) because the console operator fails to reconcile its route.
    

Expected results:

    Cluster finishes applying the payload and reaches Complete
    

Additional info:

The following error is reported in the hcco log:

{"level":"error","ts":"2024-11-12T17:33:09Z","msg":"Reconciler error","controller":"resources","object":{"name":""},"namespace":"","name":"","reconcileID":"f4216970-af97-4093-ae72-b7dbe452b767","error":"failed to reconcile global configuration: failed to reconcile ingress config: admission webhook \"ingress-config-validation.managed.openshift.io\" denied the request: Only privileged service accounts may access","errorCauses":[{"error":"failed to reconcile global configuration: failed to reconcile ingress config: admission webhook \"ingress-config-validation.managed.openshift.io\" denied the request: Only privileged service accounts may access"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222"}


    

Description of problem:

   Feature : https://issues.redhat.com/browse/MGMT-18411
when to assited installer v. 2.34.0 but apprently not including in any openshift version to be used in ABI installation.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Went thru a loop to very the different commits to check if this is delivered in any ocp version.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem: https://github.com/openshift/installer/pull/7727 changed the order of some playbooks and we're expected to run the network.yaml playbook before the metadata.json file is created. This isn't a problem with newer version of ansible, that will happily ignore missing var_files, however this is a problem with older ansible that fail with:

[cloud-user@installer-host ~]$ ansible-playbook -i "/home/cloud-user/ostest/inventory.yaml" "/home/cloud-user/ostest/network.yaml"

PLAY [localhost] *****************************************************************************************************************************************************************************************************************************
ERROR! vars file metadata.json was not found                                                                                       
Could not find file on the Ansible Controller.                                                                                      
If you are using a module and expect the file to exist on the remote, see the remote_src option

Description of problem:

When "Create NetworkAttachmentDefinition" button is clicked, the app switches to "Administrator" perspective

Version-Release number of selected component (if applicable):

 

How reproducible:

always

Steps to Reproduce:

1. Switch to "Virtualization" perspective
2. Navigate to Network -> NetworkAttachmentDefinitions
3. Click "Create NetworkAttachmentDefinition" button 

Actual results:

App switches to "Administrator" perspective

Expected results:

App stays in "Virtualization" perspective

Additional info:

 

Description of the problem:

 

FYI - OCP 4.12 has reached end of maintenance support, not it is on extended support.

 

Looks like OCP 4.12 installations started failing lately due to hosts not discovering. for example - https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_assisted-service/6628/pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted-4-12/1817416612257468416 

 

How reproducible:

 

Seems like every CI run, haven't tested locally

 

Steps to reproduce:

 

Trigger OCP 4.12 installation in the CI

 

Actual results:

 

failure, hosts not discovering

 

Expected results:

 

Successful cluster installation

Description of problem:

We were told that adding connections to a Transit Gateway also costs an exorbitant amount of money. So the create option tgName now means that we will not clean up the connections during destroy cluster.
    

Description of problem:

    We missed the window to merge the ART 4.17 image PR in time.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Fail to get ART PR merged in time

Steps to Reproduce:

    1. Have E2E Tests fail for a while.
    2. Go on vacation afterwards.
    

Actual results:

    I got asked about 4.17 OCP images.

Expected results:

    I don't get asked about 4.17 OCP images.

Additional info:

    

Description of problem:

We identified a regression where we can no longer get oauth tokens for HyperShift v4.16 clusters via the OpenShift web console. v4.16.10 works fine, but once clusters are patched to v4.16.16 (or are created at that version) they fail to get the oauth token. 

This is due to this faulty PR: https://github.com/openshift/hypershift/pull/4496.

The oauth openshift deployment was changed and affected the IBM Cloud code path.  We need this endpoint to change back to using `socks5`.

Bug:
<           value: socks5://127.0.0.1:8090
---
>           value: http://127.0.0.1:8092
98c98
<           value: socks5://127.0.0.1:8090
---
>           value: http://127.0.0.1:80924:53
Fix:
Change http://127.0.0.1:8092 to socks5://127.0.0.1:8090

 

 

Version-Release number of selected component (if applicable):

4.16.16

How reproducible:

Every time.

Steps to Reproduce:

    1. Create ROKS v4.16.16 HyperShift-based cluster. 
    2. Navigate to the OpenShift web console.
    2. Click IAM#<username> menu in the top right.
    3. Click 'Copy login command'.
    4. Click 'Display token'.
    

Actual results:

Error getting token: Post "https://example.com:31335/oauth/token": http: server gave HTTP response to HTTPS client    

Expected results:

The oauth token should be successfully displayed.

Additional info:

    

Description of problem:

Day2 add node with oc binary is not working for ARM64 on baremetal CI running

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. Running compact agent installation on arm64 platform
    2. After the cluster is ready, run day2 install 
    3. Day2 install fail with error, worker-a-00 is not reachable  

Actual results:

    Day2 install exit with error.

Expected results:

    Day2 install should works

Additional info:

Job link: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/54181/rehearse-54181-periodic-ci-openshift-openshift-tests-private-release-4.17-arm64-nightly-baremetal-compact-abi-ipv4-static-day2-f7/1823641309190033408

Error message from console when running day2 install:
rsync: [sender] link_stat "/assets/node.x86_64.iso" failed: No such file or directory (2) command terminated with exit code 23 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1823) [Receiver=3.2.3] rsync: [Receiver] write error: Broken pipe (32) error: exit status 23 {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-08-13T14:32:20Z"} error: failed to execute wrapped command: exit status 1    

Description of problem:

When adding nodes, agent-register-cluster.service and start-cluster-installation.service service status should not be checked and in their place agent-import-cluster.service and agent-add-node.service should be checked.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    console message shows start installation service and agent register service has not started

Expected results:

    console message shows agent import cluster and add host services has started

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/116

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

User Story:

As an SRE managing hypershift clusters, I want to

  • be alerted when a hosted cluster's etcd needs manual intervention

so that I can achieve

  • only intervening when needed

When the openshift-install agent wait-for bootstrap-complete command cannot connect to either the k8s API or the assisted-service API, it tries to ssh to the rendezvous host to see if it is up.

If there is a running ssh-agent on the local host, we connect to it to make use of its private keys. This is not guaranteed to work, as the private key corresponding to the public key in the agent ISO may not be present on the box.

If there is no running ssh-agent, we use the literal public key as the path to a file that we expect to contain the private key. This is guaranteed not to work.

All of this generates a lot of error messages at DEBUG level that are confusing to users.

If we did succeed in ssh-ing to the host when it has already joined the cluster, the node would end up tainted as a result, which we want to avoid. (This is unlikely in practice though, because by the time the rendezvous host joins, the k8s API should be up so we wouldn't normally run this code at that time.)

We should stop doing all of this, and maybe just ping the rendezvous host to see if it is up.

https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/2781
https://kubernetes.slack.com/archives/CKFGK3SSD/p1704729665056699
https://github.com/okd-project/okd/discussions/1993#discussioncomment-10385535

Description of problem:

INFO Waiting up to 15m0s (until 2:23PM UTC) for machines [vsphere-ipi-b8gwp-bootstrap vsphere-ipi-b8gwp-master-0 vsphere-ipi-b8gwp-master-1 vsphere-ipi-b8gwp-master-2] to provision...
E0819 14:17:33.676051    2162 session.go:265] "Failed to keep alive govmomi client, Clearing the session now" err="Post \"https://vctest.ars.de/sdk\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
E0819 14:17:33.708233    2162 session.go:295] "Failed to keep alive REST client" err="Post \"https://vctest.ars.de/rest/com/vmware/cis/session?~action=get\": context canceled" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"
I0819 14:17:33.708279    2162 session.go:298] "REST client session expired, clearing session" server="vctest.ars.de" datacenter="" username="administrator@vsphere.local"

Description of problem:

Even though fakefish is not a supported redfish interface, it is very useful to have it working for "special" scenarios, like NC-SI, while its support is implemented.

On OCP 4.14 and later, converged flow is enabled by default, and on this configuration Ironic sends a soft power_off command to the ironic agent running on the ramdisk. Since this power operation is not going through the redfish interface, it is not processed by fakefish, preventing it from working on some NC-SI configurations, where a full power-off would mean the BMC loses power.

Ironic already supports using out-of-band power off for the agent [1], so having an option to use it would be very helpful.

[1]- https://opendev.org/openstack/ironic/commit/824ad1676bd8032fb4a4eb8ffc7625a376a64371

Version-Release number of selected component (if applicable):

Seen with OCP 4.14.26 and 4.14.33, expected to happen on later versions    

How reproducible:

Always

Steps to Reproduce:

    1. Deploy SNO node using ACM and fakefish as redfish interface
    2. Check metal3-ironic pod logs    

Actual results:

We can see a soft power_off command sent to the ironic agent running on the ramdisk:

2024-08-07 15:00:45.545 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Executing agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 with params {'wait': 'false', 'agent_token': '***'} _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:197
2024-08-07 15:00:45.551 1 DEBUG ironic.drivers.modules.agent_client [None req-74c0c3ed-011f-4718-bdce-53f2ba412e85 - - - - - -] Agent command standby.power_off for node df006e90-02ee-4847-b532-be4838e844e6 returned result None, error None, HTTP status code 200 _command /usr/lib/python3.9/site-packages/ironic/drivers/modules/agent_client.py:234

Expected results:

There is an option to prevent this soft power_off command, so all power actions happen via redfish. This would allow fakefish to capture them and behave as needed.

Additional info:

    

Looks relatively new in serial jobs on aws and vsphere. First occurrence I see is Wednesday at around 5am. It's not every run but it is quite common. (10-20% of the time)

Caught by test: Undiagnosed panic detected in pod

Undiagnosed panic detected in pod expand_less 	0s
{  pods/openshift-ovn-kubernetes_ovnkube-control-plane-558bfbcf78-nfbnw_ovnkube-cluster-manager_previous.log.gz:E1106 08:04:15.797587       1 panic.go:261] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<}

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial/1854031870619029504

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-upi-serial/1854427071271407616

See component readiness for more runs:

https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&LayeredProduct=none&Network=ovn&Network=ovn&NetworkAccess=default&Platform=aws&Platform=aws&Procedure=none&Scheduler=default&SecurityMode=default&Suite=serial&Suite=serial&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=Other&columnGroupBy=Architecture%2CNetwork%2CPlatform%2CTopology&component=Test%20Framework&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20aws%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Network%3Aovn&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&includeVariant=Topology%3Amicroshift&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-11-08%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-11-01%2000%3A00%3A00&testId=Symptom%20Detection%3A171acaa74f3d5ea96e3b687038d0cf13&testName=Undiagnosed%20panic%20detected%20in%20pod

https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=upi&Installer=upi&LayeredProduct=none&Network=ovn&Network=ovn&NetworkAccess=default&Platform=vsphere&Platform=vsphere&Procedure=none&Scheduler=default&SecurityMode=default&Suite=serial&Suite=serial&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=Other&columnGroupBy=Architecture%2CNetwork%2CPlatform%2CTopology&component=Test%20Framework&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20upi%20ovn%20vsphere%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Network%3Aovn&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&includeVariant=Topology%3Amicroshift&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-11-08%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-11-01%2000%3A00%3A00&testId=Symptom%20Detection%3A171acaa74f3d5ea96e3b687038d0cf13&testName=Undiagnosed%20panic%20detected%20in%20pod

Please review the following PR: https://github.com/openshift/csi-operator/pull/81

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:


If an invalid mac address is used in the interfaces table in agent-config.yaml, like this
 
{noformat}
   - name: eno2
       macAddress: 98-BE-94-3F-48-42
{noformat}

it results in the failing to register the Infraenv with assisted-service and constant retries

{noformat}
Aug 28 15:23:37 master0 agent-register-infraenv[4606]: time="2024-08-28T19:23:37Z" level=info msg="Registering infraenv"
Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=info msg="Reference to cluster id: 1f38e4c9-afde-4ac0-aa32-aabc75ec088a"
Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=info msg="Registering infraenv"
Aug 28 15:23:37 master0 agent-register-infraenv[4606]: time="2024-08-28T19:23:37Z" level=info msg="Added 1 nmstateconfigs"
Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=info msg="Added 1 nmstateconfigs"
Aug 28 15:23:37 master0 agent-register-infraenv[4606]: time="2024-08-28T19:23:37Z" level=fatal msg="Failed to register infraenv with assisted-service: response status code does not match any response statuses defined for this endpoint in the swagger spec (status 422): {}"
Aug 28 15:23:37 master0 podman[4572]: time="2024-08-28T19:23:37Z" level=fatal msg="Failed to register infraenv with assisted-service: response status code does not match any response statuses defined for this endpoint in the swagger spec (status 422): {}"
{noformat}

The error above was in 4.15. In 4.18 I can duplicate it and its only marginally better. There is slightly more info due to an assisted-service change, but same net result of retrying continually on the 
Registering infraenv"
{noformat}
Sep 11 20:57:26 master-0 agent-register-infraenv[3013]: time="2024-09-11T20:57:26Z" level=fatal msg="Failed to register infraenv with assisted-service: json: cannot unmarshal number into Go struct field Error.code of type string"
Sep 11 20:57:26 master-0 podman[2987]: time="2024-09-11T20:57:26Z" level=fatal msg="Failed to register infraenv with assisted-service: json: cannot unmarshal number into Go struct field Error.code of type string"
{noformat}

    

Version-Release number of selected component (if applicable):

 Occcurs both in latest 4.18 and 4.15.26

    

How reproducible:


    

Steps to Reproduce:

    1. Use an invalid mac address in the interface table like this

{noformat}
      interfaces:
        - name: eth0
          macAddress: 00:59:bd:23:23:8c
        - name: eno12399np0
          macAddress: 98-BE-94-3F-51-33
      networkConfig:
        interfaces:
          - name: eno12399np0
            type: ethernet
            state: up
            ipv4:
              enabled: false
              dhcp: false
            ipv6:
              enabled: false
              dhcp: false
          - name: eth0
            type: ethernet
            state: up
            mac-address: 00:59:bd:23:23:8c
            ipv4:
              enabled: true
              address:
                - ip: 192.168.111.80
                  prefix-length: 24
              dhcp: false

{noformat}

    2. Generate the agent ISO
    3. Install using the agent ISO, I just did an SNO installation.
    

Actual results:

Install fails with the errors:

{noformat}
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
level=debug msg=infraenv is not registered in rest API
{noformat}

    

Expected results:

The invalid mac address should be detected when creating the ISO image so it can be fixed.

    

Additional info:


    

The following test is failing:

[sig-api-machinery] ValidatingAdmissionPolicy [Privileged:ClusterAdmin] should type check a CRD [Suite:openshift/conformance/parallel] [Suite:k8s]

Additional context here:

https://sippy.dptools.openshift.org/sippy-ng/tests/4.18/analysis?test=%5Bsig-api-machinery%5D%20ValidatingAdmissionPolicy%20%5BPrivileged%3AClusterAdmin%5D%20should%20type%20check%20a%20CRD%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%20%5BSuite%3Ak8s%5D&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-api-machinery%5D%20ValidatingAdmissionPolicy%20%5BPrivileged%3AClusterAdmin%5D%20should%20type%20check%20a%20CRD%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D%20%5BSuite%3Ak8s%5D%22%7D%2C%7B%22columnField%22%3A%22variants%22%2C%22not%22%3Atrue%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22never-stable%22%7D%2C%7B%22columnField%22%3A%22variants%22%2C%22not%22%3Atrue%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22aggregated%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D

This was a problem back in 4.16 when the test had Beta in the name. https://issues.redhat.com/browse/OCPBUGS-30767

But the test continues to be quite flaky and we just got unlucky and failed a payload on it.

The failure always seems to be:

{  fail [k8s.io/kubernetes/test/e2e/apimachinery/validatingadmissionpolicy.go:380]: wait for type checking: PatchOptions.meta.k8s.io "" is invalid: fieldManager: Required value: is required for apply patch
Error: exit with code 1
Ginkgo exit error 1: exit with code 1}

See: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.18-ocp-e2e-gcp-ovn-multi-x-ax/1828347537950511104

It often works on a re-try. (flakes)

Something is not quite right either with this test or the product.

Description of problem:

cluster-capi-operator is running its controllers on AzureStackCloud. And it shouldn't because CAPI is not supported for AzureStackCloud.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of the problem:

When removing a spoke BMH resource from the hub cluster the node it being shutdown. Previously, the BMH was just removed and the node wasn't affected in anyway. It seems to be due to new behavior in the BMH finalizer that removes the paused annotation from the BMH.

How reproducible:

100%

Steps to reproduce:

1. Install a spoke cluster

2. Remove one of the spoke cluster BMHs from the hub cluster

Actual results:

Correlating node is shutdown

Expected results:

Correlating node is not shutdown

Description of problem:

same admin console bug OCPBUGS-31931 on developer console, 4.15.17 cluster, kubeadmin user goes to developer console UI, click "Observe", select one project, example: openshift-monitoring, select Silences tab, click "Create silence", Creator filed is not auto filled with user name, add label name/value, and Comment to create silence.

will see error on page

An error occurred
createdBy in body is required 

see picture: https://drive.google.com/file/d/1PR64hvpYCC-WOHT1ID9A4jX91LdGG62Y/view?usp=sharing

this issue exists in 4.15/4.16/4.17/4.18, no issue with 4.14

Version-Release number of selected component (if applicable):

4.15.17

How reproducible:

alwawys

Steps to Reproduce:

see the description

Actual results:

Creator filed is not auto filled with user name    

Expected results:

no error

Additional info:

    

Description of problem:

    Filter dropdown doesn't collapse on second click

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-10-21-132049

How reproducible:

    Always

Steps to Reproduce:

    1. Navigate to Workloads -> Pod page
    2. Click the 'Filter' dropdown component
    3. Click the 'Filter' dropdown again
    

Actual results:

    Compare with OCP4.17, the dropdown list could be collapsed after the second click
    But current on OCP4.18, the dropdown list cannot collapse

Expected results:

    the dropdown can collapse after click

Additional info:

    

Description of problem:

    We should add validation in the Installer when public-only subnets is enabled to make sure that:

	1. Print a warning if OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY is set
	2. If this flag is only applicable for public cluster, we could consider exit earlier if publish: Internal
	3. If this flag is only applicable for byo-vpc configuration, we could
 consider exit earlier if no subnets provided in install-config.

Version-Release number of selected component (if applicable):

    all versions that support public-only subnets

How reproducible:

    always

Steps to Reproduce:

    1. Set OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY
    2. Do a cluster install without specifying a VPC.
    3.
    

Actual results:

    No warning about the invalid configuration.

Expected results:

    

Additional info:

    This is an internal-only feature, so this validations shouldn't affect the normal path used by customers.

Description of problem:

Create image pull secret with whitespace in the beginning/end of username and password, decode the auth in the '.dockerconfigjson' of the secret, it still contains whitespace in password.
    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-07-29-134911
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Create image pull secret with whitespace in the beginning/end of username and password, eg: '  testuser  ','  testpassword  '
    2.Check on the secret details page, reveal values of ".dockerconfigjson", decode the value of 'auth'.
    3.
    

Actual results:

1. Secret is created.
2. There is not whitespace in value for username and password. But the decoded result of 'auth' contains whitespace in password.
$ echo 'dGVzdHVzZXI6ICB0ZXN0cGFzc3dvcmQgIA==' | base64 -d
testuser:  testpassword  
    

Expected results:

1. Should not contain whitespace in password after decode auth. eg:
testuser:testpassword
    

Additional info:


    

Description of problem:

The rails example "rails-postgresql-example" no longer runs successfully, because it references a version of ruby that is not available in the library.

This is blocking the release of Samples Operator because we check the validity of the templates shipped with the operator.

Rails sample is no longer supported by the Samples Operator but is still shipped in an old version. I.e. we just continue shipping the same old version of the sample across releases. This old version references ruby that is no longer present in the openshift library.

There are a couple of ways of solving this problem:

1. Start supporting the Rails sample again in Samples Operator (the Rails examples seem to be maintained and made also available through helm-charts).

2. Remove the test that makes sure rails example is buildable to let the test suite pass. We don't support rails anymore in the Samples Operator so this should not be too surprising.

3. Remove rails from the Samples Operator altogether. This is probably the cleanest solution but most likely requires more work than just removing the sample from the assets of Samples Operator (removing the failing test is the most obvious thing that would break, too).

We need to decide ASAP how to proceed to unblock the release of Samples Operator for OCP 4.17.

Version-Release number of selected component (if applicable):

    

How reproducible:

The Samples Operator testsuite runs these tests and results in a failure like this:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-samples-operator/567/pull-ci-openshift-cluster-samples-operator-master-e2e-aws-ovn-image-ecosystem/1829111792509390848    

Steps to Reproduce:

 

Actual results:

    

Expected results:

    

Additional info:

The test in question fails here: https://github.com/openshift/origin/blob/master/test/extended/image_ecosystem/s2i_ruby.go#L59

The line in the test output that stands out:

 I0829 13:02:24.241018 3111 dump.go:53] At 2024-08-29 13:00:21 +0000 UTC - event for rails-postgresql-example: {buildconfig-controller } BuildConfigInstantiateFailed: error instantiating Build from BuildConfig e2e-test-s2i-ruby-q75fj/rails-postgresql-example (0): Error resolving ImageStreamTag ruby:3.0-ubi8 in namespace openshift: unable to find latest tagged image

Please review the following PR: https://github.com/openshift/csi-operator/pull/271

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/107

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When configuring the OpenShift image registry to use a custom Azure storage account in a different resource group, following the official documentation [1], the image-registy CO degrade and upgrade from version 4.14.x to 4.15.x fails. The image registry operator reports misconfiguration errors related to Azure storage credentials, preventing the upgrade and causing instability in the control plane.

[1] Configuring registry storage in Azure user infrastructure

Version-Release number of selected component (if applicable):

   4.14.33, 4.15.33

How reproducible:

  1. Set up ARO:
    • Deploy an ARO or OpenShift cluster on Azure, version 4.14.x.
  1. Configure Image Registry:
    • Follow the official documentation [1] to configure the image registry to use a custom Azure storage account located in a different resource group.
    • Ensure that the image-registry-private-configuration-user secret is created in the openshift-image-registry namespace.
    • Do not modify the installer-cloud-credentials secret.
  1. Check the image registry CO status
  2. Initiate Upgrade:
    • Attempt to upgrade the cluster to OpenShift version 4.15.x.

Steps to Reproduce:

  1. If we have the image-registry-private-configuration-user inplace and installer-cloud-credentials with no modified

We got the error 

    NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: client misconfigured, missing 'TenantID', 'ClientID', 'ClientSecret', 'FederatedTokenFile', 'Creds', 'SubscriptionID' option(s) 

The oeprator will also genreate a new secret image-registry-private-configuration with the same content as image-registry-private-configuration-user

$ oc get secret  image-registry-private-configuration -o yaml
apiVersion: v1
data:
  REGISTRY_STORAGE_AZURE_ACCOUNTKEY: xxxxxxxxxxxxxxxxx
kind: Secret
metadata:
  annotations:
    imageregistry.operator.openshift.io/checksum: sha256:524fab8dd71302f1a9ade9b152b3f9576edb2b670752e1bae1cb49b4de992eee
  creationTimestamp: "2024-09-26T19:52:17Z"
  name: image-registry-private-configuration
  namespace: openshift-image-registry
  resourceVersion: "126426"
  uid: e2064353-2511-4666-bd43-29dd020573fe
type: Opaque 

 

2. then we delete the secret image-registry-private-configuration-user

now the secret image-registry-private-configuration will still exisit with the same content, but image-registry CO got a new error 

 

NodeCADaemonProgressing: The daemon set node-ca is deployed Progressing: Unable to apply resources: unable to sync storage configuration: failed to get keys for the storage account arojudesa: storage.AccountsClient#ListKeys: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Storage/storageAccounts/arojudesa' under resource group 'aro-ufjvmbl1' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix" 

3. apply the workaround to manually changeing the secret installer-cloud-credentials azure_resourcegroup key with custom storage account resourcegroup

$ oc get secret installer-cloud-credentials -o yaml
apiVersion: v1
data:
  azure_client_id: xxxxxxxxxxxxxxxxx
  azure_client_secret: xxxxxxxxxxxxxxxxx
  azure_region: xxxxxxxxxxxxxxxxx
  azure_resource_prefix: xxxxxxxxxxxxxxxxx
  azure_resourcegroup: xxxxxxxxxxxxxxxxx <<<<<-----THIS
  azure_subscription_id: xxxxxxxxxxxxxxxxx
  azure_tenant_id: xxxxxxxxxxxxxxxxx
kind: Secret
metadata:
  annotations:
    cloudcredential.openshift.io/credentials-request: openshift-cloud-credential-operator/openshift-image-registry-azure
  creationTimestamp: "2024-09-26T16:49:57Z"
  labels:
    cloudcredential.openshift.io/credentials-request: "true"
  name: installer-cloud-credentials
  namespace: openshift-image-registry
  resourceVersion: "133921"
  uid: d1268e2c-1825-49f0-aa44-d0e1cbcda383
type: Opaque 

 

The  image-registry report healthy and this help the continue the upgrade

 

Actual results:

    The image registry seems still use the service principal way for Azure storage account authentication

Expected results:

    We expect the REGISTRY_STORAGE_AZURE_ACCOUNTKEY should the only thing image registry operator need for storage account authentication if Customer provide 
  • The image registry continues to function using the custom Azure storage account in the different resource group.

Additional info:

  • Reproducibility: The issue is consistently reproducible by following the official documentation to configure the image registry with a custom storage account in a different resource group and then attempting an upgrade.
  • Related Issues:
    • Similar problems have been reported in previous incidents, suggesting a systemic issue with the image registry operator's handling of Azure storage credentials.
  • Critical Customer Impact: Customers are required to perform manual interventions after every upgrade for each cluster, which is not sustainable and leads to operational overhead.

 

Slack : https://redhat-internal.slack.com/archives/CCV9YF9PD/p1727379313014789

Description of problem:

    The installer for IBM Cloud currently only checks the first group of subnets (50) when searching for Subnet details by name. It should provide pagination support to search all subnets.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%, dependent on order of subnets returned by IBM Cloud API's however

Steps to Reproduce:

    1. Create 50+ IBM Cloud VPC Subnets
    2. Use Bring Your Own Network (BYON) configuration (with Subnet names for CP and/or Compute) in install-config.yaml
    3. Attempt to create manifests (openshift-install create manifests)
    

Actual results:

    ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-1", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-2", platform.ibmcloud.controlPlaneSubnets: Not found: "eu-de-subnet-paginate-1-cp-eu-de-3", platform.ibmcloud.controlPlaneSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-cp-eu-de-1", "eu-de-subnet-paginate-1-cp-eu-de-2", "eu-de-subnet-paginate-1-cp-eu-de-3"}: number of zones (0) covered by controlPlaneSubnets does not match number of provided or default zones (3) for control plane in eu-de, platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-1", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-2", platform.ibmcloud.computeSubnets: Not found: "eu-de-subnet-paginate-1-compute-eu-de-3", platform.ibmcloud.computeSubnets: Invalid value: []string{"eu-de-subnet-paginate-1-compute-eu-de-1", "eu-de-subnet-paginate-1-compute-eu-de-2", "eu-de-subnet-paginate-1-compute-eu-de-3"}: number of zones (0) covered by computeSubnets does not match number of provided or default zones (3) for compute[0] in eu-de]

Expected results:

    Successful manifests and cluster creation

Additional info:

    IBM Cloud is working on a fix

Please review the following PR: https://github.com/openshift/csi-operator/pull/243

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/71

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem

Recently, sos package was added to the tools image used when invoking oc debut node/<some-node> (details in z).

However, the change just added the sos package without taking into account other required conditions required by sos report to work inside a container.

For reference, the toolbox container has to be launched as follows for sos report to work properly (the comand output tells you the template of the right podman run command):

$ podman inspect registry.redhat.io/rhel9/support-tools | jq -r '.[0].Config.Labels.run' 
podman run -it --name NAME --privileged --ipc=host --net=host --pid=host -e HOST=/host -e NAME=NAME -e IMAGE=IMAGE -v /run:/run -v /var/log:/var/log -v /etc/machine-id:/etc/machine-id -v /etc/localtime:/etc/localtime -v /:/host IMAGE

The most crucial thing is the HOST=/host environment variable, which makes sos report find the real root of the machine in /host, but the other ones are also required.

So if we are to support sos report in the tools image, the debug node container defaults should be changed such that container runs with the same settings than in the reference podman run indicated above.

Version-Release number of selected component (if applicable)

4.16 only

How reproducible

Always

Steps to Reproduce

Start a debug node container (oc debug node/<node>) and try to gather sos report (without chroot /host + toolbox, just from debug container).

Actual results

  • Debug container doesn't have the right environment for sos report
  • Sos report runs but generates a wrong sos report with limited and meaningless information of the debug container itself.

Expected results:

  • oc debug node/<node> to spawn a debug pod with the right environment for sos report to run as correctly as it would do in toolbox.
  • Sos report to work as expected in debug pod.

Additional info

(none)

Description of the problem:

 When tring to add a node on day2 using assisted-installer the node reports the disk to not be eligible as installation disk:

Thread: https://redhat-external.slack.com/archives/C05N3PY1XPH/p1731575515647969
Possible issue: https://github.com/openshift/assisted-service/blob/master/internal/hardware/validator.go#L117-L120 => the openshift version is not filled on day2

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

When verifying feature ACM Alerting UI, following doc,] face issue 'silence alert action link has bad format compare to CMO's same action '

Description of problem:

Camel K provides a list of Kamelets that are able to act as an event source or sink for a Knative eventing message broker.

Usually the list of Kamelets installed with the Camel K operator are displayed in the Developer Catalog list of available event sources with the provider "Apache Software Foundation" or "Red Hat Integration".

When a user adds a custom Kamelet custom resource to the user namespace the list of default Kamelets coming from the Camel K operator is gone. The Developer Catalog event source list then only displays the custom Kamelet but not the default ones.

Version-Release number of selected component (if applicable):

    

How reproducible:

Apply a custom Kamelet custom resource to the user namespace and open the list of available event sources in Dev Console Developer Catalog.

Steps to Reproduce:

    1. install global Camel K operator in operator namespace (e.g. openshift-operators)
    2. list all available event sources in "default" user namespace and see all Kamelets listed as event sources/sinks
    3. add a custom Kamelet custom resource to the default namespace
    4. see the list of available event sources only listing the custom Kamelet and the default Kamelets are gone from that list
    

Actual results:

Default Kamelets that act as event source/sink are only displayed in the Developer Catalog when there is no custom Kamelet added to a namespace.    

Expected results:

Default Kamelets coming with the Camel K operator (installed in the operator namespace) should always be part of the Developer Catalog list of available event sources/sinks. When the user adds more custom Kamelets these should be listed, too.   

Additional info:

Reproduced with Camel K operator 2.2 and OCP 4.14.8

screenshots: https://drive.google.com/drive/folders/1mTpr1IrASMT76mWjnOGuexFr9-mP0y3i?usp=drive_link

 

Please review the following PR: https://github.com/openshift/cluster-api-provider-gcp/pull/231

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-gcp-cluster-api-controllers-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

compile errors when building an ironic image look like this

2024-08-14 09:07:21 + python3 -m compileall --invalidation-mode=timestamp /usr
2024-08-14 09:07:21 Listing '/usr'...
2024-08-14 09:07:21 Listing '/usr/bin'...
...
Listing '/usr/share/zsh/site-functions'...
Listing '/usr/src'...
Listing '/usr/src/debug'...
Listing '/usr/src/kernels'...
Error: building at STEP "RUN prepare-image.sh && rm -f /bin/prepare-image.sh && /bin/prepare-ipxe.sh && rm -f /tmp/prepare-ipxe.sh": while running runtime: exit status 1

with the actual error lost in 3000+ lines of output, we should suppress the file listings

Description of problem:

I see that when one release is declared in the ImageSetConfig.yaml everything works well with respect to creating release signature configmap, but when more than one release is added to ImageSetConfig.yaml i see that binaryData content in the signature configmap is duplicated and there is more than specified releases present in the signatures directory. See below

ImageSetConfig.yaml:
=================
[fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-232.yaml 
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  platform:
    channels:
    - name: stable-4.16
      minVersion: 4.16.0
      maxVersion: 4.16.0
    - name: stable-4.15
      minVersion: 4.15.0
      maxVersion: 4.15.0

Content in Signatures directory:
=========================
[fedora@preserve-fedora-yinzhou test]$ ls -l CLID-232/working-dir/signatures/
total 12
-rw-r--r--. 1 fedora fedora 896 Sep 25 11:27 4.15.0-x86_64-sha256-0da6316466d60a3a4535d5fed3589feb0391989982fba59d47d4c729912d6363
-rw-r--r--. 1 fedora fedora 897 Sep 25 11:27 4.15.31-x86_64-sha256-c03bbdd63fa8832266a2cf0d9fbcd2867692d9ba7e09d31bc77d15dd9903e36f
-rw-r--r--. 1 fedora fedora 899 Sep 25 11:27 4.16.0-x86_64-sha256-3717338045df06e31effea46761b2c7e90f543cc4f00547af8158dd6aea868c3

Content in Signature Configmap:
==========================
apiVersion: v1
binaryData:
  sha256-0da6316466d60a3a4535d5fed3589feb0391989982fba59d47d4c729912d6363-2: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphboVxQEWxbl6SZl5qXe29lQrJRdllmQmJ+YoWSlUK2XmJqanglkp+cnZqUW6uYl5mWmpxSW6KZnpQAoopVSckWhkamZlkJJoZmxoZmJmlmJmkGicaGJqbJpimpaaYmxqYZmWmmRgbGloaWFpaWGUlpRoapliYp5ikmxuZGlpaJRiZmxmrFSro6BUUlkAsk4psSQ/NzNZITk/ryQR6LAiBaBr8xJLSotSlYCqMlNS80oySyqRHVaUmpZalJqXDNZeWJpYqZeZr59fkJpXnJGZVgKUzklNLE7VTUkt089PLoDxrUz0DE31DHQrLMzizUyUakFuyC8oyczPgwZAclEq0C1FIEODUlMUPBJLFPyBhgaDDFUIBjoqMy9dwbG0JCMfGGyVCgZ6BnqGQGM6mWRYGBg5GNhYmUChysDFKQCLgT4zAYZeps1bfryz7j15qafOW3Dqwv8q1gUhm2eahBm6BEgZRp1fNN1LJEQi0PW1qVrTmQnusy7Pq/t2qcrj83LOh7b7uhMlL7AF3j6QM/HdoTTFaZsulu3qm/FU7SCTwhUH+WsaJw2/l2/bpKDEmvI29TPTCs0pJrFt1UGds0OXeuZf/Pvo9Y8WWw/7sA0lrA0daz6Ef9RdPsGdU+SDpjCrRuai8oIbavb9Fz22FvYv/eMk/dv26L6MPqaU1R56Sz8LVJQ1XQrk3Dzl+THGVZ97BOS0znjwn/RLvsNvc/8V8w39xV/XuhvskLMXfjPp5pErMtbKPMte5krmeEefy5uWvyi9dUPesedH/ey8l894t/RM1odKsaZwtx2X8tecb/eZGsd64P/c77cOnYiX62POMY+L2Xom4bVk5DnDncrKsictr/4yDjnO5Heg0uHN6k1rkv88Ez5yy+HU009+V1l3eFUfVVhfahQS/5trr3JrtIvKFln+s9L17+9brQp10wtkeTqt5OOZrftY7Nqk1mcLejxanF7uyHvSIj+vUPDZhk4GU+MAZ4a3zCfSdeb2l4REqdRwVhoXf7u9/6qnYf79L2IOHE4RzOVbghwsXgWa3T715rLQwT7e/SuYBYqWf87c+CFw0/QTPg3vmI/G/qhaKvLf3sy7U+N2TVDe9OUqj0/wvBI/yOV0y0Mpet1ZRt+zH9tllRMkH60PSd23EAA=
  sha256-0da6316466d60a3a4535d5fed3589feb0391989982fba59d47d4c729912d6363-3: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphboVxQEWxbl6SZl5qXe29lQrJRdllmQmJ+YoWSlUK2XmJqanglkp+cnZqUW6uYl5mWmpxSW6KZnpQAoopVSckWhkamZlkJJoZmxoZmJmlmJmkGicaGJqbJpimpaaYmxqYZmWmmRgbGloaWFpaWGUlpRoapliYp5ikmxuZGlpaJRiZmxmrFSro6BUUlkAsk4psSQ/NzNZITk/ryQR6LAiBaBr8xJLSotSlYCqMlNS80oySyqRHVaUmpZalJqXDNZeWJpYqZeZr59fkJpXnJGZVgKUzklNLE7VTUkt089PLoDxrUz0DE31DHQrLMzizUyUakFuyC8oyczPgwZAclEq0C1FIEODUlMUPBJLFPyBhgaDDFUIBjoqMy9dwbG0JCMfGGyVCgZ6BnqGQGM6mWRYGBg5GNhYmUChysDFKQCLgT4zAYZeps1bfryz7j15qafOW3Dqwv8q1gUhm2eahBm6BEgZRp1fNN1LJEQi0PW1qVrTmQnusy7Pq/t2qcrj83LOh7b7uhMlL7AF3j6QM/HdoTTFaZsulu3qm/FU7SCTwhUH+WsaJw2/l2/bpKDEmvI29TPTCs0pJrFt1UGds0OXeuZf/Pvo9Y8WWw/7sA0lrA0daz6Ef9RdPsGdU+SDpjCrRuai8oIbavb9Fz22FvYv/eMk/dv26L6MPqaU1R56Sz8LVJQ1XQrk3Dzl+THGVZ97BOS0znjwn/RLvsNvc/8V8w39xV/XuhvskLMXfjPp5pErMtbKPMte5krmeEefy5uWvyi9dUPesedH/ey8l894t/RM1odKsaZwtx2X8tecb/eZGsd64P/c77cOnYiX62POMY+L2Xom4bVk5DnDncrKsictr/4yDjnO5Heg0uHN6k1rkv88Ez5yy+HU009+V1l3eFUfVVhfahQS/5trr3JrtIvKFln+s9L17+9brQp10wtkeTqt5OOZrftY7Nqk1mcLejxanF7uyHvSIj+vUPDZhk4GU+MAZ4a3zCfSdeb2l4REqdRwVhoXf7u9/6qnYf79L2IOHE4RzOVbghwsXgWa3T715rLQwT7e/SuYBYqWf87c+CFw0/QTPg3vmI/G/qhaKvLf3sy7U+N2TVDe9OUqj0/wvBI/yOV0y0Mpet1ZRt+zH9tllRMkH60PSd23EAA=
  sha256-3717338045df06e31effea46761b2c7e90f543cc4f00547af8158dd6aea868c3-1: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphbqJSUGVJf56SZl5adU9WtVKyUWZJZnJiTlKVgrVSpm5iempYFZKfnJ2apFubmJeZlpqcYluSmY6kAJKKRVnJBqZmlkZmxuaGxtbGJiYpqQZmKUaG6ampaUmmpiZmxkmGSWbp1oapJmaGCcnm6QZGJiamCemWRiaWqSkmCWmJlqYWSQbK9XqKCiVVBaArFNKLMnPzUxWSM7PK0nMzEstUgC6Ni+xpLQoVQmoKjMlNa8ks6QS2WFFqWmpRal5yWDthaWJlXqZ+fr5Bal5xRmZaSVA6ZzUxOJU3ZTUMv385AIY38pEz9BMz0C3wsIs3sxEqRbkhvyCksz8PGgAJBelAt1SBDI0KDVFwSOxRMEfaGgwyFCFYKCjMvPSFRxLSzLygcFWqWCgZ6BnCDSmk0mGhYGRg4GNlQkUqgxcnAKwGNiSIcDQLFrmt8ZarfU0234jphipJx9PrVWR6Ne1P/lzlbnN1blfXt+UWnXz4NW1Ne/eHNI+vNpyxpe0VZozL1YKlMg+VCo+uul5S4t3L+8byXsmb98vdVy61TLumM+0Ta1WuikS3NfVlvPNLJ6y4+6qX74pz9pqnXbr32lxenH6btxcpW+C21ICAxd9tOkST7Vemn7kedPrOXyPCkQ5blZK1BdaPYndXcMZK3AsI7a4SqMsrvH2pNgVRU+X3z1t/umAHWv4FbZowW8zDnZtt1ov5215R/dtsXOw4fwEi5WtClM55h0908FyYOor+7/HI0qPZ3DsP8DPIZy4YOl38fb5PPOCTP8fm8t++erKN9mbAh7+Yo90eO8urXuho6OitC3hcIjpf9HiSBMl13fOt6MEF7zsn7Zj5oI7x5Y2Hr6ys/RNxnPZgjlh/pkdr7OccxM2zLFvXTN7b7n0r3dq277/LvuYl+l+e16u18bpMbmZu2VtkmYY31h94+uCaN3I43tbJLmXTtly97Yyc23LrtxtK7PM5K4oSd0oMJ7zaN3Ssr0bEo8GFIT7m9eY/3leG/76McPKO5uDHji8zpWUnfNyv2L315RVXc+usYuwf/v81PvHlz3Vt/49PTFNILy04pjQv788culLEi1edk2amaH5zTfBvN407aP4i6NzPwi98O5nac/cHbZLzDEw4iXjpHWsuWZPzJhyNF3myQb3SlQ7AQA=
  sha256-3717338045df06e31effea46761b2c7e90f543cc4f00547af8158dd6aea868c3-5: owGbwMvMwMEoOU9/4l9n2UDGtYzJSWLxRQW5xZnpukWphbqJSUGVJf56SZl5adU9WtVKyUWZJZnJiTlKVgrVSpm5iempYFZKfnJ2apFubmJeZlpqcYluSmY6kAJKKRVnJBqZmlkZmxuaGxtbGJiYpqQZmKUaG6ampaUmmpiZmxkmGSWbp1oapJmaGCcnm6QZGJiamCemWRiaWqSkmCWmJlqYWSQbK9XqKCiVVBaArFNKLMnPzUxWSM7PK0nMzEstUgC6Ni+xpLQoVQmoKjMlNa8ks6QS2WFFqWmpRal5yWDthaWJlXqZ+fr5Bal5xRmZaSVA6ZzUxOJU3ZTUMv385AIY38pEz9BMz0C3wsIs3sxEqRbkhvyCksz8PGgAJBelAt1SBDI0KDVFwSOxRMEfaGgwyFCFYKCjMvPSFRxLSzLygcFWqWCgZ6BnCDSmk0mGhYGRg4GNlQkUqgxcnAKwGNiSIcDQLFrmt8ZarfU0234jphipJx9PrVWR6Ne1P/lzlbnN1blfXt+UWnXz4NW1Ne/eHNI+vNpyxpe0VZozL1YKlMg+VCo+uul5S4t3L+8byXsmb98vdVy61TLumM+0Ta1WuikS3NfVlvPNLJ6y4+6qX74pz9pqnXbr32lxenH6btxcpW+C21ICAxd9tOkST7Vemn7kedPrOXyPCkQ5blZK1BdaPYndXcMZK3AsI7a4SqMsrvH2pNgVRU+X3z1t/umAHWv4FbZowW8zDnZtt1ov5215R/dtsXOw4fwEi5WtClM55h0908FyYOor+7/HI0qPZ3DsP8DPIZy4YOl38fb5PPOCTP8fm8t++erKN9mbAh7+Yo90eO8urXuho6OitC3hcIjpf9HiSBMl13fOt6MEF7zsn7Zj5oI7x5Y2Hr6ys/RNxnPZgjlh/pkdr7OccxM2zLFvXTN7b7n0r3dq277/LvuYl+l+e16u18bpMbmZu2VtkmYY31h94+uCaN3I43tbJLmXTtly97Yyc23LrtxtK7PM5K4oSd0oMJ7zaN3Ssr0bEo8GFIT7m9eY/3leG/76McPKO5uDHji8zpWUnfNyv2L315RVXc+usYuwf/v81PvHlz3Vt/49PTFNILy04pjQv788culLEi1edk2amaH5zTfBvN407aP4i6NzPwi98O5nac/cHbZLzDEw4iXjpHWsuWZPzJhyNF3myQb3SlQ7AQA=
  sha256-c03bbdd63fa8832266a2cf0d9fbcd2867692d9ba7e09d31bc77d15dd9903e36f-4: owGbwMvMwMEoOU9/4l9n2UDGtYwpSWLxRQW5xZnpukWphbpZ+ZXhZuF6SZl5abcZJKuVkosySzKTE3OUrBSqlTJzE9NTwayU/OTs1CLd3MS8zLTU4hLdlMx0IAWUUirOSDQyNbNKNjBOSkpJMTNOS7SwMDYyMjNLNEpOM0ixTEtKTjGyMDM3szRKsUxKNE81sEwxNkxKNjdPMTRNSbG0NDBONTZLU6rVUVAqqSwAWaeUWJKfm5mskJyfV5KYmZdapAB0bV5iSWlRqhJQVWZKal5JZkklssOKUtNSi1LzksHaC0sTK/Uy8/XzC1LzijMy00qA0jmpicWpuimpZfr5yQUwvpWJnqGpnrGhboWFWbyZiVItyBH5BSWZ+XnQEEguSgU6pghkalBqioJHYomCP9DUYJCpCsFAV2XmpSs4lpZk5APDrVLBQM9AzxBoTCeTDAsDIwcDGysTKFgZuDgFYFHwQYP/r7TdX8MJrlqz/3tPL+rjsZNXsNwX8Vxgc++2GI5dkt4r1r1nmrfdcGVn8tVJMtzTrf7m6F+9v5m7uK54b18F3+1JS5ziwtOfTpSpZs1u4z41o2QHo3HJmQNum0OK5ywoMtB4s8Mh+YVo7FSN7Vpr8/fdkHDPmr/plNTxw5EByZreMicnzhWx1TX94bxkYf9X1gehhstDj5Vu+7G6VTv49O9yx+xah4XC4ccvGyj4y374ql1TcsZwscHEagvz1eeFey97Lkj6nX2y+MyjY3yvJMRbxEvZ/iS9W/+b4+zOGZmHdm6pymfO9104VY3JVeO2V3JvfvKi9KKmXh8xyf/lQlprjI52nomwOOSZfIpBLv7Ezf/r9wQ4Lt81dfuJlfO50uc5p5ybIMD3L6ZywY3EA1yvIkNllmkwCTgc9RDwf7hnqrpoxNeLP75tcY7ekplU3FymE1z7YMIli8Trp3c0VFTFHuibLcGn13Rvu0roraAZBpvXV7vL7mExXjJHaoJlenxeOIvZ85ksH29fe3Cp2lVCp8Kh1KjUeyZ7w8PJX/W0Ppp96TTwUPuXNi/ZxXSpxxy19trJysLbLi5In8sytTB08vRLarfc0hiVXgs7m6f0P7xyYpbzVPbZrPYHnRjfCS9ljFNamXL50KzN6T46hww81YT1W84kzvMNZd/M0B+auvfe758FLnyRM3zfrJ43n2tbF1P3Ph7tqngA
kind: ConfigMap
metadata:
  labels:
    release.openshift.io/verification-signatures: ""
  namespace: openshift-config-managed

    

Version-Release number of selected component (if applicable):

     [fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-298-ga5a32fa", GitCommit:"a5a32fa3", GitTreeState:"clean", BuildDate:"2024-09-25T08:22:44Z", GoVersion:"go1.22.4", Compiler:"gc", Platform:"linux/amd64"}

    

How reproducible:

    Always
    

Steps to Reproduce:

    1.  clone oc-mirror repo, cd oc-mirror, run make build
    2.  Now use the imageSetConfig.yaml present above and run mirror2disk & disk2mirror commands
    3. oc-mirror -c /tmp/clid-232.yaml file://CLID-232 --v2 ; oc-mirror -c /tmp/clid-232.yaml --from file://CLID-232 docker://localhost:5000/clid-232 --dest-tls-verify=false --v2
    

Actual results:

   1.  See that signature directory contains more than expected releases as shown in the description
    2. Also see that binaryData is duplicated in the signatureconfigmap.yaml
    

Expected results:

    1. Should only see the releases that are defined in the imageSetConfig.yaml in the signatures directory
    2. Should not see any duplication of binaryData in the signatureconfigmap.yaml file.
    

Additional info:


    

The duplication of controllers for hostedcontrolplane v2 has caused some technical debt.

The new controllers are now out of sync with their v1.

For example:

control-plane-operator/controllers/hostedcontrolplane/v2/cloud_controller_manager/openstack/config.go is missing a feature that was merged between the v2 controller was merged, so it's out of sync.

Description of problem:

Inspection is failing on hosts which special characters found in serial number of block devices:

Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: 2024-07-03 09:16:11.325 1 DEBUG ironic_python_agent.inspector [-] collected data: {'inventory'....'error': "The following errors were encountered:\n* collector logs failed: 'utf-8' codec can't decode byte 0xff in position 12: invalid start byte"} call_inspector /usr/lib/python3.9/site-packages/ironic_python_agent/inspector.py:128

Serial found:
"serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff"

Interesting stacktrace error:
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed

Full stack trace:
~~~
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: 2024-07-03 09:16:11.628 1 DEBUG oslo_concurrency.processutils [-] CMD "lsblk -bia --json -oKNAME,MODEL,SIZE,ROTA,TYPE,UUID,PARTUUID,SERIAL" returned: 0 in 0.006s e
xecute /usr/lib/python3.9/site-packages/oslo_concurrency/processutils.py:422
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: --- Logging error ---
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: --- Logging error ---
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Traceback (most recent call last):
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]:   File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Traceback (most recent call last):
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     stream.write(msg + self.terminator)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Call stack:
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]:     stream.write(msg + self.terminator)
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 1260-1267: surrogates not allowed
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/bin/ironic-python-agent", line 10, in <module>
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     sys.exit(run())
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/cmd/agent.py", line 50, in run
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     agent.IronicPythonAgent(CONF.api_url,
Jul 03 09:16:11 master3.xxxxxx.yyy ironic-agent[2272]: Call stack:
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 485, in run
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     self.process_lookup_data(content)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 400, in process_lookup_data
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     hardware.cache_node(self.node)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3179, in cache_node
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     dispatch_to_managers('wait_for_disks')
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 3124, in dispatch_to_managers
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     return getattr(manager, method)(*args, **kwargs)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 997, in wait_for_disks
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     self.get_os_install_device()
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1518, in get_os_install_device
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     block_devices = self.list_block_devices_check_skip_list(
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1495, in list_block_devices_check_skip_list
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     block_devices = self.list_block_devices(
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 1460, in list_block_devices
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     block_devices = list_all_block_devices()
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_python_agent/hardware.py", line 526, in list_all_block_devices
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     report = il_utils.execute('lsblk', '-bia', '--json',
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 111, in execute
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     _log(result[0], result[1])
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:   File "/usr/lib/python3.9/site-packages/ironic_lib/utils.py", line 99, in _log
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]:     LOG.debug('Command stdout is: "%s"', stdout)
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Message: 'Command stdout is: "%s"'
Jul 03 09:16:11 master3.xxxxxx.yyy podman[2234]: Arguments: ('{\n   "blockdevices": [\n      {\n         "kname": "loop0",\n         "model": null,\n         "size": 67467313152,\n         "rota": false,\n         "type": "loop",\n         "uuid": "28f5ff52-7f5b-4e5a-bcf2-59813e5aef5a",\n         "partuuid": null,\n         "serial": null\n      },{\n         "kname": "loop1",\n         "model": null,\n         "size": 1027846144,\n         "rota": false,\n         "type": "loop",\n         "uuid": null,\n         "partuuid": null,\n         "serial": null\n      },{\n         "kname": "sda",\n         "model": "LITEON IT ECE-12",\n         "size": 120034123776,\n         "rota": false,\n         "type": "disk",\n         "uuid": null,\n         "partuuid": null,\n         "serial": "XXXXXXXXXXXXXXXXXX"\n      },{\n         "kname": "sdb",\n         "model": "LITEON IT ECE-12",\n         "size": 120034123776,\n         "rota": false,\n         "type": "disk",\n         "uuid": null,\n         "partuuid": null,\n         "serial": "XXXXXXXXXXXXXXXXXXXX"\n      },{\n         "kname": "sdc",\n         "model": "External",\n         "size": 0,\n         "rota": true,\n         "type": "disk",\n         "uuid": null,\n         "partuuid": null,\n         "serial": "2HC015KJ0000\udcff\udcff\udcff\udcff\udcff\udcff\udcff\udcff"\n      }\n   ]\n}\n',)
~~~

Version-Release number of selected component (if applicable):

OCP 4.14.28

How reproducible:

Always

Steps to Reproduce:

    1. Add a BMH with a bad utf-8 characters in serial
    2.
    3.
    

Actual results:

Inspection fail

Expected results:

Inspection works

Additional info:

    

 

Description of problem:

Selecting Add from Event modal in topology redirects to add page but the event modal to add trigger for  a broker persistes    

Version-Release number of selected component (if applicable):


    

How reproducible:

Everytime
    

Steps to Reproduce:

    1. Enable event option in config map of knative-eventing namespace
    2. Create a broker and associate an event to it
    3. In topology select add trigger for the broker
    4. Since no service is created it will ask to go to Add page to create a service so select Add from the modal 
    

Actual results:

The modal persists
    

Expected results:

The modal should be closed after the user is redirected to the Add page
    

Additional info:

Adding video of the issue
    

https://drive.google.com/file/d/16hMbtBj0GeqUOLnUdCTMeYR3exY84oEn/view?usp=sharing

Description of problem:

    Rotating the root certificates (root CA) requires multiple certificates during the rotation process to prevent downtime as the server and client certificates are updated in the control and data planes. Currently, the HostedClusterConfigOperator uses the cluster-signer-ca from the control plane to create a kublet-serving-ca on the data plane. The cluster-signer-ca contains only a single certificate that is used for signing certificates for the kube-controller-manager. 

During a rotation, the kublet-serving-ca will be updated with the new CA which triggers the metrics-server pod to restart and use the new CA. This will lead to an error in the metrics-server where it cannot scrape metrics as the kublet has yet to pickup the new certificate.

E0808 16:57:09.829746       1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.240.0.29:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="pres-cqogb7a10b7up68kvlvg-rkcpsms0805-default-00000130"

rkc@rmac ~> kubectl get pods -n openshift-monitoring
NAME                                                     READY   STATUS    RESTARTS   AGE
metrics-server-594cd99645-g8bj7                          0/1     Running   0          2d20h
metrics-server-594cd99645-jmjhj                          1/1     Running   0          46h 

The HostedClusterConfigOperator should likely be using the KubeletClientCABundle from the control plane for the kublet-serving-ca in the data plane. This CA bundle will contain both the new and old CA such that all data plane components can remain up during the rotation process.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

the section is: https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-arm-tested-machine-types_installing-aws-vpc  

all tesed arm instances for 4.14+:
c6g.*
c7g.*
m6g.*
m7g.*
r8g.*

we need to ensure all sections include "Tested instance types for AWS on 64-bit ARM infrastructures" section been updated for 4.14+    

Additional info:

    

In 4.17 the openshift installer will have the `create config iso` functionality (see epic). IBIO should stop implementing this logic, instead it should extract the openshift installer from the release image (already part of the ICI CR) and use it to create ethe configuration ISO.

Please review the following PR: https://github.com/openshift/route-controller-manager/pull/47

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The console crashes when the user selects SSH as the Authentication type for the git server under add secret in the start pipeline form     

Version-Release number of selected component (if applicable):

    

How reproducible:

Everytime. Only in developer perspective and if the Pipelines dynamic plugin is enabled.
    

Steps to Reproduce:

    1. Create a pipeline through add flow and open start pipeline page 
    2. Under show credentials select add secret
    3. In the secret form select `Access to ` as Git server and `Authentication type` as SSH key
    

Actual results:

Console crashes
    

Expected results:

UI should work as expected
    

Additional info:

Attaching console log screenshot
    

https://drive.google.com/file/d/1bGndbq_WLQ-4XxG5ylU7VuZWZU15ywTI/view?usp=sharing

Please review the following PR: https://github.com/openshift/csi-operator/pull/227

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Dualstack jobs beyond 4.13 (presumably when we added cluster-data.json) are miscategorized as NetworkStack = ipv4 because the code doesn't know how to detect dualstack: https://github.com/openshift/origin/blob/11f7ac3e64e6ee719558fc18d753d4ce1303d815/pkg/monitortestlibrary/platformidentification/types.go#L88

We have the ability to NOT override a variant calculated from jobname if cluster-data disagrees: https://github.com/openshift/sippy/blob/master/pkg/variantregistry/ocp.go#L181

We should fix origin, but we don't want to backport to five releases, so we should also update the variant registry to ignore this field in cluster data is release <= 4.18 (assuming that's where we fix this)

Description of the problem:

Looks like nmstate service enabled on ARM machine .

ARM machine: (Run on CI job)
nvd-srv-17.nvidia.eng.rdu2.redhat.com

https://ci-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/CI/job/assisted-saas-api-dualstack-staticip-mno-arm/

[root@worker-0-0 core]# cd /etc/nmstate/ [root@worker-0-0 nmstate]# cat cat catchsegv [root@worker-0-0 nmstate]# ls -l total 8 -rw-r--r--. 1 root root 95 Aug 1 2022 README -rw-------. 1 root root 804 Sep 24 12:36 ymlFile2.yml [root@worker-0-0 nmstate]# cat ymlFile2.yml capture: iface0: interfaces.mac-address == "52:54:00:82:6B:E0" desiredState: dns-resolver: config: server: - 192.168.200.1 interfaces: - ipv4: address: - ip: 192.168.200.53 prefix-length: 24 dhcp: false enabled: true name: "{{ capture.iface0.interfaces.0.name }}" type: ethernet state: up ipv6: address: - ip: fd2e:6f44:5dd8::39 prefix-length: 64 dhcp: false enabled: true routes: config: - destination: 0.0.0.0/0 next-hop-address: 192.168.200.1 next-hop-interface: "{{ capture.iface0.interfaces.0.name }}" table-id: 254 - destination: ::/0 next-hop-address: fd2e:6f44:5dd8::1 next-hop-interface: "{{ capture.iface0.interfaces.0.name }}" table-id: 254[root@worker-0-0 nmstate]#

 

[root@worker-0-0 nmstate]# systemctl status nmstate.service ● nmstate.service - Apply nmstate on-disk state Loaded: loaded (/usr/lib/systemd/system/nmstate.service; enabled; preset: enabled) Active: active (exited) since Tue 2024-09-24 12:40:05 UTC; 20min ago Docs: man:nmstate.service(8) https://www.nmstate.io Process: 3427 ExecStart=/usr/bin/nmstatectl service (code=exited, status=0/SUCCESS) Main PID: 3427 (code=exited, status=0/SUCCESS) CPU: 36ms Sep 24 12:40:03 worker-0-0 systemd[1]: Starting Apply nmstate on-disk state... Sep 24 12:40:03 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:03Z INFO nmstatectl] Nmstate version: 2.2.27 Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::ip] Static addresses fd2e:6f44:5dd8::39/64 defined when dynamic IP is enabled Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::ip] Static addresses fd2e:6f44:5dd8::39/64 defined when dynamic IP is enabled Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::query_apply::net_state] Created checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z INFO nmstate::query_apply::net_state] Rollbacked to checkpoint /org/freedesktop/NetworkManager/Checkpoint/1 Sep 24 12:40:05 worker-0-0 nmstatectl[3427]: [2024-09-24T12:40:05Z ERROR nmstatectl::service] Failed to apply state file /etc/nmstate/ymlFile2.yml: NmstateError: NotImplementedError: Autoconf without DHCP is not supported yet Sep 24 12:40:05 worker-0-0 systemd[1]: Finished Apply nmstate on-disk state. [root@worker-0-0 nmstate]# more /usr/lib/systemd/system/nmstate.service [Unit] Description=Apply nmstate on-disk state Documentation=man:nmstate.service(8) https://www.nmstate.io After=NetworkManager.service Before=network-online.target Requires=NetworkManager.service [Service] Type=oneshot ExecStart=/usr/bin/nmstatectl service RemainAfterExit=yes [Install] WantedBy=NetworkManager.service [root@worker-0-0 nmstate]#

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Gather the nodenetworkconfigurationpolicy.nmstate.io/v1 and nodenetworkstate.nmstate.io/v1beta1 cluster scoped resources in the Insights data. This CRs are introduced by the NMState operator.

Description of problem:

  A new Chart 'Architecture' is added on Metrics page for some resources eg: Deployments, StatefulSet, and DemonStets, and so on. It will be shown 'no data point found' on the Chart which is not correct

The Report issue/Question is: 
Q1. Should the Chart of 'Architecture' be listed on the Metrics page for those resources?
Q2. If yes, it should not shown 'No datapoints found'

Version-Release number of selected component (if applicable):

  4.18.0-0.nightly-2024-10-08-075347  

How reproducible:

    Always

Steps to Reproduce:

    1. Navigate to a resource details page, such as StatefulSet details/ Deployments details page, and go to Metrices tab
       eg: k8s/ns/openshift-monitoring/statefulsets/alertmanager-main/metrics
    2. Check the new chart 'Architecture'
    3.
    

Actual results:

    A new chart'Architecture' is listed on Metrics page
    And the data in the chart return 'no datapoints found'

Expected results:

    The chart of 'Architecture' should not exist
    If it is added by Design, it should not return 'No datapoints found'    

Additional info:

For Reference: I think the page is impact by the PR https://github.com/openshift/console/pull/13718

Description of problem:

    etcd-operator is using JSON-based client for core object communication. Instead it should use protobuf version

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of the problem:

When attempting to delete the agentserviceconfig, it gets stuck deleting on the `agentserviceconfig.agent-install.openshift.io/local-cluster-import-deprovision` finalizer. 

 

The following errors is reported by the infrastructure operator pod:

time="2024-09-03T12:57:17Z" level=info msg="AgentServiceConfig (LocalClusterImport) Reconcile started"
time="2024-09-03T12:57:17Z" level=error msg="could not delete local cluster ClusterDeployment due to error failed to delete ClusterDeployment  in namespace : resource name may not be empty"
time="2024-09-03T12:57:17Z" level=error msg="failed to clean up local cluster CRs" error="failed to delete ClusterDeployment  in namespace : resource name may not be empty"
time="2024-09-03T12:57:17Z" level=info msg="AgentServiceConfig (LocalClusterImport) Reconcile ended"
{"level":"error","ts":"2024-09-03T12:57:17Z","msg":"Reconciler error","controller":"agentserviceconfig","controllerGroup":"agent-install.openshift.io","controllerKind":"AgentServiceConfig","AgentServiceConfig":{"name":"agent"},"namespace":"","name":"agent","reconcileID":"470afd7d-ec86-4d45-818f-eb6ebb4caa3d","error":"failed to delete ClusterDeployment  in namespace : resource name may not be empty","errorVerbose":"resource name may not be empty\nfailed to delete ClusterDeployment  in namespace \ngithub.com/openshift/assisted-service/internal/controller/controllers.(*LocalClusterImportReconciler).deleteClusterDeployment\n\t/remote-source/assisted-service/app/internal/controller/controllers/local_cluster_import_controller.go:250\ngithub.com/openshift/assisted-service/internal/controller/controllers.(*LocalClusterImportReconciler).ensureLocalClusterCRsDeleted\n\t/remote-source/assisted-service/app/internal/controller/controllers/local_cluster_import_controller.go:333\ngithub.com/openshift/assisted-service/internal/controller/controllers.(*LocalClusterImportReconciler).Reconcile\n\t/remote-source/assisted-service/app/internal/controller/controllers/local_cluster_import_controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1695","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/assisted-service/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"} 

How reproducible:

100%

Steps to reproduce:

1. Delete AgentServiceConfig resource

Actual results:

The AgentServiceConfig isn't removed

Expected results:

The AgentServiceConfig is removed

Description of problem:

Using the latest main branch hypershift client to create a 4.15 hc, the capi provider crashed with the logs:

$ oc logs capi-provider-647f454bf-sqq9c
Defaulted container "manager" out of: manager, token-minter, availability-prober (init)
invalid argument "EKS=false,ROSA=false" for "--feature-gates" flag: unrecognized feature gate: ROSA
Usage of /bin/cluster-api-provider-aws-controller-manager:
invalid argument "EKS=false,ROSA=false" for "--feature-gates" flag: unrecognized feature gate: ROSA

Version-Release number of selected component (if applicable):

4.15 HC    

How reproducible:

    100%

Steps to Reproduce:

    1. Just use main latest cli to create a public aws 4.15 HC 
    2.
    3.
    

Actual results:

capi-provider pod crashed     

Expected results:

    the 4.15 hc could be created successfully

Additional info:

probably related to

4576

slack: https://redhat-internal.slack.com/archives/G01QS0P2F6W/p1724249475037359

 

Description of problem:

Removing third party override of cloud-provider-vsphere's config package

Version-Release number of selected component (if applicable):

4.18, 4.17.z

How reproducible:

Always

Additional info:

Upstream package was overridden to fix logging confusion while we waited for upstream fix.  Fix is now ready and the third party override needs to be removed.

Description of problem:

    After branching, main branch still publishes Konflux builds to mce-2.7

Version-Release number of selected component (if applicable):

    mce-2.7

How reproducible:

    100%

Steps to Reproduce:

    1.Post a PR to 

main

    2. Check the jobs that run
    

Actual results:

Both mce-2.7 and main Konflux builds get triggered    

Expected results:

Only main branch Konflux builds gets triggered

Additional info:

    

Description of problem:

After installed MCE operator, tried to create MultiClusterEngine instance, it failed with error:
 "error applying object Name: mce Kind: ConsolePlugin Error: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found"
Checked in openshift-console-operator, there is not webhook service, also deployment "console-conversion-webhook" is missed.
    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-06-25-103421
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Check resources in openshift-console-opeator, such as deployment and service.
    2.
    3.
    

Actual results:

1. There is not webhook related deployment, pod and service. 
    

Expected results:

1. Should have webhook related resources.
    

Additional info:


    

Description of problem:

Edit Deployment and Edit DeploymentConfig actions redirect user to project workloads page instead of resource details page

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-22-123921    

How reproducible:

    Always

Steps to Reproduce:

    1. user tries to `Edit Deployment` and `Edit DeploymentConfig` action either in Form or YAML view, save the changes
    

Actual results:

1. user will be redirected to project workloads page    

Expected results:

1. user should be taken to resource details page    

Additional info:

    

Description of problem:

    Cancelling the file browser dialog after initial file was previously uploaded causes TypeError crash

Version-Release number of selected component (if applicable):

4.18.0-0.ci-2024-10-30-043000
    

How reproducible:

    always

Steps to Reproduce:

1. user logins to console 
2. goes to Secrets -> Create Image pull secret, on the page - Secret name: test-secret - Authentication type: Upload configuration file, here we click on browse and upload some file.
3. then when we try to browse for other file, but instead of uploading another file we cancel the file chooser dialog, the console crash with 'Cannot read properties of undefined (reading 'size')' error.     

Actual results:

Console crashes with 'Cannot read properties of undefined (reading 'size')' error

Expected results:

Console should not crash.

Additional info:

    

Description of problem:

On one ingress details page, click "Edit" button for Labels, it opens annotation edit modal.
    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-10-133647
4.17.0-0.nightly-2024-09-09-120947
    
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Go to one ingress details page, click "Edit" button for Labels.
    2.
    3.
    

Actual results:

1. The "Edit annotations" modal is opened.
    

Expected results:

1. Should open "Edit labels" modal.
    

Additional info:


    

Description of problem:

    Enabling the Shipwright tests in CI

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

In cluster-capi-operator, if the VsphereCluster object gets deleted, the controller attempts to recreate it and fails while trying to also recreate its corresponding vsphere credentials secret, which instead still exists.

The failure is highlighted by the following logs in the controller: `resourceVersion should not be set on objects to be created`

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Delete VsphereCluster
    2. Check the cluster-capi-operator logs
    3.
    

Actual results:

    VsphereCluster fails to be recreated as the reconciliation fails during ensuring the vsphere credentials secret

Expected results:

    VsphereCluster gets recreated

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-operator/pull/231

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

 On 1.8.2024, assisted-installer-agent job started failing subsystem test "add_multiple_servers". We need to make sure it is occurs only in tests and The fix should be backported.

Description of problem:

    there is a spelling error for word `instal` , it should be `install` 

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-03-211053

How reproducible:

    Always

Steps to Reproduce:

    1. normal user open Lightspeed hover button, check the messages
    2.
    3.
    

Actual results:

Must have administrator accessContact your administrator and ask them to instal Red Hat OpenShift Lightspeed.    

Expected results:

word `instal` should be `install`     

Additional info:

    

Description of problem:

When we enable OCB functionality and we create a MC that configures an eforcing=0 kernel argumnent the MCP is degraded reporting this message

              {
                  "lastTransitionTime": "2024-05-30T09:37:06Z",
                  "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"",
                  "reason": "1 nodes are reporting degraded status on sync",
                  "status": "True",
                  "type": "NodeDegraded"
              },


    

Version-Release number of selected component (if applicable):

IPI on AWS

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-05-30-021120   True        False         97m     Error while reconciling 4.16.0-0.nightly-2024-05-30-021120: the cluster operator olm is not available

    

How reproducible:

Alwasy
    

Steps to Reproduce:

    1. Enable techpreview
$ oc patch featuregate cluster --type=merge -p '{"spec":{"featureSet": "TechPreviewNoUpgrade"}}'

    2. Configure a MSOC resource to enable OCB functionality in the worker pool

When we hit this problem we were using the mcoqe quay repository.
A copy of the pull-secret for baseImagePullSecret and renderedImagePushSecret and no currentImagePullSecret configured.

apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: worker
spec:
  machineConfigPool:
    name: worker
#  buildOutputs:
#    currentImagePullSecret:
#      name: ""
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: pull-copy 
    renderedImagePushSecret:
      name: pull-copy 
    renderedImagePushspec: "quay.io/mcoqe/layering:latest"

    3. Create a MC to use enforing=0 kernel argument

{
    "kind": "List",
    "apiVersion": "v1",
    "metadata": {},
    "items": [
        {
            "apiVersion": "machineconfiguration.openshift.io/v1",
            "kind": "MachineConfig",
            "metadata": {
                "labels": {
                    "machineconfiguration.openshift.io/role": "worker"
                },
                "name": "change-worker-kernel-selinux-gvr393x2"
            },
            "spec": {
                "config": {
                    "ignition": {
                        "version": "3.2.0"
                    }
                },
                "kernelArguments": [
                    "enforcing=0"
                ]
            }
        }
    ]
}

    

Actual results:

The worker MCP is degraded reporting this message:

oc get mcp worker -oyaml
....

              {
                  "lastTransitionTime": "2024-05-30T09:37:06Z",
                  "message": "Node ip-10-0-29-166.us-east-2.compute.internal is reporting: \"unexpected on-disk state validating against quay.io/mcoqe/layering@sha256:654149c7e25a1ada80acb8eedc3ecf9966a8d29e9738b39fcbedad44ddd15ed5: missing expected kernel arguments: [enforcing=0]\"",
                  "reason": "1 nodes are reporting degraded status on sync",
                  "status": "True",
                  "type": "NodeDegraded"
              },

    

Expected results:

The MC should be applied without problems and selinux should be using enforcing=0
    

Additional info:


    

Description of problem:

In hostedcluster installations, when the following OAuthServer service is configure without any configured hostname parameter, the oauth route is created in the management cluster with the standard hostname  which following the pattern from ingresscontroller wilcard domain (oauth-<hosted-cluster-namespace>.<wildcard-default-ingress-controller-domain>):  

~~~
$ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml
  - service: OAuthServer
    servicePublishingStrategy:
      type: Route
~~~  

On the other hand, if any custom hostname parameter is configured, the oauth route is created in the management cluster with the following labels: 

~~~
$ oc get hostedcluster -n <namespace> <hosted-cluster-name> -oyaml
  - service: OAuthServer
    servicePublishingStrategy:
      route:
        hostname: oauth.<custom-domain>
      type: Route

$ oc get routes -n hcp-ns --show-labels
NAME    HOST/PORT             LABELS
oauth oauth.<custom-domain>  hypershift.openshift.io/hosted-control-plane=hcp-ns <---
~~~

The configured label makes the ingresscontroller does not admit the route as the following configuration is added by hypershift operator to the default ingresscontroller resource: 

~~~
$ oc get ingresscontroller -n openshift-ingress-default default -oyaml
    routeSelector:
      matchExpressions:
      - key: hypershift.openshift.io/hosted-control-plane <---
        operator: DoesNotExist <---
~~~

This configuration should be allowed as there are use-cases where the route should have a customized hostname. Currently the HCP platform is not allowing this configuration and the oauth route does not work.

Version-Release number of selected component (if applicable):

   4.15

How reproducible:

    Easily

Steps to Reproduce:

    1. Install HCP cluster 
    2. Configure OAuthServer with type Route 
    3. Add a custom hostname different than default wildcard ingress URL from management cluster
    

Actual results:

    Oauth route is not admitted

Expected results:

    Oauth route should be admitted by Ingresscontroller

Additional info:

    

Version of components:
OCP version 

4.16.0-0.nightly-2024-11-05-003735

Operator bundle: quay.io/rhobs/observability-operator-bundle:0.4.3-241105092032

Description of issue:
When Tracing UI plugin instance is created. The distributed-tracing-* pod shows the following errors and the Tracing UI is not available in the OCP web console. 

 % oc logs distributed-tracing-745f655d84-2jk6b
time="2024-11-05T13:08:37Z" level=info msg="enabled features: []\n" module=main
time="2024-11-05T13:08:37Z" level=error msg="cannot read base manifest file" error="open web/dist/plugin-manifest.json: no such file or directory" module=manifest
time="2024-11-05T13:08:37Z" level=info msg="listening on https://:9443" module=server
I1105 13:08:37.620932       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
10.128.0.109 - - [05/Nov/2024:13:08:54 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62
10.128.0.109 - - [05/Nov/2024:13:08:54 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62
10.128.0.109 - - [05/Nov/2024:13:09:10 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62
10.128.0.109 - - [05/Nov/2024:13:09:25 +0000] "GET /plugin-manifest.json HTTP/1.1" 500 62 

Steps to reproduce the issue:

*Instal the latest operator bundle. 

quay.io/rhobs/observability-operator-bundle:0.4.3-241105092032

*Set the -openshift.enabled flag in the CSV.

*Create the Tracing UI plugin instance and check the UI plugin pod logs.

Description of problem: If a customer applies ethtool configuration to the interface used in br-ex, that configuration will be dropped when br-ex is created. We need to read and apply the configuration from the interface to the phys0 connection profile, as described in https://issues.redhat.com/browse/RHEL-56741?focusedId=25465040&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25465040

Version-Release number of selected component (if applicable): 4.16

How reproducible: Always

Steps to Reproduce:

1. Deploy a cluster with an NMState config that sets the ethtool.feature.esp-tx-csum-hw-offload field to "off"

2.

3.

Actual results: The ethtool setting is only applied to the interface profile which is disabled after configure-ovs runs

Expected results: The ethtool setting is present on the configure-ovs-created profile

Additional info:

Affected Platforms: VSphere. Probably baremetal too and possibly others.

Description of problem:

The whereabouts kubeconfig is known to expire, if the cluster credentials and kubernetes secret changes, the whereabouts kubeconfig (which is stored on disk) is not updated to reflect the credential change

Version-Release number of selected component (if applicable):

>= 4.8.z (all OCP versions which ship Whereabouts)

How reproducible:

With time.

Steps to Reproduce:

1. Wait for cluster credentials to expire (which may take a year depending on cluster configuration) (currently unaware of a technique to force a credentials change to the serviceaccount secret token)

Actual results:

Kubeconfig is out of date and Whereabouts cannot properly authenticate with API server

Expected results:

Kubeconfig is updated and Whereabouts can authenticate with API server

Description of the problem:

 Trying to create cluster (Multi - operators : mtv + cnv + lvms) with minimal requirements

according to preflight response (attached below):
We should need 5 vcpu cores as minimal req:

  • basic: 2
  • additional for mtv 1
  • additional for cnv 2
  • additional for lvms 0
  • should be 5

however when creating the cluster it is asking for 6instead of 5

tooltip says
Require at least 6 CPU cores for worker role, found only 5.

{"ocp":{"master":{"qualitative":null,"quantitative":{"cpu_cores":4,"disk_size_gb":20,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":100,"packet_loss_percentage":0,"ram_mib":16384}},"worker":{"qualitative":null,"quantitative":{"cpu_cores":2,"disk_size_gb":20,"installation_disk_speed_threshold_ms":10,"network_latency_threshold_ms":1000,"packet_loss_percentage":10,"ram_mib":8192}}},"operators":[{"dependencies":[],"operator_name":"lso","requirements":{"master":{"qualitative":null,"quantitative":{}},"worker":{"qualitative":null,"quantitative":{}}}},{"dependencies":["lso"],"operator_name":"odf","requirements":{"master":{"qualitative":["Requirements apply only for master-only clusters","At least 3 hosts","At least 1 non-boot SSD or HDD disk on 3 hosts"],"quantitative":{"cpu_cores":6,"ram_mib":19456}},"worker":{"qualitative":["Requirements apply only for clusters with workers","5 GiB of additional RAM for each non-boot disk","2 additional CPUs for each non-boot disk","At least 3 workers","At least 1 non-boot SSD or HDD disk on 3 workers"],"quantitative":{"cpu_cores":8,"ram_mib":19456}}}},{"dependencies":["lso"],"operator_name":"cnv","requirements":{"master":{"qualitative":["Additional 1GiB of RAM per each supported GPU","Additional 1GiB of RAM per each supported SR-IOV NIC","CPU has virtualization flag (vmx or svm)"],"quantitative":{"cpu_cores":4,"ram_mib":150}},"worker":{"qualitative":["Additional 1GiB of RAM per each supported GPU","Additional 1GiB of RAM per each supported SR-IOV NIC","CPU has virtualization flag (vmx or svm)"],"quantitative":{"cpu_cores":2,"ram_mib":360}}}},{"dependencies":[],"operator_name":"lvm","requirements":{"master":{"qualitative":["At least 1 non-boot disk per host","100 MiB of additional RAM","1 additional CPUs for each non-boot disk"],"quantitative":{"cpu_cores":1,"ram_mib":100}},"worker":{"qualitative":null,"quantitative":{}}}},{"dependencies":[],"operator_name":"mce","requirements":{"master":{"qualitative":[],"quantitative":{"cpu_cores":4,"ram_mib":16384}},"worker":{"qualitative":[],"quantitative":{"cpu_cores":4,"ram_mib":16384}}}},{"dependencies":["cnv"],"operator_name":"mtv","requirements":{"master":{"qualitative":["1024 MiB of additional RAM","1 additional CPUs"],"quantitative":{"cpu_cores":1,"ram_mib":1024}},"worker":{"qualitative":["1024 MiB of additional RAM","1 additional CPUs"],"quantitative":{"cpu_cores":1,"ram_mib":1024}}}}]}

How reproducible:

 100%

Steps to reproduce:

1. create a multi cluster

2. select mtv + lvms + cnv

3. add 5 cpu cores work node

Actual results:

 unaable to continue installation process cluster asking for an extra cpu core

Expected results:
should be bale to isntall cluster 5 cpu should be enough

Description of problem:

Trying to install AWS EFS Driver 4.15 in 4.16 OCP. And driver pods get stuck with the below error:
$ oc get pods
NAME                                             READY   STATUS    RESTARTS   AGE
aws-ebs-csi-driver-controller-5f85b66c6-5gw8n    11/11   Running   0          80m
aws-ebs-csi-driver-controller-5f85b66c6-r5lzm    11/11   Running   0          80m
aws-ebs-csi-driver-node-4mcjp                    3/3     Running   0          76m
aws-ebs-csi-driver-node-82hmk                    3/3     Running   0          76m
aws-ebs-csi-driver-node-p7g8j                    3/3     Running   0          80m
aws-ebs-csi-driver-node-q9bnd                    3/3     Running   0          75m
aws-ebs-csi-driver-node-vddmg                    3/3     Running   0          80m
aws-ebs-csi-driver-node-x8cwl                    3/3     Running   0          80m
aws-ebs-csi-driver-operator-5c77fbb9fd-dc94m     1/1     Running   0          80m
aws-efs-csi-driver-controller-6c4c6f8c8c-725f4   4/4     Running   0          11m
aws-efs-csi-driver-controller-6c4c6f8c8c-nvtl7   4/4     Running   0          12m
aws-efs-csi-driver-node-2frs7                    0/3     Pending   0          6m29s
aws-efs-csi-driver-node-5cpb8                    0/3     Pending   0          6m26s
aws-efs-csi-driver-node-bchg5                    0/3     Pending   0          6m28s
aws-efs-csi-driver-node-brndb                    0/3     Pending   0          6m27s
aws-efs-csi-driver-node-qcc4m                    0/3     Pending   0          6m27s
aws-efs-csi-driver-node-wpk5d                    0/3     Pending   0          6m27s
aws-efs-csi-driver-operator-6b54c78484-gvxrt     1/1     Running   0          13m

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  6m58s                  default-scheduler  0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  3m42s (x2 over 4m24s)  default-scheduler  0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.

 

 

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    all the time

Steps to Reproduce:

    1. Install AWS EFS CSI driver 4.15 in 4.16 OCP
    2.
    3.
    

Actual results:

    EFS CSI drive node pods are stuck in pending state

Expected results:

    All pod should be running.

Additional info:

    More info on the initial debug here: https://redhat-internal.slack.com/archives/CBQHQFU0N/p1715757611210639

Description of problem:

    In 4.18 Azure Stack Hub cluster, Azure-Disk CSI Driver doesn't doesn't work with following error when provisioning volume:
E1024 05:36:01.335536       1 utils.go:110] GRPC error: rpc error: code = Internal desc = PUT https://management.mtcazs.wwtatc.com/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/ci-op-wv5kxjrl-cc5c6/providers/Microsoft.Compute/disks/pvc-854653a6-6107-44ff-95e3-a6d588864420
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
ERROR CODE: NoRegisteredProviderFound
--------------------------------------------------------------------------------
{
  "error": {
    "code": "NoRegisteredProviderFound",
    "message": "No registered resource provider found for location 'mtcazs' and API version '2023-10-02' for type 'disks'. The supported api-versions are '2017-03-30, 2018-04-01, 2018-06-01, 2018-09-30, 2019-03-01, 2019-07-01, 2019-11-01'. The supported locations are 'mtcazs'."
  }
}
--------------------------------------------------------------------------------

Version-Release number of selected component (if applicable):

    OCP:4.18.0-0.nightly-2024-10-23-112324
    AzureDisk CSI Driver: v1.30.4

 

How reproducible:

    Always

Steps to Reproduce:

    1. Create cluster on Azure Stack Hub with prometheus pvc configurated
    2. Volume provisioning failed due to "NoRegisteredProviderFound" 
    

Actual results:

    Volume provisioning failed 

Expected results:

    Volume provisioning should succeed     

Additional info:

    

Summary

Duplicate issue of https://issues.redhat.com/browse/OU-258

To pass the CI/CD requirements of the openshift/console each PR needs to have a issue in a OCP own Jira board. 

This issue migrates the rendering of the Developer Perspective > Observe > Metrics page from the openshift/console to openshift/monitioring-plugin. 

openshift/console PR#4187: Removes the Metrics Page. 

openshift/monitoring-plugin PR#138: Add the Metrics Page & consolidates the code to use the same components as the Administrative > Observe > Metrics Page. 

Testing

Both openshift/console PR#4187 & openshift/monitoring-plugin PR#138 need to be launched to see the full feature. After launching both the PRs you should see a page like the screenshot attached below.  

Except from OU-258 : https://issues.redhat.com/browse/OU-258 :

Background

The admin console's alert details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

The UX of the two pages differs somewhat, so we will need to decide whether we can change the dev console to use the same UX as the admin page or whether we need to keep some differences. This is an opportunity to bring the improved PromQL editing UX from the admin console to the dev console.

Outcomes
  • The dev console metrics is loaded from monitoring-plugin and the code that is not shared with other components in the console is removed from the console codebase.
  • The dev console version of the page has the project selector dropdown, but the admin console page doesn't, so monitoring-plugin will need to be changed to support that difference.

 

OCPBUGS-36283 introduced the ability to switch on TLS between the BMC and the Metal3's httpd server. It is currently off by default to make the change backportable without a high risk of regressions. We need to turn it on for 4.18+ for consistency with CBO-deployed Metal3.

Description of problem:

    The kubeconfigs for the DNS Operator and the Ingress Operator are managed by Hypershift and they should only be managed by the cloud service provider. This can lead to the kubeconfig/certificate being invalid in the cases where the cloud service provider further manages the kubeconfig (for example ca-rotation).

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    Circular dependencies in OCP Console prevent migration of Webpack 5 

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enable the CHECK_CYCLES env var while building
    2. Observe errors
    3.
    

Actual results:

    There are errors

Expected results:

    No errors

Additional info:

    

 

Description of problem:
The OpenShift Pipelines operator automatically installs a OpenShift console plugin. The console plugin metrics reports this as unknown after the plugin was renamed from "pipeline-console-plugin" to "pipelines-console-plugin".

Version-Release number of selected component (if applicable):
4.14+

How reproducible:
Always

Steps to Reproduce:

  1. Install the OpenShift Pipelines operator with the plugin
  2. Navigate to Observe > Metrics
  3. Check the metrics console_plugins_info

Actual results:
It shows an "unknown" plugin in the metrics.

Expected results:
It should shows a "pipelines" plugin in the metrics.

Additional info:
None

Description of problem:

We are in a live migration scenario.

If a project has a networkpolicy to allow from the host network (more concretely, to allow from the ingress controllers and the ingress controllers are in the host network), traffic doesn't work during the live migration between any ingress controller node (either migrated or not migrated) and an already migrated application node.

I'll expand later in the description and internal comments, but the TL;DR is that the IPs of the tun0 of not migrated source nodes and the IPs of the ovn-k8s-mp0 from migrated source nodes are not added to the address sets related to the networkpolicy ACL in the target OVN-Kubernetes node, so that traffic is not allowed.

Version-Release number of selected component (if applicable):

4.16.13

How reproducible:

Always

Steps to Reproduce:

1. Before the migration: have a project with a networkpolicy that allows from the ingress controller and the ingress controller in the host network. Everything must work properly at this point.

2. Start the migration

3. During the migration, check connectivity from the host network of either a migrated node or a non-migrated node. Both will fail (checking from the same node doesn't fail)

Actual results:

Pod on the worker node is not reachable from the host network of the ingress controller node (unless the pod is in the same node than the ingress controller), which causes the ingress controller routes to throw 503 error.

Expected results:

Pod on the worker node to be reachable from the ingress controller node, even when the ingress controller node has not migrated yet and the application node has.

Additional info:

This is not a duplicate of OCPBUGS-42578. This bug refers to the host-to-pod communication path while the other one doesn't.

This is a customer issue. More details to be included in private comments for privacy.

Workaround: Creating a networkpolicy that explicitly allows traffic from tun0 and ovn-k8s-mp0 interfaces. However, note that the workaround can be problematic for clusters with hundreds or thousands of projects. Another possible workaround is to temporarily delete all the networkpolicies of the projects. But again, this may be problematic (and a security risk).

operator conditions kube-apiserver

is showing as regressed in 4.17 (and 4.18) for metal and vsphere

Stephen Benjamin noted there is one line of JQ used to create the tests and has offered to try to stabilize that code some. Ultimately TRT-1764 is intended to build out a smarter framework. This bug is to see what can be done in the short term.

Description of problem:

    Shipwright operator installation through CLI is failing - 

Failure:

# Shipwright build details page.Shipwright build details page Shipwright tab should be default on first open if the operator is installed (ODC-7623): SWB-01-TC01
Error: Failed to install Shipwright Operator - Pod timeout

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

In the Secret details view, if one of the data properties from the Secret contains a tab character, it is considered "unpritable" and the content cannot be viewed in the console. This is not correct. Tab characters can be printed and should not prevent content from being viewed. 

We have a dependency "istextorbinary" that will determine if a buffer contains binary. We should use it here.

Version-Release number of selected component (if applicable)  4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Download (this file)[https://gist.github.com/TheRealJon/eb1e2eaf80c923938072f8a997fed3cd/raw/04b7307d31a825ae686affd9da0c0914d490abd3/pull-secret-with-tabs.json]
    2. Run this command:
oc create secret generic test -n default --from-file=.dockerconfigjson=<path-to-file-from-step-1> --type=kubernetes.io/dockerconfigjson
    3. In the console, navigate to Workloads -> Secrets and make sure that the "default" project is selected from the project dropdown.
    4. Select named "test"
    5. Scroll to the bottom to view the data content of the Secret

Actual results:

    The "Save this file" option is shown, and user is unable to reveal the contents of the Secret

Expected results:

    The "Save this file" option should not be shown, the obfuscated content should be rendered, and the reveal/hide button should show and hide the content from the pull secret.

 

Additional info:

    There is logic in this view that prevents us from trying to render binary data by detecting "unprintable characters". The regex for this includes the Tab character, which is incorrect, since that character is printable.

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/126

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

[sig-arch] events should not repeat pathologically for ns/openshift-machine-api

The machine-api resource seems to not be responding to the `/healthz` requests from kubelet causing an increase in probe error events. The pod does seem to be up, and preliminary look at Loki is showing that the `/healthz` endpoint does seem to be up, but looses leader between, before starting the health probe again.

Prow Link
Loki General Query

Loki Start/Stop/Query

(read from bottom up)

I1016 19:51:31.418815       1 server.go:191] "Starting webhook server" logger="controller-runtime.webhook"
I1016 19:51:31.418764       1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8082" secure=false
I1016 19:51:31.418703       1 server.go:83] "starting server" name="health probe" addr="[::]:9441"
I1016 19:51:31.418650       1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"		
2024/10/16 19:51:31 Starting the Cmd.

...

2024/10/16 19:50:44 leader election lost
I1016 19:50:44.406280       1 leaderelection.go:297] failed to renew lease openshift-machine-api/cluster-api-provider-machineset-leader: timed out waiting for the condition
error
E1016 19:50:44.406230       1 leaderelection.go:436] error retrieving resource lock openshift-machine-api/cluster-api-provider-machineset-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-api-provider-machineset-leader": context deadline exceeded
error
E1016 19:50:37.430054       1 leaderelection.go:429] Failed to update lock optimitically: rpc error: code = DeadlineExceeded desc = context deadline exceeded, falling back to slow path
error
E1016 19:50:04.423920       1 leaderelection.go:436] error retrieving resource lock openshift-machine-api/cluster-api-provider-machineset-leader: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io cluster-api-provider-machineset-leader)
error
E1016 19:49:04.422237       1 leaderelection.go:429] Failed to update lock optimitically: rpc error: code = DeadlineExceeded desc = context deadline exceeded, falling back to slow path
....

I1016 19:46:21.358989       1 server.go:83] "starting server" name="health probe" addr="[::]:9441"
I1016 19:46:21.358891       1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8082" secure=false
I1016 19:46:21.358682       1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"		
2024/10/16 19:46:21 Starting the Cmd.

Event Filter

Description of problem:

    Circular dependencies in OCP Console prevent migration of Webpack 5 

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enable the CHECK_CYCLES env var while building
    2. Observe errors
    3.
    

Actual results:

    There are errors

Expected results:

    No errors

Additional info:

    

 

Description of problem:

The image ecosystem testsuite sometimes fails due to timeouts in samples smoke tests in origin - the tests starting with "[sig-devex][Feature:ImageEcosystem][Slow] openshift sample application repositories".

These can be caused by either the build taking too long (for example the rails application tends to take quite a while to build) or the application actually can start quite slowly.

There is no bullet proof solution here but to try and increase the timeouts to a value that both provides enough time and doesn't stall the testsuite for too long.
    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1. Run the image-ecosystem testsuite
    2.
    3.
    

Actual results:

sometime the testsuite fails because of timeouts
    

Expected results:

no timeouts
    

Additional info:


    

Description of problem:

ConsolePlugin example YAML lacks required data    

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-08-30-231249

How reproducible:

    Always

Steps to Reproduce:

1. goes to ConsolePlugins list page 
 /k8s/cluster/customresourcedefinitions/consoleplugins.console.openshift.io/instances  
or 
/k8s/cluster/console.openshift.io~v1~ConsolePlugin
2. Click on 'Create ConsolePlugin' button
    

Actual results:

Example YAML is quite simple and lacking of required data, user will get various error if trying from example YAML

apiVersion: console.openshift.io/v1
kind: ConsolePlugin
metadata:
  name: example
spec: {}
    

Expected results:

we should add complete YAML as as example or create a default Sample     

Additional info:

    

Description of problem:

Add two new props to VirtualizedTable in order to make the header checkbox work.

allRowsSelected and canSelectAll. allRowsSelected will check the checkbox and canSelectAll will be a control to hide or show the header checkbox. 

Description of problem:

When the vSphere CSI driver is removed (using managementState: Removed), it leaves all existing conditions in the ClusterCSIDriver. IMO it should delete all of them and keep some something like"Disabled: true" that we use for Manila CSI driver operator.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-09-031511

How reproducible: always

Steps to Reproduce:

  1. Edit ClusterCSIDriver and set `managementState: Removed`.
  2. See the CSI driver deployment + DaemonSet are removed.
  3. Check ClusterCSIDriver conditions

Actual results: All Deployment + DaemonSet conditions are present

Expected results: The conditions are pruned.

Description of problem:
OpenShift automatically installs the OpenShift networking plugin, but the console plugin metrics reports this as "unknown".

Version-Release number of selected component (if applicable):
4.17+ ???

How reproducible:
Always

Steps to Reproduce:

  1. Navigate to Observe > Metrics
  2. Check the metric console_plugins_info

Actual results:
It shows an "unknown" plugin in the metrics.

Expected results:
It should shows a "networking" plugin in the metrics.

Additional info:
None

Description of problem:

While working on the readiness probes we have discovered that the single member health check always allocates a new client. 

Since this is an expensive operation, we can make use of the pooled client (that already has a connection open) and change the endpoints for a brief period of time to the single member we want to check.

This should reduce CEO's and etcd CPU consumption.

Version-Release number of selected component (if applicable):

any supported version    

How reproducible:

always, but technical detail

Steps to Reproduce:

 na    

Actual results:

CEO creates a new etcd client when it is checking a single member health

Expected results:

CEO should use the existing pooled client to check for single member health    

Additional info:

    

Description of problem:

    HyperShift currently runs 3 replicas of active/passive HA deployments such as kube-controller-manager, kube-scheduler, etc. In order to reduce the overhead of running a HyperShift control plane, we should be able to run these deployments with 2 replicas.

In a 3 zone environment with 2 replicas, we can still use a rolling update strategy, and set the maxSurge value to 1, as the new pod would schedule into the unoccupied zone.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/vsphere-problem-detector/pull/172

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/images/pull/193

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

   openshift install fails with "failed to lease wait: Invalid configuration for device '0'. generated yaml below:
additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: XXX
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    vsphere:
      coresPerSocket: 2
      cpus: 8
      memoryMB: 40960
      osDisk:
        diskSizeGB: 150
      zones:
      - generated-failure-domain
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    vsphere:
      coresPerSocket: 2
      cpus: 4
      memoryMB: 32768
      osDisk:
        diskSizeGB: 150
      zones:
      - generated-failure-domain
  replicas: 3
metadata:
  creationTimestamp: null
  name: dc3
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  vsphere:
    apiVIP: 172.21.0.20
    apiVIPs:
    - 172.21.0.20
    cluster: SA-LAB
    datacenter: OVH-SA
    defaultDatastore: DatastoreOCP
    failureDomains:
    - name: generated-failure-domain
      region: generated-region
      server: XXX
      topology:
        computeCluster: /OVH-SA/host/SA-LAB
        datacenter: OVH-SA
        datastore: /OVH-SA/datastore/DatastoreOCP
        networks:
        - ocpdemo
        resourcePool: /OVH-SA/host/SA-LAB/Resources
      zone: generated-zone
    ingressVIP: 172.21.0.21
    ingressVIPs:
    - 172.21.0.21
    network: ocpdemo

~~~ Truncated~~~

 

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1.openshift-install create cluster
    2.choose Vsphere
    3.
    

Actual results:

    Error

Expected results:

    Cluster creation

Additional info:

    

Description of problem:

    regular user can update route spec.tls.certificate/key without extra permissions, but if the user try to edit/patch spec.tls.externalCertificate, it reports error:
spec.tls.externalCertificate: Forbidden: user does not have update permission on custom-host 

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-05-21-221942    

How reproducible:

    100%

Steps to Reproduce:

    1. login as regular use and create namespace, pod, svc and edge route
$ oc create route edge myedge --service service-unsecure --cert tls.crt --key tls.key
$ oc get route myedge -oyaml

    2. edit the route and remove one certificate from spec.tls.certificate 
$ oc edit route myedge
$ oc get route myedge

    3. edit the route and restore the original spec.tls.certificate

    4. edit the route with spec.tls.externalCertificate
     

Actual results:

    1. edge route is admitted and works well
$ oc get route myedge -oyaml
<......>
spec:
  host: myedge-test3.apps.hongli-techprev.qe.azure.devcluster.openshift.com
  port:
    targetPort: http
  tls:
    certificate: |
      -----BEGIN CERTIFICATE-----
      XXXXXXXXXXXXXXXXXXXXXXXXXXX
      -----END CERTIFICATE-----
      -----BEGIN CERTIFICATE-----
      XXXXXXXXXXXXXXXXXXXXXXXX 
      -----END CERTIFICATE-----

   key: |
      -----BEGIN RSA PRIVATE KEY-----
<......>

    2. route is failed validation since "private key does not match public key"
$ oc get route myedge
NAME     HOST/PORT                  PATH   SERVICES           PORT   TERMINATION   WILDCARD
myedge   ExtendedValidationFailed          service-unsecure   http   edge          None

    3. route is admitted again after the spec.tls.certificate is restored

    4. reports error when updating spec.tls.externalCertificate 
spec.tls.externalCertificate: Forbidden: user does not have update permission on custom-host 

Expected results:

    user can has same permission to update both spec.tls.certificate and spec.tls.externalCertificate

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/161

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    oc-mirror produces images signature config maps in JSON format, inconsistent with other manifests which are normally in YAML. That breaks some automation, especially Multicloud Operators Subscription controller which expects manifests in YAML only.

Version-Release number of selected component (if applicable):

    

How reproducible:

Always    

Steps to Reproduce:

    1. Perform release payload mirroring as documented
    2. Check 'release-signatures' directory 

   

Actual results:

    There is a mix of YAML and JSON files with kubernetes manifests.

Expected results:

    Manifests are stored in one format, either YAML or JSON

Additional info:

 

Description of problem:


An unexpected validation failure occurs when creating the agent ISO image if the RendezvousIP is a substring of the next-hop-address set for a worker node.

For example this configuration snippet in agent-config.yaml:

apiVersion: v1alpha1
kind: AgentConfig
metadata:
  name: agent-config
rendezvousIP: 7.162.6.1
hosts:
...
 - hostname: worker-0
    role: worker
    networkConfig:
     interfaces:
        - name: eth0
          type: Ethernet
          state: up
          ipv4:
            enabled: true
            address:
              - ip: 7.162.6.4
                prefix-length: 25
            dhcp: false
     routes:
        config:
          - destination: 0.0.0.0/0
            next-hop-address: 7.162.6.126
            next-hop-interface: eth0
            table-id: 254

Will result in the validation failure when creating the image:

FATAL failed to fetch Agent Installer ISO: failed to fetch dependency of "Agent Installer ISO": failed to fetch dependency of "Agent Installer Artifacts": failed to fetch dependency of "Agent Installer Ignition": failed to fetch dependency of "Agent Manifests": failed to fetch dependency of "NMState Config": failed to generate asset "Agent Hosts": invalid Hosts configuration: [Hosts[3].Host: Forbidden: Host worker-0 has role 'worker' and has the rendezvousIP assigned to it. The rendezvousIP must be assigned to a control plane host.

The problem is this check here https://github.com/openshift/installer/pull/6716/files#diff-fa305fe33630f77b65bd21cc9473b620f67cfd9ce35f7ddf24d03b26ec2ccfffR293
Its checking for the IP in the raw nmConfig. The problem is the routes stanza is also included in the nmConfig and the route is
next-hop-address: 7.162.6.126
So when rendezvousIP is 7.162.6.1 that strings.Contains() check returns true and the validation fails.

Some time users wants to create some modifications while installing ibi, like create new partitions for the disk, in order to save them and not to override them by coreos installer command we need a way to provide params to coreos installer command 

Description of problem:

 

The e2e test, TestMetrics, is repeatedly failing with the following failure message:

=== RUN   TestMetrics
    utils.go:135: Setting up pool metrics
    utils.go:636: Applied label "node-role.kubernetes.io/metrics" to node ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q
    utils.go:722: Created MachineConfigPool "metrics"
    utils.go:140: Target Node: ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q
    utils.go:124: No MachineConfig provided, will wait for pool "metrics" to include MachineConfig "00-worker"
    utils.go:252: Pool metrics has rendered configs [00-worker] with rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 6.039157947s)
    utils.go:286: Pool metrics has completed rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 1m14.043792995s)
    utils.go:145: 
            Error Trace:    /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:145
                                       /go/src/github.com/openshift/machine-config-operator/test/e2e/mco_test.go:149
            Error:          Expected nil, but got: &fmt.wrapError{msg:"node config change did not occur (waited 37.479869ms): nodes \"ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q\" not found", err:(*errors.StatusError)(0xc00071a8c0)}
            Test:           TestMetrics

Version-Release number of selected component (if applicable):

    

How reproducible:

Sporadically, but could potentially block e2e.

Steps to Reproduce:

Run the e2e-gcp-op test

Actual results:

=== RUN   TestMetrics
    utils.go:135: Setting up pool metrics
    utils.go:636: Applied label "node-role.kubernetes.io/metrics" to node ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q
    utils.go:722: Created MachineConfigPool "metrics"
    utils.go:140: Target Node: ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q
    utils.go:124: No MachineConfig provided, will wait for pool "metrics" to include MachineConfig "00-worker"
    utils.go:252: Pool metrics has rendered configs [00-worker] with rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 6.039157947s)
    utils.go:286: Pool metrics has completed rendered-metrics-688bea8bcb23f911e27b5d530a7385bb (waited 1m14.043792995s)
    utils.go:145: 
            Error Trace:    /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:145
                                       /go/src/github.com/openshift/machine-config-operator/test/e2e/mco_test.go:149
            Error:          Expected nil, but got: &fmt.wrapError{msg:"node config change did not occur (waited 37.479869ms): nodes \"ci-op-k925dznq-1354f-vhpxw-worker-a-sjv6q\" not found", err:(*errors.StatusError)(0xc00071a8c0)}
            Test:           TestMetrics
    

Expected results:

The test should pass

Additional info:

    

Please review the following PR: https://github.com/openshift/installer/pull/8960

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    EncryptionAtHost and DiskEncryptionSets are two features which should not be tightly coupled.  They should be able to be enabled / disabled independently.  Currently EncryptionAtHost is only enabled if DiskEncryptionSetID is a valid disk encryption set resource ID.


https://github.com/openshift/hypershift/blob/0cc82f7b102dcdf6e5d057255be1bdb1593d1203/hypershift-operator/controllers/nodepool/azure.go#L81-L88

Version-Release number of selected component (if applicable):

    

How reproducible:

    Every time

Steps to Reproduce:

    1.See comments    

Actual results:

   EncryptionAtHost is only set if DiskEncryptionSetID is set.      

Expected results:

    EncryptionAtHost and DiskEncryptionSetID should be independently settable.  

Additional info:

    https://redhat-external.slack.com/archives/C075PHEFZKQ/p1724772123804009

The customer's cloud credentials operator generates millions of the below messages per day in the GCP cluster.

And they want to reduce/stop these logs as it is consuming more disks. Also, their "cloud credentials" operator runs in manual mode.

time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds
time="2024-06-21T08:37:42Z" level=error msg="error creating GCP client" error="Secret \"gcp-credentials\" not found"
time="2024-06-21T08:37:42Z" level=error msg="error determining whether a credentials update is needed" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-ccm error="unable to check whether credentialsRequest needs update"
time="2024-06-21T08:37:42Z" level=error msg="error syncing credentials: error determining whether a credentials update is needed" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials
time="2024-06-21T08:37:42Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-ccm secret=openshift-cloud-controller-manager/gcp-ccm-cloud-credentials
time="2024-06-21T08:37:42Z" level=info msg="reconciling clusteroperator status"
time="2024-06-21T08:37:42Z" level=info msg="operator detects timed access token enabled cluster (STS, Workload Identity, etc.)" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator
time="2024-06-21T08:37:42Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator
time="2024-06-21T08:37:42Z" level=warning msg="read-only creds not found, using root creds client" actuator=gcp cr=openshift-cloud-credential-operator/openshift-gcp-pd-csi-driver-operator secret=openshift-cloud-credential-operator/cloud-credential-operator-gcp-ro-creds

Description of problem:

    When the user selects a shared vpc install, the created control plane service account is left over. To verify, after the destruction of the cluster check the principals in the host project for a remaining name XXX-m@some-service-account.com

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    No principal remaining

Additional info:

    

There were remaining issues from the original issue. A new bug has been opened to address this. This is a clone of issue OCPBUGS-32947. The following is the description of the original issue:

Description of problem:

    [vSphere] network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-04-23-032717

How reproducible:

    Always

Steps to Reproduce:

    1.Install a vSphere 4.16 cluster, we use automated template: ipi-on-vsphere/versioned-installer
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-04-23-032717   True        False         24m     Cluster version is 4.16.0-0.nightly-2024-04-23-032717     

    2.Check the controlplanemachineset, you can see network.devices, template and workspace have value.
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset     
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
cluster   3         3         3       3                       Active   51m
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml
apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
  creationTimestamp: "2024-04-25T02:52:11Z"
  finalizers:
  - controlplanemachineset.machine.openshift.io
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
  name: cluster
  namespace: openshift-machine-api
  resourceVersion: "18273"
  uid: f340d9b4-cf57-4122-b4d4-0f45f20e4d79
spec:
  replicas: 3
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
  state: Active
  strategy:
    type: RollingUpdate
  template:
    machineType: machines_v1beta1_machine_openshift_io
    machines_v1beta1_machine_openshift_io:
      failureDomains:
        platform: VSphere
        vsphere:
        - name: generated-failure-domain
      metadata:
        labels:
          machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
      spec:
        lifecycleHooks: {}
        metadata: {}
        providerSpec:
          value:
            apiVersion: machine.openshift.io/v1beta1
            credentialsSecret:
              name: vsphere-cloud-credentials
            diskGiB: 120
            kind: VSphereMachineProviderSpec
            memoryMiB: 16384
            metadata:
              creationTimestamp: null
            network:
              devices:
              - networkName: devqe-segment-221
            numCPUs: 4
            numCoresPerSocket: 4
            snapshot: ""
            template: huliu-vs425c-f5tfl-rhcos-generated-region-generated-zone
            userDataSecret:
              name: master-user-data
            workspace:
              datacenter: DEVQEdatacenter
              datastore: /DEVQEdatacenter/datastore/vsanDatastore
              folder: /DEVQEdatacenter/vm/huliu-vs425c-f5tfl
              resourcePool: /DEVQEdatacenter/host/DEVQEcluster/Resources
              server: vcenter.devqe.ibmc.devcluster.openshift.com
status:
  conditions:
  - lastTransitionTime: "2024-04-25T02:59:37Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Error
  - lastTransitionTime: "2024-04-25T03:03:45Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2024-04-25T03:03:45Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2024-04-25T03:01:04Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasUpdated
    status: "False"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3     

    3.Delete the controlplanemachineset, it will recreate a new one, but those three fields that had values ​​before are now cleared.

liuhuali@Lius-MacBook-Pro huali-test % oc delete controlplanemachineset cluster
controlplanemachineset.machine.openshift.io "cluster" deleted
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE      AGE
cluster   3         3         3       3                       Inactive   6s
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -oyaml
apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
  creationTimestamp: "2024-04-25T03:45:51Z"
  finalizers:
  - controlplanemachineset.machine.openshift.io
  generation: 1
  name: cluster
  namespace: openshift-machine-api
  resourceVersion: "46172"
  uid: 45d966c9-ec95-42e1-b8b0-c4945ea58566
spec:
  replicas: 3
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
  state: Inactive
  strategy:
    type: RollingUpdate
  template:
    machineType: machines_v1beta1_machine_openshift_io
    machines_v1beta1_machine_openshift_io:
      failureDomains:
        platform: VSphere
        vsphere:
        - name: generated-failure-domain
      metadata:
        labels:
          machine.openshift.io/cluster-api-cluster: huliu-vs425c-f5tfl
          machine.openshift.io/cluster-api-machine-role: master
          machine.openshift.io/cluster-api-machine-type: master
      spec:
        lifecycleHooks: {}
        metadata: {}
        providerSpec:
          value:
            apiVersion: machine.openshift.io/v1beta1
            credentialsSecret:
              name: vsphere-cloud-credentials
            diskGiB: 120
            kind: VSphereMachineProviderSpec
            memoryMiB: 16384
            metadata:
              creationTimestamp: null
            network:
              devices: null
            numCPUs: 4
            numCoresPerSocket: 4
            snapshot: ""
            template: ""
            userDataSecret:
              name: master-user-data
            workspace: {}
status:
  conditions:
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Error
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2024-04-25T03:45:51Z"
    message: ""
    observedGeneration: 1
    reason: AllReplicasUpdated
    status: "False"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 3
  replicas: 3
  updatedReplicas: 3     

    4.I active the controlplanemachineset and it does not trigger an update,  I continue to add these field values ​​back and it does not trigger an update, I continue to edit these fields to add a second network device and it still does not trigger an update. 


            network:
              devices:
              - networkName: devqe-segment-221
              - networkName: devqe-segment-222


By the way, I can create worker machines with other network device or two network devices.
huliu-vs425c-f5tfl-worker-0a-ldbkh    Running                          81m
huliu-vs425c-f5tfl-worker-0aa-r8q4d   Running                          70m

Actual results:

    network.devices, template and workspace will be cleared when deleting the controlplanemachineset, updating these fields will not trigger an update

Expected results:

    The fields value should not be changed when deleting the controlplanemachineset, 
    Updating these fields should trigger an update, or if these fields should not be modified, then it should not take effect when modifying the controlplanemachineset, as such an inconsistency seems confusing.

Additional info:

    Must gather:  https://drive.google.com/file/d/1mHR31m8gaNohVMSFqYovkkY__t8-E30s/view?usp=sharing 

Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/67

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/images/pull/194

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In vSphere cluster, change clustercsidrivers.managementState to "Removed" from "Managed", the check of VSphereProblemDetector become less frequent(once in 24 hours), see log: Scheduled the next check in 24h0m0. It is as expect.
Then change clustercsidrivers.managementState to "Managed" from "Removed", the VSphereProblemDetector check frequency is still 24 hours.     

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-01-175607    

How reproducible:

Always

Steps to Reproduce:

See Description     

Actual results:

The VSphereProblemDetector check frequency is once in 24 hour    

Expected results:

The VSphereProblemDetector check frequency should become to 1 hour    

Additional info:

    

Component Readiness has found a potential regression in the following test:

[sig-mco][OCPFeatureGate:ManagedBootImages][Serial] Should degrade on a MachineSet with an OwnerReference [apigroup:machineconfiguration.openshift.io] [Suite:openshift/conformance/serial]

New feature went live that ensures new tests in a release have at least a 95% pass rate. This test was one that showed up immediately with a couple bad runs in the last 20 attempts. The failures look similar which indicate the test probably has a problem that could be fixed.

We suspect a timeout issue, the test takes about 25s on average with a 30s timeout.

Test has a 91.67% pass rate, but 95.00% is required.

Sample (being evaluated) Release: 4.18
Start Time: 2024-10-10T00:00:00Z
End Time: 2024-10-17T23:59:59Z
Success Rate: 91.67%
Successes: 22
Failures: 2
Flakes: 0

Insufficient pass rate

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=serial&Suite=serial&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=OCPFeatureGate%3AManagedBootImages&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=Machine%20Config%20Operator&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-10-17%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-10-10%2000%3A00%3A00&testId=openshift-tests%3A94bbe8be59569d92f1c5afdef12b26dd&testName=%5Bsig-mco%5D%5BOCPFeatureGate%3AManagedBootImages%5D%5BSerial%5D%20Should%20degrade%20on%20a%20MachineSet%20with%20an%20OwnerReference%20%5Bapigroup%3Amachineconfiguration.openshift.io%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fserial%5D

Description of problem:

Starting from version 4.16, the installer does not support creating a cluster in AWS with the OPENSHIFT_INSTALL_AWS_PUBLIC_ONLY=true flag enabled anymore.

Version-Release number of selected component (if applicable):

    

How reproducible:

The installation procedure fails systemically when using a predefined VPC

Steps to Reproduce:

    1. Follow the procedure at https://docs.openshift.com/container-platform/4.16/installing/installing_aws/ipi/installing-aws-vpc.html#installation-aws-config-yaml_installing-aws-vpc to prepare an install-config.yaml in order to install a cluster with a custom VPC
    2. Run `openshift-install create cluster ...'
    3. The procedure fails: `failed to create load balancer`
    

Actual results:

The installation procedure fails.

Expected results:

An OCP cluster to be provisioned in AWS, with public subnets only.    

Additional info:

    

Description of problem:

clear all filters button is counted as part of resource type 

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-08-19-002129    

How reproducible:

Always    

Steps to Reproduce:

    1. navigate to Home -> Events page, choose 3 resource types, check what's shown on page
    2. navigate to Home -> Search page, choose 3 resource types, check what's shown on page. Choose 4 resource types and check what's shown    

Actual results:

1. it shows `1 more`, only clear all button will be shown if we click on `1 more` button
2. `1 more` button is only displayed when 4 resource types are selected, this is working as expected

Expected results:

1. clear all button should not be counted as part of resource number, the number 'N more' should reveal correct resource type numbers

Additional info:

    

Description of problem:

cluster-capi-operator's manifests-gen tool would generate CAPI providers transport configmaps with missing metadata details

Version-Release number of selected component (if applicable):

4.17, 4.18

How reproducible:

Not impacting payload, only a tooling bug

Description of problem:

On CI all the software for openstack and ansible related pieces are taken from pip and ansible-glalaxy instead of OS repository.    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Cluster's global address "<infra id>-apiserver" not deleted during "destroy cluster"

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-multi-2024-08-15-212448    

How reproducible:

Always

Steps to Reproduce:

1. "create install-config", then optionally insert interested settings (see [1])
2. "create cluster", and make sure the cluster turns healthy finally (see [2])
3. check the cluster's addresses on GCP (see [3])
4. "destroy cluster", and make sure everything of the cluster getting deleted (see [4])

Actual results:

The global address "<infra id>-apiserver" is not deleted during "destroy cluster".

Expected results:

Everything of the cluster shoudl get deleted during "destroy cluster".    

Additional info:

FYI we had a 4.16 bug once, see https://issues.redhat.com/browse/OCPBUGS-32306    

Description of problem:


Difficult to detect in which component I should report this bug. The description is the following.

Today we can install RH operators using one precise namespace or just all namepaces that will install the operator in "openshift-operators" namespace.

if this operator creates a serviceMonitor that should be scrapped by platform prometheus, this will have a token authentication and security configured in its definition.

But if the operator is installed in "openshift-operators" namespace, it's user workload monitoring that will try to scrappe it since this mentioned namespace has not the corresponding label to be scrapped by platform monitoring and we don't want it to have it because in this namespace we can also install community operators.

The result is that user workload monitoring will scrap this namespace and the service monitors will be skipped since they are configured with security against platform monitoring and UWM will not hande this.

A possible workaround is to do:

oc label namespace openshift-operators openshift.io/user-monitoring=false

losing functionality since some RH operators will not be monitored if installed in openshift-operators.



    

Version-Release number of selected component (if applicable):

 4.16

    

Please review the following PR: https://github.com/openshift/baremetal-operator/pull/376

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The installation of compact and HA clusters is failing in the vSphere environment. During the cluster setup, two master nodes were observed to be in a "Not Ready" state, and the rendezvous host failed to join the cluster. 

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-25-131159    

How reproducible:

100%    

Actual results:

level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
level=info msg=Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
level=info msg=Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
level=info msg=Use the following commands to gather logs from the cluster
level=info msg=openshift-install gather bootstrap --help
level=error msg=Bootstrap failed to complete: : bootstrap process timed out: context deadline exceeded
ERROR: Bootstrap failed. Aborting execution.

Expected results:

Installation should be successful.    

Additional info:

Agent Gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/pr-logs/pull/openshift_release/54459/rehearse-54459-periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-vsphere-agent-compact-fips-f14/1839389511629410304/artifacts/vsphere-agent-compact-fips-f14/cucushift-agent-gather/artifacts/agent-gather.tar.xz

Description of problem:

    sometimes cluster-capi-operator pod stuck in CrashLoopBackOff on osp

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-01-213905    

How reproducible:

    Sometimes

Steps to Reproduce:

    1.Create an osp cluster with TechPreviewNoUpgrade
    2.Check cluster-capi-operator pod
    3.
    

Actual results:

cluster-capi-operator pod in CrashLoopBackOff status
$ oc get po                               
cluster-capi-operator-74dfcfcb9d-7gk98          0/1     CrashLoopBackOff   6 (2m54s ago)   41m

$ oc get po         
cluster-capi-operator-74dfcfcb9d-7gk98          1/1     Running   7 (7m52s ago)   46m

$ oc get po                                                               
cluster-capi-operator-74dfcfcb9d-7gk98          0/1     CrashLoopBackOff   7 (2m24s ago)   50m

E0806 03:44:00.584669       1 kind.go:66] "kind must be registered to the Scheme" err="no kind is registered for the type v1alpha7.OpenStackCluster in scheme \"github.com/openshift/cluster-capi-operator/cmd/cluster-capi-operator/main.go:86\"" logger="controller-runtime.source.EventHandler"
E0806 03:44:00.685539       1 controller.go:203] "Could not wait for Cache to sync" err="failed to wait for clusteroperator caches to sync: timed out waiting for cache to be synced for Kind *v1alpha7.OpenStackCluster" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator"
I0806 03:44:00.685610       1 internal.go:516] "Stopping and waiting for non leader election runnables"
I0806 03:44:00.685620       1 internal.go:520] "Stopping and waiting for leader election runnables"
I0806 03:44:00.685646       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685706       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster"
I0806 03:44:00.685712       1 controller.go:242] "All workers finished" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster"
I0806 03:44:00.685717       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685722       1 controller.go:242] "All workers finished" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685718       1 controller.go:242] "All workers finished" controller="secret" controllerGroup="" controllerKind="Secret"
I0806 03:44:00.685720       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator"
I0806 03:44:00.685823       1 recorder_in_memory.go:80] &Event{ObjectMeta:{dummy.17e906d425f7b2e1  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:CustomResourceDefinitionUpdateFailed,Message:Failed to update CustomResourceDefinition.apiextensions.k8s.io/openstackclusters.infrastructure.cluster.x-k8s.io: Put "https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/openstackclusters.infrastructure.cluster.x-k8s.io": context canceled,Source:EventSource{Component:cluster-capi-operator-capi-installer-apply-client,Host:,},FirstTimestamp:2024-08-06 03:44:00.685748961 +0000 UTC m=+302.946052179,LastTimestamp:2024-08-06 03:44:00.685748961 +0000 UTC m=+302.946052179,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0806 03:44:00.719743       1 capi_installer_controller.go:309] "CAPI Installer Controller is Degraded" logger="CapiInstallerController" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="6fa96361-4dc2-4865-b1b3-f92378c002cc"
E0806 03:44:00.719942       1 controller.go:329] "Reconciler error" err="error during reconcile: failed to set conditions for CAPI Installer controller: failed to sync status: failed to update cluster operator status: client rate limiter Wait returned an error: context canceled" controller="clusteroperator" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="6fa96361-4dc2-4865-b1b3-f92378c002cc"

Expected results:

    cluster-capi-operator pod is always Running

Additional info:

    

Please review the following PR: https://github.com/openshift/bond-cni/pull/65

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When running /bin/bridge and trying to access localhost:9000 while the frontend is still starting, the bridge crashes as it cannot find frontend/public/dist/index.html

Version-Release number of selected component (if applicable):

    4.17.0

How reproducible:

    Always

Steps to Reproduce:

    1. Build the OpenShift Console backend and run /bin/bridge 
    2. Try to access localhost:9000 while it is still starting
    

Actual results:

    Bridge crash

Expected results:

    No crash, either return HTTP 404/500 to the browser or serve a fallback page

Additional info:

    This is just a minor dev annoyance

Description of problem:

when user tries to create Re-encrypt type route, there is no place to upload 'Destination CA certificate'     

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-09-120947

How reproducible:

    Always

Steps to Reproduce:

    1. create Secure route, TLS termination: Re-encrypt
    2. 
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Information on the Lightspeed modal is not as clear as it could be for users to understand what to do next. Users should also have a very clear way to disable and those options are not obvious. 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    ci/prow/security is failing on google.golang.org/grpc/metadata

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

always    

Steps to Reproduce:

    1. run ci/pro/security job on 4.15 pr
    2.
    3.
    

Actual results:

    Medium severity vulnerability found in google.golang.org/grpc/metadata

Expected results:

    

Additional info:

 

Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/62

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

The MAC mapping validation added in MGMT-17618 caused a regression on ABI.

To avoid this regression, the validation should be mitigated to validate only non-predictable interface names.

We should still make sure at least one MAC address exist in the MAC map, to be able to detect the relevant host.

slack discussion.

 

 

How reproducible:

100%

 

Steps to reproduce:

  1. Install on a node with two (statically configured via nmstate yaml) interfaces with a predictable name format (not eth*).
  2. add on one of the interfaces MAC address to the MAC map.

 

Actual results:
error 'mac-interface mapping for interface xxxx is missing'
Expected results:

Installation succeeds and the interfaces are correctly configured.

Description of problem:

When using configuring an OpenID idp that can only be accessed via the data plane, if the hostname of the provider can only be resolved by the data plane, reconciliation of the idp fails.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1. Configure an OpenID idp on a HostedCluster with a URL that points to a service in the dataplane (like https://keycloak.keycloak.svc)
    

Actual results:

    The oauth server fails to be reconciled

Expected results:

    The oauth server reconciles and functions properly

Additional info:

    Follow up to OCPBUGS-37753

kube rebase broke TechPreview hypershift on 4.18 with resource.k8s.io group going to v1alpha3

KAS fails to start with

E1010 19:05:25.175819       1 run.go:72] "command failed" err="group version resource.k8s.io/v1alpha2 that has not been registered"

KASO addressed it here
https://github.com/openshift/cluster-kube-apiserver-operator/pull/1731

Description of problem:
There are two enhancements we could have for cns-migration:
1. we can print the error message once the target datastore is not found, currently it exits as nothing did:

sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source vsanDatastore -destination invalid -volume-file /tmp/pv.txt
KubeConfig is: /tmp/kubeconfig
I0806 07:59:34.884908     131 logger.go:28] logging successfully to vcenter
I0806 07:59:36.078911     131 logger.go:28] ----------- Migration Summary ------------
I0806 07:59:36.078944     131 logger.go:28] Migrated 0 volumes
I0806 07:59:36.078960     131 logger.go:28] Failed to migrate 0 volumes
I0806 07:59:36.078968     131 logger.go:28] Volumes not found 0    

See the source datastore checing:

sh-5.1$ /bin/cns-migration -kubeconfig /tmp/kubeconfig -source invalid -destination Datastorenfsdevqe -volume-file /tmp/pv.txt
KubeConfig is: /tmp/kubeconfig
I0806 08:02:08.719657     138 logger.go:28] logging successfully to vcenter
E0806 08:02:08.749709     138 logger.go:10] error listing cns volumes: error finding datastore invalid in datacenter DEVQEdatacenter

 

 

2. If we the volume-file has one invalid pv name which is not found like at the beginning, it exits immediately and all the remaining pvs are skips, we can let it continue to check other pvs.

 

Version-Release number of selected component (if applicable):

4.17    

How reproducible:

    Always

Steps to Reproduce:

    See Description     

Description of problem:

normal user(project admin) visit Routes Metrics tab, only empty page returned    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-21-014704        

How reproducible:

Always    

Steps to Reproduce:

    1. normal user has a project and a route
    2. visit Networking -> Routes -> Metrics tab
    3.
    

Actual results:

empty page returned    

Expected results:

- we may should not expose Metrics tab for normal user(compared with 4.16 behavior)
- if Metrics tab is supposed to be exposed to normal user, then we should return correct content instead of empty page

Additional info:

    

Description of problem:

The prometheus operator fails to reconcile when proxy settings like no_proxy are set in the Alertmanager configuration secret.    

Version-Release number of selected component (if applicable):

4.15.z and later    

How reproducible:

    Always when AlertmanagerConfig is enabled

Steps to Reproduce:

    1. Enable UWM with AlertmanagerConfig
    enableUserWorkload: true
    alertmanagerMain:
      enableUserAlertmanagerConfig: true
    2. Edit the "alertmanager.yaml" key in the alertmanager-main secret (see attached configuration file)
    3. Wait for a couple of minutes.
    

Actual results:

Monitoring ClusterOperator goes Degraded=True.
    

Expected results:

No error
    

Additional info:

The Prometheus operator logs show that it doesn't understand the proxy_from_environment field.
The newer proxy fields are supported since Alertmanager v0.26.0 which is equivalent to OCP 4.15 and above. 
    

Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/753

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Context

In order to be able to use UIPlugins when installing COO we need to on board the plugins using konflux with COO.

We might need to create a new Dockerfile in the plugin repos that is based on rhel8

Outcome

  • The plugins used by COO are on boarded and can be included in the COO payload

Description of problem:

I see that if a release does not contain kubevirt coreos container image and if kubeVirtContainer flag is set to true oc-mirror fails to continue.
    

Version-Release number of selected component (if applicable):

     [fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-280-g8a42369", GitCommit:"8a423691", GitTreeState:"clean", BuildDate:"2024-08-03T08:02:06Z", GoVersion:"go1.22.4", Compiler:"gc", Platform:"linux/amd64"}

    

How reproducible:

     Always
    

Steps to Reproduce:

    1. use imageSetConfig.yaml as shown below
    2. Run command oc-mirror -c clid-179.yaml file://clid-179 --v2
    3.
    

Actual results:

    fedora@preserve-fedora-yinzhou test]$ ./oc-mirror -c /tmp/clid-99.yaml file://CLID-412 --v2

2024/08/03 09:24:38  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/08/03 09:24:38  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/08/03 09:24:38  [INFO]   : ⚙️  setting up the environment for you...
2024/08/03 09:24:38  [INFO]   : 🔀 workflow mode: mirrorToDisk 
2024/08/03 09:24:38  [INFO]   : 🕵️  going to discover the necessary images...
2024/08/03 09:24:38  [INFO]   : 🔍 collecting release images...
2024/08/03 09:24:44  [INFO]   : kubeVirtContainer set to true [ including :  ]
2024/08/03 09:24:44  [ERROR]  : unknown image : reference name is empty
2024/08/03 09:24:44  [INFO]   : 👋 Goodbye, thank you for using oc-mirror
2024/08/03 09:24:44  [ERROR]  : unknown image : reference name is empty 

    

Expected results:

    If kubeVirt coreos container does not exist in a relelase oc-mirror should skip and continue mirroring other operators but should not fail.
    

Additional info:

    [fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-99.yaml 
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  platform:
    channels:
      - name: stable-4.12
        minVersion: 4.12.61
        maxVersion: 4.12.61
    kubeVirtContainer: true
  operators:
  - catalog: oci:///test/ibm-catalog
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: devworkspace-operator
      minVersion: "0.26.0"
    - name: nfd
      maxVersion: "4.15.0-202402210006"
    - name: cluster-logging
      minVersion: 5.8.3
      maxVersion: 5.8.4
    - name: quay-bridge-operator
      channels:
      - name: stable-3.9
        minVersion: 3.9.5
    - name: quay-operator
      channels:
      - name: stable-3.9
        maxVersion: "3.9.1"
    - name: odf-operator
      channels:
      - name: stable-4.14
        minVersion: "4.14.5-rhodf"
        maxVersion: "4.14.5-rhodf"
  additionalImages:
  - name: registry.redhat.io/ubi8/ubi:latest
  - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27
  - name: quay.io/openshifttest/scratch@sha256:b045c6ba28db13704c5cbf51aff3935dbed9a692d508603cc80591d89ab26308

    

Please review the following PR: https://github.com/openshift/cloud-provider-openstack/pull/295

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Azure Disk CSI driver operator runs node DaemonSet that exposes CSI driver metrics on loopback, but there is no kube-rbac-proxy in front of it and there is no Service / ServiceMonitor for it. Therefore OCP doesn't collect these metrics.

Description of problem:

In 4.16 version now we can collapse and expand the "Getting Started resource" section under administrative perspective. 

But as in the earlier version, we can directly remove this tab [X], which is not there in the 4.16 version.

There is only an expand and collapse function available, but removing that tab is not available as it was there in previous versions. 


 

Version-Release number of selected component (if applicable):

    

How reproducible:

    Every time

Steps to Reproduce:

    1. Go to Web console. Click on the "Getting started resources." 
    2. Then you can expand and collapse this tab.
    3. But there is no option to directly remove this tab.      

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/multus-admission-controller/pull/90

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Version-Release number of selected component (if applicable):
build openshift/ovn-kubernetes#2291

How reproducible:
Always

Steps to Reproduce:

1. Create a ns ns1

2. Create a CRD in ns1

% oc get UserDefinedNetwork -n ns1 -o yaml
apiVersion: v1
items:
- apiVersion: k8s.ovn.org/v1
  kind: UserDefinedNetwork
  metadata:
    creationTimestamp: "2024-09-09T08:34:49Z"
    finalizers:
    - k8s.ovn.org/user-defined-network-protection
    generation: 1
    name: udn-network
    namespace: ns1
    resourceVersion: "73943"
    uid: c923b0b1-05b4-4889-b076-c6a28f7353de
  spec:
    layer3:
      role: Primary
      subnets:
      - cidr: 10.200.0.0/16
        hostSubnet: 24
    topology: Layer3
  status:
    conditions:
    - lastTransitionTime: "2024-09-09T08:34:49Z"
      message: NetworkAttachmentDefinition has been created
      reason: NetworkAttachmentDefinitionReady
      status: "True"
      type: NetworkReady
kind: List
metadata:
  resourceVersion: ""

3. Create a service and pods in ns1

 % oc get svc -n ns1
NAME           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE
test-service   ClusterIP   172.30.16.88   <none>        27017/TCP   5m32s
% oc get pods -n ns1
NAME            READY   STATUS    RESTARTS   AGE
test-rc-f54tl   1/1     Running   0          5m4s
test-rc-lhnd7   1/1     Running   0          5m4s
% oc exec -n ns1 test-rc-f54tl -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if41: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:80:02:1b brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.128.2.27/23 brd 10.128.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe80:21b/64 scope link 
       valid_lft forever preferred_lft forever
3: ovn-udn1@if42: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:c8:03:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.200.3.3/24 brd 10.200.3.255 scope global ovn-udn1
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fec8:303/64 scope link 
       valid_lft forever preferred_lft forever

4. Restart ovn pods

{code:java}
% oc delete pods --all -n openshift-ovn-kubernetes
pod "ovnkube-control-plane-76fd6ddbf4-j69j8" deleted
pod "ovnkube-control-plane-76fd6ddbf4-vnr2m" deleted
pod "ovnkube-node-5pd5w" deleted
pod "ovnkube-node-5r9mg" deleted
pod "ovnkube-node-6bdtx" deleted
pod "ovnkube-node-6v5d7" deleted
pod "ovnkube-node-8pmpq" deleted
pod "ovnkube-node-cffld" deleted


Actual results:

{code:java}
 % oc get pods -n openshift-ovn-kubernetes                        
NAME                                     READY   STATUS             RESTARTS        AGE
ovnkube-control-plane-76fd6ddbf4-9cklv   2/2     Running            0               9m22s
ovnkube-control-plane-76fd6ddbf4-gkmlg   2/2     Running            0               9m22s
ovnkube-node-bztn5                       7/8     CrashLoopBackOff   5 (21s ago)     9m19s
ovnkube-node-qhjsw                       7/8     Error              5 (2m45s ago)   9m18s
ovnkube-node-t5f8p                       7/8     Error              5 (2m32s ago)   9m20s
ovnkube-node-t8kpp                       7/8     Error              5 (2m34s ago)   9m19s
ovnkube-node-whbvx                       7/8     Error              5 (2m35s ago)   9m20s
ovnkube-node-xlzlh                       7/8     CrashLoopBackOff   5 (14s ago)     9m18s

ovnkube-controller:
    Container ID:  cri-o://977dd8c17320695b1098ea54996bfad69c14dc4219a91dfd4354c818ea433cac
    Image:         registry.build05.ci.openshift.org/ci-ln-y1ypd82/stable@sha256:3110151b89e767644c01c8ce2cf3fec4f26f6d6e011262d0988c1d915d63355f
    Image ID:      registry.build05.ci.openshift.org/ci-ln-y1ypd82/stable@sha256:3110151b89e767644c01c8ce2cf3fec4f26f6d6e011262d0988c1d915d63355f
    Port:          29105/TCP
    Host Port:     29105/TCP
    Command:
      /bin/bash
      -c
      set -xe
      . /ovnkube-lib/ovnkube-lib.sh || exit 1
      start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105
      
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   :205] Sending *v1.Node event handler 7 for removal
I0909 08:45:58.537155  170668 factory.go:542] Stopping watch factory
I0909 08:45:58.537167  170668 handler.go:219] Removed *v1.Node event handler 7
I0909 08:45:58.537185  170668 handler.go:219] Removed *v1.Namespace event handler 1
I0909 08:45:58.537198  170668 handler.go:219] Removed *v1.Namespace event handler 5
I0909 08:45:58.537206  170668 handler.go:219] Removed *v1.EgressIP event handler 8
I0909 08:45:58.537207  170668 handler.go:219] Removed *v1.EgressFirewall event handler 9
I0909 08:45:58.537187  170668 handler.go:219] Removed *v1.Node event handler 10
I0909 08:45:58.537219  170668 handler.go:219] Removed *v1.Node event handler 2
I0909 08:45:58.538642  170668 network_attach_def_controller.go:126] [network-controller-manager NAD controller]: shutting down
I0909 08:45:58.538703  170668 secondary_layer3_network_controller.go:433] Stop secondary layer3 network controller of network ns1.udn-network
I0909 08:45:58.538742  170668 services_controller.go:243] Shutting down controller ovn-lb-controller for network=ns1.udn-network
I0909 08:45:58.538767  170668 obj_retry.go:432] Stop channel got triggered: will stop retrying failed objects of type *v1.Node
I0909 08:45:58.538754  170668 obj_retry.go:432] Stop channel got triggered: will stop retrying failed objects of type *v1.Pod
E0909 08:45:58.5
      Exit Code:    1
      Started:      Mon, 09 Sep 2024 16:44:57 +0800
      Finished:     Mon, 09 Sep 2024 16:45:58 +0800
    Ready:          False
    Restart Count:  5
    Requests:
      cpu:      10m
      memory:   600Mi

Expected results:
ovn pods should not crash

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Do presume that Engineering will access attachments through supportshell.
  • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • When showing the results from commands, include the entire command in the output.  
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
  • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
  • For guidance on using this template please see
    OCPBUGS Template Training for Networking  components

Description of problem:

Deploy a 4.18 cluster on a PowerVS zone where LoadBalancers are slow to create.
We are called with InfraReady. We then create DNS records for the LBs. However, only the public LB exists. So the cluster fails to deploy.  The internal LB does eventually complete.
    

Version-Release number of selected component (if applicable):

4.18
    

How reproducible:

Occassionally on a zone with slow LB creation.
    

Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/125

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When a kubevirt-csi pod runs on a worker node of a Guest cluster, the underlying PVC from the infra/host cluster is attached to the Virtual Machine that is the worker node of the Guest cluster.

That works well, but only until the VM is rebooted.

After the VM is power cycled for some reason, the volumeattachment on the Guest cluster is still there and shows as attached.

[guest cluster]# oc get volumeattachment
NAME                                                                   ATTACHER          PV                                         NODE                         ATTACHED   AGE
csi-976b6b166ef7ea378de9a350c9ef427c23e8c072dc6e76a392241d273c3effdb   csi.kubevirt.io   pvc-4e375fa9-c1ad-4fa6-a254-03d4c3b1111b   hostedcluster2-rlq9m-z2x88   true       39m

But the VM does not have the hotplugged disk anymore (its not a persistent hotplug). Its not attached at all.

It only has its rhcos disk and cloud-init after the reboot:

[host cluster]# oc get vmi -n clusters-hostedcluster2 hostedcluster2-rlq9m-z2x88 -o yaml | yq '.status.volumeStatus'
- name: cloudinitvolume
  size: 1048576
  target: vdb
- name: rhcos
  persistentVolumeClaimInfo:
    accessModes:
      - ReadWriteOnce
    capacity:
      storage: 32Gi
    claimName: hostedcluster2-rlq9m-z2x88-rhcos
    filesystemOverhead: "0"
    requests:
      storage: "34359738368"
    volumeMode: Block
  target: vda

The result is all workloads with PVCs now fail to start, as the hotplug is not triggered again. The worker node VM cannot find the disk:

26s         Warning   FailedMount                                  pod/mypod                             MountVolume.MountDevice failed for volume "pvc-4e375fa9-c1ad-4fa6-a254-03d4c3b1111b" : rpc error: code = Unknown desc = couldn't find device by serial id

So workload pods cannot start.

Version-Release number of selected component (if applicable):

    OCP 4.17.3
    CNV 4.17.0
    MCE 2.7.0

How reproducible:

    Always

Steps to Reproduce:

    1. Have a pod running with a PV from kubevirt-csi in the guest cluster
    2. Shutdown the Worker VM running the Pod and start it again
    

Actual results:

    Workloads fail to start after VM reboot

Expected results:

    Hotplug the disk again and let workloads start

Additional info:

    

Description of problem:

When running 4.17 installer QE full function test, following am64 instances types are detected and tested passed, so append them in installer doc[1]: 
* standardBasv2Family
* StandardNGADSV620v1Family 
* standardMDSHighMemoryv3Family
* standardMIDSHighMemoryv3Family
* standardMISHighMemoryv3Family
* standardMSHighMemoryv3Family

[1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_x86_64.md 

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

To summarize, when we meet the following three conditions, baremetal nodes cannot boot due to a hostname resolution failure.

  • HubCluster is IPv4/IPv6 Dual Stack
  • BMC of managed baremetal hosts are IPv6 single stack
  • A hostname is used instead of An IP address in "spec.bmc.address" of BMH resource
  • The hostname is resolved only to IPv6 address, not IPv4

According to the following update, the provisioning service checks the BMC address scheme on the target and provides a matching URL for the installation media:

When we create a BMH resource, spec.bmc.address will be an URL of the BMC.
However, when we put a hostname instead of an IP address in the spec.bmc.address like the following example,

 

<Example BMH definition>
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
  :
spec:
  bmc:
    address: redfish://bmc.hostname.example.com:443/redfish/v1/Systems/1

we observe the following error.

$ oc logs -n openshift-machine-api metal3-baremetal-operator-6779dff98c-9djz7

{"level":"info","ts":1721660334.9622784,"logger":"provisioner.ironic","msg":"Failed to look up the IP address for BMC hostname","host":"myenv~mybmh","hostname":"redfish://bmc.hostname.example.com:443/redfish/v1/Systems/1"} 

Because of name resolution failure, baremetal-operator cannot determine if the BMC is IPv4 or IPv6.
Therefore, the IP scheme is fall-back to IPv4 and ISO images are exposed via IPv4 address even if the BMC is IPv6 single stack.
In this case, the IPv6 BMC cannot access to the ISO image on IPv4, we observe error messages like the following example, and the baremetal host cannot boot from the ISO.

<Error message on iDRAC>
Unable to locate the ISO or IMG image file or folder in the network share location because the file or folder path or the user credentials entered are incorrect

The issue is caused by the following implementation.
The following line passes `p.bmcAddress` which is whole URL, that's why the name resolution fails.
I think we should pass `parsedURL.Hostname()` instead, which is the hostname part of the URL.

https://github.com/metal3-io/baremetal-operator/blob/main/pkg/provisioner/ironic/ironic.go#L657

		ips, err := net.LookupIP(p.bmcAddress) 

 

Version-Release number of selected component (if applicable):
We observe this issue on OCP 4.14 and 4.15. But I think this issue occurs even in the latest releases.

How reproducible:

  • HubCluster is IPv4/IPv6 Dual Stack
  • BMC of managed baremetal hosts are IPv6 single stack
  • A hostname is used instead of An IP address in "spec.bmc.address" of BMH resource
  • The hostname is resolved only to IPv6 address, not IPv4

Steps to Reproduce:

  1. Create a HubCluster with IPv4/IPv6 Dual Stack
  2. Prepare a baremetal host and BMC with IPv6 single stack
  3. Prepare a DNS server with an AAAA record entry which resolve the BMC hostname to an IPv6 address
  4. Create a BMH resource and use the hostname in the URL of "spec.bmc.address"
  5. BMC cannot boot due to IPv4/IPv6 mismatch

Actual results:
Name resolution fails and the baremetal host cannot boot

Expected results:
Name resolution works and the baremetal host can boot

Additional info:

 

Description of problem:

Hypershift doesn't allow to configure the Failure Domains for node pools; which could help to put machines into the desired availability zone.    

Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/91

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-machine-api-provider-gcp-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

Description of problem:

Creating and destroying transit gateways (TG) during CI testing is costing an abnormal amount of money.  Since the monetary cost for creating a TG is high, provide support for a user created TG when creating an OpenShift cluster.
    

Version-Release number of selected component (if applicable):

all
    

How reproducible:

always
    

Description of problem:

https://search.dptools.openshift.org/?search=Helm+Release&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

For conditional updates, status.conditionalUpdates.release is also a Release type https://github.com/openshift/console/blob/master/frontend/public/module/k8s/types.ts#L812-L815 which will also trigger Admission Webhook Warning

Version-Release number of selected component (if applicable):

4.18.0-ec.2

How reproducible:

Always    

Steps to Reproduce:

1.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/80

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When creating an OCP cluster on AWS and selecting "publish: Internal," 
the ingress operator may create external LB mappings to external 
subnets.

This can occur if public subnets were specified during installation at install-config.

https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installing-aws-private.html#private-clusters-about-aws_installing-aws-private 

A configuration validation should be added to the installer.    

Version-Release number of selected component (if applicable):

    4.14+ probably older versions as well.

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    Slack thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1714986876688959

Description of problem:

https://issues.redhat.com//browse/OCPBUGS-31919 partially fixed an issue consuming the test image from a custom registry.
The fix is about consuming in the test binary the pull-secret of the cluster under tests.
To complete it we have to do the same trusting custom CA as the cluster under test.

Without that, if the test image is exposed by a registry where the TLS cert is signed by a custom CA, the same tests will fail as for:

{  fail [github.com/openshift/origin/test/extended/operators/certs.go:120]: Unexpected error:
    <*errors.errorString | 0xc0023105c0>: 
    unable to determine openshift-tests image oc wrapper with cluster ps: Error running /usr/bin/oc adm release info virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7 --image-for=tests --registry-config /tmp/image-pull-secret2435751342:
    StdOut>
    error: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority
    StdErr>
    error: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority
    exit status 1
    
    {
        s: "unable to determine openshift-tests image oc wrapper with cluster ps: Error running /usr/bin/oc adm release info virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7 --image-for=tests --registry-config /tmp/image-pull-secret2435751342:\nStdOut>\nerror: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": tls: failed to verify certificate: x509: certificate signed by unknown authority\nStdErr>\nerror: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:d368cc92e8d274744aac655e070d3a346f351fc5bd5f18a227b73452fd5c58b7: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": tls: failed to verify certificate: x509: certificate signed by unknown authority\nexit status 1\n",
    }
occurred
Ginkgo exit error 1: exit with code 1}

Version-Release number of selected component (if applicable):

    release-4.16, release-4.17 and master branchs in origin.

How reproducible:

Always    

Steps to Reproduce:

    1. try to run the test suite against a cluster where the OCP release (and the test image) comes from a private registry with a cert signed by a custom CA
    2.
    3.
    

Actual results:

    3 failing tests:
: [sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel] expand_more
: [sig-arch][Late][Jira:"kube-apiserver"] all registered tls artifacts must have no metadata violation regressions [Suite:openshift/conformance/parallel] expand_more
: [sig-arch][Late][Jira:"kube-apiserver"] all tls artifacts must be registered [Suite:openshift/conformance/parallel] expand_more

Expected results:

    No failing tests

Additional info:

    OCPBUGS-31919 partially fixed it having the test binary downloading the pull secret from the cluster under test. But in order to have it working we have also to trust custom CAs trusted by the cluster under test

Description of problem:

node-joiner pod does not honour cluster wide testing   

Version-Release number of selected component (if applicable):

OCP 4.16.6

How reproducible:

Always

Steps to Reproduce:

    1. Configure an OpenShift cluster wide proxy according to https://docs.openshift.com/container-platform/4.16/networking/enable-cluster-wide-proxy.html and add Red Hat urls (quay.io and alii) to the proxy allow list.
    2. Add a node to a cluster using a node joiner pod, following https://github.com/openshift/installer/blob/master/docs/user/agent/add-node/add-nodes.md
    

Actual results:

Error retrieving the images on quay.io
time=2024-08-22T08:39:02Z level=error msg=Release Image arch could not be found: command '[oc adm release info quay.io/openshift-release-dev/ocp-release@sha256:24ea553ce2e79fab0ff9cf2917d26433cffb3da954583921926034b9d5d309bd -o=go-template={{if and .metadata.metadata (index . "metadata" "metadata" "release.openshift.io/architecture")}}{{index . "metadata" "metadata" "release.openshift.io/architecture"}}{{else}}{{.config.architecture}}{{end}} --insecure=true --registry-config=/tmp/registry-config1164077466]' exited with non-zero exit code 1:time=2024-08-22T08:39:02Z level=error msg=error: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:24ea553ce2e79fab0ff9cf2917d26433cffb3da954583921926034b9d5d309bd: Get "http://quay.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)    

Expected results:

  node-joiner is able to downoad the images using the proxy

Additional info:
By allowing full direct internet access, without a proxy, the node joiner pod is able to download image from quay.io.

So there is a strong suspicion that the http timeout error above comes from the pod not being to use the proxy.

Restricted environementes when external internet access is only allowed through a proxy allow lists is quite common in corporate environements.

Please consider honouring the openshift proxy configuration .

Description of problem:

    Circular dependencies in OCP Console prevent migration of Webpack 5 

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enable the CHECK_CYCLES env var while building
    2. Observe errors
    3.
    

Actual results:

    There are errors

Expected results:

    No errors

Additional info:

    

 

Description of problem

During install of multi-AZ OSD GCP clusters into customer-provided GCP projects, extra control plane nodes are created by the installer. This may be limited to a few regions, and has show up in our testing in us-west2 and asia-east2.

When the cluster is installed, the installer provisions three control plane nodes via the cluster-api:

  • master-0 in AZ *a
  • master-1 in AZ *b
  • master-2 in AZ *c

However, the Machine manifest for master-0 and master-2 are written with the wrong AZs (master-0 in AZ *c and master-2 in AZ *a).

When the Machine controller in-cluster starts up and parses the manifests, it cannot find a VM for master-0 in AZ *c, or master-2 in *a, so it proceeds to try to create new VMs for those cases. master-1 is identified correctly, and unaffected.

This results in the cluster coming up with three control plane nodes, with master-0 and master-2 having no backing Machines, three control plane Machines, with only master-1 having a Node link, and the other two listed in Provisioned state, but with no Nodes, and 5 GCP VMs for these control plane nodes:

  • master-0 in AZ *a
  • master-0 in AZ *c
  • master-1 in AZ *b
  • master-2 in AZ *a
  • master-2 in AZ *c

This happens consistently, across multiple GCP projects, so far in us-west2 and asia-east2 ONLY.

4.16.z clusters work as expected, as do clusters upgraded from 4.16.z to 4.17.z.

Version-Release number of selected component

4.17.0-rc3 - 4.17.0-rc6 have all been identified as having this issue.

How reproducible

100%

Steps to Reproduce

I'm unsure how to replicate this in vanilla cluster install, but via OSD:

  1. Create a multi-az cluster in one of the reported zones, with a supplied GCP project (not the core OSD shared project, ie: CCS, or "Customer Cloud Subscription").

Example:

$ ocm create cluster --provider=gcp --multi-az --ccs --secure-boot-for-shielded-vms --region asia-east2 --service-account-file ~/.config/gcloud/chcollin1-dev-acct.json --channel-group candidate --version openshift-v4.17.0-rc.3-candidate chcollin-4170rc3-gcp

Requesting a GCP install via an install-config with controlPlane.platform.gcp.zones out of order seems to reliably reproduce.

Actual results

Install will fail in OSD, but a cluster will be created with multiple extra control-plane nodes, and the API server will respond on the master-1 node.

Expected results

A standard 3 control-plane-node cluster is created.

Additional info

We're unsure what it is about the two reported Zones or the difference between the primary OSD GCP project and customer-supplied Projects that has an effect.

The only thing we've noticed is the install-config has the order backwards for compute nodes, but not for control plane nodes:

{
  "controlPlane": [
    "us-west2-a",
    "us-west2-b",
    "us-west2-c"
  ],
  "compute": [
    "us-west2-c",     <--- inverted order.  Shouldn't matter when building control-plane Machines, but maybe cross-contaminated somehow?
    "us-west2-b",
    "us-west2-a"
  ],
  "platform": {
    "defaultMachinePlatform": {  <--- nothing about zones in here, although again, the controlPlane block should override any zones configured here
      "osDisk": {
        "DiskSizeGB": 0,
        "diskType": ""
      },
      "secureBoot": "Enabled",
      "type": ""
    },
    "projectID": "anishpatel",
    "region": "us-west2"
  }
}

Since we see the divergence at the asset/manifest level, we should be able to reproduce with just an openshift-install create manifests, followed by grep -r zones: or something, without having to wait for an actuall install attempt to come up and fail.

Description of problem:

 azure-disk-csi-driver doesnt use registryOverrides 

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1.set registry override on CPO
    2.watch that azure-disk-csi-driver continues to use default registry
    3.
    

Actual results:

    azure-disk-csi-driver uses default registry

Expected results:

    azure-disk-csi-driver mirrored registry

Additional info:

    

See this comment for background:
https://github.com/openshift/origin/blob/6b07170abad135bc7c5b22c78b2079ceecfc8b51/test/extended/etcd/vertical_scaling.go#L75-L86

The current vertical scaling test relies triggers CPMSO to create a new machine by first deleting an existing machine. In that test we can't validate that the new machine is scaled-up before the old one is removed.

Another test we could add is to first disable the CPMSO and then delete an existing machine and manually create a new machine like we did before the CPMSO.

https://docs.openshift.com/container-platform/4.16/machine_management/control_plane_machine_management/cpmso-disabling.html

That way we can validate that the scale-down does not happen before the scale-up event.

Description of problem:

Error is thrown by the broker form view for a pre-populated application name The error reads:  formData.application.selectedKey must be a `string` type, but the final value was: `null`. If "null" is intended as an empty value be sure to mark the schema as `.nullable()`
_    

Version-Release number of selected component (if applicable):


    

How reproducible:

 Every time
    

Steps to Reproduce:

    1. Install serverless operator
    2. Create any application in a namespace 
    3. Now open broker in form view
    

Actual results:

You have to select no application or any other application for the form view to work
    

Expected results:

Error should not be thrown for the appropriate value
    

Additional info:

Attaching a video of the error
    

https://drive.google.com/file/d/1WRp2ftMPlCG0ZiHZwC0QfleES3iVHObq/view?usp=sharing

Maxim Patlasov pointed this out in STOR-1453 but still somehow we missed it. I tested this on 4.15.0-0.ci-2023-11-29-021749.

It is possible to set a custom TLSSecurityProfile without minTLSversion:

$ oc edit apiserver cluster
...
spec:
  tlsSecurityProfile:
    type: Custom
    custom:
      ciphers:
      - ECDHE-ECDSA-CHACHA20-POLY1305
      - ECDHE-ECDSA-AES128-GCM-SHA256

This causes the controller to crash loop:

$ oc get pods -n openshift-cluster-csi-drivers
NAME                                             READY   STATUS             RESTARTS       AGE
aws-ebs-csi-driver-controller-589c44468b-gjrs2   6/11    CrashLoopBackOff   10 (18s ago)   37s
...

because the `${TLS_MIN_VERSION}` placeholder is never replaced:

        - --tls-min-version=${TLS_MIN_VERSION}
        - --tls-min-version=${TLS_MIN_VERSION}
        - --tls-min-version=${TLS_MIN_VERSION}
        - --tls-min-version=${TLS_MIN_VERSION}
        - --tls-min-version=${TLS_MIN_VERSION}

The observed config in the ClusterCSIDriver shows an empty string:

$ oc get clustercsidriver ebs.csi.aws.com -o json | jq .spec.observedConfig
{
  "targetcsiconfig": {
    "servingInfo":

{       "cipherSuites": [         "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256",         "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256"       ],       "minTLSVersion": ""     }

  }
}

which means minTLSVersion is empty when we get to this line, and the string replacement is not done:

[https://github.com/openshift/library-go/blob/c7f15dcc10f5d0b89e8f4c5d50cd313ae158de20/pkg/operator/csi/csidrivercontrollerservicecontroller/helpers.go#L234]

So it seems we have a couple of options:

1) completely omit the --tls-min-version arg if minTLSVersion is empty, or
2) set --tls-min-version to the same default value we would use if TLSSecurityProfile is not present in the apiserver object

Description of problem:

Azure HostedClusters are failing in OCP 4.17 due to issues with the cluster-storage-operator.
- lastTransitionTime: "2024-05-29T19:58:39Z"
          message: 'Unable to apply 4.17.0-0.nightly-multi-2024-05-29-121923: the cluster operator storage is not available'
          observedGeneration: 2
          reason: ClusterOperatorNotAvailable
          status: "True"
          type: ClusterVersionProgressing  
I0529 20:05:21.547544       1 status_controller.go:218] clusteroperator/storage diff {"status":{"conditions":[{"lastTransitionTime":"2024-05-29T20:02:00Z","message":"AzureDiskCSIDriverOperatorCRDegraded: AzureDiskDriverGuestStaticResourcesControllerDegraded: \"node_service.yaml\" (string): namespaces \"clusters-test-case4\" not found\nAzureDiskCSIDriverOperatorCRDegraded: AzureDiskDriverGuestStaticResourcesControllerDegraded: ","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverGuestStaticResourcesController_SyncError","status":"True","type":"Degraded"},{"lastTransitionTime":"2024-05-29T20:04:15Z","message":"AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying","status":"True","type":"Progressing"},{"lastTransitionTime":"2024-05-29T19:59:00Z","message":"AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service","reason":"AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying","status":"False","type":"Available"},{"lastTransitionTime":"2024-05-29T19:59:00Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"},{"lastTransitionTime":"2024-05-29T19:59:00Z","reason":"NoData","status":"Unknown","type":"EvaluationConditionsDetected"}]}} I0529 20:05:21.566215       1 event.go:364] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"azure-cloud-controller-manager", UID:"205a4307-67e4-481e-9fee-975b2c5c40fb", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/storage changed: Progressing message changed from "AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nAzureFileCSIDriverOperatorCRProgressing: AzureFileDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods" to "AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods"

 

On the HostedCluster itself, these errors with the csi pods not coming up are:

% k describe pod/azure-disk-csi-driver-node-5hb24 -n openshift-cluster-csi-drivers | grep fail
    Liveness:  http-get http://:healthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5
    Liveness:     http-get http://:rhealthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5
  Warning  FailedMount  2m (x28 over 42m)  kubelet            MountVolume.SetUp failed for volume "metrics-serving-cert" : secret "azure-disk-csi-driver-node-metrics-serving-cert" not found  

There was an error with the CO as well:

storage                                    4.17.0-0.nightly-multi-2024-05-29-121923   False       True          True       49m     AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service  

 

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Every time

Steps to Reproduce:

    1. Create a HC with a 4.17 nightly
    

Actual results:

    Azure HC does not complete; nodes do join NodePool though

Expected results:

    Azure HC should complete

Additional info:

    

Description of problem:

Console dynamic plugins may declare their extensions using TypeScript, e.g. Kubevirt plugin-extensions.ts module.

The EncodedExtension type should be exposed directly via Console plugin SDK, instead of plugins having to import this type from the dependent OpenShift plugin SDK packages.

Description of problem:

IPI Baremetal - BootstrapVM machineNetwork interface restart impacts pulling image and causes ironic service to fail
    

Version-Release number of selected component (if applicable):

4.16.Z but also seen this in 4.15 and 4.17
    

How reproducible:

50% of our jobs fail because of this.
    

Steps to Reproduce:

    1. Prepare an IPI baremetal deployment (we have provisioning network disabled, we are using Virtual Media)
    2. Start a deployment, wait for the bootstrapVM to start running and login via SSH
    3. Run the command: journalctl -f | grep "Dependency failed for Ironic baremetal deployment service"
    4. If the command above returns something, then print around 70 lines before and check for the NetworkManager entries in the log about the interface in the baremetal network getting restarted and an error about pulling an image because DNS is not reachable. 

    

Actual results:

Deployments fail 50% of the time, bootstrapVM is not able to pull an image because main machineNetwork interface is getting restarted and DNS resolution fails.
    

Expected results:

Deployments work 100% of the time, bootstrapVM is able to pull any image because machineNetwork interface is NOT restarted while images are getting pulled. 
    

Additional info:

We have a CI system to test OCP 4.12 through 4.17 deployments and this issue started to occurred a few weeks ago. mainly in 4.15, 4.16, and 4.17 
    

In this log extract of a deployment with OCP 4.16.0-0.nightly-2024-07-07-171226: you can see the image pull error because is not able to resolve the registry name, but in the lines before and after you can see that the machineNetwork interface is getting restarted, causing the lack of DNS resolution.

Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Finished Build Ironic environment.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Extract Machine OS Images...
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Provisioning interface...
Mon 2024-07-08 23:15:15 UTC localhost.localdomain extract-machine-os.service[3779]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b49111aa35052140e7fdd79964c32db47074c1...
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.3899] audit: op="connection-update" uuid="bf7e41e3-f1ea-3eed-98fd-c3d021e35d11" 
name="Wired connection 1" args="ipv4.addresses" pid=3812 uid=0 result="success"
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <warn>  [1720480515.4008] keyfile: load: "/etc/NetworkManager/system-connections/nmconnection": failed to load connection: invalid connection: connection.type: property is missing
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4018] audit: op="connections-reload" pid=3817 uid=0 result="success"
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4159] agent-manager: agent[543677841603162b,:1.67/nmcli-connect/0]: agent registered
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4164] device (ens3): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4170] manager: NetworkManager state is now CONNECTED_LOCAL
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4172] device (ens3): disconnecting for new activation request.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4172] audit: op="connection-activate" uuid="bf7e41e3-f1ea-3eed-98fd-c3d021e35d11" name="Wired connection 1" pid=3821 uid=0 result="success"
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4200] device (ens3): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4214] dhcp4 (ens3): canceled DHCP transaction
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4215] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4215] dhcp4 (ens3): state changed no lease
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4216] dhcp6 (ens3): canceled DHCP transaction
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4216] dhcp6 (ens3): activation: beginning transaction (timeout in 45 seconds)
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4216] dhcp6 (ens3): state changed no lease

Mon 2024-07-08 23:15:15 UTC localhost.localdomain extract-machine-os.service[3779]: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b49111aa35052140e7fdd79964c32db47074c1: (Mirrors also failed: [registry.dfwt5g.lab:4443/ocp-4.16/4.16.0-0.nightly-2024-07-07-171226@sha256:1370c041f0ecf4f6590c1
2f3e1b49111aa35052140e7fdd79964c32db47074c1: Get "https://registry.dfwt5g.lab:4443/v2/ocp-4.16/4.16.0-0.nightly-2024-07-07-171226/manifests/sha256:1370c041f0ecf4f6590c12f3e1b49111a
a35052140e7fdd79964c32db47074c1": dial tcp 192.168.5.9:4443: connect: network is unreachable]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1370c041f0ecf4f6590c12f3e1b491
11aa35052140e7fdd79964c32db47074c1: pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp: lookup quay.io on 192.168.32.8:53: dial udp 192.168.32.8:53: connect: network is unreachable

Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: extract-machine-os.service: Main process exited, code=exited, status=125/n/a
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2607:b500:410:7700::1 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.10.223.134 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4309] policy: set-hostname: set hostname to 'localhost.localdomain' (no hostname found)
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 207.246.65.226 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4309] device (ens3): Activation: starting connection 'Wired connection 1' (bf7e41e3-f1ea-3eed-98fd-c3d021e35d11)
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2001:470:f1c4:1::42 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4315] device (ens3): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2603:c020:0:8369::feeb:dab offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4317] manager: NetworkManager state is now CONNECTING
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 2600:3c01:e000:7e6::123 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4317] device (ens3): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 192.168.32.8 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4322] device (ens3): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.89.207.99 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4326] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 135.148.100.14 offline
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4347] dhcp4 (ens3): state changed new lease, address=192.168.32.28
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4350] policy: set 'Wired connection 1' (ens3) as default for IPv4 routing and DNS
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.4385] device (ens3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Removed source 192.168.32.8
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.10.223.134 online
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 207.246.65.226 online
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 69.89.207.99 online
Mon 2024-07-08 23:15:15 UTC localhost.localdomain chronyd.service[1764]: Source 135.148.100.14 online
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: extract-machine-os.service: Failed with result 'exit-code'.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Failed to start Extract Machine OS Images.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Customized Machine OS Image Server.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Ironic baremetal deployment service.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: ironic.service: Job ironic.service/start failed with result 'dependency'.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Dependency failed for Metal3 deployment service.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: metal3-baremetal-operator.service: Job metal3-baremetal-operator.service/start failed with result 'dependency'.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: image-customization.service: Job image-customization.service/start failed with result 'dependency'.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Ironic ramdisk logger...
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Starting Update master BareMetalHosts with introspection data...
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3899]: NM local-dns-prepender triggered by ens3 dhcp4-change.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3899]: <13>Jul  8 23:15:15 root: NM local-dns-prepender triggered by ens3 dhcp4-change.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3901]: NM resolv-prepender: Checking for nameservers in /var/run/NetworkManager/resolv.conf
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3903]: nameserver 192.168.32.8
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3905]: Failed to get unit file state for systemd-resolved.service: No such file or directory
Mon 2024-07-08 23:15:15 UTC localhost.localdomain root[3911]: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3911]: <13>Jul  8 23:15:15 root: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3917]: NM local-dns-prepender: local DNS IP already is the first entry in resolv.conf
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager-dispatcher.service[3917]: <13>Jul  8 23:15:15 root: NM local-dns-prepender: local DNS IP already is the first entry in resolv.conf
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5372] device (ens3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain provisioning-interface.service[3821]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveMon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5375] device (ens3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5377] manager: NetworkManager state is now CONNECTED_SITE
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5379] device (ens3): Activation: successful, device activated.
Mon 2024-07-08 23:15:15 UTC localhost.localdomain NetworkManager.service[1741]: <info>  [1720480515.5383] manager: NetworkManager state is now CONNECTED_GLOBAL
Mon 2024-07-08 23:15:15 UTC localhost.localdomain init.scope[1]: Finished Provisioning interface.
    

Please review the following PR: https://github.com/openshift/router/pull/624

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Debugging https://issues.redhat.com/browse/OCPBUGS-36808 (the Metrics API failing some of the disruption checks) and taking https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808 as a reproducer of the issue, I think the Kube-aggregator is behind the problem.

According to the disruption checks which forward some relevant errors from the apiserver in the logs, looking at one of the new-connections check failures (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808/artifacts/e2e-aws-ovn-upgrade-2/openshift-e2e-test/artifacts/junit/backend-disruption_20240816-155051.json)

> "Aug 16 16:43:17.672 - 2s E backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/c62b7d32-856f-49de-86f5-1daed55326b2 backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: error running request: 503 Service Unavailable: error trying to reach service: dial tcp 10.128.2.31:10250: connect: connection refused"

The "error trying to reach service" part comes from: https://github.com/kubernetes/kubernetes/blob/b3c725627b15bb69fca01b70848f3427aca4c3ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L105, the apiserver failing to reach the metrics-server Pod, the problem is that the IP "10.128.2.31" corresponds to a Pod that was deleted some milliseconds before (as part of a node update/draining), as we can see in:

> 2024-08-16T16:19:43.087Z|00195|binding|INFO|openshift-monitoring_metrics-server-7b9d8c5ddb-dtsmr: Claiming 0a:58:0a:80:02:1f 10.128.2.31
...
I0816 16:43:17.650083 2240 kubelet.go:2453] "SyncLoop DELETE" source="api" pods=["openshift-monitoring/metrics-server-7b9d8c5ddb-dtsmr"]
...

The apiserver was using a stale IP to reach a Pod that no longer exists, even though a new Pod that had already replaced the other Pod (Metrics API backend runs on 2 Pods), some minutes before, was available.
According to OVN, a fresher IP 10.131.0.12 of that Pod was already in the endpoints at that time:

> I0816 16:40:24.711048 4651 lb_config.go:1018] Cluster endpoints for openshift-monitoring/metrics-server are: map[TCP/https:

{10250 [10.128.2.31 10.131.0.12] []}

]

I think, when "10.128.2.31" failed, the apiserver should have fallen back to "10.131.0.12", maybe it waits for some time/retries before doing so, or maybe it wasn't even aware of "10.131.0.12"

AFAIU, we have "--enable-aggregator-routing" set by default https://github.com/openshift/cluster-kube-apiserver-operator/blob/37df1b1f80d3be6036b9e31975ac42fcb21b6447/bindata/assets/config/defaultconfig.yaml#L101-L103 on the apiservers, so instead of forwarding to the metrics-server's service, apiserver directly reaches the Pods.

For that it keeps track of the relevant services and endpoints https://github.com/kubernetes/kubernetes/blob/ad8a5f5994c0949b5da4240006d938e533834987/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L40

bad decisions may be made if the if the services and/or endpoints cache are stale.

Looking at the metrics-server (the Metrics API backend) endpoints changes in the apiserver audit logs:

> $ grep -hr Event . | grep "endpoints/metrics-server" | jq -c 'select( .verb | match("watch|update"))' | jq -r '[.requestReceivedTimestamp,.user.username,.verb] | @tsv' | sort
2024-08-16T15:39:57.575468Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:02.005051Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:35.085330Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:35.128519Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:19:41.148148Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:19:47.797420Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.051594Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.100761Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.938927Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:21:01.699722Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:39:00.328312Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:39:XX the first Pod was rolled out
2024-08-16T16:39:07.260823Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:39:41.124449Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:43:23.701015Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:43:23, the new Pod that replaced the second one was created
2024-08-16T16:43:28.639793Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:43:47.108903Z system:serviceaccount:kube-system:endpoint-controller update

We can see that just before the new-connections checks succeeded again at around "2024-08-16T16:43:23.", an UPDATE was received/treated which may have helped the apiserver sync its endpoints cache or/and chose a healthy Pod

Also, no update was triggered when the second Pod was deleted at "16:43:17" which may explain the stale 10.128.2.31 endpoints entry on apiserver side.

To summarize, I can see two problems here (maybe one is the consequence of the other):

A Pod was deleted and an Endpoint pointing to it wasn't updated. Apparently the Endpoints controller had/has some sync issues https://github.com/kubernetes/kubernetes/issues/125638
The apiserver resolver had a endpoints cache with one stale and one fresh entry but it kept 4-5 times in a row trying to reach the stale entry OR
The endpoints was updated "At around 16:39:XX the first Pod was rolled out, see above", but the apiserver resolver cache missed that and ended up with 2 stale entries in the cache, and had to wait until "At around 16:43:23, the new Pod that replaced the second one was created, see above" to sync and replace them with 2 fresh entries.

Version-Release number of selected component (if applicable):
{code:none}
    

How reproducible:

    

Steps to Reproduce:

    1. See "Description of problem"
    2.
    3.
    

Actual results:

    

Expected results:

the kube-aggregator should detect stale Apiservice endpoints.
    

Additional info:

the kube-aggregator proxies requests to a stale Endpoints/Pod which makes Metrics API requests falsely fail.
    

Description of problem:

While running batches of 500 managedclusters upgrading via Image-Based Upgrades (IBU) via RHACM and TALM, frequently the haproxy load balancer configured by default for a bare metal cluster in the openshift-kni-infra namespace would run out of connections despite being tuned for 20,000 connections. 

Version-Release number of selected component (if applicable):

Hub OCP - 4.16.3
Spoke Clusters - Originally deployed 4.14.31 then upgraded in sequence to 4.14.32 -> 4.15.20 -> 4.15.21 -> 4.16.1 -> 4.16.3
ACM - 2.11.0-DOWNSTREAM-2024-07-10-21-49-48
TALM - 4.16.0    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

While monitoring the current connections during a CGU batch of 500 SNOs to IBU to a new OCP version I would observe the oc cli returning "net/http: TLS handshake timeout" and if I monitoring the current connections via rsh into the active haproxy pod:

# oc  -n openshift-kni-infra rsh haproxy-d16-h10-000-r650 
Defaulted container "haproxy" out of: haproxy, haproxy-monitor, verify-api-int-resolvable (init)
sh-5.1$ echo "show info" | socat stdio /var/lib/haproxy/run/haproxy.sock | grep CurrConns
CurrConns: 20000
sh-5.1$ 

While capturing this value every 10 or 15 seconds I would observe a high fluctuation of the number of connections such as 
Thu Aug  8 17:51:57 UTC 2024
CurrConns: 17747
Thu Aug  8 17:52:02 UTC 2024
CurrConns: 18413
Thu Aug  8 17:52:07 UTC 2024
CurrConns: 19147
Thu Aug  8 17:52:12 UTC 2024
CurrConns: 19785
Thu Aug  8 17:52:18 UTC 2024
CurrConns: 20000
Thu Aug  8 17:52:23 UTC 2024
CurrConns: 20000
Thu Aug  8 17:52:28 UTC 2024
CurrConns: 20000
Thu Aug  8 17:52:33 UTC 2024
CurrConns: 20000

A brand new hub cluster without any spoke clusters and without ACM installed runs between 53-56 connections, after installing ACM I would see the connection count rise to 56-60 connections. In a smaller environment with only 297 managedclusters I observed between 1410-1695 connections. I do not have a measurement of how many approximate connections we need in the large environment however it clearly fluctuates and the initiation of the IBU upgrades seems to spike it to the current default limit triggering the timeout error message.

 

The story is to track i18n upload/download routine tasks which are perform every sprint. 

 

A.C.

  - Upload strings to Memosource at the start of the sprint and reach out to localization team

  - Download translated strings from Memsource when it is ready

  -  Review the translated strings and open a pull request

  -  Open a followup story for next sprint

Description of problem:

CSS overrides in the OpenShift console are applied to ACM dropdown menu    

Version-Release number of selected component (if applicable):

4.14, 4.15    

How reproducible:

Always    

Steps to Reproduce:

View ACM, Governance > Policies. Actions dropdown
    

Actual results:

Actions are indented and preceded by bullets

Expected results:

Dropdown menu style should not be affected    

Additional info:

    

Description of problem:

while applying "oc adm upgrade --to-multi-arch"
certain flags such as --to and --to-image are blocked with error message such as: 
error: --to-multi-arch may not be used with --to or --to-image
however if one applies --force, or --to-latest, no error message is generated, only:
Requested update to multi cluster architecture
and the flags are omitted silently, applying .spec:
desiredUpdate:
    architecture:    Multi
    force:    false   <- --force silently have no effect here
    image:    
    version:    4.13.0-ec.2  <- --to-latest omitted silently either 

Version-Release number of selected component (if applicable):

4.13.0-ec.2 but seen elsewhere

How reproducible:

100%

Steps to Reproduce:

1. oc adm upgrade --to-multi-arch --force
2. oc adm upgrade --to-multi-arch --to-latest
3. oc adm upgrade --to-multi-arch --force --to-latest 

Actual results:

omitted silently as explained above

Expected results:

either blocked with the same error as --to and --to-image
or if there is a use case, should have the desired effect and not omitted 

 

Please review the following PR: https://github.com/openshift/machine-os-images/pull/40

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/ironic-agent-image/pull/143

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/console-operator/pull/929

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    The label data for networking services is inverted, it should be stored as "key=value", but it's currently stored as "value=key"

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-09-09-120947
    4.18.0-0.nightly-2024-09-09-212926

How reproducible:

    Always

Steps to Reproduce:

    1. Navigate to Networking - Services page. and create a sample Service with lable
       eg: 
apiVersion: v1
kind: Service
metadata:
  name: exampleasd
  namespace: default
  labels:
    testkey1: testvalue1
    testkey2: testvaule2
spec:
  selector:
    app: MyApp
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376    

    2. Check the Labels on Service details page
    3. Check the Labels on Labels column on Networking -> Services page
    

Actual results:

    the data is shown as 'testvalue1=testkey1' and 'testvalue2=testkey2'

Expected results:

    it should be shown as 'testkey1=testvalue1' and 'testkey2=testvalue2'

Additional info:

    

Description of problem:

    Creating a faulty configmap for UWM results in cluster_operator_up=0 with the reason InvalidConfiguration. With https://issues.redhat.com/browse/MON-3421 we're expecting the reason to match UserWorkload.*

Version-Release number of selected component (if applicable):

    4.15.z

How reproducible:

    100%

Steps to Reproduce:

apply the following CM to a cluster with UWM enabled:

apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    hah helo! :)     

Actual results:

    cluster_operator_up=0 with reason InvalidConfiguration

Expected results:

    cluster_operator_up=0 with reason matching pattern UserWorkload.*

Additional info:

https://issues.redhat.com/browse/MON-3421 streamlined reasons to allow separation between UWM and cluster monitoring. The above is a leftover that should be updated to match the same pattern.    

Description of problem:

This is a followup of https://issues.redhat.com/browse/OCPBUGS-34996, in which comments led us to better understand the issue customers are facing.

LDAP IDP traffic from the oauth pod seems to be going through the configured HTTP(S) proxy, while it should not due to it being a different protocol. This results in customers adding the ldap endpoint to their no-proxy config to circumvent the issue. 

Version-Release number of selected component (if applicable):

4.15.11     

How reproducible:

    

Steps to Reproduce:

 (From the customer)   
    1. Configure LDAP IDP
    2. Configure Proxy
    3. LDAP IDP communication from the control plane oauth pod goes through proxy instead of going to the ldap endpoint directly
    

Actual results:

    LDAP IDP communication from the control plane oauth pod goes through proxy 

Expected results:

    LDAP IDP communication from the control plane oauth pod should go to the ldap endpoint directly using the ldap protocol, it should not go through the proxy settings

Additional info:

For more information, see linked tickets.    

Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/421

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

  1. Until OCP 4.11, the form with Name and Role in 'Dev Console -> Project -> Project Access tab' seems to have been changed to the form of Subject, Name, and Role through OCPBUGS-7800. Here, when the Subject is ServiceAccount, the Save button is not available unless Project is selected.

This seems to be a requirement to set Project/namespace.However, in the CLI, RoleBinding objects can be created without namespace with no issues.

$ oc describe rolebinding.rbac.authorization.k8s.io/monitor
Name: monitor
Labels: <none>
Annotations: <none>
Role:
Kind: ClusterRole
Name: view
Subjects:
Kind Name Namespace
---- ---- ---------
ServiceAccount monitor

This is inconsistent with the dev console, causing confusion for developers and administrators and making things cumbersome for administrators.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Login to the web console for Developer.
    2. Select Project on the left.
    3. Select 'Project Access' tab.
    4. Add  access -> Select Sevice Account on the dropdown
   

Actual results:

   Save button is not active when no project is selected      

Expected results:

    The Save button is enabled even though the Project is not selected, so that it can be created just as it is handled in the CLI.

Additional info:

    

Description of problem:

    openshift-install create cluster leads to error:
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: Invalid configuration for device '0'. 

Vsphere standard port group

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. openshift-install create cluster
     2. Choose Vsphere
    3. fill in the blanks
4. Have a standard port group
    

Actual results:

    error

Expected results:

    cluster creation

Additional info:

    

Description of problem:

    The single page docs are missing the "oc adm policy add-cluster-role-to* and remove-cluster-role-from-* commands.  These options exist in these docs:

https://docs.openshift.com/container-platform/4.14/authentication/using-rbac.html

but not in these docs:

https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/cli_tools/index#oc-adm-policy-add-role-to-user 

Description of problem:

    VirtualizedTable which is exposed to dynamic plugin is missing onRowsRendered prop which is available in VirtualTableBody of @patternfly/react-virtualized-extension package

Version-Release number of selected component (if applicable):

    4.15.z

Actual results:

    onRowsRendered prop is not available in VirtualizedTable component

Expected results:

    onRowsRendered prop should be available in VirtualizedTable component

Additional info:

    

Description of problem:

Necessary security group rules are not created when using installer created VPC.

Version-Release number of selected component (if applicable):

    4.17.2

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy a power vs cluster and have the installer create the VPC, or remove required rules from a VPC you're bringing.
    2. Control plane nodes fail to bootstrap.
    3. Fail
    

Actual results:

    Install fails

Expected results:

    Install succeeds

Additional info:

    Fix identified

Description of problem:

In OpenShift 4.13-4.15, when a "rendered" MachineConfig in use is deleted, it's automatically recreated. In OpenShift 4.16, it's not recreated, and nodes and MCP becomes degraded due to the "rendered" not found error.

 

Version-Release number of selected component (if applicable):

4.16

 

How reproducible:

Always

 

Steps to Reproduce:

1. Create a MC to deploy any file in the worker MCP

2. Get the name of the new rendered MC, like for example "rendered-worker-bf829671270609af06e077311a39363e"

3. When the first node starts updating, delete the new rendered MC

    oc delete mc rendered-worker-bf829671270609af06e077311a39363e     

 

Actual results:

Node degraded with "rendered" not found error

 

Expected results:

In OCP 4.13 to 4.15, the "rendered" MC is automatically re-created, and the node continues updating to the MC content without issues. It should be the same in 4.16.

 

Additional info:

The same behavior in 4.12 and older than now in 4.16. In 4.13-4.15, the "rendered" is re-created and no issues with the nodes/MCPs are shown.

Description of problem:

Azure-File volume mount failed, it happens on arm cluster with multi payload

$ oc describe pod
  Warning  FailedMount       6m28s (x2 over 95m)  kubelet            MountVolume.MountDevice failed for volume "pvc-102ad3bf-3480-410b-a4db-73c64daeb3e2" : rpc error: code = InvalidArgument desc = GetAccountInfo(wduan-0319b-bkp2k-rg#clusterjzrlh#pvc-102ad3bf-3480-410b-a4db-73c64daeb3e2###wduan) failed with error: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/wduan-0319b-bkp2k-rg/providers/Microsoft.Storage/storageAccounts/clusterjzrlh/listKeys?api-version=2021-02-01: StatusCode=0 -- Original Error: adal: Failed to execute the refresh request. Error = 'Post "https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token": dial tcp 20.190.190.193:443: i/o timeout'

 

The node log reports:
W0319 09:41:30.745936 1 azurefile.go:806] GetStorageAccountFromSecret(azure-storage-account-clusterjzrlh-secret, wduan) failed with error: could not get secret(azure-storage-account-clusterjzrlh-secret): secrets "azure-storage-account-clusterjzrlh-secret" is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:azure-file-csi-driver-node-sa" cannot get resource "secrets" in API group "" in the namespace "wduan"

 

 
 

Checked the role looks good, at least the same as previous: 
$ oc get clusterrole azure-file-privileged-role -o yaml
...
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use

 

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-multi-2024-03-13-031451

How reproducible:

2/2

Steps to Reproduce:

    1. Checked in CI, azure-file cases failed due to this
    2. Create one cluster with the same config and payload, create azure-file pvc and pod
    3.
    

Actual results:

Pod could not be running    

Expected results:

Pod should be running 

Additional info:

    

Description of problem:

    In the Administrator view under Cluster Settings -> Update Status Pane, the text for the versions is black instead of white when Dark mode is selected on Firefox (128.0.3 Mac). Also happens if you choose System default theme and the system is set to Dark mode.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. Open /settings/cluster using Firefox with Dark mode selected
    2.
    3.
    

Actual results:

    The version numbers under Update status are black

Expected results:

    The version numbers under Update status are white

Additional info:

    

Description of problem:

There are 2 problematic tests in the ImageEcosystem testsuite in: the rails sample and the s2i perl test. This issue tries to fix them both at once so that we can get a passing image ecosystem test.
    

Version-Release number of selected component (if applicable):


    

How reproducible:

always
    

Steps to Reproduce:

    1. Run the imageecosystem testsuite
    2. observe the {[Feature:ImageEcosystem][ruby]} and {[Feature:ImageEcosystem][perl]} test fail
    

Actual results:

The two tests fail
    

Expected results:

No test failures
    

Additional info:


    

Description of the problem:

After multiple re-installations over the exact same baremetal host and re-using the exact same parameters (such as Agent ID, Cluster name, domain, etc) - even if the postgres database does save latest entries, the eventsURL hits a limit so there is no direct way to check the progress.

How reproducible:

 

Steps to reproduce:

1. Install an SNO cluster in a Host

2. Fully wipe out all the resources in RHACM, including SNO project

3. Re-install exact same SNO in the same Host

4. Repeat steps 1-3 multiple times

 

Actual results:

Last ManagedCluster installed is from 09/09 and the postgres database contains its last installation logs:

installer=> SELECT * FROM events WHERE host_id LIKE 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201' ORDER BY event_time DESC;
   id   |          created_at           |          updated_at           | deleted_at | category |              cluster_id              |         event_time         |
               host_id                |             infra_env_id             |                                                                                       
                                    message                                                                                                                          
  |             name              | props |              request_id              | severity 
--------+-------------------------------+-------------------------------+------------+----------+--------------------------------------+----------------------------+
--------------------------------------+--------------------------------------+---------------------------------------------------------------------------------------
--------------+-------+--------------------------------------+----------
 213102 | 2024-09-09 10:15:54.440757+00 | 2024-09-09 10:15:54.440757+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:15:54.439+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host sno1: validation 'api-int-domain-name-resolved-correctly' that used to succeed is
 now failing                                                                                                                                                         
  | host_validation_failed        |       | b7785748-9f73-46e8-a11a-afefe2bfeb59 | warning
 213088 | 2024-09-09 10:06:16.021777+00 | 2024-09-09 10:06:16.021777+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:06:16.021+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Done                                           
                                                                                                                                                                
  | host_install_progress_updated |       | a711f06b-870f-4f5f-886a-882ed6ea4665 | info
 213087 | 2024-09-09 10:06:16.019012+00 | 2024-09-09 10:06:16.019012+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:06:16.018+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host sno1: updated status from installing-in-progress to installed (Done)             
                                                                                                                                                                     
  | host_status_updated           |       | a711f06b-870f-4f5f-886a-882ed6ea4665 | info
 213086 | 2024-09-09 10:05:16.029495+00 | 2024-09-09 10:05:16.029495+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:05:16.029+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Joined                                         
                                                                                                                                                                     
  | host_install_progress_updated |       | 2a8028c1-a0d0-4145-92cf-ea32e6b3f7e6 | info
 213085 | 2024-09-09 10:03:32.06692+00  | 2024-09-09 10:03:32.06692+00  |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:32.066+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Rebooting: Ironic will reboot the node shortly 
                                                                                                                                                                     
  | host_install_progress_updated |       | fced0438-2f03-415f-913e-62da2d43431b | info
 213084 | 2024-09-09 10:03:31.998935+00 | 2024-09-09 10:03:31.998935+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:31.998+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Uploaded logs for host sno1 cluster c5b3b1d3-0cc6-4674-8ba6-62140e9dea16              
                                                                                                                                                                     
  | host_logs_uploaded            |       | df3bc18a-d56a-4a20-84cb-d179fe3040f6 | info
 213083 | 2024-09-09 10:03:12.621342+00 | 2024-09-09 10:03:12.621342+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:12.621+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Writing image to disk: 100%                    
                                                                                                                                                                     
  | host_install_progress_updated |       | 69cad5b4-b606-406c-921e-4f7b0ababfb6 | info
 213082 | 2024-09-09 10:03:12.158359+00 | 2024-09-09 10:03:12.158359+00 |            | user     | c5b3b1d3-0cc6-4674-8ba6-62140e9dea16 | 2024-09-09 10:03:12.158+00 |
 aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | cd0dddc3-e879-4c72-9e9d-3d98bb7813bb | Host: sno1, reached installation stage Writing image to disk: 97%       

But opening the Agent eventsURL (from 09/09 installation): 

 

apiVersion: agent-install.openshift.io/v1beta1
kind: Agent
metadata:
  annotations:
    inventory.agent-install.openshift.io/version: "0.1"
  creationTimestamp: "2024-09-09T09:55:46Z"
  finalizers:
  - agent.agent-install.openshift.io/ai-deprovision
  generation: 2
  labels:
    agent-install.openshift.io/bmh: sno1
    agent-install.openshift.io/clusterdeployment-namespace: sno1
    infraenvs.agent-install.openshift.io: sno1
    inventory.agent-install.openshift.io/cpu-architecture: x86_64
    inventory.agent-install.openshift.io/cpu-virtenabled: "true"
    inventory.agent-install.openshift.io/host-isvirtual: "true"
    inventory.agent-install.openshift.io/host-manufacturer: RedHat
    inventory.agent-install.openshift.io/host-productname: KVM
    inventory.agent-install.openshift.io/storage-hasnonrotationaldisk: "false"
  name: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201
  namespace: sno1
...
...
  debugInfo:
    eventsURL: https://assisted-service-multicluster-engine.apps.hub-sno.nokia-test.lab/api/assisted-install/v2/events?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbmZyYV9lbnZfaWQiOiJjZDBkZGRjMy1lODc5LTRjNzItOWU5ZC0zZDk4YmI3ODEzYmIifQ.eMlGvHeR69CoEA6OhtZX0uBZFeQOSRGOhYsqd1b0W3M78cGo1a2kbIKTz1eU80GUb70cU3v3pxKmxd19kpFaQA&host_id=aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201
    state: installed
    stateInfo: Done 

Clicking on the eventsURL shows latest event as one of past 25/7, which means it is still showing past installations over the host and not the last one:

  {
    "cluster_id": "4df40e8d-b28e-4cad-88d3-fa5c37a81939",
    "event_time": "2024-07-25T00:37:15.538Z",
    "host_id": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201",
    "infra_env_id": "f6564380-9d04-47e3-afe9-b348204cf521",
    "message": "Host sno1: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation)",
    "name": "host_status_updated",
    "severity": "info"
  } 

Trying to replicate the behavior on the postgres database, its like if there was around 50.000 entries max and it is shown the last one of it, something like:

installer=> SELECT * FROM (SELECT * FROM events WHERE host_id LIKE 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201' LIMIT 50000) AS A ORDER BY event_time DESC LIMIT 1;
   id   |         created_at          |         updated_at          | deleted_at | category |              cluster_id              |         event_time         |    
           host_id                |             infra_env_id             |                                                           message                         
                                  |        name         | props |              request_id              | severity 
--------+-----------------------------+-----------------------------+------------+----------+--------------------------------------+----------------------------+----
----------------------------------+--------------------------------------+-------------------------------------------------------------------------------------------
----------------------------------+---------------------+-------+--------------------------------------+----------
 170052 | 2024-07-29 04:41:53.4572+00 | 2024-07-29 04:41:53.4572+00 |            | user     | 4df40e8d-b28e-4cad-88d3-fa5c37a81939 | 2024-07-29 04:41:53.457+00 | aaa
aaaaa-aaaa-aaaa-aaaa-aaaaaaaa0201 | f6564380-9d04-47e3-afe9-b348204cf521 | Host sno1: updated status from known to preparing-for-installation (Host finished successf
ully to prepare for installation) | host_status_updated |       | 872c267a-499e-4b91-8bbb-fdc7ff4521aa | info
 

 

Expected results:

That the user can directly see in eventsURL latests events, in this scenario, they would be all from 09/09 installation and not from July

Description of problem:

[AWS-EBS-CSI-Driver] allocatable volumes count incorrect in csinode for AWS vt1*/g4* instance types    

Version-Release number of selected component (if applicable):

 4.17.0-0.nightly-2024-07-16-033047   

How reproducible:

 Always

Steps to Reproduce:

1. Use instance type "vt1.3xlarge"/"g4ad.xlarge"/"g4dn.xlarge" install Openshift cluster on AWS

2. Check the csinode allocatable volumes count 
$ oc get csinode ip-10-0-53-225.ec2.internal -ojsonpath='{.spec.drivers[?(@.name=="ebs.csi.aws.com")].allocatable.count}'
26

g4ad.xlarge # 25 
g4dn.xlarge # 25
vt1.3xlarge # 26                                                              

$ oc get no/ip-10-0-53-225.ec2.internal -oyaml| grep 'instance-type'
    beta.kubernetes.io/instance-type: vt1.3xlarge
    node.kubernetes.io/instance-type: vt1.3xlarge
3. Create statefulset with pvc(which use the ebs csi storageclass), nodeAnffinity to the same node and set the replicas to the max volumesallocatable count to verify the the csinode allocatable volumes count is correct and all the pods should become Running 

# Test data
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: statefulset-vol-limit
spec:
  serviceName: "my-svc"
  replicas: 26
  selector:
    matchLabels:
      app: my-svc
  template:
    metadata:
      labels:
        app: my-svc
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - ip-10-0-53-225.ec2.internal # Make all volume attach to the same node
      containers:
      - name: openshifttest
        image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339
        volumeMounts:
        - name: data
          mountPath: /mnt/storage
      tolerations:
        - key: "node-role.kubernetes.io/master"
          effect: "NoSchedule"
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      #storageClassName: gp3-csi
      resources:
        requests:
          storage: 1Gi

Actual results:

In step 3 there's some pods stuck at "ContainerCreating" status caused by volumes stuck at attaching status and couldn't be attached to the node    

Expected results:

 In step 3 all the pods with pvc should become "Running", and In step 2 the csinode allocatable volumes count should be correct

-> g4ad.xlarge allocatable count should be 24
-> g4dn.xlarge allocatable count should be 24
-> vt1.3xlarge allocatable count should be 24   

Additional info:

  ...
attach or mount volumes: unmounted volumes=[data12 data6], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition
06-25 17:51:23.680      Warning  FailedAttachVolume      4m1s (x13 over 14m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-d08d4133-f589-4aa3-bbef-f988058c419a" : rpc error: code = Internal desc = Could not attach volume "vol-0aa138f453d414ec3" to node "i-09d532f5155b3c05d": attachment of disk "vol-0aa138f453d414ec3" failed, expected device to be attached but was attaching
06-25 17:51:23.681      Warning  FailedMount             3m40s (x3 over 10m)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[data6 data12], unattached volumes=[data12 data6], failed to process volumes=[]: timed out waiting for the condition
...  

The story is to track i18n upload/download routine tasks which are perform every sprint. 

 

A.C.

  - Upload strings to Memosource at the start of the sprint and reach out to localization team

  - Download translated strings from Memsource when it is ready

  -  Review the translated strings and open a pull request

  -  Open a followup story for next sprint

Description of problem:

  Numbers input into NumberSpinnerField that are above 2147483647 are not accepted as integers  

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enter a number larger than 2147483647 into any NumberSpinnerField
    

Actual results:

    Number is not accepted as an integer

Expected results:

    There should be a separate validation error stating the number should be less than 2147483647

Additional info:

    See https://github.com/openshift/console/pull/14084

Description of problem:

On NetworkPolicies page, select MultiNetworkPolicies and create the policy, the created policy is not MultiNetworkPolicy, but NetworkPolicy.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

100%

Steps to Reproduce:

1. Create a MultiNetworkPolicy
2.
3.

Actual results:

The policy is a NetworkPolicy, not MultiNetworkPolicy

Expected results:

It's MultiNetworkPolicy

Additional info:


TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.

The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.

The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:

source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]

Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed

More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.

The operator degraded is probably the strongest symptom to persue as it appears in most of the above.

If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.

Tracker issue for bootimage bump in 4.18. This issue should block issues which need a bootimage bump to fix.

The previous bump was OCPBUGS-41259.

Description of problem:

    The message of the co olm conditions of Upgradeable is not correct if one ClusterExtension(without olm.maxOpenShiftVersion) is installed.


Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-10-06-223232

How reproducible:

    always

Steps to Reproduce:

    1.create ClusterCatalog
apiVersion: olm.operatorframework.io/v1alpha1
kind: ClusterCatalog
metadata:
  name: catalog-1
  labels:
    example.com/support: "true"
    provider: olm-1
spec:
  priority: 1000
  source:
    type: image
    image:
      ref: quay.io/openshifttest/nginxolm-operator-index:nginxolm74108     2. create ns and sa
3. create ClusterExtension
apiVersion: olm.operatorframework.io/v1alpha1
kind: ClusterExtension
metadata:
  name: test-74108
spec:
  source:
    sourceType: Catalog
    catalog:
      packageName: nginx74108
      channels:
        - candidate-v1.1
  install:
    serviceAccount:
      name: sa-74108
    namespace: test-74108   

 4. check co olm status
status:
  conditions:
  - lastTransitionTime: "2024-10-08T11:51:01Z"
    message: 'OLMIncompatibleOperatorControllerDegraded: error with cluster extension
      test-74108: error in bundle nginx74108.v1.1.0: could not convert olm.properties:
      failed to unmarshal properties annotation: unexpected end of JSON input'
    reason: OLMIncompatibleOperatorController_SyncError
    status: "True"
    type: Degraded
  - lastTransitionTime: "2024-10-08T02:16:36Z"
    message: All is well
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2024-10-08T02:16:36Z"
    message: All is well
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2024-10-08T11:48:26Z"
    message: 'InstalledOLMOperatorsUpgradeable: error with cluster extension test-74108:
      error in bundle nginx74108.v1.1.0: could not convert olm.properties: failed
      to unmarshal properties annotation: unexpected end of JSON input'
    reason: InstalledOLMOperators_FailureGettingExtensionMetadata
    status: "False"
    type: Upgradeable
  - lastTransitionTime: "2024-10-08T02:09:59Z"
    reason: NoData
    status: Unknown
    type: EvaluationConditionsDetected


Actual results:

    co olm is Degraded

Expected results:

    co olm is OK

Additional info:

    Bellow annotation of CSV is not configured 
olm.properties: '[{"type": "olm.maxOpenShiftVersion", "value": "4.8"}]'

Description of problem:

    When the console is loaded there are errors in the browsers console abouth failing to fetch networking-console-plugin locales.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    The issue is also effecting console CI

Description of problem:

4.18 efs controller, node pods are left behind after uninstalling driver

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-08-075347

How reproducible:

Always

Steps to Reproduce:

1. Install 4.18 EFS operator, driver on  cluster and check the efs pods are all up and Running
2. Uninstall EFs driver and check the controller, node pods gets deleted

Execution on 4.16 and 4.18 clusters 

4.16 cluster

oc create -f og-sub.yaml
oc create -f driver.yaml

oc get pods | grep "efs"
aws-efs-csi-driver-controller-b8858785-72tp9     4/4     Running   0          4s
aws-efs-csi-driver-controller-b8858785-gvk4b     4/4     Running   0          6s
aws-efs-csi-driver-node-2flqr                    3/3     Running   0          9s
aws-efs-csi-driver-node-5hsfp                    3/3     Running   0          9s
aws-efs-csi-driver-node-kxnlv                    3/3     Running   0          9s
aws-efs-csi-driver-node-qdshm                    3/3     Running   0          9s
aws-efs-csi-driver-node-ss28h                    3/3     Running   0          9s
aws-efs-csi-driver-node-v9zwx                    3/3     Running   0          9s
aws-efs-csi-driver-operator-65b55bf877-4png9     1/1     Running   0          2m53s

oc get clustercsidrivers | grep "efs"
efs.csi.aws.com   2m26s

oc delete -f driver.yaml

oc get pods | grep "efs"
aws-efs-csi-driver-operator-65b55bf877-4png9     1/1     Running   0          4m40s

4.18 cluster
oc create -f og-sub.yaml
oc create -f driver.yaml

oc get pods | grep "efs" 
aws-efs-csi-driver-controller-56d68dc976-847lr   5/5     Running   0               9s
aws-efs-csi-driver-controller-56d68dc976-9vklk   5/5     Running   0               11s
aws-efs-csi-driver-node-46tsq                    3/3     Running   0               18s
aws-efs-csi-driver-node-7vpcd                    3/3     Running   0               18s
aws-efs-csi-driver-node-bm86c                    3/3     Running   0               18s
aws-efs-csi-driver-node-gz69w                    3/3     Running   0               18s
aws-efs-csi-driver-node-l986w                    3/3     Running   0               18s
aws-efs-csi-driver-node-vgwpc                    3/3     Running   0               18s
aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv     1/1     Running   0               2m55s

oc get clustercsidrivers 
efs.csi.aws.com   2m19s

oc delete -f driver.yaml

oc get pods | grep "efs"              
aws-efs-csi-driver-controller-56d68dc976-847lr   5/5     Running   0               4m58s
aws-efs-csi-driver-controller-56d68dc976-9vklk   5/5     Running   0               5m
aws-efs-csi-driver-node-46tsq                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-7vpcd                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-bm86c                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-gz69w                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-l986w                    3/3     Running   0               5m7s
aws-efs-csi-driver-node-vgwpc                    3/3     Running   0               5m7s
aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv     1/1     Running   0               7m44s

oc get clustercsidrivers  | grep "efs" => Nothing is there

Actual results:

The EFS controller, node pods are left behind

Expected results:

After uninstalling driver the EFS controller, node pods should get deleted

Additional info:

 On 4.16 cluster this is working fine

EFS Operator logs:

oc logs aws-efs-csi-driver-operator-7cc9bf69b5-hj7zv
E1009 07:13:41.460469       1 base_controller.go:266] "LoggingSyncer" controller failed to sync "key", err: clustercsidrivers.operator.openshift.io "efs.csi.aws.com" not found

Discussion: https://redhat-internal.slack.com/archives/C02221SB07R/p1728456279493399 

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/249

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Go to NetworkPolicies page, make sure they have policies in each tab.
Go to MultiNetworkPolicies tab and create a filter, then move the the first tab(NetworkPolicies tab), it does not show the policies any more.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1. Have policies on NetworkPolicies tab and MultiNetworkPolicies tab
2. Create a filter on MultiNetworkPolicies tab
3. Go to NetworkPolicies tab 

Actual results:

It shows "Not found"

Expected results:

the list of networkpolicies shows up

Additional info:


Description of problem:

External network ID should be an optional CLI option but when not given, the Hypershift Operator crashes with a nil pointer error.

Version-Release number of selected component (if applicable):

    4.18 and 4.17

Description of problem:

Creation of pipeline through import from git using devfile repo does not work
    

Version-Release number of selected component (if applicable):


    

How reproducible:

Everytime
    

Steps to Reproduce:

    1. Create a pipeline from import from git form using devfile repo `https://github.com/nodeshift-starters/devfile-sample.git`
    2. Check pipelines page
    3.
    

Actual results:

No pipeline is created instead build config is created for it
    

Expected results:

If the pipeline option is showing in _import from git form _for a repo, the pipeline should be generated
    

Additional info:


    

Description of problem:

Completions column values need to be marked for translation.

Version-Release number of selected component (if applicable):

4.12.0

How reproducible:

 

Steps to Reproduce:

1. Navigate to Workloads - Jobs
2. Values under Completions column are in English
3.

Actual results:

Content is in English

Expected results:

Content should be in target language

Additional info:

screenshot provided

arm64 is dev preview by CNV since 4.14. The installer shouldn't block installing it.

Just make sure it is shown in the UI as dev preview.

In all releases tested, in particular, 4.16.0-0.okd-scos-2024-08-21-155613, Samples operator uses incorrect templates, resulting in following alert:

Samples operator is detecting problems with imagestream image imports. You can look at the "openshift-samples" ClusterOperator object for details. Most likely there are issues with the external image registry hosting the images that needs to be investigated. Or you can consider marking samples operator Removed if you do not care about having sample imagestreams available. The list of ImageStreams for which samples operator is retrying imports: fuse7-eap-openshift fuse7-eap-openshift-java11 fuse7-java-openshift fuse7-java11-openshift fuse7-karaf-openshift-jdk11 golang httpd java jboss-datagrid73-openshift jboss-eap-xp3-openjdk11-openshift jboss-eap-xp3-openjdk11-runtime-openshift jboss-eap-xp4-openjdk11-openshift jboss-eap-xp4-openjdk11-runtime-openshift jboss-eap74-openjdk11-openshift jboss-eap74-openjdk11-runtime-openshift jboss-eap74-openjdk8-openshift jboss-eap74-openjdk8-runtime-openshift jboss-webserver57-openjdk8-tomcat9-openshift-ubi8 jenkins jenkins-agent-base mariadb mysql nginx nodejs perl php postgresql13-for-sso75-openshift-rhel8 postgresql13-for-sso76-openshift-rhel8 python redis ruby sso75-openshift-rhel8 sso76-openshift-rhel8 fuse7-karaf-openshift jboss-webserver57-openjdk11-tomcat9-openshift-ubi8 postgresql

For example, the sample image for Mysql 8.0 is being pulled from registry.redhat.io/rhscl/mysql-80-rhel7:latest (and cannot be found using the dummy pull secret).

Works correctly on OKD FCOS builds.

The openshift-apiserver is using kube 1.29.2 . However the library-go is bumped to 1.30.1.

In order to have smooth vendoring of changes from library-go to openshift-apiserver, both repositories should use the same Kube version.

Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/76

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/216

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:


In GetMirrorFromRelease() https://github.com/openshift/installer/blob/master/pkg/asset/agent/mirror/registriesconf.go#L313-L328, the agent installer sets the mirror for the release image based on the source url.

This setting is then used in assisted-service to extract images etc. https://github.com/openshift/assisted-service/blob/master/internal/oc/release.go#L328-L340 in conjunction with the icsp file.

The problem is that GetMirrorFromRelease() returns just the first entry in registries.conf so its not really the actual mirror in the case when a source has multiple mirrors. A better way to handle this would be to net set the env variable OPENSHIFT_INSTALL_RELEASE_IMAGE_MIRROR and just let the resolving of the mirror be handled by the icsp-file. Its currently using the icsp-file but since the source has changed to the mirror it might not use these if, for example, the first mirror does not have the manifest file.

We've had an internal report of a failure when using mirroring:

Oct 01 10:06:16 master-0 agent-register-cluster[7671]: time="2024-10-01T14:06:16Z" level=fatal msg="Failed to register cluster with assisted-service: command 'oc adm release info -o template --template '{{.metadata.version}}' --insecure=true --icsp-file=/tmp/icsp-file2810072099 registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev@sha256:6078cb4ae197b5b0c526910363b8aff540343bfac62ecb1ead9e068d541da27b --registry-config=/tmp/registry-config204889789' exited with non-zero exit code 1: \nFlag --icsp-file has been deprecated, support for it will be removed in a future release. Use --idms-file instead.\nerror: image \"registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev@sha256:6078cb4ae197b5b0c526910363b8aff540343bfac62ecb1ead9e068d541da27b\" not found: manifest unknown: manifest unknown\n"

When using the mirror config:

[[registry]]
  location = "quay.io/openshift-release-dev/ocp-release"
  mirror-by-digest-only = true
  prefix = ""

  [[registry.mirror]]
    location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev"

  [[registry.mirror]]
    location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release"

[[registry]]
  location = "quay.io/openshift-release-dev/ocp-v4.0-art-dev"
  mirror-by-digest-only = true
  prefix = ""

  [[registry.mirror]]
    location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-v4.0-art-dev"

  [[registry.mirror]]
    location = "registry.hub.qe.lab.redhat.com:5000/openshift-release-dev/ocp-release"

Description of problem:

The library-sync.sh script may leave some files of the unsupported samples in the checkout. In particular, the files that have been renamed are not deleted even though they should have.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Run library-sync.sh

Actual results:

A couple of files under assets/operator/ocp-x86_64/fis are present.    

Expected results:

The directory should not be present at all, because it is not supported.    

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-operator/pull/268

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/network-tools/pull/133

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/586

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

https://github.com/search?q=repo%3Aopenshift%2Fconsole+name+%3D%3D%3D+%27%7Enew%27&type=code shows a number of instances in Console code where there is a check for a resource name with a value of "~new".  This check is not valid as a resource name cannot include "~".  We should remove these invalid checks.

Component Readiness has found a potential regression in the following test:

[bz-Routing] clusteroperator/ingress should not change condition/Available

Probability of significant regression: 97.63%

Sample (being evaluated) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-09-09T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-28T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 67
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=upi&Installer=upi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=vsphere&Platform=vsphere&Scheduler=default&SecurityMode=default&Suite=serial&Suite=serial&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-28%2000%3A00%3A00&capability=Operator&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=Networking%20%2F%20router&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20upi%20ovn%20vsphere%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-09-09%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-09-01%2000%3A00%3A00&testId=openshift-tests%3Ab690e68fb6372a8924d84a0d6aa2f552&testName=%5Bbz-Routing%5D%20clusteroperator%2Fingress%20should%20not%20change%20condition%2FAvailable

 

It is worth mentioning that in two of the three failures, ingress operator went available=false at the same time image registry went available=false. This is one example.

 

Team can investigate, and if legit reason exists, please create an exception with origin and address it at proper time: https://github.com/openshift/origin/blob/4557bdcecc10d9fa84188c1e9a36b1d7d162c393/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L90

 

Since this is appearing on component readiness dashboard and the management depends on a green dashboard to make release decisions, please give the initial investigation a high priority. If an exception is needed, please contact TRT team to triage the issue. 

We are aiming to find containers that are restarting more than 3 times in the progress of an e2e test. Critical pods like metal3-static-ip-set should not be restarting more than 3 times in the progress of a test.

Can your team investigate this and aim to fix for it?

For now, we will exclude our test from failing.

See https://search.dptools.openshift.org/?search=restarted+.*+times+at%3A&maxAge=168h&context=1&type=junit&name=4.18&excludeName=okd%7Csingle&maxMatches=5&maxBytes=20971520&groupBy=job

for an example of how much this container restarts in the progress of a test.

Description of problem:

See attached screenshots. Different operator versions have different descriptions but Operator hub shows still the 
same description for whatever operator version is selected.

Version-Release number of selected component (if applicable):

    OCP 4.16

How reproducible:

    Always

Steps to Reproduce:

    1.Open Operator Hub and find Sail operator
    2.Select Sail Operator
    3.Choose different versions and channels
    

Actual results:

    Description is always the same even though actual description for given version is different.

Expected results:

Expected behavior - when selecting different operator versions during installation the description should be updated according to selected operator. 

Additional info:

    See attachments in original issue https://issues.redhat.com/browse/OPECO-3239

Description of problem:

After upgrading the cluster to 4.15 the Prometheus Operator´s "Prometheus" tab does not show the Prometheuses, they can still be viewed and accessed through the "All instances" tab

Version-Release number of selected component (if applicable):

OCP v4.15

Steps to Reproduce:

    1. Install prometheus operator from operator hub
    2. create prometheus instance
    3. Instance will be visible under all instances tab , not under prometheus tab
    

Actual results:

Prometheus instance be visible in all instance tab only

Expected results:

Prometheus instance should be visible in all instance along with prometheus tab

Description of problem:

The Azure cloud node manager uses a service account with a cluster role attached that provides it with cluster wide permissions to update Node objects.

This means, were the service account to become compromised, Node objects could be maliciously updated.

To limit the blast radius of a leak, we should determine if there is a way to limit the Azure Cloud Node Manager to only be able to update the node on which it resides, or, to move it's functionality centrally within the cluster.

Possible paths:
* Check upstream progress for any attempt to move the node manager role into the CCM
* See if we can re-use kubelet credentials as these are already scoped to updating only the Node on which they reside
* See if there's another admission control method we can use to limit the updates (possibly https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/) 

Version-Release number of selected component (if applicable):

4.14.0

How reproducible:

Always

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

In Azure Stack, the Azure-Disk CSI Driver node pod CrashLoopBackOff:

openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-57rxv                                      1/3     CrashLoopBackOff   33 (3m55s ago)   59m     10.0.1.5       ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-m62cj   <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-8wvqm                                      1/3     CrashLoopBackOff   35 (29s ago)     67m     10.0.0.6       ci-op-q8b6n4iv-904ed-kp5mv-master-1              <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-97ww5                                      1/3     CrashLoopBackOff   33 (12s ago)     67m     10.0.0.7       ci-op-q8b6n4iv-904ed-kp5mv-master-2              <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-9hzw9                                      1/3     CrashLoopBackOff   35 (108s ago)    59m     10.0.1.4       ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-gjqmw   <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-glgzr                                      1/3     CrashLoopBackOff   34 (69s ago)     67m     10.0.0.8       ci-op-q8b6n4iv-904ed-kp5mv-master-0              <none>           <none>
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-hktfb                                      2/3     CrashLoopBackOff   48 (63s ago)     60m     10.0.1.6       ci-op-q8b6n4iv-904ed-kp5mv-worker-mtcazs-kdbpf   <none>           <none>
The CSI-Driver container log:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xc8 pc=0x18ff5db]
goroutine 228 [running]:
sigs.k8s.io/cloud-provider-azure/pkg/provider.(*Cloud).GetZone(0xc00021ec00, {0xc0002d57d0?, 0xc00005e3e0?})
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_zones.go:182 +0x2db
sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).NodeGetInfo(0xc000144000, {0x21ebbf0, 0xc0002d5470}, 0x273606a?)
 /go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/nodeserver.go:336 +0x13b
github.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler.func1({0x21ebbf0, 0xc0002d5470}, {0x1d71a60?, 0xc0003b0320})
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:7160 +0x72
sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x21ebbf0, 0xc0002d5470}, {0x1d71a60?, 0xc0003b0320?}, 0xc0003b0340, 0xc00050ae10)
 /go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409
github.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler({0x1ec2f40?, 0xc000144000}, {0x21ebbf0, 0xc0002d5470}, 0xc000054680, 0x20167a0)
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:7162 +0x135
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000530000, {0x21ebbf0, 0xc0002d53b0}, {0x21f5f40, 0xc00057b1e0}, 0xc00011cb40, 0xc00052c810, 0x30fa1c8, 0x0)
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1343 +0xe03
google.golang.org/grpc.(*Server).handleStream(0xc000530000, {0x21f5f40, 0xc00057b1e0}, 0xc00011cb40)
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1737 +0xc4c
google.golang.org/grpc.(*Server).serveStreams.func1.1()
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 260
 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:997 +0x145  

 

The registrar container log:
E0321 23:08:02.679727       1 main.go:103] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Unavailable desc = error reading from server: EOF, restarting registration container. 

Version-Release number of selected component (if applicable):

    4.16.0-0.nightly-2024-03-21-152650    

How reproducible:

    See it in CI profile, and manual install failed earlier.

Steps to Reproduce:

    See Description     

Actual results:

    Azure-Disk CSI Driver node pod CrashLoopBackOff

Expected results:

    Azure-Disk CSI Driver node pod should be running

Additional info:

    See gather-extra and must-gather: 
https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-azure-stack-ipi-proxy-fips-f2/1770921405509013504/artifacts/azure-stack-ipi-proxy-fips-f2/

Please review the following PR: https://github.com/openshift/image-registry/pull/411

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

CI Disruption during node updates:
4.18 Minor and 4.17 micro upgrades started failing with the initial 4.17 payload 4.17.0-0.ci-2024-08-09-225819

4.18 Micro upgrade failures began with the initial payload  4.18.0-0.ci-2024-08-09-234503

CI Disruption in the -out-of-change jobs in the nightlies that start with
4.18.0-0.nightly-2024-08-10-011435 and
4.17.0-0.nightly-2024-08-09-223346

The common change in all of those scenarios appears to be:
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4437
OCPNODE-2357: templates/master/cri-o: make crun as the default container runtime #4518

Description of problem:

openshift-apiserver that sends traffic through konnectivity proxy is sending traffic intended for the local audit-webhook service. The audit-webhook service should be included in the NO_PROXY env var of the openshift-apiserver container.

    

4.14.z,4.15.z,4.15.z,4.16.z

    How reproducible:{code:none} Always

    

Steps to Reproduce:

    1. Create a rosa hosted cluster
    2. Obeserve logs of the konnectivity-proxy sidecar of openshift-apiserver
    3.
    

Actual results:

     Logs include requests to the audit-webhook local service

    

Expected results:

      Logs do not include requests to audit-webhook 
    

Additional info:


    

Slack thread asking apiserver team

We saw excess pathological events tests that failed aggregated jobs in aws and gcp jobs for 4.18.0-0.ci-2024-09-26-062917 (azure has them too and now failed in 4.18.0-0.nightly-2024-09-26-093014). The events are in namespace/openshift-apiserver-operator and namespace/openshift-authentication-operator – reason/DeploymentUpdated Updated Deployment.apps/apiserver -n openshift-oauth-apiserver because it changed
Examples:

Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/127

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/24

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

After successful deployment, trying to delete spoke resources.
BMHs are not being removed and stuck.

How reproducible:

Always

Steps to reproduce:

1. Deploy spoke node (tested in disconnected + IPV6 but CI also fails on ipv4)

2. Try to delete BMH (after deleting agents)

3.

Actual results:

BMH is still in provisioned state and not being deleted.
From assisted logs:
-------
time="2024-09-20T21:02:23Z" level=error msg="failed to delete BMH" func=github.com/openshift/assisted-service/internal/controller/controllers.removeSpokeResources file="/remote-source/assisted-service/app/internal/controller/controllers/agent_controller.go:450" agent=6df557e8-00af-4377-ac93-096b66c8e3c6 agent_namespace=spoke-0 error="failed to remove BMH openshift-machine-api/spoke-worker-0-1 finalizers: Internal error occurred: failed calling webhook \"baremetalhost.metal3.io\": failed to call webhook: Post \"https://baremetal-operatf557e8-00af-4377-ac93-096b66c8e3c6 agent_namespace=spoke-0 error="failed to remove BMH openshift-machine-api/spoke-worker-0-1 finalizers: Internal error occurred: failed calling webhook \"baremetalhost.metal3.io\": failed to call webhook: Post \"https://baremetal-operator-webhook-service.openshift-machine-api.svc:443/validate-metal3-io-v1alpha1-baremetalhost?timeout=10s\": no endpoints available for service \"baremetal-operator-webhook-service\"" go-id=393 hostname=spoke-worker-0-1 machine=spoke-0-f9w48-worker-0-x484f machine_namespace=openshift-machine-api machine_set=spoke-0-f9w48-worker-0 node=spoke-w
--------

Expected results:
BMH shoud be deleted 

must-gather: https://drive.google.com/file/d/1JOeDGTzQNgDy9ZdjlJMcRi-hksB6Iz9h/view?usp=drive_link 

Description of the problem:

B[Staging]BE 2.35.0, UI 2.34.2 - [Staging] - BE allows LVMS and ODF to be enabled

How reproducible:

100%

Steps to reproduce:

1.

Actual results:

 

Expected results:

Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/364

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1332

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The assisted service is throwing an error message stating that the Cloud Controller Manager (CCM) is not enabled, even though the CCM value is correctly set in the install-config file.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-19-045205

How reproducible:

Always

Steps to Reproduce:

    1. Prepare install-config and agent-config for external OCI platform.
      example of install-config configuration
.......
.......
platform: external
  platformName: oci
  cloudControllerManager: External
.......
.......
    2. Create agent ISO for external OCI platform     
    3. Boot up nodes using created agent ISO     

Actual results:

Oct 21 16:40:47 agent-sno.private.agenttest.oraclevcn.com service[2829]: time="2024-10-21T16:40:47Z" level=info msg="Register cluster: agenttest with id 2666753a-0485-420b-b968-e8732da6898c and params {\"api_vips\":[],\"base_dns_domain\":\"abitest.oci-rhelcert.edge-sro.rhecoeng.com\",\"cluster_networks\":[{\"cidr\":\"10.128.0.0/14\",\"host_prefix\":23}],\"cpu_architecture\":\"x86_64\",\"high_availability_mode\":\"None\",\"ingress_vips\":[],\"machine_networks\":[{\"cidr\":\"10.0.0.0/20\"}],\"name\":\"agenttest\",\"network_type\":\"OVNKubernetes\",\"olm_operators\":null,\"openshift_version\":\"4.18.0-0.nightly-2024-10-19-045205\",\"platform\":{\"external\":{\"cloud_controller_manager\":\"\",\"platform_name\":\"oci\"},\"type\":\"external\"},\"pull_secret\":\"***\",\"schedulable_masters\":false,\"service_networks\":[{\"cidr\":\"172.30.0.0/16\"}],\"ssh_public_key\":\"ssh-rsa XXXXXXXXXXXX\",\"user_managed_networking\":true,\"vip_dhcp_allocation\":false}" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterClusterInternal" file="/src/internal/bminventory/inventory.go:515" cluster_id=2666753a-0485-420b-b968-e8732da6898c go-id=2110 pkg=Inventory request_id=82e83b31-1c1b-4dea-b435-f7316a1965e

Expected results:

The cluster installation should be successful. 

Description of problem:

When doing the mirror to mirror, will count the operator catalog image twice : 
 ✓   70/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 
 ✓   71/81 : (3s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15 
 ✓   72/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 
2024/09/06 04:55:05  [INFO]   : Mirroring is ongoing. No errors.
 ✓   73/81 : (0s) oci:///test/ibm-catalog 
 ✓   74/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 
 ✓   75/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 
 ✓   76/81 : (3s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 
 ✓   77/81 : (1s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 
 ✓   78/81 : (3s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 
 ✓   79/81 : (2s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 
 ✓   80/81 : (2s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15

 

Version-Release number of selected component (if applicable):

   oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-unknown-7b0b3bf2", GitCommit:"7b0b3bf2", GitTreeState:"clean", BuildDate:"2024-09-06T01:32:29Z", GoVersion:"go1.23.0", Compiler:"gc", Platform:"linux/amd64"}      

How reproducible:

     Always
    

Steps to Reproduce:

    1. do mirror2mirror with following imagesetconfig: 
cat config-136.yaml 
apiVersion: mirror.openshift.io/v2alpha1
kind: ImageSetConfiguration
mirror:
  operators:
  - catalog: oci:///test/ibm-catalog
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15
    packages:
    - name: windows-machine-config-operator
    - name: cluster-kube-descheduler-operator
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.14
    packages:
    - name: servicemeshoperator 
    - name: windows-machine-config-operator
  - catalog: registry.redhat.io/redhat/certified-operator-index:v4.15
    packages:
    - name: nvidia-network-operator
  - catalog: registry.redhat.io/redhat/community-operator-index:v4.15
    packages:
    - name: skupper-operator
    - name: reportportal-operator
  - catalog: registry.redhat.io/redhat/redhat-marketplace-index:v4.15
    packages:
    - name: dynatrace-operator-rhmp

`oc-mirror -c config-136.yaml docker://localhost:5000/m2m06 --workspace file://m2m6 --v2  --dest-tls-verify=false`

Actual results:

will count the operator catalog images twice : 

 ✓   70/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 
 ✓   71/81 : (3s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15 
 ✓   72/81 : (3s) docker://registry.redhat.io/redhat/community-operator-index:v4.15 
2024/09/06 04:55:05  [INFO]   : Mirroring is ongoing. No errors.
 ✓   73/81 : (0s) oci:///test/ibm-catalog 
 ✓   74/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 
 ✓   75/81 : (2s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.15 
 ✓   76/81 : (3s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 
 ✓   77/81 : (1s) docker://registry.redhat.io/redhat/redhat-operator-index:v4.14 
 ✓   78/81 : (3s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 
 ✓   79/81 : (2s) docker://registry.redhat.io/redhat/certified-operator-index:v4.15 
 ✓   80/81 : (2s) docker://registry.redhat.io/redhat/redhat-marketplace-index:v4.15      

Expected results:

   Should only count the operator catalog image corretly 

Additional info:

    

Description of problem:

The hypershift CLI has an implicit dependency on the az and jq commands, as it invokes them directly. 

As a result, the "hypershift-azure-create" chain will not work since it's based on the hypershift-operator image, which lacks these tools. 

Expected results:

Refactor the hypershift CLI to handle these dependencies in a Go-native way, so that the CLI is self-contained.

Description of problem:

    Circular dependencies in OCP Console prevent migration of Webpack 5 

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. Enable the CHECK_CYCLES env var while building
    2. Observe errors
    3.
    

Actual results:

    There are errors

Expected results:

    No errors

Additional info:

    

 

If the network to the bootstrap VM is slow, the extract-machine-os.service can time out (after 180s). If this happens, it will be restarted but services that depend on it (like ironic) will never be started even once it succeeds. systemd added support for Restart:on-failure for Type:oneshot services, but they still don't behave the same way as other types of services.

This can be simulated in dev-scripts by doing:

sudo tc qdisc add dev ostestbm root netem rate 33Mbit

Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/445

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When deleting an AWS HostedCluster with endpoint access of type PublicAndPrivate or Private, the VPC endpoint for the HostedCluster is not always cleaned up when the HostedCluster is deleted.

Version-Release number of selected component (if applicable):

4.18.0    

How reproducible:

    Most of the time

Steps to Reproduce:

    1. Create a HostedCluster on AWS with endpoint access PublicAndPrivate
    2. Wait for the HostedCluster to finish deploying
    3. Delete the HostedCluster by deleting the HostedCluster resource (oc delete hostedcluster/[name] -n clusters)    

Actual results:

    The vpc endpoint and/or the DNS entries in the hypershift.local hosted zone that corresponds to the hosted cluster are not removed.

Expected results:

    The vpc endpoint and DNS entries in the hypershift.local hosted zone are deleted when the hosted cluster is cleaned up.

Additional info:

With current code, the namespace is deleted before the control plane operator finishes cleanup of the VPC endpoint and related DNS entries.    

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver/pull/92

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Pull image from gcp artifact registry failed

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

1. Create repo for gcp artifact registry: zhsun-repo1
 
2. Login to registry
gcloud auth login
gcloud auth configure-docker us-central1-docker.pkg.dev 
    
3. Push image to registry
$ docker pull openshift/hello-openshift
$ docker tag openshift/hello-openshift:latest us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest
$ docker push us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest
4. Create pod
$ oc new-project hello-gcr
$ oc new-app --name hello-gcr --allow-missing-images \  
  --image us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest
5. Check pod status

Actual results:

Pull image failed.
must-gather: https://drive.google.com/file/d/1o9cyJB53vQtHNmL5EV_hIx9I_LzMTB0K/view?usp=sharing
kubelet log: https://drive.google.com/file/d/1tL7HGc4fEOjH5_v6howBpx2NuhjGKsTp/view?usp=sharing
$ oc get po               
NAME                          READY   STATUS             RESTARTS   AGE
hello-gcr-658f7f9869-76ssg    0/1     ImagePullBackOff   0          3h24m

$ oc describe po hello-gcr-658f7f9869-76ssg 
  Warning  Failed          14s (x2 over 15s)  kubelet  Error: ImagePullBackOff
  Normal   Pulling         2s (x2 over 16s)   kubelet  Pulling image "us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest"
  Warning  Failed          1s (x2 over 16s)   kubelet  Failed to pull image "us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest": rpc error: code = Unknown desc = Requesting bearer token: invalid status code from registry 403 (Forbidden)

Expected results:

Can pull image from artifact registry succeed

Additional info:

gcr.io works as expected. 
us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest doesn't work.
$ oc get po -n hello-gcr            
NAME                          READY   STATUS             RESTARTS   AGE
hello-gcr-658f7f9869-76ssg    0/1     ImagePullBackOff   0          156m
hello-gcr2-6d98c475ff-vjkt5   1/1     Running            0          163m
$ oc get po -n hello-gcr -o yaml | grep image                                                                                                                                       
    - image: us-central1-docker.pkg.dev/openshift-qe/zhsun-repo1/hello-gcr:latest
    - image: gcr.io/openshift-qe/hello-gcr:latest

Description of problem:

In MultiNetworkPolicies page, the learn more link does not work

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1. Go to Networking -> NetworkPolicies -> MultiNetworkPolicies
2.
3.

Actual results:


Expected results:


Additional info:


It is been observed that the esp_offload kernel module might be loaded by libreswan even if bond ESP offloads have been correctly turned off.

This might be because ipsec service and configure-ovs run at the same time, so it is possible that ipsec service starts when bond offloads are not yet turned off and trick libreswan into thinking they should be used.

The potential fix would be to run ipsec service after configure-ovs.

Description of problem:

Renable knative and A-04-TC01 tests that are being disabled in the pr  https://github.com/openshift/console/pull/13931   

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/prometheus/pull/226

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Update the installer to use commit c6bcd313bce0fc9866e41bb9e3487d9f61c628a3 of cluster-api-provider-ibmcloud.  This includes a couple of necessary Transit Gateway fixes.
    

Context

in order to ease CI builds and konflux integrations, and have standardise with other observability plugins we need to migrate away from yarn and use npm

Outcome

The monitoring plugin uses npm instead of yarn for development and in Dockerfiles

Steps

  • Migrate the yarn.lock into a package-lock.json, check the resolutions that were added to resolve CVEs
  • Update the makefile to remove yarn calls
  • Remove yarn specific installations from the Dockerfiles
  • Update docs that have yarn references

Please review the following PR: https://github.com/openshift/cluster-api/pull/222

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-cluster-api-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/80

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Should not panic when specify wrong loglevel for oc-mirror

Version-Release number of selected component (if applicable):

oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1. Run command: `oc-mirror -c config-36410.yaml --from file://out36410 docker://quay.io/zhouying7780/36410test --v2  --loglevel -h`

Actual results:

The command panic with error: 
oc-mirror -c config-36410.yaml --from file://out36410 docker://quay.io/zhouying7780/36410test --v2  --loglevel -h
2024/07/31 05:22:41  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/07/31 05:22:41  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/07/31 05:22:41  [INFO]   : ⚙️  setting up the environment for you...
2024/07/31 05:22:41  [INFO]   : 🔀 workflow mode: diskToMirror 
2024/07/31 05:22:41  [ERROR]  : parsing config error parsing local storage configuration : invalid loglevel -h Must be one of [error, warn, info, debug]
panic: StorageDriver not registered: 
goroutine 1 [running]:github.com/distribution/distribution/v3/registry/handlers.NewApp({0x5634e98, 0x76ea4a0}, 0xc000a7c388)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:126 +0x2374github.com/distribution/distribution/v3/registry.NewRegistry({0x5634e98?, 0x76ea4a0?}, 0xc000a7c388)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/distribution/distribution/v3/registry/registry.go:141 +0x56github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).setupLocalStorage(0xc000a78488)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:571 +0x3c6github.com/openshift/oc-mirror/v2/internal/pkg/cli.NewMirrorCmd.func1(0xc00090f208, {0xc0007ae300, 0x1, 0x8})	/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:201 +0x27fgithub.com/spf13/cobra.(*Command).execute(0xc00090f208, {0xc0000520a0, 0x8, 0x8})	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:987 +0xab1github.com/spf13/cobra.(*Command).ExecuteC(0xc00090f208)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1115 +0x3ffgithub.com/spf13/cobra.(*Command).Execute(0x74bc8d8?)	/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1039 +0x13main.main()	/go/src/github.com/openshift/oc-mirror/cmd/oc-mirror/main.go:10 +0x18

Expected results:

Exit with error , should not panic
 

 

Please review the following PR: https://github.com/openshift/aws-encryption-provider/pull/21

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Currently, CMO only tests that the plugin Deployment is rolled out with the appropriate config https://github.com/openshift/cluster-monitoring-operator/blob/f7e92e869c43fa0455d656dcfc89045b60e5baa1/test/e2e/config_test.go#L730

The plugin Deployment does set any readinessProbe, we're missing a check to ensure the plugin is ready to serve requests.

With the new plugin backend, a readiness probe can/will be added, see https://github.com/openshift/cluster-monitoring-operator/pull/2412#issuecomment-2315085438, that will help ensure minimal readiness on palyload tests flavors.

The CMO test can be more demanding and ask for /plugin-manifest.json

Description of problem:

https://search.dptools.openshift.org/?search=failed+to+configure+the+policy+based+routes+for+network&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 

 

See:

event happened 183 times, something is wrong: node/ip-10-0-52-0.ec2.internal hmsg/9cff2a8527 - reason/ErrorUpdatingResource error creating gateway for node ip-10-0-52-0.ec2.internal: failed to configure the policy based routes for network "default": invalid host address: 10.0.52.0/18 (17:55:20Z) result=reject

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.

2.

3.

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Do presume that Engineering will access attachments through supportshell.
  • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • When showing the results from commands, include the entire command in the output.  
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
  • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
  • For guidance on using this template please see
    OCPBUGS Template Training for Networking  components

Please review the following PR: https://github.com/openshift/agent-installer-utils/pull/36

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    When console-operator performs health check for the active console route, the retry takes 50ms, which is too short. It should be bumped at least to couple of seconds, to prevent burst of request which could lead to the same result and thus be misleading.
Also we need to add additional logging around the healthcheck for better debugging.

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    The GCP environment parament is missing on GCP STS environment

Based on feature  https://issues.redhat.com/browse/CONSOLE-4176 
 If the cluster is in GCP WIF mode and the operator claims support for it, the operator subscription page provides configuring 4 additional fields,which will be set on the Subscription's spec.config.env field

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-10-23-112324

How reproducible:

    Always

Steps to Reproduce:

    1. Prepare a GCP WIF mode enable cluster
    2. Navigate to Operator Hub page, and selected 'Auth Token GCP' on the Infrastructure features section
    3. Choose one operator and click install button (eg: Web Terminal)
    4. Check the Operator subscription page
       /operatorhub/subscribe?pkg=web-terminal&catalog=redhat-operators&catalogNamespace=openshift-marketplace&targetNamespace=undefined&channel=fast&version=1.11.0&tokenizedAuth=null     

Actual results:

    The fuction for feature CONSOLE-4176 is missing 

Expected results:

    1. WI/FI Warning message can shown on the subscription page
    2. User can setup POOL_ID, PROVIDER_ID,SERVICE_ACCOUNT_EMAIL on the page 

Additional info:

    

Please review the following PR: https://github.com/openshift/cloud-provider-aws/pull/94

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Adding of ATB to HC doesnt create new user token and doesnt reconcile to worker ndoes    

Version-Release number of selected component (if applicable):

    4.18 nightly

How reproducible:

    100%

Steps to Reproduce:

    1.Create 4.18 nightly HC
    2.Add atb to HC 
    3.Notice no new user token 
    

Actual results:

    no new user token generated so no new payload

Expected results:

    new user token generated with new payload

Additional info:

    

Hello Team,

When we deploy the HyperShift cluster with OpenShift Virtualization by specifying NodePort strategy for services, the requests to ignition, oauth, connectivity (for oc rsh, oc logs, oc exec), virt-launcher-hypershift-node-pool pod fails as by default following netpols get created automatically and restricting the traffic on on all other ports.

 

$ oc get netpol
NAME                      POD-SELECTOR           AGE
kas                       app=kube-apiserver     153m
openshift-ingress         <none>                 153m
openshift-monitoring      <none>                 153m
same-namespace            <none>                 153m 

I resolved

$ cat ingress-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ingress
spec:
  ingress:
  - ports:
    - port: 31032
      protocol: TCP
  podSelector:
    matchLabels:
      kubevirt.io: virt-launcher
  policyTypes:
  - Ingress


$ cat oauth-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: oauth
spec:
  ingress:
  - ports:
    - port: 6443
      protocol: TCP
  podSelector:
    matchLabels:
      app: oauth-openshift
      hypershift.openshift.io/control-plane-component: oauth-openshift
  policyTypes:
  - Ingress


$ cat ignition-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: nodeport-ignition-proxy
spec:
  ingress:
  - ports:
    - port: 8443
      protocol: TCP
  podSelector:
    matchLabels:
      app: ignition-server-proxy
  policyTypes:
  - Ingress


$ cat konn-netpol
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: konn
spec:
  ingress:
  - ports:
    - port: 8091
      protocol: TCP
  podSelector:
    matchLabels:
      app: kube-apiserver
      hypershift.openshift.io/control-plane-component: kube-apiserver
  policyTypes:
  - Ingress

The bug for ignition netpol has already been reported.

--> https://issues.redhat.com/browse/OCPBUGS-39158

--> https://issues.redhat.com/browse/OCPBUGS-39317

 

It would be helpful if these policies get created automatically as well or maybe we get an option in HyperShift to disable the automatic management of network policies where we can manually take care of the network policies.

 

Description of problem:

    ose-aws-efs-csi-driver-operator has an invalid reference tools that cause build failed
this issue is due to https://github.com/openshift/csi-operator/pull/252/files#r1719471717

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    "import * as icon from '[...].svg' " imports cause errors on webpack5/rspack (can't convert value to primitive type). They should be rewritten as "import icon from '[...].svg'"

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1. 
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

 as a follow up issue of https://issues.redhat.com/browse/OCPBUGS-4496 

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-08-130531    

How reproducible:

    Always

Steps to Reproduce:

    1. create a ConfigMap ConsoleYAMLSample without 'snippet: true'
apiVersion: console.openshift.io/v1
kind: ConsoleYAMLSample
metadata:
  name: cm-example-without-snippet
spec:
  targetResource:
    apiVersion: v1
    kind: ConfigMap
  title: Example ConfigMap
  description: An example ConfigMap YAML sample
  yaml: |
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: game-demo
    data:
      player_initial_lives: "3"
      ui_properties_file_name: "user-interface.properties"
      game.properties: |
        enemy.types=aliens,monsters
        player.maximum-lives=5    
      user-interface.properties: |
        color.good=purple
        color.bad=yellow
        allow.textmode=true
    2. goes to ConfigMap creation page -> YAML view
    3. create a ConfigMap ConsoleYAMLSample WITH 'snippet: true'
apiVersion: console.openshift.io/v1
kind: ConsoleYAMLSample
metadata:
  name: cm-example-without-snippet
spec:
  targetResource:
    apiVersion: v1
    kind: ConfigMap
  title: Example ConfigMap
  description: An example ConfigMap YAML sample
  snippet: true
  yaml: |
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: game-demo
    data:
      player_initial_lives: "3"
      ui_properties_file_name: "user-interface.properties"
      game.properties: |
        enemy.types=aliens,monsters
        player.maximum-lives=5    
      user-interface.properties: |
        color.good=purple
        color.bad=yellow
        allow.textmode=true
    4. goes to ConfigMap creation page -> YAML view

Actual results:

2. Sample tab doesn't show up
4. Snippet tab appears    

Expected results:

2. Sample tab should show up when there is no snippet: true 

Additional info:

    

Description of problem:

Test Platform has detected a large increase in the amount of time spent waiting for pull secrets to be initialized.
Monitoring the audit log, we can see nearly continuous updates to the SA pull secrets in the cluster (~2 per minute for every SA pull secret in the cluster).

Controller manager is filled with entries like: 
- "Internal registry pull secret auth data does not contain the correct number of entries" ns="ci-op-tpd3xnbx" name="deployer-dockercfg-p9j54" expected=5 actual=4"
- "Observed image registry urls" urls=["172.30.228.83:5000","image-registry.openshift-image-registry.svc.cluster.local:5000","image-registry.openshift-image-registry.svc:5000","registry.build01.ci.openshift.org","registry.build01.ci.openshift.org"

In this "Observed image registry urls" log line, notice the duplicate entries for "registry.build01.ci.openshift.org" . We are not sure what is causing this but it leads to duplicate entry, but when actualized in a pull secret map, the double entry is reduced to one. So the controller-manager finds the cardinality mismatch on the next check.

The duplication is evident in OpenShiftControllerManager/cluster:
      dockerPullSecret:
        internalRegistryHostname: image-registry.openshift-image-registry.svc:5000
        registryURLs:
        - registry.build01.ci.openshift.org
        - registry.build01.ci.openshift.org


But there is only one hostname in config.imageregistry.operator.openshift.io/cluster:
  routes:
  - hostname: registry.build01.ci.openshift.org
    name: public-routes
    secretName: public-route-tls

Version-Release number of selected component (if applicable):

4.17.0-rc.3

How reproducible:

Constant on build01 but not on other build farms

Steps to Reproduce:

    1. Something ends up creating duplicate entries in the observed configuration of the openshift-controller-manager.
    2.
    3.
    

Actual results:

- Approximately 400K secret patches an hour on build01 vs ~40K on other build farms. Intialization times have increased by two orders of magnitude in new ci-operator namespaces.    
- The openshift-controller-manager is hot looping and experiencing client throttling.

Expected results:

1. Initialization of pull secrets in a namespace should take < 1 seconds. On build01, it can take over 1.5 minutes.
2. openshift-controller-manager should not possess duplicate entries.
3. If duplicate entries are a configuration error, openshift-controller-manager should de-dupe the entries.
4. There should be alerting when the openshift-controller-manager experiences client-side throttling / pathological behavior.

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/238

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-cluster-machine-approver-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

Description of problem:

Topology screen crashes and reports "Oh no! something went wrong" when a pod in completed state is selected.

Version-Release number of selected component (if applicable):

RHOCP 4.15.18    

How reproducible:

100%

Steps to Reproduce:

1. Switch to developer mode
2. Select Topology
3. Select a project that has completed cron jobs like openshift-image-registry
4. Click the green CronJob Object
5. Observe Crash

Actual results:

The Topology screen crashes with error "Oh no! Something went wrong."

Expected results:

After clicking the completed pod / workload, the screen should display the information related to it.

Additional info:

    

The error bellow was solved in this PR https://github.com/openshift/hypershift/pull/4723, but we can do a better sanitisation of the IgnitionServer payload. This is the suggestion from Alberto in Slack: https://redhat-internal.slack.com/archives/G01QS0P2F6W/p1726257008913779?thread_ts=1726241321.475839&cid=G01QS0P2F6W

✗ [High] Cross-site Scripting (XSS) 
  Path: ignition-server/cmd/start.go, line 250 
  Info: Unsanitized input from an HTTP header flows into Write, where it is used to render an HTML page returned to the user. This may result in a Reflected Cross-Site Scripting attack (XSS).

Description of problem:

when viewing binary type of secret data, we are also providing 'Reveal/Hide values' option which is redundant

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-22-123921    

How reproducible:

Always    

Steps to Reproduce:

    1. create a Key/Value secret when the data is binary file, Workloads -> Secrets -> Create Key/value secret -> upload binary file as secret data -> Create
    2. check data on Secret details page
    3.
    

Actual results:

2. both options: Save file and Reveal/Hide Values are provided. But `Reveal/Hide values` button makes no sense since the data is binary file

Expected results:

2. Only show 'Save file' option for binary data    

Additional info:

    

Please review the following PR: https://github.com/openshift/coredns/pull/130

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

In the case of OpenStack, the network operator tries and fails to update the infrastructure resource.   

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    always

Steps to Reproduce:

    1. Install hypershift
    2. Create openstack hosted cluster
    

Actual results:

Network operator fails to report as available due to:
 - lastTransitionTime: "2024-08-22T15:54:16Z"
    message: 'Error while updating infrastructures.config.openshift.io/cluster: failed
      to apply / update (config.openshift.io/v1, Kind=Infrastructure) /cluster: infrastructures.config.openshift.io
      "cluster" is forbidden: ValidatingAdmissionPolicy ''config'' with binding ''config-binding''
      denied request: This resource cannot be created, updated, or deleted. Please
      ask your administrator to modify the resource in the HostedCluster object.'
    reason: UpdateInfrastructureSpecOrStatus
    status: "True"
    type: network.operator.openshift.io/Degraded    

Expected results:

    Cluster operator becomes available

Additional info:

    This is a bug introduced with https://github.com/openshift/hypershift/pull/4303 

Please review the following PR: https://github.com/openshift/kubernetes-metrics-server/pull/32

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem: Unnecessary warning notification message on debug pod.

 

Code: https://github.com/openshift/console/blob/bdb211350a66fe96ab215a655d41c45864dc3cef/frontend/public/components/debug-terminal.tsx#L114

 

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

use an imageSetConfig with operator catalog

do a mirror to mirror

without removing the working-dir or the cache, do mirror to mirror again

It fails with error : filtered declarative config not found

 

We think that low disk space is likely the cause of https://issues.redhat.com/browse/OCPBUGS-37785

It's not immediately obvious that this happened during the run without digging into the events.

Could we create a new test to enforce that the kubelet never reports disk pressure during a run?

 

Description of problem:

IHAC who is facing the same problem with OCPBUGS-17356 in OCP 4.16 cluster. There is no ContainerCreating pod and the alert firing appears to be a false positive.    

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.16.15

Additional info:
This is very similar to OCPBUGS-17356

Description of problem:

When we try to deleted the MachineOSConfig when it is still building state. Then the resources related to MOSC is deleted but not for configmap. And hence when we try to again apply the MOSC in same pool the status of MOSB is not properly generated. 

To resolve the issue we have to manually delete the resources created in configmap. 

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-07-10-022831   True        False         3h15m   Cluster version is 4.16.0-0.nightly-2024-07-10-022831

How reproducible:

    

Steps to Reproduce:

1. Create CustomMCP
2. Apply any MOSC
3. Delete the MOSC while it is still in building stage
4. Again apply the MOSC
5. Check the MOSB status
oc get machineosbuilds.
NAME                                                            PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED
infra-rendered-infra-371dc5d02dbe0bb5712857393db95bf3-builder                                                   False

Actual results:

oc get machineosbuilds
 NAME                                                            PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED infra-rendered-infra-371dc5d02dbe0bb5712857393db95bf3-builder                                                   False 

Expected results:

We should be able to see the status     

Additional info:

Check the logs of machine-os-builder
$ oc logs machine-os-builder-74d56b55cf-mp6mv | grep -i error I0710 11:05:56.750770       1 build_controller.go:474] Error syncing machineosbuild infra3: could not start build for MachineConfigPool infra: could not load rendered MachineConfig mc-rendered-infra-371dc5d02dbe0bb5712857393db95bf3 into configmap: configmaps "mc-rendered-infra-371dc5d02dbe0bb5712857393db95bf3" already exists     

After looking at this test run we need to validate the following scenarios:

  1. Monitor test for nodes should fail when nodes go ready=false unexpectedly.
  2. Monitor test for nodes should fail when the unreachable taint is placed on them.
  3. Monitor test for node leases should create timeline entries when leases are not renewed “on time”.  This could also fail after N failed renewal cycles.

 

Do the monitor tests in openshift/origin accurately test these scenarios?

Please review the following PR: https://github.com/openshift/oc/pull/1866

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/images/pull/196

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

ROSA HCP allows customers to select hostedcluster and nodepool OCP z-stream versions, respecting version skew requirements. E.g.:

  • A 4.15.28 hostedcluster with
  • A 4.15.28 nodepool
  • A 4.15.25 nodepool

Version-Release number of selected component (if applicable):

Reproducible on 4.14-4.16.z, this bug report demonstrates it for a 4.15.28 hostedcluster with a 4.15.25 nodepool

How reproducible:

100%    

Steps to Reproduce:

    1. Create a ROSA HCP cluster, which comes with a 2-replica nodepool with the same z-stream version (4.15.28)
    2. Create an additional nodepool at a different version (4.15.25)
    

Actual results:

Observe that while nodepool objects report the different version (4.15.25), the resulting kernel version of the node is that of the hostedcluster (4.15.28)

❯ k get nodepool -n ocm-staging-2didt6btjtl55vo3k9hckju8eeiffli8                                                                                    
NAME                     CLUSTER       DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
mshen-hyper-np-4-15-25   mshen-hyper   1               1               False         True         4.15.25   False             False            
mshen-hyper-workers      mshen-hyper   2               2               False         True         4.15.28   False             False  


❯ k get no -owide                                            
NAME                                         STATUS   ROLES    AGE   VERSION            INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-0-129-139.us-west-2.compute.internal   Ready    worker   24m   v1.28.12+396c881   10.0.129.139   <none>        Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow)   5.14.0-284.79.1.el9_2.aarch64   cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9
ip-10-0-129-165.us-west-2.compute.internal   Ready    worker   98s   v1.28.12+396c881   10.0.129.165   <none>        Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow)   5.14.0-284.79.1.el9_2.aarch64   cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9
ip-10-0-132-50.us-west-2.compute.internal    Ready    worker   30m   v1.28.12+396c881   10.0.132.50    <none>        Red Hat Enterprise Linux CoreOS 415.92.202408100433-0 (Plow)   5.14.0-284.79.1.el9_2.aarch64   cri-o://1.28.9-5.rhaos4.15.git674ed4c.el9

Expected results:

    

Additional info:

 

Description of problem:

    When running the `make fmt` target in the repository the command can fail due to a mismatch of versions between the go language and the goimports dependency.

 

Version-Release number of selected component (if applicable):

    4.16.z

How reproducible:

    always

Steps to Reproduce:

    1.checkout release-4.16 branch
    2.run `make fmt`
    

Actual results:

INFO[2024-10-01T14:41:15Z] make fmt make[1]: Entering directory '/go/src/github.com/openshift/cluster-cloud-controller-manager-operator' hack/goimports.sh go: downloading golang.org/x/tools v0.25.0 go: golang.org/x/tools/cmd/goimports@latest: golang.org/x/tools@v0.25.0 requires go >= 1.22.0 (running go 1.21.11; GOTOOLCHAIN=local) 

Expected results:

    successful completion of `make fmt`

Additional info:

    our goimports.sh script file reference `goimports@latest` which means that this problem will most likely affect older branches as well. we will need to set a specific version of the goimports package for those branches.

given that the CCCMO includes golangci-lint, and uses it for a test, we should include goimports through golangci-lint which will solve this problem without needing special versions of goimports.

Description of problem:

  OLM 4.17 references 4.16 catalogs  

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. oc get pods -n openshift-marketplace -o yaml | grep "image: registry.redhat.io"
    

Actual results:

      image: registry.redhat.io/redhat/certified-operator-index:v4.16
      image: registry.redhat.io/redhat/certified-operator-index:v4.16
      image: registry.redhat.io/redhat/community-operator-index:v4.16
      image: registry.redhat.io/redhat/community-operator-index:v4.16
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.16
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.16
      image: registry.redhat.io/redhat/redhat-operator-index:v4.16
      image: registry.redhat.io/redhat/redhat-operator-index:v4.16

Expected results:

      image: registry.redhat.io/redhat/certified-operator-index:v4.17
      image: registry.redhat.io/redhat/certified-operator-index:v4.17
      image: registry.redhat.io/redhat/community-operator-index:v4.17
      image: registry.redhat.io/redhat/community-operator-index:v4.17
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.17
      image: registry.redhat.io/redhat/redhat-marketplace-index:v4.17
      image: registry.redhat.io/redhat/redhat-operator-index:v4.17
      image: registry.redhat.io/redhat/redhat-operator-index:v4.17

Additional info:

    

Description of problem:

With the Configuring a private storage endpoint on Azure by enabling the Image Registry Operator to discover VNet and subnet names[1], if creating cluster with internal Image Registry, it will create a storage account with private endpoint, so once the new pvc using the same skuName with this private storage account, it will hit the mount permission issue. 
 

[1] https://docs.openshift.com/container-platform/4.16/post_installation_configuration/configuring-private-cluster.html#configuring-private-storage-endpoint-azure-vnet-subnet-iro-discovery_configuring-private-cluster

Version-Release number of selected component (if applicable):

4.17

How reproducible:

Always

Steps to Reproduce:

Creating cluster with flexy job: aos-4_17/ipi-on-azure/versioned-installer-customer_vpc-disconnected-fully_private_cluster-arm profile and specify enable_internal_image_registry: "yes"
Create pod and pvc with azurefile-csi sc     

Actual results:

pod failed to up due to mount error:

mount //imageregistryciophgfsnrc.file.core.windows.net/pvc-facecce9-d4b5-4297-b253-9a6200642392 on /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/b4b5e52fb1d21057c9644d0737723e8911d9519ec4c8ddcfcd683da71312a757/globalmount failed with mount failed: exit status 32
  Mounting command: mount
  Mounting arguments: -t cifs -o mfsymlinks,cache=strict,nosharesock,actimeo=30,gid=1018570000,file_mode=0777,dir_mode=0777, //imageregistryciophgfsnrc.file.core.windows.net/pvc-facecce9-d4b5-4297-b253-9a6200642392 /var/lib/kubelet/plugins/kubernetes.io/csi/file.csi.azure.com/b4b5e52fb1d21057c9644d0737723e8911d9519ec4c8ddcfcd683da71312a757/globalmount
  Output: mount error(13): Permission denied 

Expected results:

Pod should be up

Additional info:

We can have some simple WA like using storageclass with networkEndpointType: privateEndpoint or specify another storage account, but using the pre-defined storageclass azurefile-csi will fail. And the automation is not easy to walk around.  

I'm not sure if CSI Driver could check if the reused storage account has the private endpoint before using the existing storage account. 

Description of problem:

    Running "make fmt" in the repository fails with an error about a version mismatch between goimports and the go language version.

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1. checkout release-4.16 branch
    2. run "make fmt" (with golang version 1.21)
    

Actual results:

openshift-hack/check-fmt.sh
go: downloading golang.org/x/tools v0.26.0
go: golang.org/x/tools/cmd/goimports@latest: golang.org/x/tools@v0.26.0 requires go >= 1.22.0 (running go 1.21.11; GOTOOLCHAIN=local)
make: *** [openshift.mk:18: fmt] Error 1

Expected results:

    completion without errors

Additional info:

this is affecting us currently with 4.16 and previous, but will become a persistent problem over time.

we can correct this by using a holistic approach such as calling goimports from the binary that is included in our build images.

Please review the following PR: https://github.com/openshift/csi-operator/pull/114

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Machine got stuck in Provisioning phase after the EC2 gets terminated by AWS.

The scenario I got this problem was when running an rehearsal cluster in a under development[1] job[2] for AWS Local Zone. The EC2 created through MachineSet template was launched in the Local Zone us-east-1-qro-1a, but the instance was terminated right after it was created with this message[3] (From AWS Console):
~~~
Client.VolumeLimitExceeded: Volume limit exceeded. You have exceeded the maximum gp2 storage limit of 87040 GiB in this location. Please contact AWS Support for more information.
~~~

When I saw this problem in the Console, I removed the Machine object and the MAPI was able to create a new instance in the same Zone:

~~~
$ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test
Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init)
sh-4.4$ oc get machines -A
NAMESPACE               NAME                                                     PHASE          TYPE         REGION      ZONE         AGE
openshift-machine-api   ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx   Provisioning                                         45m

sh-4.4$ oc delete machine ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx -n openshift-machine-api
machine.machine.openshift.io "ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" deleted

(...)
$ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test
Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init)
sh-4.4$ oc get machines -n openshift-machine-api -w
NAME                                                     PHASE         TYPE         REGION      ZONE               AGE
ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-v675j   Provisioned   c5.2xlarge   us-east-1   us-east-1-qro-1a   2m6s
~~~

The job[2] didn't finish successfully due the timeout checking for node readiness, but the Machine got provisioned correctly (without Console errors) and kept in running state.

The main problem I can see in the logs of Machine Controller is an endless loop trying to reconcile an terminated machine/instance (i-0fc8f2e7fe7bba939):

~~~
2023-06-20T19:38:01.016776717Z I0620 19:38:01.016760       1 controller.go:156] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling Machine
2023-06-20T19:38:01.016776717Z I0620 19:38:01.016767       1 actuator.go:108] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator checking if machine exists
2023-06-20T19:38:01.079829331Z W0620 19:38:01.079800       1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.132099118Z E0620 19:38:01.132063       1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.132099118Z I0620 19:38:01.132080       1 reconciler.go:296] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Instance does not exist
2023-06-20T19:38:01.132146892Z I0620 19:38:01.132096       1 controller.go:349] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling machine triggers idempotent create
2023-06-20T19:38:01.132146892Z I0620 19:38:01.132101       1 actuator.go:81] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator creating machine
2023-06-20T19:38:01.132489856Z I0620 19:38:01.132460       1 reconciler.go:41] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: creating machine
2023-06-20T19:38:01.190935211Z W0620 19:38:01.190901       1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.238693678Z E0620 19:38:01.238661       1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.238693678Z I0620 19:38:01.238680       1 machine_scope.go:90] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: patching machine
2023-06-20T19:38:01.249796760Z E0620 19:38:01.249761       1 actuator.go:72] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx error: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue
2023-06-20T19:38:01.249824958Z W0620 19:38:01.249796       1 controller.go:351] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: failed to create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue
2023-06-20T19:38:01.249858967Z E0620 19:38:01.249847       1 controller.go:324]  "msg"="Reconciler error" "error"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue" "controller"="machine-controller" "name"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" "namespace"="openshift-machine-api" "object"={"name":"ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx","namespace":"openshift-machine-api"} "reconcileID"="8890f9f7-2fbf-441d-a8b7-a52ec5f4ae2f"
~~~

I also reviewed the Account quotas for EBS gp2 and we are under the limits. The second machine was also provisioned, so I would discard any account quotas, and focus on the capacity issues in the Zone - considering Local Zone does not have high capacity as regular zones, it could happen more frequently.

I am asking the AWS teams a RCA, asking more clarification how we can programatically get this error (maybe EC2 API, I didn't described the EC2 when the event happened).


[1] https://github.com/openshift/release/pull/39902#issuecomment-1599559108
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39902/rehearse-39902-pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-localzones/1671215930459295744
[3] https://user-images.githubusercontent.com/3216894/247285243-3cd28306-2972-4576-a9a6-a620e01747a6.png

Version-Release number of selected component (if applicable):

4.14.0-0.ci.test-2023-06-20-191559-ci-op-ljs7pd35-latest

How reproducible:

- Rarely by AWS (mainly in zone capacity issues - a RCA has been requested to AWS to check if we can find options to reproduce)

Steps to Reproduce:

this is hard to reproduce as the EC2 had been terminated by AWS. 

I created one script to watch the specific subnet ID and terminate any instances created on it instantaneously, but the Machine is going to the Failed phase and getting stuck on it - and not the "Provisioning" as we got in the CI job.

Steps to try to reproduce:
1. Create a cluster with Local Zone support: https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-localzone.html
2. Wait for the cluster be created
3. Scale down the MachineSet for the Local Zone
4. Start a new Terminal(#2): watch and terminate EC2 instance created in an Local Zone subnet (example: us-east-1-bue-1a)
~~~
machineset_monitor="byonet1-sc9fb-edge-us-east-1-bue-1a"

# discover the subnet ID
subnet_id=$(oc get machineset $machineset_monitor -n openshift-machine-api -o json | jq -r .spec.template.spec.providerSpec.value.subnet.id)

# discover the zone name
zone_name="$(aws ec2 describe-subnets --subnet-ids $subnet_id --query 'Subnets[].AvailabilityZone' --output text)"

# Discover instance ids in the subnet and terminate it
while true; do
    echo "$(date): Getting instance in the zone ${zone_name} / subnet ${subnet_id}..."
    
    instance_ids=$(aws ec2 describe-instances --filters Name=subnet-id,Values=$subnet_id Name=instance-state-name,Values=pending,running,shutting-down,stopping --query 'Reservations[].Instances[].InstanceId' --output text)
    
    echo "$(date): Instances retrieved: $instance_ids"
    if [[ -n "$instance_ids" ]]; then
        echo "Terminating instances..."
        aws ec2 terminate-instances --instance-ids $instance_ids
        sleep 1
    else
        echo "Awaiting..."
        sleep 2
    fi
done
~~~

4. Scale up the MachineSet
5. Observe the Machines

Actual results:

 

Expected results:

- Machine moved to Failed phase when EC2 is terminated by AWS, or
- maybe self-recover the Machine when EC2 is deleted/terminated by deleting the Machine object when managed by a MachineSet, so we can prevent manual steps

Additional info:

 
  • `appProtocol: kubernetes.io/h2c` has been adopted upstream but OCP router does not support it so Serverless needs to revert its use downstream.
  • recently we faced an issue where we have an grpc client using an edge terminated route where alpn is empty, when we use `appProtocol: h2c` .
    With the latest grpc client versions an alpn is required, see here and here. Could this be fixed as well?

Description of problem:

    Multiple monitoring-plugin Pods return the response code every 10s, there will be too many logs as time goes by

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-09-09-150616

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

% oc -n openshift-monitoring logs monitoring-plugin-76b8c847f6-m872m
time="2024-09-10T07:55:52Z" level=info msg="enabled features: []\n" module=main
time="2024-09-10T07:55:52Z" level=warning msg="cannot read config file, serving plugin with default configuration, tried /etc/plugin/config.yaml" error="open /etc/plugin/config.yaml: no such file or directory" module=server
time="2024-09-10T07:55:52Z" level=info msg="listening on https://:9443" module=server
10.128.2.2 - - [10/Sep/2024:07:55:53 +0000] "GET /health HTTP/2.0" 200 2
10.128.2.2 - - [10/Sep/2024:07:55:58 +0000] "GET /health HTTP/2.0" 200 2
10.128.2.2 - - [10/Sep/2024:07:56:08 +0000] "GET /health HTTP/2.0" 200 2
10.128.2.2 - - [10/Sep/2024:07:56:18 +0000] "GET /health HTTP/2.0" 200 2
10.128.2.2 - - [10/Sep/2024:07:56:28 +0000] "GET /health HTTP/2.0" 200 2
10.128.2.2 - - [10/Sep/2024:07:56:38 +0000] "GET /health HTTP/2.0" 200 2
...

$ oc -n openshift-monitoring logs monitoring-plugin-76b8c847f6-m872m | grep "GET /health HTTP/2.0" | wc -l
1967

 

Expected results:

    Before we switched to the golang backend, there are usually not many logs

Additional info:

    

Description of problem:

Running https://github.com/shiftstack/installer/blob/master/docs/user/openstack/README.md#openstack-credentials-update leads to cinder pvc stuck in terminating status:

 $ oc get pvc -A
NAMESPACE              NAME                                 STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS         VOLUMEATTRIBUTESCLASS   AGE
cinder-test            pvc-0                                Terminating   pvc-d7d37d04-d8d1-4a61-a3bc-c038e53a13c7   1Gi        RWO            standard-csi         <unset>                 12h
cinder-test            pvc-1                                Terminating   pvc-32049f0e-b842-4e54-aff8-5f41f51b3c54   1Gi        RWO            standard-csi         <unset>                 12h
cinder-test            pvc-2                                Terminating   pvc-3eb42d8a-f22f-418b-881e-21c913b89c56   1Gi        RWO            standard-csi         <unset>                 12h

The cinder-csi-controller reports below error:

E1022 07:21:11.772540       1 utils.go:95] [ID:4401] GRPC error: rpc error: code = Internal desc = DeleteVolume failed with error Expected HTTP response code [202 204] when accessing [DELETE https://10.46.44.159:13776/v3/c27fbb9d859e40cc9
6f82e47b5ceebd6/volumes/bd5e6cf9-f27e-4aff-81ac-a83e7bccea86], but got 400 instead: {"badRequest": {"code": 400, "message": "Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing
and must not be migrating, attached, belong to a group, have snapshots or be disassociated from snapshots after volume transfer."}}

However, in openstack, they appears in-use:

stack@undercloud-0 ~]$ OS_CLOUD=shiftstack openstack volume list                                                                                                                                                                            
/usr/lib/python3.9/site-packages/osc_lib/utils/__init__.py:515: DeprecationWarning: The usage of formatter functions is now discouraged. Consider using cliff.columns.FormattableColumn instead. See reviews linked with bug 1687955 for more
detail.
  warnings.warn(
+--------------------------------------+------------------------------------------+-----------+------+------------------------------------------------------+                                                                                
| ID                                   | Name                                     | Status    | Size | Attached to                                          |                                                                                
+--------------------------------------+------------------------------------------+-----------+------+------------------------------------------------------+                                                                                
| 093b14c1-a79a-46aa-ab6b-6c71d2adcef9 | pvc-3eb42d8a-f22f-418b-881e-21c913b89c56 | in-use    |    1 | Attached to ostest-2nkmx-worker-0-cflkl on /dev/vdd  |                                                                                
| 4342c947-732d-4d23-964c-58bd56b79fd4 | pvc-32049f0e-b842-4e54-aff8-5f41f51b3c54 | in-use    |    1 | Attached to ostest-2nkmx-worker-0-cflkl on /dev/vdc  |                                                                                
| 6da3147f-4ce8-4e17-a29a-6f311599a969 | pvc-d7d37d04-d8d1-4a61-a3bc-c038e53a13c7 | in-use    |    1 | Attached to ostest-2nkmx-worker-0-cflkl on /dev/vdb  |                                   

 

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-21-010606    
RHOS-17.1-RHEL-9-20240701.n.1

How reproducible:

Always (twice in a row)

Additional info:

must-gather provided in private comment

Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1083

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The "oc adm ocp-certificates regenerate-machine-config-server-serving-cert" is failing

    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-10-195326


$ oc  version
Client Version: 4.18.0-202410080912.p0.g3692450.assembly.stream-3692450
Kustomize Version: v5.4.2
Server Version: 4.18.0-0.nightly-2024-10-10-195326
Kubernetes Version: v1.31.1

    

How reproducible:

Always
    

Steps to Reproduce:

    1. Execute the "oc adm ocp-certificates regenerate-machine-config-server-serving-cert" with the right oc binary for the tested version
    2.
    3.
    

Actual results:


The "oc adm ocp-certificates regenerate-machine-config-server-serving-cert"  command fails with this error:

$ oc adm ocp-certificates regenerate-machine-config-server-serving-cert
W1011 10:13:41.951040 2699876 recorder_logging.go:53] &Event{ObjectMeta:{dummy.17fd5e657c5748ca  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:SecretUpdateFailed,Message:Failed to update Secret/: Secret "machine-config-server-tls" is invalid: type: Invalid value: "kubernetes.io/tls": field is immutable,Source:EventSource{Component:,Host:,},FirstTimestamp:2024-10-11 10:13:41.950941386 +0000 UTC m=+0.377199185,LastTimestamp:2024-10-11 10:13:41.950941386 +0000 UTC m=+0.377199185,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
The Secret "machine-config-server-tls" is invalid: type: Invalid value: "kubernetes.io/tls": field is immutable


    

Expected results:

The command should be executed without errors
    

Additional info:

    

This line is repeated many times, about once a second when provisioning a new cluster:

    level=debug msg=    baremetalhost resource not yet available, will retry

 

  1. Regular expression for matching audience string is incorrect
  2. STS functionality functions incorrectly due to convoluted logic (detected by QE)

Please review the following PR: https://github.com/openshift/cloud-provider-powervs/pull/79

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-image-registry-operator/pull/1113

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/548

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/oc/pull/1867

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/118

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/75

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

must-gather creates empty monitoring/prometheus/rules.json file due to error "Unable to connect to the server: x509: certificate signed by unknown authority"

Version-Release number of selected component (if applicable):

4.9

How reproducible:

not sure what customer did on certs

Steps to Reproduce:

1.
2.
3.

Actual results:

monitoring/prometheus/rules.json is empty, while monitoring/prometheus/rules.sterr contains error message "Unable to connect to the server: x509: certificate signed by unknown authority"

Expected results:

as must-gather runs inside the cluster only it should be safe to ignore any certificate check when data is queried from Prometheus

Additional info:

https://attachments.access.redhat.com/hydra/rest/cases/03329385/attachments/e89af78a-3e35-4f1a-a13c-46f05ff755cc?usePresignedUrl=true should contain an example

Please review the following PR: https://github.com/openshift/origin/pull/29071

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Because SNO replaces the api-server during an upgrade, the storage-operator's csi-snapshot-container exits because it can retreive a CR, causing an exit loop back-off for the period where the api-server is down, this also effects other tests during this same time frame. We will be resolving each one of these individually and updating the tests for the time being to unblock the problems.

Additional context here:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-single-node-upgrade-4.18-micro-release-openshift-release-analysis-aggregator/1844300620022943744

https://redhat-internal.slack.com/archives/C0763QRRUS2/p1728567187172169

 

Description of problem:

    When deploying nodepools on OpenStack, the Nodepool condition complains about unsupported amd64 while we actually support it.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

As part of TRT investigations of k8s API disruptions, we have discovered there are times when haproxy considers underlying apiserver as Down, yet from k8s perspective the apiserver is healthy&functional.

From the customer perspective, during this time any call to the cluster API endpoint will fail. It simply looks like an outage.

Thorough investigation leads us to the following difference in how haproxy perceives apiserver being alive versus how k8s perceives it, i.e.

inter 1s fall 2 rise 3

and

     readinessProbe:
      httpGet:
        scheme: HTTPS
        port: 6443
        path: readyz
      initialDelaySeconds: 0
      periodSeconds: 5
      timeoutSeconds: 10
      successThreshold: 1
      failureThreshold: 3

We can see the top check is much stricter. And it belongs to haproxy. As a result, haproxy sees the following

2024-10-08T12:37:32.779247039Z [WARNING]  (29) : Server masters/master-2 is DOWN, reason: Layer7 wrong status, code: 500, info: "Internal Server Error", check duration: 5ms. 0 active and 0 backup servers left. 154 sessions active, 0 requeued, 0 remaining in queue.

much faster than k8s would consider something as wrong.

In order to remediate this issue, it has been agreed the haproxy checks should be softened and adjusted to the k8s readiness probe.

Description of problem:

periodics are failing due to a change in coreos.    

Version-Release number of selected component (if applicable):

    4.15,4.16,4.17,4.18

How reproducible:

    100%

Steps to Reproduce:

    1. Check any periodic conformance jobs
    2.
    3.
    

Actual results:

    periodic conformance fails with hostedcluster creation

Expected results:

    periodic conformance test suceeds 

Additional info:

    

Description of problem:
Console user settings are saved in a ConfigMap for each user in the namespace openshift-console-user-settings.

The console frontend uses the k8s API to read and write that ConfigMap. The console backend creates a ConfigMap with a Role and RoleBinding for each user, giving that single user read and write access to his/her own ConfigMap.

The number of Role and RoleBindings might decrease a cluster performance. This has happened in the past, esp. on the Developer Sandbox, where a long-living cluster creates new users that is then automatically removed after a month. Keeping the Role and RoleBinding results in performance issues.

The resources had an ownerReference before 4.15 so that the 3 resources (1 ConfigMap, 1 Role, 1 RoleBinding) was automatically removed when the User resource was deleted. This ownerReference was removed with 4.15 to support external OIDC providers.

The ask in this issue is to restore that ownerReference for the OpenShift auth provider.

History:

  • User setting feature was introduced 2020 with 4.7 (ODC-4370) without a ownerReference for these resources.
  • After noticing performance issues on Dev Sandbox 2022 (BZ 2019564) we added an ownerReference in 4.11 (PR 11130) and backported this change 4.10 and 4.9.
  • The ownerReference was removed in 4.15 with CONSOLE-3829/OCPBUGS-16814/PR 13321. This is a regression.

See also:

Version-Release number of selected component (if applicable):
4.15+

How reproducible:
Always

Steps to Reproduce:

  1. Create a new user
  2. Login into the console
  3. Check for the user settings ConfigMap, Role and RoleBinding for that deleted user.
  4. Delete the user
  5. The resources should now be removed...

Actual results:
The three resources weren't deleted after the user was deleted.

Expected results:
The three resources should be deleted after the user is deleted.

Additional info:

-> While upgrading the cluster from 4.13.38 -> 4.14.18, it is stuck on CCO, clusterversion is complaining about

"Working towards 4.14.18: 690 of 860 done (80% complete), waiting on cloud-credential".

While checking further we see that CCO deployment is yet to rollout.

-> ClusterOperator status.versions[name=operator] isn't a narrow "CCO Deployment is updated", it's "the CCO asserts the whole CC component is updated", which requires (among other things) a functional CCO Deployment. Seems like you don't have a functional CCO Deployment, because logs have it stuck talking about asking for a leader lease. You don't have Kube API audit logs to say if it's stuck generating the Lease request, or waiting for a response from the Kube API server.

Description of problem:

My customer is trying to install OCP 4.15 IPv4/v6 dual stack with IPv6 primary using IPI-OpenStack (platform: openstack) on OSP 17.1.
However, it fails with the following error

~~~
$ ./openshift-install create cluster --dir ./
   :
ERROR: Bootstrap failed to complete: Get "https://api.openshift.example.com:6443/version": dial tcp [2001:db8::5]:6443: i/o timeout
~~~

On the bootstrap node, the VIP "2001:db8::5" is not set.

~~~
$ ip addr
      :
2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether aa:aa:aa:aa:aa:aa brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.3/24 brd 10.0.0.254 scope global dynamic noprefixroute enp3s0
       valid_lft 40000sec preferred_lft 40000sec
    inet6 2001:db8::3/128 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::aaaa:aaff:feaa:aaaa/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
~~~

As far as I investigated, the reason why VIP is not set is that "nameserver" is not properly set on /etc/resolv.conf.
Because of this, name resolution doesn't work on the bootstrap node.

~~~
$ cat /etc/resolv.conf
# Generated by NetworkManager
nameserver 127.0.0.1
search openshift.example.com
~~~
==> There should be a nameserver entry which is advertised by DHCPv6 or DHCPv4. However, there is only 127.0.0.1

/var/run/NetworkManager/resolv.conf has a proper "nameserver" entry which is advertised by DHCPv6:

~~~
# cat /var/run/NetworkManager/resolv.conf
# Generated by NetworkManager
search openshift.example.com
nameserver 2001:db8::8888
~~~

In IPI-openstack installation, /etc/resolv.conf is generated from /var/run/NetworkManager/resolv.conf by the following script:

https://github.com/openshift/installer/blob/9938156e81b5c0085774b2ec56a4be075413fd2d/data/data/bootstrap/openstack/files/etc/NetworkManager/dispatcher.d/30-local-dns-prepender

I'm wondering if the above script doesn't work well due to timing issue, race condition or something like that.

And according to the customer, this issue depends on DNS setting.

- When DNS server info is advertised only by IPv4 DHCP: The issue occurs
- When DNS server info is advertised only by IPv6 DHCP: The issue occurs
- When DNS server info is advertised by both IPv4 and IPv6 DHCP: The issue does NOT occurs

Version-Release number of selected component (if applicable):

    OCP 4.15 IPI-OpenStack

How reproducible:

Steps to Reproduce:

    1. Create a provider network on OSP 17.1
    2. Create IPv4 subnet and IPv6 subnet on the provider network
    3. Create set dns-nameserver setting using "openstack subnet set --dns-nameserver" command only on either of IPv4 subnet or IPv6 subnet 
    4. Run IPI-OpenStack installation on the provider network

Actual results:

    IPI-openstack installation fails because nameserver of /etc/resolv.conf on bootstrap node is not set properly

Expected results:

    IPI-openstack installation succeeds and nameserver of /etc/resolv.conf on bootstrap node is set properly

Additional info:

    

Please review the following PR: https://github.com/openshift/installer/pull/8965

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/331

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Modify the import to strip or change the bootOptions.efiSecureBootEnabled

https://redhat-internal.slack.com/archives/CLKF3H5RS/p1722368792144319

archive := &importx.ArchiveFlag{Archive: &importx.TapeArchive{Path: cachedImage}}

ovfDescriptor, err := archive.ReadOvf("*.ovf")
if err != nil {
// Open the corrupt OVA file
f, ferr := os.Open(cachedImage)
if ferr != nil

{ err = fmt.Errorf("%s, %w", err.Error(), ferr) }

defer f.Close()

// Get a sha256 on the corrupt OVA file
// and the size of the file
h := sha256.New()
written, cerr := io.Copy(h, f)
if cerr != nil

{ err = fmt.Errorf("%s, %w", err.Error(), cerr) }

return fmt.Errorf("ova %s has a sha256 of %x and a size of %d bytes, failed to read the ovf descriptor %w", cachedImage, h.Sum(nil), written, err)
}

ovfEnvelope, err := archive.ReadEnvelope(ovfDescriptor)
if err != nil

{ return fmt.Errorf("failed to parse ovf: %w", err) }

Description of problem:

OCP UI enabled ES and FR recently and a new memsource project template was created for the upload operation. So we need to update the memsource-upload.sh script to make use of the new project template  ID.

 

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/322

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-cluster-control-plane-machine-set-operator-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

Description of problem:

User could filter route with status on OCP 4.16 and before version, but this filter disappeared on OCP 4.17 and 4.18.
    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-10-09-114619
4.18.0-0.nightly-2024-10-09-113533
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Check to routes list page.
    2.
    3.
    

Actual results:

1. There is not filter with status field.
    

Expected results:

1. Should have filter with status field. Refer to filter on 4.16: https://drive.google.com/file/d/1j0QdO98cMy0ots8rtHdB82MSWilxkOGr/view?usp=drive_link
    

Additional info:


    

Description of problem:

    When creating sample application from OCP Dev Console, The deployments, services, roues get created but it does not create any BuildConfigs for the application and hence the application throws: ImagePullBackOff: Back-off pulling image "nodejs-sample:latest"

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. OCP Web Console -> Developer Mode -> Add -> Samples -> Select Any "Builder Images" type Application -> Create     
    2. Check BuildConfig for this application.
    3.
    

Actual results:

    No BuildConfig gets created.

Expected results:

    Application should create a build and the image should be available for the application deployment.

Additional info:

    

Please review the following PR: https://github.com/openshift/node_exporter/pull/152

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

update the tested instance type for IBMCloud

Version-Release number of selected component (if applicable):

4.17

How reproducible:

1. Some new instance type need to be added
2. match the memory and cpu limitation

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

https://docs.openshift.com/container-platform/4.16/installing/installing_ibm_cloud_public/installing-ibm-cloud-customizations.html#installation-ibm-cloud-tested-machine-types_installing-ibm-cloud-customizations     

Description of problem:

    When an IDP name contains whitespaces, it causes the oauth-server to panic, if Golang is v1.22 or higher.

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. Create a cluster with OCP 4.17
    2. Create IDP with whitespaces in the name.
    3. oauth-server panics.
    

Actual results:

    oauth-server panics (if Go is at version 1.22 or higher).

Expected results:

    NO REGRESSION, it worked with Go 1.21 and lower.

Additional info:

    

Please review the following PR: https://github.com/openshift/cloud-provider-nutanix/pull/35

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

the first time we try to clear the input value on Expand PVC modal doesn't set input value to zero, instead the value is cleared and set to 1

we need clear again then the input value will be 0

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-01-053925    

How reproducible:

Always    

Steps to Reproduce:

    1. create a PVC with size 300MiB, and make sure it's in Bound status
    2. goes to PVC details -> Actions -> Expand PVC, select the input value and press 'backspace/delete' button
    

Actual results:

2. the input value is set to 1    

Expected results:

2. the input value should be set to 0 on a clear action    

Additional info:

    screenshot https://drive.google.com/file/d/1Y-FwiCndGpnR6A8ZR1V9weumBi2xzcp0/view?usp=drive_link

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/852

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/165

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

 The `aws-ebs-csi-driver-node-` appears to be failing to deploy way too often in the CI recently

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

  in a statistically significant pattern 

Steps to Reproduce:

    1. run OCP test suite many times for it to matter
    

Actual results:

    fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors
Error creating: pods "aws-ebs-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[1].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[1].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[2].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[2].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/aws-ebs-csi-driver-node -n openshift-cluster-csi-drivers happened 4 times

Expected results:

Test pass 

Additional info:

Link to the regression dashboard - https://sippy.dptools.openshift.org/sippy-ng/component_readiness/capability?baseEndTime=2023-10-31%2023%3A59%3A59&baseRelease=4.14&baseStartTime=2023-10-04%2000%3A00%3A00&capability=SCC&component=oauth-apiserver&confidence=95&excludeArches=arm64%2Cheterogeneous%2Cppc64le%2Cs390x&excludeClouds=openstack%2Cibmcloud%2Clibvirt%2Covirt%2Cunknown&excludeVariants=hypershift%2Cosd%2Cmicroshift%2Ctechpreview%2Csingle-node%2Cassisted%2Ccompact&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&pity=5&sampleEndTime=2023-12-11%2023%3A59%3A59&sampleRelease=4.15&sampleStartTime=2023-12-05%2000%3A00%3A00

[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]

Description of problem:

On the operator hub page, the operator is not showing and is getting the following error message:

"Oh no! Something went wrong."

TypeError:

Description: 
A.reduce is not a function

 

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Go to the operator hub page on the web console. 
2.
3.

Actual results:

"Oh no! Something went wrong."

Expected results:

Should list all the operators. 

Additional info:

 

Description of problem:

when we view list page with 'All Projects' selected, it is not showing all Ingress resources

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-09-212926    

How reproducible:

    Always

Steps to Reproduce:

    1. create Ingress under different project
$ oc get ingress -A
NAMESPACE   NAME        CLASS    HOSTS         ADDRESS   PORTS   AGE
39-3        example-2   <none>   example.com             80      8m56s
default     example     <none>   example.com             80      9m43s
    2. goes to Networking -> Ingresses -> choose 'All Projects'
    3.
    

Actual results:

    2. Only one Ingress resource listed 

Expected results:

2. Should list Ingresses from all projects    

Additional info:

    

Description of problem:

    In cri-o the first interface in a CNI result is used as the Pod.IP in Kubernetes. In net-attach-def client lib version 1.7.4, we use the first CNI result as the "default=true" interface as noted in the network-status, this is problematic for CNV along with OVN-K UDN, as it needs to know that the UDN interface is the default=true

Version-Release number of selected component (if applicable):

    4.18,4.17

How reproducible:

    Reproduction only under specific circumstances without an entire OVN-K stack.

Therefore, use: https://gist.github.com/dougbtv/a97e047c9872b2a40d275bb27af85789 in order to validate this functionality, which requires installing a custom CNI plugin using the script in the gist named 'z-dummy-cni-script.sh' create that as /var/lib/cni/bin/dummyresult on a host, and make it executable, and then make sure it's on the same node you label with multusdebug=true

Please review the following PR: https://github.com/openshift/monitoring-plugin/pull/178

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Observing a CI test where the metal3 Pod is deleted and allowed to recreate on another host, it took 5 attempts to start the new pod because static-ip-manager was crashlooping with the following log:

+ '[' -z 172.22.0.3/24 ']'
+ '[' -z enp1s0 ']'
+ '[' -n enp1s0 ']'
++ ip -o addr show dev enp1s0 scope global
+ [[ -n 2: enp1s0    inet 172.22.0.134/24 brd 172.22.0.255 scope global dynamic noprefixroute enp1s0\       valid_lft 3sec preferred_lft 3sec ]]
+ ip -o addr show dev enp1s0 scope global
+ grep -q 172.22.0.3/24
ERROR: "enp1s0" is already set to ip address belong to different subset than "172.22.0.3/24"
+ echo 'ERROR: "enp1s0" is already set to ip address belong to different subset than "172.22.0.3/24"'
+ exit 1

The error message is misleading about what is actually checked (apart from the whole subnet/subset typo). It doesn't appear this should ever work for IPv4, since we don't ever expect the Provisioning VIP to appear on the interface before we've set it. (With IPv6 this should often work thanks to an appalling and unsafe hack. Not to suggest that grepping for an IPv4 address complete with .'s in it is safe either.)

 

Eventually the pod does start up, with this in the log:

+ '[' -z 172.22.0.3/24 ']'
+ '[' -z enp1s0 ']'
+ '[' -n enp1s0 ']'
++ ip -o addr show dev enp1s0 scope global
+ [[ -n '' ]]
+ /usr/sbin/ip address flush dev enp1s0 scope global
+ /usr/sbin/ip addr add 172.22.0.3/24 dev enp1s0 valid_lft 300 preferred_lft 300

So essentially this only worked because there are no IP addresses on the provisioning interface.

In the original (error) log the machine's IP 172.22.0.134/24 has a valid lifetime of 3s, so that likely explains why it later disappears. The provisioning network is managed, so the IP address comes from dnsmasq in the former incarnation of the metal3 pod. We effectively prevent the new pod from starting until the DHCP addresses have timed out, even though we will later flush them to ensure no stale ones are left behind.

The check was originally added by https://github.com/openshift/ironic-static-ip-manager/pull/27 but that only describes what it does and not the reason. There's no linked ticket to indicate what the purpose was.

Description of problem:

pseudolocalizes navigation test is failing due to https://github.com/openshift/networking-console-plugin/issues/46 and CI is blocked.  We discussed this as a team and believe the best option is to remove this test so that future plugin changes do not block CI.

 

Description of problem:

'Remove alternate Service' button doesn't remove alternative service edit section    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-09-120947    

How reproducible:

    Always

Steps to Reproduce:

    1. goes to Routes creation form, Networking -> Routes -> Create Route -> Form view
    2. click on 'Add alternate Service'
    3. click on 'Remove alternate Service'
    

Actual results:

3. alternative service edit section can not be removed and since these fields are mandatory, user can not create Route successfully unless he must choose one alternate service, otherwise user will see error
Error "Required value" for field "spec.alternateBackends[0].name".

Expected results:

clicking on 'Remove alternate Service' button should remove alternate service edit section  

Additional info:

    

Description of problem:

according to doc https://docs.openshift.com/container-platform/4.16/storage/understanding-persistent-storage.html#pv-access-modes_understanding-persistent-storage    

Version-Release number of selected component (if applicable):

 4.17.0-0.nightly-2024-09-08-135628       

How reproducible:

Always    

Steps to Reproduce:

1. goes to PVC creation page and select a storageclass whose privisioner is `file.csi.azure.com`
2. check access mode dropdown values     

Actual results:

ROX is disabled        

Expected results:

ROX should be enabled, all access modes should be enabled       

Additional info:

    

Description of problem:

   When installing OpenShift 4.16 on vSphere using IPI method with a template it fails with below error:
2024-08-07T09:55:51.4052628Z             "level=debug msg=  Fetching Image...",
2024-08-07T09:55:51.4054373Z             "level=debug msg=  Reusing previously-fetched Image",
2024-08-07T09:55:51.4056002Z             "level=debug msg=  Fetching Common Manifests...",
2024-08-07T09:55:51.4057737Z             "level=debug msg=  Reusing previously-fetched Common Manifests",
2024-08-07T09:55:51.4059368Z             "level=debug msg=Generating Cluster...",
2024-08-07T09:55:51.4060988Z             "level=info msg=Creating infrastructure resources...",
2024-08-07T09:55:51.4063254Z             "level=debug msg=Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202406251923-0/x86_64/rhcos-416.94.202406251923-0-vmware.x86_64.ova?sha256=893a41653b66170c7d7e9b343ad6e188ccd5f33b377f0bd0f9693288ec6b1b73'",
2024-08-07T09:55:51.4065349Z             "level=debug msg=image download content length: 12169",
2024-08-07T09:55:51.4066994Z             "level=debug msg=image download content length: 12169",
2024-08-07T09:55:51.4068612Z             "level=debug msg=image download content length: 12169",
2024-08-07T09:55:51.4070676Z             "level=error msg=failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to use cached vsphere image: bad status: 403"

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    All the time in user environment

Steps to Reproduce:

    1.Try to install disconnected IPI install on vSphere using a template.
    2.
    3.
    

Actual results:

    No cluster installation

Expected results:

    Cluster installed with indicated template

Additional info:

    - 4.14 works as expected in customer environment
    - 4.15 works as expected in customer environment

Description of problem:

    Adjust OVS Dynamic Pinning tests to hypershift. Port 7_performance_kubelet_node/cgroups.go  and 7_performance_kubelet_node/kubelet.go to hypershift

Version-Release number of selected component (if applicable):

4.18    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

 

Additional info:

     This bug is created to port test cases to 4.17 branch

Please review the following PR: https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/74

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The term "Label" and "Selector" for pod selector is confusing in NetworkPolicies form.
Suggestion:
1. change the term accordingly
Label -> Key
Selector -> Value
2. redunce the length of the input dialog

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

On 4.17, ABI jobs fail with error

level=debug msg=Failed to register infra env. Error: 1 error occurred:
level=debug msg=	* mac-interface mapping for interface eno12399np0 is missing

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-05-24-193308

How reproducible:

On Prow CI ABI jobs, always

Steps to Reproduce:

    1. Generate ABI ISO starting with an agent-config file defining multiple network interfaces with `enabled: false`
    2. Boot the ISO
    3. Wait for error
    

Actual results:

    Install fails with error 'mac-interface mapping for interface xxxx is missing'

Expected results:

    Install completes

Additional info:

The check fails on the 1st network interface defined with `enabled: false`

Prow CI ABI Job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1797619997015543808

agent-config.yaml: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14/1797619997015543808/artifacts/baremetal-pxe-ha-agent-ipv4-static-connected-f14/baremetal-lab-agent-install/artifacts/agent-config.yaml

Description of problem:

When you update the IngressController's Scope on PowerVS, Alibaba Cloud, or OpenStack, a Progressing status condition is added that only says:

"The IngressController scope was changed from "Internal" to "External"

It's missing the instructions we see on AWS which begin with "To effectuate this change, you must delete the service..."

These platforms do NOT have mutable scope (meaning you must delete the service to effectuate), so the instructions should be included.

Version-Release number of selected component (if applicable):

    4.12+

How reproducible:

    100%

Steps to Reproduce:

    1. On PowerVS, Alibaba Cloud, or OpenStack, create an IngressController
    2. Now change the scope of ingresscontroller.spec.endpointPublishingStrategy.loadBalancer.scope   

Actual results:

    Missing "To effectuate this change, you must delete the service..." instructions

Expected results:

    Should contain "To effectuate this change, you must delete the service..." instructions

Additional info:

    

Prometheus HTTP API provides POST endpoints to fetch metrics: https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries

Those endpoints are used in the go client: https://github.com/prometheus/client_golang/blob/main/api/prometheus/v1/api.go#L1438

 

So a viewer-only program/user relying on the go client, or using these POST endpoints to fetch metrics, currently needs to create an additional Role+Binding in that purpose [1]

It would be much more convenient if that permission was directly included in the existing cluster-monitoring-view role, since it's actually used for reading.

 

[1]Role+Binding example

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: metrics
rules:
  - verbs:
      - create
    apiGroups:
      - metrics.k8s.io
    resources:
      - pods
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: metrics
subjects:
  - kind: User
    apiGroup: rbac.authorization.k8s.io
    name: test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: metrics

[internal] cf slack discussion here https://redhat-internal.slack.com/archives/C0VMT03S5/p1724684997333529?thread_ts=1715862728.898369&cid=C0VMT03S5

 

 

Description of problem:

    when testing unrelased OCP versions NP fails with   - lastTransitionTime: "2024-11-18T07:11:20Z"
    message: 'Failed to get release image: the latest version supported is: "4.18.0".
      Attempting to use: "4.19.0-0.nightly-2024-11-18-064347"'
    observedGeneration: 1
    reason: ValidationFailed
    status: "False"
    type: ValidReleaseImage


We should allow for skipping NP image validation with the hypershift.openshift.io/skip-release-image-validation annotation

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    100%

Steps to Reproduce:

    1.Try to create a NP with 4.19 payload
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

In OCPBUGS-38414, a new featuregate was turned on that didn't work correctly on metal (or at least it's tests didn't).  Metal should have techpreview jobs to ensure new features are tested properly.  I think the right matrix is:

  • e2e-metal-ovn-techpreview
  • e2e-metal-ovn-ipv6-techpreview
  • e2e-metal-ovn-dualstack-techpreview

On standard CI jobs, we incorporate this by wiring in the appropriate FEATURE_SET variable, but metal jobs don't currently have a way to do this as far as I can tell.

These should be release informers.

 

https://github.com/openshift/release/blob/5ce4d77a6317479f909af30d66bc0285ffd38dbd/ci-operator/step-registry/ipi/conf/ipi-conf-commands.sh#L63-L68 is the relevant step

 

Please review the following PR: https://github.com/openshift/cluster-policy-controller/pull/152

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Start last run option from the Action menu does not work on the BuildConfig details page
    

Version-Release number of selected component (if applicable):


    

How reproducible:

Every time
    

Steps to Reproduce:

    1. Create workloads with with builds 
    2. Goto the builds page from navigation 
    3. Select the build config 
    4. Select the` Start last run` option from the Action menu
    

Actual results:

The option doesn't work
    

Expected results:

The option should work
    

Additional info:

Attaching video
    

https://drive.google.com/file/d/10shQqcFbIKfE4Jv60AxNYBXKz08EdUAK/view?usp=sharing

The hypershift team has reported a nil pointer dereference causing a crash when attempting to call the validation method on an NTO performance profile.

This was detected as the hypershift team was attempting to complete a revendoring under OSASINFRA-3643

Appears to be fallout from https://github.com/openshift/cluster-node-tuning-operator/pull/1086

Error:

--- FAIL: TestGetTuningConfig (0.02s)
    --- FAIL: TestGetTuningConfig/gets_a_single_valid_PerformanceProfileConfig (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x2ec0fc3]

goroutine 329 [running]:
testing.tRunner.func1.2({0x31c21e0, 0x651c8c0})
	/home/emilien/sdk/go1.22.0/src/testing/testing.go:1631 +0x3f7
testing.tRunner.func1()
	/home/emilien/sdk/go1.22.0/src/testing/testing.go:1634 +0x6b6
panic({0x31c21e0?, 0x651c8c0?})
	/home/emilien/sdk/go1.22.0/src/runtime/panic.go:770 +0x132
github.com/openshift/cluster-node-tuning-operator/pkg/apis/performanceprofile/v2.(*PerformanceProfile).getNodesList(0xc000fa4000)
	/home/emilien/git/github.com/shiftstack/hypershift/vendor/github.com/openshift/cluster-node-tuning-operator/pkg/apis/performanceprofile/v2/performanceprofile_validation.go:594 +0x2a3
github.com/openshift/cluster-node-tuning-operator/pkg/apis/performanceprofile/v2.(*PerformanceProfile).ValidateBasicFields(0xc000fa4000)
	/home/emilien/git/github.com/shiftstack/hypershift/vendor/github.com/openshift/cluster-node-tuning-operator/pkg/apis/performanceprofile/v2/performanceprofile_validation.go:132 +0x65
github.com/openshift/hypershift/hypershift-operator/controllers/nodepool.validateTuningConfigManifest({0xc000f34a00, 0x1ee, 0x200})
	/home/emilien/git/github.com/shiftstack/hypershift/hypershift-operator/controllers/nodepool/nto.go:237 +0x307
github.com/openshift/hypershift/hypershift-operator/controllers/nodepool.(*NodePoolReconciler).getTuningConfig(0xc000075cd8, {0x50bf5f8, 0x65e4e40}, 0xc000e05408)
	/home/emilien/git/github.com/shiftstack/hypershift/hypershift-operator/controllers/nodepool/nto.go:187 +0x834
github.com/openshift/hypershift/hypershift-operator/controllers/nodepool.TestGetTuningConfig.func1(0xc000e07a00)
	/home/emilien/git/github.com/shiftstack/hypershift/hypershift-operator/controllers/nodepool/nto_test.go:459 +0x297
testing.tRunner(0xc000e07a00, 0xc000693650)
	/home/emilien/sdk/go1.22.0/src/testing/testing.go:1689 +0x21f
created by testing.(*T).Run in goroutine 325
	/home/emilien/sdk/go1.22.0/src/testing/testing.go:1742 +0x826 

 

Description of problem:

Specify long cluster name in install-config, 
==============
metadata:
  name: jima05atest123456789test123

Create cluster, installer exited with below error:
08-05 09:46:12.788  level=info msg=Network infrastructure is ready
08-05 09:46:12.788  level=debug msg=Creating storage account
08-05 09:46:13.042  level=debug msg=Collecting applied cluster api manifests...
08-05 09:46:13.042  level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: error creating storage account jima05atest123456789tsh586sa: PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima05atest123456789t-sh586-rg/providers/Microsoft.Storage/storageAccounts/jima05atest123456789tsh586sa
08-05 09:46:13.042  level=error msg=--------------------------------------------------------------------------------
08-05 09:46:13.042  level=error msg=RESPONSE 400: 400 Bad Request
08-05 09:46:13.043  level=error msg=ERROR CODE: AccountNameInvalid
08-05 09:46:13.043  level=error msg=--------------------------------------------------------------------------------
08-05 09:46:13.043  level=error msg={
08-05 09:46:13.043  level=error msg=  "error": {
08-05 09:46:13.043  level=error msg=    "code": "AccountNameInvalid",
08-05 09:46:13.043  level=error msg=    "message": "jima05atest123456789tsh586sa is not a valid storage account name. Storage account name must be between 3 and 24 characters in length and use numbers and lower-case letters only."
08-05 09:46:13.043  level=error msg=  }
08-05 09:46:13.043  level=error msg=}
08-05 09:46:13.043  level=error msg=--------------------------------------------------------------------------------
08-05 09:46:13.043  level=error
08-05 09:46:13.043  level=info msg=Shutting down local Cluster API controllers...
08-05 09:46:13.298  level=info msg=Stopped controller: Cluster API
08-05 09:46:13.298  level=info msg=Stopped controller: azure infrastructure provider
08-05 09:46:13.298  level=info msg=Stopped controller: azureaso infrastructure provider
08-05 09:46:13.298  level=info msg=Shutting down local Cluster API control plane...
08-05 09:46:15.177  level=info msg=Local Cluster API system has completed operations    

See azure doc[1], the naming rules on storage account name, it must be between 3 and 24 characters in length and may contain numbers and lowercase letters only.

The prefix of storage account created by installer seems changed to use infraID with CAPI-based installation, it's "cluster" when installing with terraform.

Is it possible to change back to use "cluster" as sa prefix to keep consistent with terraform? because there are several storage accounts being created once cluster installation is completed. One is created by installer starting with "cluster", others are created by image-registry starting with "imageregistry". And QE has some CI profiles[2] and automated test cases relying on installer sa, need to search prefix with "cluster", and not sure if customer also has similar scenarios.

[1] https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview
[2] https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh#L241

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Similar to the work done for AWS STS and Azure WIF support, the console UI (specifically OperatorHub) needs to:

  1. warn users when they are on an GCP cluster that support GCP's Workload Identity Management and the operator they will be installing supports it
  2. Subscribing to an operator that supports it can be customized in the UI by adding fields to the subscription config field that need to be provided to the operator at install time.

CONSOLE-3776 was adding filtering for the GCP WIP case, for the operator-hub tile view. Part fo the change was also check for the annotation which indicates that the operator supports GCP's WIF:

features.operators.openshift.io/token-auth-gcp: "true"

 

AC:

  • Add warning alert to the operator-hub-item-details component, if the cluster is GCP with WIF, similar to Azure and AWS.
  • Add warning alert to the operator-hub-subscribe component, if the cluster is GCP with WIF, similar to Azure and AWS.
  • If the cluster is in GCP WIF mode and the operator claims support for it the the subscription page provides configuring 4 additional fields, which will be set on the Subscription's spec.config.env field:
    • POOL_ID
    • PROVIDER_ID
    • SERVICE_ACCOUNT_EMAIL
  • Default subscription to manual for installs on WIF mode clusters for operators that support it.

 

Design docs

Please review the following PR: https://github.com/openshift/oc/pull/1870

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/50

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-vsphere-cluster-api-controllers-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

Description of problem:
Attempting to Migrate from OpenShiftSDN to OVNKubernetes but experiencing the below Error once the Limited Live Migration is started.

+ exec /usr/bin/hybrid-overlay-node --node ip-10-241-1-192.us-east-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h
I0829 14:06:20.313928   82345 config.go:2192] Parsed config file /run/ovnkube-config/ovnkube.conf
I0829 14:06:20.314202   82345 config.go:2193] Parsed config: {Default:{MTU:8901 RoutableMTU:0 ConntrackZone:64000 HostMasqConntrackZone:0 OVNMasqConntrackZone:0 HostNodePortConntrackZone:0 ReassemblyConntrackZone:0 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 OfctrlWaitBeforeClear:0 MonitorAll:true OVSDBTxnTimeout:1m40s LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:1048576 RawClusterSubnets:100.64.0.0/15/23 ClusterSubnets:[] EnableUDPAggregation:true Zone:global} Logging:{File: CNIFile: LibovsdbFile:/var/log/ovnkube/libovsdb.log Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:0 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} IPFIX:{Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableAdminNetworkPolicy:true EnableEgressIP:true EgressIPReachabiltyTotalTimeout:1 EnableEgressFirewall:true EnableEgressQoS:true EnableEgressService:true EgressIPNodeHealthCheckPort:9107 EnableMultiNetwork:true EnableMultiNetworkPolicy:false EnableStatelessNetPol:false EnableInterconnect:false EnableMultiExternalGateway:true EnablePersistentIPs:false EnableDNSNameResolver:false EnableServiceTemplateSupport:false} Kubernetes:{BootstrapKubeconfig: CertDir: CertDuration:10m0s Kubeconfig: CACert: CAData:[] APIServer:https://api-int.nonamenetwork.sandbox1730.opentlc.com:6443 Token: TokenFile: CompatServiceCIDR: RawServiceCIDRs:198.18.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:169.254.169.0/29 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5} DisablePacketMTUCheck:false RouterSubnet: SingleNode:false DisableForwarding:false AllowNoUplink:false} MasterHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} ClusterMgrHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} HybridOverlay:{Enabled:true RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full DPResourceDeviceIdsMap:map[] MgmtPortNetdev: MgmtPortDPResourceName:} ClusterManager:{V4TransitSwitchSubnet:100.88.0.0/16 V6TransitSwitchSubnet:fd97::/64}}
F0829 14:06:20.315468   82345 hybrid-overlay-node.go:54] illegal network configuration: built-in join subnet "100.64.0.0/16" overlaps cluster subnet "100.64.0.0/15"

The OpenShift Container Platform 4 - Cluster has been installed with the below configuration and therefore has a conflict because of the clusterNetwork with the Join Subnet of OVNKubernetes.

$ oc get cm -n kube-system cluster-config-v1 -o yaml
apiVersion: v1
data:
  install-config: |
    additionalTrustBundlePolicy: Proxyonly
    apiVersion: v1
    baseDomain: sandbox1730.opentlc.com
    compute:
    - architecture: amd64
      hyperthreading: Enabled
      name: worker
      platform: {}
      replicas: 3
    controlPlane:
      architecture: amd64
      hyperthreading: Enabled
      name: master
      platform: {}
      replicas: 3
    metadata:
      creationTimestamp: null
      name: nonamenetwork
    networking:
      clusterNetwork:
      - cidr: 100.64.0.0/15
        hostPrefix: 23
      machineNetwork:
      - cidr: 10.241.0.0/16
      networkType: OpenShiftSDN
      serviceNetwork:
      - 198.18.0.0/16
    platform:
      aws:
        region: us-east-2
    publish: External
    pullSecret: ""

So following the procedure, the below steps were executed but still the problem is being reported.

oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":{"internalJoinSubnet": "100.68.0.0/16"}}}}}'

Checking whether change was applied and one can see it being there/configured.

$ oc get network.operator cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2024-08-29T10:05:36Z"
  generation: 376
  name: cluster
  resourceVersion: "135345"
  uid: 37f08c71-98fa-430c-b30f-58f82142788c
spec:
  clusterNetwork:
  - cidr: 100.64.0.0/15
    hostPrefix: 23
  defaultNetwork:
    openshiftSDNConfig:
      enableUnidling: true
      mode: NetworkPolicy
      mtu: 8951
      vxlanPort: 4789
    ovnKubernetesConfig:
      egressIPConfig: {}
      gatewayConfig:
        ipv4: {}
        ipv6: {}
        routingViaHost: false
      genevePort: 6081
      ipsecConfig:
        mode: Disabled
      ipv4:
        internalJoinSubnet: 100.68.0.0/16
      mtu: 8901
      policyAuditConfig:
        destination: "null"
        maxFileSize: 50
        maxLogFiles: 5
        rateLimit: 20
        syslogFacility: local0
    type: OpenShiftSDN
  deployKubeProxy: false
  disableMultiNetwork: false
  disableNetworkDiagnostics: false
  kubeProxyConfig:
    bindAddress: 0.0.0.0
  logLevel: Normal
  managementState: Managed
  migration:
    mode: Live
    networkType: OVNKubernetes
  observedConfig: null
  operatorLogLevel: Normal
  serviceNetwork:
  - 198.18.0.0/16
  unsupportedConfigOverrides: null
  useMultiNetworkPolicy: false

Following the above the Limited Live Migration is being triggered, which then suddently stops because of the Error shown.

oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'

Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16.9

How reproducible:
Always

Steps to Reproduce:
1. Install OpenShift Container Platform 4 with OpenShiftSDN, the configuration shown above and then update to OpenShift Container Platform 4.16
2. Change internalJoinSubnet to prevent a conflict with the Join Subnet of OVNKubernetes (oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":

{"internalJoinSubnet": "100.68.0.0/16"}

}}}}')
3. Initiate the Limited Live Migration running oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
4. Check the logs of ovnkube-node using oc logs ovnkube-node-XXXXX -c ovnkube-controller

Actual results:

+ exec /usr/bin/hybrid-overlay-node --node ip-10-241-1-192.us-east-2.compute.internal --config-file=/run/ovnkube-config/ovnkube.conf --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h
I0829 14:06:20.313928   82345 config.go:2192] Parsed config file /run/ovnkube-config/ovnkube.conf
I0829 14:06:20.314202   82345 config.go:2193] Parsed config: {Default:{MTU:8901 RoutableMTU:0 ConntrackZone:64000 HostMasqConntrackZone:0 OVNMasqConntrackZone:0 HostNodePortConntrackZone:0 ReassemblyConntrackZone:0 EncapType:geneve EncapIP: EncapPort:6081 InactivityProbe:100000 OpenFlowProbe:180 OfctrlWaitBeforeClear:0 MonitorAll:true OVSDBTxnTimeout:1m40s LFlowCacheEnable:true LFlowCacheLimit:0 LFlowCacheLimitKb:1048576 RawClusterSubnets:100.64.0.0/15/23 ClusterSubnets:[] EnableUDPAggregation:true Zone:global} Logging:{File: CNIFile: LibovsdbFile:/var/log/ovnkube/libovsdb.log Level:4 LogFileMaxSize:100 LogFileMaxBackups:5 LogFileMaxAge:0 ACLLoggingRateLimit:20} Monitoring:{RawNetFlowTargets: RawSFlowTargets: RawIPFIXTargets: NetFlowTargets:[] SFlowTargets:[] IPFIXTargets:[]} IPFIX:{Sampling:400 CacheActiveTimeout:60 CacheMaxFlows:0} CNI:{ConfDir:/etc/cni/net.d Plugin:ovn-k8s-cni-overlay} OVNKubernetesFeature:{EnableAdminNetworkPolicy:true EnableEgressIP:true EgressIPReachabiltyTotalTimeout:1 EnableEgressFirewall:true EnableEgressQoS:true EnableEgressService:true EgressIPNodeHealthCheckPort:9107 EnableMultiNetwork:true EnableMultiNetworkPolicy:false EnableStatelessNetPol:false EnableInterconnect:false EnableMultiExternalGateway:true EnablePersistentIPs:false EnableDNSNameResolver:false EnableServiceTemplateSupport:false} Kubernetes:{BootstrapKubeconfig: CertDir: CertDuration:10m0s Kubeconfig: CACert: CAData:[] APIServer:https://api-int.nonamenetwork.sandbox1730.opentlc.com:6443 Token: TokenFile: CompatServiceCIDR: RawServiceCIDRs:198.18.0.0/16 ServiceCIDRs:[] OVNConfigNamespace:openshift-ovn-kubernetes OVNEmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:169.254.169.0/29 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5} DisablePacketMTUCheck:false RouterSubnet: SingleNode:false DisableForwarding:false AllowNoUplink:false} MasterHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} ClusterMgrHA:{ElectionLeaseDuration:137 ElectionRenewDeadline:107 ElectionRetryPeriod:26} HybridOverlay:{Enabled:true RawClusterSubnets: ClusterSubnets:[] VXLANPort:4789} OvnKubeNode:{Mode:full DPResourceDeviceIdsMap:map[] MgmtPortNetdev: MgmtPortDPResourceName:} ClusterManager:{V4TransitSwitchSubnet:100.88.0.0/16 V6TransitSwitchSubnet:fd97::/64}}
F0829 14:06:20.315468   82345 hybrid-overlay-node.go:54] illegal network configuration: built-in join subnet "100.64.0.0/16" overlaps cluster subnet "100.64.0.0/15"

Expected results:
OVNKubernetes Limited Live Migration to recognize the change applied for internalJoinSubnet and don't report any CIDR/Subnet overlap during the OVNKubernetes Limited Live Migration

Additional info:
N/A

Affected Platforms:
OpenShift Container Platform 4.16 on AWS

Description of problem:

TestIngressControllerNamespaceSelectorUpdateShouldClearRouteStatus failed due to previously seen issue with using a outdated IngressController object on update:

    router_status_test.go:248: failed to update ingresscontroller: Operation cannot be fulfilled on ingresscontrollers.operator.openshift.io "ic-namespace-selector-test": the object has been modified; please apply your changes to the latest version and try again

Version-Release number of selected component (if applicable):

4.12-4.17

How reproducible:

<5% (Seen only once)

Steps to Reproduce:

    1. Run TestIngressControllerNamespaceSelectorUpdateShouldClearRouteStatus on a busy cluster with other tests in parallel until it fails

Actual results:

   Flake

Expected results:

    No flake

Additional info:

Search.CI Link

"7 runs, 57% failed, 25% of failures match = 14% impact"

Example Failure

I think we should address all possible "Operation cannot be fulfilled on ingresscontroller" flakes together. 

Description of problem:

webpack dependency in @openshift-console/dynamic-plugin-sdk-webpack package is listed as "5.75.0" i.e. not a semver range but an exact version.

If a plugin project updates its webpack dependency to a newer version, it may cause the package manager to not hoist node_modules/@openshift/dynamic-plugin-sdk-webpack (which is a dependency of the ☝️ package) which then causes problems during the webpack build.

Steps to Reproduce:

1. git clone https://github.com/kubevirt-ui/kubevirt-plugin
2. modify webpack dependency in package.json to a newer version
3. yarn install # missing node_modules/@openshift/dynamic-plugin-sdk-webpack
4. yarn build   # results in build errors due to ^^

Actual results:

Build errors due to missing node_modules/@openshift/dynamic-plugin-sdk-webpack

Expected results:

No build errors

 

Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/76

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cloud-network-config-controller/pull/152

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The quorum checker starts to become very difficult to maintain and we're having a lot more problems with concurrent controllers as identified in OCPBUGS-31849.

To avoid plastering the code in all places where a revision rollout could happen, we should invert the control and tell the revision controller when we do not want to have a rollout at all.

Links to some of the discussions:

AC:

Add precondition to the revision controller - this would halt the whole revision process

  • introduce a callback true/false to skip the creation of new revision if the quorum is about to be violated.

Description of problem:

The following logs are from namespaces/openshift-apiserver/pods/apiserver-6fcd57c747-57rkr/openshift-apiserver/openshift-apiserver/logs/current.log

    2024-06-06T15:57:06.628216833Z E0606 15:57:06.628186       1 finisher.go:175] FinishRequest: post-timeout activity - time-elapsed: 139.823053ms, panicked: true, err: <nil>, panic-reason: runtime error: invalid memory address or nil pointer dereference
2024-06-06T15:57:06.628216833Z goroutine 192790 [running]:
2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1.1()
2024-06-06T15:57:06.628216833Z  k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/finisher/finisher.go:105 +0xa5
2024-06-06T15:57:06.628216833Z panic({0x498ac60?, 0x74a51c0?})
2024-06-06T15:57:06.628216833Z  runtime/panic.go:914 +0x21f
2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer.(*ImageStreamImporter).importImages(0xc0c5bf0fc0, {0x5626bb0, 0xc0a50c7dd0}, 0xc07055f4a0, 0xc0a2487600)
2024-06-06T15:57:06.628216833Z  github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer/importer.go:263 +0x1cf5
2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer.(*ImageStreamImporter).Import(0xc0c5bf0fc0, {0x5626bb0, 0xc0a50c7dd0}, 0x0?, 0x0?)
2024-06-06T15:57:06.628216833Z  github.com/openshift/openshift-apiserver/pkg/image/apiserver/importer/importer.go:110 +0x139
2024-06-06T15:57:06.628216833Z github.com/openshift/openshift-apiserver/pkg/image/apiserver/registry/imagestreamimport.(*REST).Create(0xc0033b2240, {0x5626bb0, 0xc0a50c7dd0}, {0x5600058?, 0xc07055f4a0?}, 0xc08e0b9ec0, 0x56422e8?)
2024-06-06T15:57:06.628216833Z  github.com/openshift/openshift-apiserver/pkg/image/apiserver/registry/imagestreamimport/rest.go:337 +0x1574
2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.(*namedCreaterAdapter).Create(0x55f50e0?, {0x5626bb0?, 0xc0a50c7dd0?}, {0xc0b5704000?, 0x562a1a0?}, {0x5600058?, 0xc07055f4a0?}, 0x1?, 0x2331749?)
2024-06-06T15:57:06.628216833Z  k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:254 +0x3b
2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.CreateResource.createHandler.func1.1()
2024-06-06T15:57:06.628216833Z  k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:184 +0xc6
2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers.CreateResource.createHandler.func1.2()
2024-06-06T15:57:06.628216833Z  k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/create.go:209 +0x39e
2024-06-06T15:57:06.628216833Z k8s.io/apiserver/pkg/endpoints/handlers/finisher.finishRequest.func1()
2024-06-06T15:57:06.628216833Z  k8s.io/apiserver@v0.29.2/pkg/endpoints/handlers/finisher/finisher.go:117 +0x84

Version-Release number of selected component (if applicable):

We applied into all clusters in CI and checked 3 of them and all 3 share the same errors.

oc --context build09 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-rc.3   True        False         3d9h    Error while reconciling 4.16.0-rc.3: the cluster operator machine-config is degraded

oc --context build02 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-rc.2   True        False         15d     Error while reconciling 4.16.0-rc.2: the cluster operator machine-config is degraded

oc --context build03 get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.16   True        False         34h     Error while reconciling 4.15.16: the cluster operator machine-config is degraded

How reproducible:

We applied this PR https://github.com/openshift/release/pull/52574/files to the clusters.

It breaks at least 3 of them.

"qci-pull-through-cache-us-east-1-ci.apps.ci.l2s4.p1.openshiftapps.com" is a registry cache server https://github.com/openshift/release/blob/master/clusters/app.ci/quayio-pull-through-cache/qci-pull-through-cache-us-east-1.yaml

Additional info:

There are lots of image imports in OpenShift CI jobs.

It feels like the registry cache server returns unexpected results to the openshift-apiserver:

2024-06-06T18:13:13.781520581Z E0606 18:13:13.781459       1 strategy.go:60] unable to parse manifest for "sha256:c5bcd0298deee99caaf3ec88de246f3af84f80225202df46527b6f2b4d0eb3c3": unexpected end of JSON input 

Our theory is that the requests of imports from all CI clusters crashed the cache server and it sent some unexpected data which caused apiserver to panic.

 

The expected behaviour is that if the image cannot be pulled from the first mirror in the ImageDigestMirrorSet, then it will be failed over to the next one.

Description of problem:

Navigation:
           Storage -> StorageClasses -> Create StorageClass -> Provisioner -> kubernetes.io/gce-pd

Issue:
           "Type" "Select GCE type" "Zone" "Zones" "Replication type" "Select Replication type" are in English.
        

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-01-063526

How reproducible:

Always

Steps to Reproduce:

1. Log into web console and set language to non en_US
2. Navigate to 
3. Storage -> StorageClasses -> Create StorageClass -> Provisioner
4. Select Provisioner "kubernetes.io/gce-pd"
5. "Type" "Select GCE type" "Zone" "Zones" "Replication type" "Select Replication type" are in English

Actual results:

Content is in English

Expected results:

Content should be in set language.

Additional info:

Screenshot reference attached

Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/313

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

We added the following to openstack-kubelet-nodename.service: https://github.com/openshift/machine-config-operator/pull/4570

But wait-for-br-ex-up.service is disabled, so doesn't normally do anything. This is why it doesn't break anything on other platforms, even though it's never going to work the way we are currently configuring workers for HCP. However, this Wants directive enables it when openstack-kubelet-nodename is added to the systemd transaction, so adding it broke us.

"Wants" adds it to the transaction and it hangs. If it failed it would be fine, but it doesn't. It also adds a RequiredBy on node-valid-hostname.

"br-ex" is up but it doesn't matter because that's not what it's testing. It's testing that /run/nodeip-configuration/br-ex-up  exists, which it won't because it's written by /etc/NetworkManager/dispatcher.d/30-resolv-prepender, which is empty.

Version-Release number of selected component (if applicable):

4.18  

Component Readiness has found a potential regression in the following test:

[sig-node] node-lifecycle detects unexpected not ready node

Extreme regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 100.00% to 84.62%.

Sample (being evaluated) Release: 4.18
Start Time: 2024-10-29T00:00:00Z
End Time: 2024-11-05T23:59:59Z
Success Rate: 84.62%
Successes: 33
Failures: 6
Flakes: 0

Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 100.00%
Successes: 79
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&FeatureSet=default&Installer=ipi&LayeredProduct=none&Network=ovn&NetworkAccess=default&Platform=aws&Procedure=none&Scheduler=default&SecurityMode=default&Suite=serial&Topology=ha&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=Other&columnGroupBy=Architecture%2CNetwork%2CPlatform%2CTopology&component=Node%20%2F%20Kubelet&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20aws%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Network%3Aovn&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&includeVariant=Topology%3Amicroshift&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-11-05%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-10-29%2000%3A00%3A00&testId=openshift-tests%3A1f3a2a9f8d7b6e8deb502468746bc363&testName=%5Bsig-node%5D%20node-lifecycle%20detects%20unexpected%20not%20ready%20node

Description of problem:

node-joiner tool does not honour additionalNTPSources
As mentioned in  https://docs.openshift.com/container-platform/4.16/installing/installing_with_agent_based_installer/installation-config-parameters-agent.html
the setting of additionalNTPSources is  possible when adding nodes at day1, but the setting is not honoured at day2


    

How reproducible:

always

    

Steps to Reproduce:

Create a  agent config with

AdditionalNTPSources:
  - 10.10.10.10
  - 10.10.10.11
hosts:
    - hostname: extra-worker-0
      interfaces:
        - name: eth0
          macAddress:  0xDEADBEEF
    - hostname: extra-worker-1
      interfaces:
        - name: eth0
          macAddress: 00:02:46:e3:9e:8c
    - hostname:  0xDEADBEEF
      interfaces:
        - name: eth0
          macAddress:  0xDEADBEEF

    

Actual results:

NTP on added node cannot join the NTP server.
ntp-synced Status:failure Message:Host couldn't synchronize with any NTP server

    

NTP on added node can contact the NTP server.

Please review the following PR: https://github.com/operator-framework/operator-marketplace/pull/574

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The version page in our docs is out of date and needs to be updated with the current versioning standards we expect.

 

Minimum of OCP mgmt cluster/k8s needs to be added. 

Please review the following PR: https://github.com/openshift/images/pull/192

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run.

This card captures image-registry operator that blips Degraded=True during upgrade runs.

Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-upgrade/1843366977876267008

Reasons associated with the blip: ProgressDeadlineExceeded, NodeCADaemonControllerError

For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in.

Exception can be found here: https://github.com/openshift/origin/blob/fd6fe36319c39b51ab0f02ecb8e2777c0e1bb210/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L319

See linked issue for more explanation on the effort.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

After click "Don's show again" on Lightspeed popup modal, it went to user preference page, and highlighted "Hide Lightspeed" part with rectangle lines, if now click "Lightspeed" popup button at the right bottom, the highlighted rectangle lines lay above the popup modal.
    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-09-150616
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Clicked "Don's show again" on Lightspeed popup modal, it went to user preference page, and highlighted "Hide Lightspeed" part with rectangle lines. At the same time, click "Lightspeed" popup button at the right bottom.
    2.
    3.
    

Actual results:

1. The highlighted rectangle lines lay above the popup modal.
Screenshot: https://drive.google.com/drive/folders/15te0dbavJUTGtqRYFt-rM_U8SN7euFK5?usp=sharing
    

Expected results:

1. The Lightspeed popup modal should be on the top layer.
    

Additional info:


    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

The openshift-ingress/router-default never stops reconciling in the ingress operator.

 

2024-08-22T15:59:22.789Z	INFO	operator.ingress_controller	controller/controller.go:114	reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T15:59:22.799Z	INFO	operator.status_controller	controller/controller.go:114	Reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T15:59:22.868Z	INFO	operator.ingress_controller	ingress/deployment.go:135	updated router deployment	{"namespace": "openshift-ingress", "name": "router-default", "diff": "  &v1.Deployment{\n  \tTypeMeta:   {},\n  \tObjectMeta: {Name: \"router-default\", Namespace: \"openshift-ingress\", UID: \"6cf98392-8782-4741-b5c9-ce63fb77879a\", ResourceVersion: \"6244\", ...},\n  \tSpec: v1.DeploymentSpec{\n  \t\tReplicas: &1,\n  \t\tSelector: &{MatchLabels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\"}},\n  \t\tTemplate: {ObjectMeta: {Labels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\", \"ingresscontroller.operator.openshift.io/hash\": \"9c69cc8d\"}, Annotations: {\"target.workload.openshift.io/management\": `{\"effect\": \"PreferredDuringScheduling\"}`}}, Spec: {Volumes: {{Name: \"default-certificate\", VolumeSource: {Secret: &{SecretName: \"default-ingress-cert\", DefaultMode: &420}}}, {Name: \"service-ca-bundle\", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: \"service-ca-bundle\"}, Items: {{Key: \"service-ca.crt\", Path: \"service-ca.crt\"}}, DefaultMode: &420, Optional: &false}}}, {Name: \"stats-auth\", VolumeSource: {Secret: &{SecretName: \"router-stats-default\", DefaultMode: &420}}}, {Name: \"metrics-certs\", VolumeSource: {Secret: &{SecretName: \"router-metrics-certs-default\", DefaultMode: &420}}}}, Containers: {{Name: \"router\", Image: \"registry.build05.ci.openshift.org/ci-op-vxbb8hxy/stable@sha256:f\"..., Ports: {{Name: \"http\", ContainerPort: 80, Protocol: \"TCP\"}, {Name: \"https\", ContainerPort: 443, Protocol: \"TCP\"}, {Name: \"metrics\", ContainerPort: 1936, Protocol: \"TCP\"}}, Env: {{Name: \"DEFAULT_CERTIFICATE_DIR\", Value: \"/etc/pki/tls/private\"}, {Name: \"DEFAULT_DESTINATION_CA_PATH\", Value: \"/var/run/configmaps/service-ca/service-ca.crt\"}, {Name: \"RELOAD_INTERVAL\", Value: \"5s\"}, {Name: \"ROUTER_ALLOW_WILDCARD_ROUTES\", Value: \"false\"}, ...}, ...}}, RestartPolicy: \"Always\", TerminationGracePeriodSeconds: &3600, ...}},\n  \t\tStrategy: v1.DeploymentStrategy{\n- \t\t\tType:          \"RollingUpdate\",\n+ \t\t\tType:          \"\",\n- \t\t\tRollingUpdate: s\"&RollingUpdateDeployment{MaxUnavailable:25%,MaxSurge:25%,}\",\n+ \t\t\tRollingUpdate: nil,\n  \t\t},\n  \t\tMinReadySeconds:      30,\n  \t\tRevisionHistoryLimit: &10,\n  \t\t... // 2 identical fields\n  \t},\n  \tStatus: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...},\n  }\n"}
2024-08-22T15:59:22.884Z	ERROR	operator.ingress_controller	controller/controller.go:114	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)"}
2024-08-22T15:59:24.820Z	INFO	operator.ingress_controller	handler/enqueue_mapped.go:103	queueing ingress	{"name": "default", "related": ""}
2024-08-22T15:59:24.820Z	INFO	operator.ingress_controller	handler/enqueue_mapped.go:103	queueing ingress	{"name": "default", "related": ""}
2024-08-22T15:59:24.820Z	INFO	operator.ingress_controller	controller/controller.go:114	reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T15:59:24.887Z	INFO	operator.ingress_controller	ingress/deployment.go:135	updated router deployment	{"namespace": "openshift-ingress", "name": "router-default", "diff": "  &v1.Deployment{\n  \tTypeMeta:   {},\n  \tObjectMeta: {Name: \"router-default\", Namespace: \"openshift-ingress\", UID: \"6cf98392-8782-4741-b5c9-ce63fb77879a\", ResourceVersion: \"7194\", ...},\n  \tSpec: v1.DeploymentSpec{\n  \t\tReplicas: &1,\n  \t\tSelector: &{MatchLabels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\"}},\n  \t\tTemplate: {ObjectMeta: {Labels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\", \"ingresscontroller.operator.openshift.io/hash\": \"9c69cc8d\"}, Annotations: {\"target.workload.openshift.io/management\": `{\"effect\": \"PreferredDuringScheduling\"}`}}, Spec: {Volumes: {{Name: \"default-certificate\", VolumeSource: {Secret: &{SecretName: \"default-ingress-cert\", DefaultMode: &420}}}, {Name: \"service-ca-bundle\", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: \"service-ca-bundle\"}, Items: {{Key: \"service-ca.crt\", Path: \"service-ca.crt\"}}, DefaultMode: &420, Optional: &false}}}, {Name: \"stats-auth\", VolumeSource: {Secret: &{SecretName: \"router-stats-default\", DefaultMode: &420}}}, {Name: \"metrics-certs\", VolumeSource: {Secret: &{SecretName: \"router-metrics-certs-default\", DefaultMode: &420}}}}, Containers: {{Name: \"router\", Image: \"registry.build05.ci.openshift.org/ci-op-vxbb8hxy/stable@sha256:f\"..., Ports: {{Name: \"http\", ContainerPort: 80, Protocol: \"TCP\"}, {Name: \"https\", ContainerPort: 443, Protocol: \"TCP\"}, {Name: \"metrics\", ContainerPort: 1936, Protocol: \"TCP\"}}, Env: {{Name: \"DEFAULT_CERTIFICATE_DIR\", Value: \"/etc/pki/tls/private\"}, {Name: \"DEFAULT_DESTINATION_CA_PATH\", Value: \"/var/run/configmaps/service-ca/service-ca.crt\"}, {Name: \"RELOAD_INTERVAL\", Value: \"5s\"}, {Name: \"ROUTER_ALLOW_WILDCARD_ROUTES\", Value: \"false\"}, ...}, ...}}, RestartPolicy: \"Always\", TerminationGracePeriodSeconds: &3600, ...}},\n  \t\tStrategy: v1.DeploymentStrategy{\n- \t\t\tType:          \"RollingUpdate\",\n+ \t\t\tType:          \"\",\n- \t\t\tRollingUpdate: s\"&RollingUpdateDeployment{MaxUnavailable:25%,MaxSurge:25%,}\",\n+ \t\t\tRollingUpdate: nil,\n  \t\t},\n  \t\tMinReadySeconds:      30,\n  \t\tRevisionHistoryLimit: &10,\n  \t\t... // 2 identical fields\n  \t},\n  \tStatus: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...},\n  }\n"}
2024-08-22T15:59:24.911Z	INFO	operator.route_metrics_controller	controller/controller.go:114	reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T15:59:24.911Z	INFO	operator.status_controller	controller/controller.go:114	Reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T15:59:24.911Z	INFO	operator.certificate_controller	controller/controller.go:114	Reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T15:59:24.911Z	INFO	operator.ingressclass_controller	controller/controller.go:114	reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T15:59:24.913Z	INFO	operator.ingress_controller	controller/controller.go:114	reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T15:59:24.924Z	INFO	operator.status_controller	controller/controller.go:114	Reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T15:59:24.984Z	INFO	operator.ingress_controller	ingress/deployment.go:135	updated router deployment	{"namespace": "openshift-ingress", "name": "router-default", "diff": "  &v1.Deployment{\n  \tTypeMeta:   {},\n  \tObjectMeta: {Name: \"router-default\", Namespace: \"openshift-ingress\", UID: \"6cf98392-8782-4741-b5c9-ce63fb77879a\", ResourceVersion: \"7194\", ...},\n  \tSpec: v1.DeploymentSpec{\n  \t\tReplicas: &1,\n  \t\tSelector: &{MatchLabels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\"}},\n  \t\tTemplate: {ObjectMeta: {Labels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\", \"ingresscontroller.operator.openshift.io/hash\": \"9c69cc8d\"}, Annotations: {\"target.workload.openshift.io/management\": `{\"effect\": \"PreferredDuringScheduling\"}`}}, Spec: {Volumes: {{Name: \"default-certificate\", VolumeSource: {Secret: &{SecretName: \"default-ingress-cert\", DefaultMode: &420}}}, {Name: \"service-ca-bundle\", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: \"service-ca-bundle\"}, Items: {{Key: \"service-ca.crt\", Path: \"service-ca.crt\"}}, DefaultMode: &420, Optional: &false}}}, {Name: \"stats-auth\", VolumeSource: {Secret: &{SecretName: \"router-stats-default\", DefaultMode: &420}}}, {Name: \"metrics-certs\", VolumeSource: {Secret: &{SecretName: \"router-metrics-certs-default\", DefaultMode: &420}}}}, Containers: {{Name: \"router\", Image: \"registry.build05.ci.openshift.org/ci-op-vxbb8hxy/stable@sha256:f\"..., Ports: {{Name: \"http\", ContainerPort: 80, Protocol: \"TCP\"}, {Name: \"https\", ContainerPort: 443, Protocol: \"TCP\"}, {Name: \"metrics\", ContainerPort: 1936, Protocol: \"TCP\"}}, Env: {{Name: \"DEFAULT_CERTIFICATE_DIR\", Value: \"/etc/pki/tls/private\"}, {Name: \"DEFAULT_DESTINATION_CA_PATH\", Value: \"/var/run/configmaps/service-ca/service-ca.crt\"}, {Name: \"RELOAD_INTERVAL\", Value: \"5s\"}, {Name: \"ROUTER_ALLOW_WILDCARD_ROUTES\", Value: \"false\"}, ...}, ...}}, RestartPolicy: \"Always\", TerminationGracePeriodSeconds: &3600, ...}},\n  \t\tStrategy: v1.DeploymentStrategy{\n- \t\t\tType:          \"RollingUpdate\",\n+ \t\t\tType:          \"\",\n- \t\t\tRollingUpdate: s\"&RollingUpdateDeployment{MaxUnavailable:25%,MaxSurge:25%,}\",\n+ \t\t\tRollingUpdate: nil,\n  \t\t},\n  \t\tMinReadySeconds:      30,\n  \t\tRevisionHistoryLimit: &10,\n  \t\t... // 2 identical fields\n  \t},\n  \tStatus: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...},\n  }\n"}
2024-08-22T15:59:43.457Z	INFO	operator.ingress_controller	controller/controller.go:114	reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T15:59:43.539Z	INFO	operator.ingress_controller	ingress/deployment.go:135	updated router deployment	{"namespace": "openshift-ingress", "name": "router-default", "diff": "  &v1.Deployment{\n  \tTypeMeta:   {},\n  \tObjectMeta: {Name: \"router-default\", Namespace: \"openshift-ingress\", UID: \"6cf98392-8782-4741-b5c9-ce63fb77879a\", ResourceVersion: \"7194\", ...},\n  \tSpec: v1.DeploymentSpec{\n  \t\tReplicas: &1,\n  \t\tSelector: &{MatchLabels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\"}},\n  \t\tTemplate: {ObjectMeta: {Labels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\", \"ingresscontroller.operator.openshift.io/hash\": \"9c69cc8d\"}, Annotations: {\"target.workload.openshift.io/management\": `{\"effect\": \"PreferredDuringScheduling\"}`}}, Spec: {Volumes: {{Name: \"default-certificate\", VolumeSource: {Secret: &{SecretName: \"default-ingress-cert\", DefaultMode: &420}}}, {Name: \"service-ca-bundle\", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: \"service-ca-bundle\"}, Items: {{Key: \"service-ca.crt\", Path: \"service-ca.crt\"}}, DefaultMode: &420, Optional: &false}}}, {Name: \"stats-auth\", VolumeSource: {Secret: &{SecretName: \"router-stats-default\", DefaultMode: &420}}}, {Name: \"metrics-certs\", VolumeSource: {Secret: &{SecretName: \"router-metrics-certs-default\", DefaultMode: &420}}}}, Containers: {{Name: \"router\", Image: \"registry.build05.ci.openshift.org/ci-op-vxbb8hxy/stable@sha256:f\"..., Ports: {{Name: \"http\", ContainerPort: 80, Protocol: \"TCP\"}, {Name: \"https\", ContainerPort: 443, Protocol: \"TCP\"}, {Name: \"metrics\", ContainerPort: 1936, Protocol: \"TCP\"}}, Env: {{Name: \"DEFAULT_CERTIFICATE_DIR\", Value: \"/etc/pki/tls/private\"}, {Name: \"DEFAULT_DESTINATION_CA_PATH\", Value: \"/var/run/configmaps/service-ca/service-ca.crt\"}, {Name: \"RELOAD_INTERVAL\", Value: \"5s\"}, {Name: \"ROUTER_ALLOW_WILDCARD_ROUTES\", Value: \"false\"}, ...}, ...}}, RestartPolicy: \"Always\", TerminationGracePeriodSeconds: &3600, ...}},\n  \t\tStrategy: v1.DeploymentStrategy{\n- \t\t\tType:          \"RollingUpdate\",\n+ \t\t\tType:          \"\",\n- \t\t\tRollingUpdate: s\"&RollingUpdateDeployment{MaxUnavailable:25%,MaxSurge:25%,}\",\n+ \t\t\tRollingUpdate: nil,\n  \t\t},\n  \t\tMinReadySeconds:      30,\n  \t\tRevisionHistoryLimit: &10,\n  \t\t... // 2 identical fields\n  \t},\n  \tStatus: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...},\n  }\n"}
2024-08-22T16:01:07.866Z	INFO	operator.route_metrics_controller	handler/enqueue_mapped.go:103	queueing ingresscontroller	{"name": "default"}
2024-08-22T16:01:07.866Z	INFO	operator.route_metrics_controller	handler/enqueue_mapped.go:103	queueing ingresscontroller	{"name": "default"}
2024-08-22T16:01:07.866Z	INFO	operator.route_metrics_controller	controller/controller.go:114	reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T16:01:07.870Z	INFO	operator.route_metrics_controller	handler/enqueue_mapped.go:103	queueing ingresscontroller	{"name": "default"}
2024-08-22T16:01:07.870Z	INFO	operator.route_metrics_controller	handler/enqueue_mapped.go:103	queueing ingresscontroller	{"name": "default"}
2024-08-22T16:01:07.870Z	INFO	operator.route_metrics_controller	controller/controller.go:114	reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T16:01:07.899Z	INFO	operator.route_metrics_controller	handler/enqueue_mapped.go:103	queueing ingresscontroller	{"name": "default"}
2024-08-22T16:01:07.899Z	INFO	operator.route_metrics_controller	handler/enqueue_mapped.go:103	queueing ingresscontroller	{"name": "default"}
2024-08-22T16:01:07.899Z	INFO	operator.route_metrics_controller	controller/controller.go:114	reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}}
2024-08-22T16:01:08.957Z	INFO	operator.route_metrics_controller	handler/enqueue_mapped.go:103	queueing ingresscontroller	{"name": "default"}
2024-08-22T16:01:08.957Z	INFO	operator.route_metrics_controller	handler/enqueue_mapped.go:103	queueing ingresscontroller	{"name": "default"}
2024-08-22T16:01:08.957Z	INFO	operator.route_metrics_controller	controller/controller.go:114	reconciling	{"request": {"name":"default","namespace":"openshift-ingress-operator"}} 

Version-Release number of selected component (if applicable):

4.17    

 

The diff is:

❯ cat /tmp/msg.json | jq -r '.diff'
  &v1.Deployment{
  	TypeMeta:   {},
  	ObjectMeta: {Name: "router-default", Namespace: "openshift-ingress", UID: "6cf98392-8782-4741-b5c9-ce63fb77879a", ResourceVersion: "6244", ...},
  	Spec: v1.DeploymentSpec{
  		Replicas: &1,
  		Selector: &{MatchLabels: {"ingresscontroller.operator.openshift.io/deployment-ingresscontroller": "default"}},
  		Template: {ObjectMeta: {Labels: {"ingresscontroller.operator.openshift.io/deployment-ingresscontroller": "default", "ingresscontroller.operator.openshift.io/hash": "9c69cc8d"}, Annotations: {"target.workload.openshift.io/management": `{"effect": "PreferredDuringScheduling"}`}}, Spec: {Volumes: {{Name: "default-certificate", VolumeSource: {Secret: &{SecretName: "default-ingress-cert", DefaultMode: &420}}}, {Name: "service-ca-bundle", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: "service-ca-bundle"}, Items: {{Key: "service-ca.crt", Path: "service-ca.crt"}}, DefaultMode: &420, Optional: &false}}}, {Name: "stats-auth", VolumeSource: {Secret: &{SecretName: "router-stats-default", DefaultMode: &420}}}, {Name: "metrics-certs", VolumeSource: {Secret: &{SecretName: "router-metrics-certs-default", DefaultMode: &420}}}}, Containers: {{Name: "router", Image: "registry.build05.ci.openshift.org/ci-op-vxbb8hxy/stable@sha256:f"..., Ports: {{Name: "http", ContainerPort: 80, Protocol: "TCP"}, {Name: "https", ContainerPort: 443, Protocol: "TCP"}, {Name: "metrics", ContainerPort: 1936, Protocol: "TCP"}}, Env: {{Name: "DEFAULT_CERTIFICATE_DIR", Value: "/etc/pki/tls/private"}, {Name: "DEFAULT_DESTINATION_CA_PATH", Value: "/var/run/configmaps/service-ca/service-ca.crt"}, {Name: "RELOAD_INTERVAL", Value: "5s"}, {Name: "ROUTER_ALLOW_WILDCARD_ROUTES", Value: "false"}, ...}, ...}}, RestartPolicy: "Always", TerminationGracePeriodSeconds: &3600, ...}},
  		Strategy: v1.DeploymentStrategy{
- 			Type:          "RollingUpdate",
+ 			Type:          "",
- 			RollingUpdate: s"&RollingUpdateDeployment{MaxUnavailable:25%,MaxSurge:25%,}",
+ 			RollingUpdate: nil,
  		},
  		MinReadySeconds:      30,
  		RevisionHistoryLimit: &10,
  		... // 2 identical fields
  	},
  	Status: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...},
  } 

Description of problem:

    Based on the results in [Sippy|https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Etcd&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-19%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-13%2000%3A00%3A00&testId=Operator%20results%3A45d55df296fbbfa7144600dce70c1182&testName=operator%20conditions%20etcd], it appears that the periodic tests are not waiting for the etcd operator to complete before exiting.

The test is supposed to wait for up to 20 mins after the final control plane machine is rolled, to allow operators to settle. But we are seeing the etcd operator triggering 2 further revisions after this happens.

We need to understand if the etcd operator is correctly rolling out vs whether these changes should have rolled out prior to the final machine going away, and, understand if there's a way to add more stability to our checks to make sure that all of the operators stabilise, and, that they have been stable for at least some period (1 minute)

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Context

Outcomes

  • all dynamic plugins owned by our team are FIPS compliant
    • all dynamic plugins do not include Not compliant build options for Go
  • Dynamic plugins list to be checked
    • Troubleshooting panel
    • Logging
    • Monitoring
    • Dashboards
    • Distributed tracing

Steps

  1. Remove not compliant build options for Go
  2. Sync with QE to test in a FIPS compliant OS using the FIPS or Die feature, check if this tests can be automated
  3. Update COO midstream with the fix commit

Acceptance Criteria

  1. All golang-based containers use the ENV GOEXPERIMENT=strictfipsruntime.
  2. All golang-based containers use the ENV CGO_ENABLED=1.
  3. All golang-based containers use the build tag strictfipsruntime.
  4. All golang-based containers omit using static linking.
  5. All golang-based container omit using the build tag no_openssl.
  6. All containers use a runner base RHEL ELS image:  e.g. registry.redhat.io/rhel9-4-els/rhel:9.4
  7. All images pass the check-payload checks successfully.

Description of problem:

On NetworkPolicies page, the position of the titles and the tab does not have the same look of other pages, it should have the same style with others, move the title above the tabs.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

    If an instance type is specified in the install-config.yaml, the installer will try to validate its availability in the given region and that it meets the minimum requirements for OCP nodes. When that happens, the `ec2:DescribeInstanceTypes` permission is used but it's not validated by the installer as a required permissions for installs.

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    Always by setting an instanceType in the install-config.yaml

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    If you install with an user with minimal permissions, you'll get the error:

level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [controlPlane.platform.aws: Internal error: error listing instance types: fetching instance types: UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-8phprrsm-ccf9a-minimal-perm is not authorized to perform: ec2:DescribeInstanceTypes because no identity-based policy allows the ec2:DescribeInstanceTypes action
                level=error msg=	status code: 403, request id: 559344f4-0fc3-4a6c-a6ee-738d4e1c0099, compute[0].platform.aws: Internal error: error listing instance types: fetching instance types: UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-8phprrsm-ccf9a-minimal-perm is not authorized to perform: ec2:DescribeInstanceTypes because no identity-based policy allows the ec2:DescribeInstanceTypes action   
                level=error msg=	status code: 403, request id: 584cc325-9057-4c31-bb7d-2f4458336605]

Expected results:

    The installer fails with an explicit message saying that `ec2:DescribeInstanceTypes` is required.

Additional info:

    

Description of problem:

The cluster policy controller does not get the same feature flags that other components in the control plane are getting.
    

Version-Release number of selected component (if applicable):

4.18
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create hosted cluster
    2. Get cluster-policy-controller-config configmap from control plane namespace
    

Actual results:

Default feature gates are not included in the config
    

Expected results:

Feature gates are included in the config
    

Additional info:


    

This E2E tests whether the etcd is able to block the rollout of a new revision when the quorum is not safe.

Description of problem:

    e980 is a valid system type for the madrid region but it is not listed as such in the installer.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy to mad02 with SysType set to e980
    2. Fail
    3.
    

Actual results:

    Installer exits

Expected results:

    Installer should continue as it's a valid system type.

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-operator/pull/78

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

When user set Chinese launguage, check on OpenShift Lightspeed nav modal, the "Meet OpenShift Lightspeed" is translated to "OpenShift Lightspeed", "Meet" is not translated.
    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-10-133523
    

How reproducible:

Always
    

Steps to Reproduce:

    1. When Chinese language is set, check the "Meet OpenShift Lightspeed" on OpenShift Lightspeed nav modal.
    2.
    3.
    

Actual results:

1. The "Meet OpenShift Lightspeed" is translated to "OpenShift Lightspeed", "Meet" is not translated.
    

Expected results:

1. "Meet" could be translated in Chinese. It has been translated for other languages.
    

Additional info:


    

The HyperShift codebase has numerous examples of MustParse*() functions being used on non-constant input. This is not their intended use, as any failure will cause a panic in the controller.

In a few cases they are are called on user-provided input, meaning any authenticated user can (intentionally or unintentionally) deny service to all other users by providing invalid input which continuously crashes the HostedCluster controller.

This is probably a security issue, but as I have already described it in https://github.com/openshift/hypershift/pull/4546 there is no reason to embargo it.

Description of problem:

Oc-mirror should not panic when failed to get  release signature

Version-Release number of selected component (if applicable):

oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1)  Mirror2disk+disk2mirror with following imagesetconfig, and mirror to enterprise registry :
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.15                                             
      type: ocp
      minVersion: '4.15.18'
      maxVersion: '4.15.18'

2) Setup squid and only with white list with enterprise registry and the OSUS service ;
cat /etc/squid/squid.conf
http_port 3128
coredump_dir /var/spool/squid
acl whitelist dstdomain "/etc/squid/whitelist"
http_access allow whitelist
http_access deny !whitelist

cat /etc/squid/whitelist 
my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com             -------------registry route  (oc get route -n your registry app's project)
update-service-oc-mirror-route-openshift-update-service.apps.yinzhou-88.qe.devcluster.openshift.com        ---osus route  (oc get route -n openshift-update-service)

Sudo systemctl restart squid
export https_proxy=http://127.0.0.1:3128
export http_proxy=http://127.0.0.1:3128

3)  Setting registry redirect with : 
cat ~/.config/containers/registries.conf 
[[registry]]
  location = "quay.io"
  insecure = false
  blocked = false
  mirror-by-digest-only = false
  prefix = ""
  [[registry.mirror]]
    location = "my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com"
    insecure = false


4) Use the same imagesetconfig and mirror to a new folder :
`oc-mirror -c config-38037.yaml file://new-folder --v2`

Actual results:

4) the oc-mirror command panic with error :  

I0812 06:45:26.026441  199941 core-cincinnati.go:508] Using proxy 127.0.0.1:3128 to request updates from https://update-service-oc-mirror-route-openshift-update-service.apps.yinzhou-417.qe.devcluster.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=a6097264-8b29-438f-9e71-4aba1e9ec32d
2024/08/12 06:45:26  [ERROR]  : http request Get "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=0f55261077557d1bb909c06b115e0c79b0025677be57ba2f045495c11e2443ee/signature-1": Forbidden
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x3d1e3f6]
 
goroutine 1 [running]:
github.com/openshift/oc-mirror/v2/internal/pkg/release.SignatureSchema.GenerateReleaseSignatures({

{0x55d8670, 0xc000729e00}

, {0x4c7b348, 0x15}, {0xc000058c60, 0x1c, {...}, {...}, {...}, {...}, ...}, ..., ...}, ...)
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/signature.go:97 +0x676
github.com/openshift/oc-mirror/v2/internal/pkg/release.(*CincinnatiSchema).GetReleaseReferenceImages(0xc0007203c0, {0x55caf68, 0x764c060})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/cincinnati.go:230 +0x70b
github.com/openshift/oc-mirror/v2/internal/pkg/release.(*LocalStorageCollector).ReleaseImageCollector(0xc000b12388, {0x55caf68, 0x764c060})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/release/local_stored_collector.go:58 +0x407
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).CollectAll(0xc000ae8908, {0x55caf68, 0x764c060})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:955 +0x122
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).RunMirrorToDisk(0xc000ae8908, 0xc0005f3b08, {0xa?, 0x20?, 0x20?})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:707 +0x1aa
github.com/openshift/oc-mirror/v2/internal/pkg/cli.(*ExecutorSchema).Run(0xc000ae8908, 0xc0005f1640?, {0xc0005f1640?, 0x0?, 0x0?})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:459 +0x149
github.com/openshift/oc-mirror/v2/internal/pkg/cli.NewMirrorCmd.func1(0xc0005f3b08, {0xc0005f1640, 0x1, 0x4})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/internal/pkg/cli/executor.go:207 +0x32a
github.com/spf13/cobra.(*Command).execute(0xc0005f3b08, {0xc000166010, 0x4, 0x4})
/home/fedora/yinzhou/oc-mirror/vendor/github.com/spf13/cobra/command.go:987 +0xab1
github.com/spf13/cobra.(*Command).ExecuteC(0xc0005f3b08)
/home/fedora/yinzhou/oc-mirror/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(0x741ec38?)
/home/fedora/yinzhou/oc-mirror/vendor/github.com/spf13/cobra/command.go:1039 +0x13
main.main()
/home/fedora/yinzhou/oc-mirror/cmd/oc-mirror/main.go:10 +0x18
 

Expected results:

The command could fail, but not panic
 

 

Description of problem:

Before the fix for https://issues.redhat.com/browse/OCPBUGS-42253 is merged upstream and propagated, we can apply a temporary fix directly in the samples operator repo, unblocking us from the need wait for that to happen.
    

Version-Release number of selected component (if applicable):


    

How reproducible:

always
    

Steps to Reproduce:

    1.oc new-app openshift/rails-postgresql-example
    2.
    3.
    

Actual results:

app pod in crash loop
    

Expected results:

app working
    

Additional info:


    

Description of the problem:

BE 2.35.1 - OCP 4.17 ARM64 cluster -  Selecting CNV in UI throws the following error:
Local Storage Operator is not available when arm64 CPU architecture is selected
How reproducible:

100%

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of problem:

    This PR introduces graceful shutdown functionality to the Multus daemon by adding a /readyz endpoint alongside the existing /healthz. The /readyz endpoint starts returning 500 once a SIGTERM is received, indicating the daemon is in shutdown mode. During this time, CNI requests can still be processed for a short window. The daemonset configs have been updated to increase terminationGracePeriodSeconds from 10 to 30 seconds, ensuring we have a bit more time for these clean shutdowns.This addresses a race condition during pod transitions where the readiness check might return true, but a subsequent CNI request could fail if the daemon shuts down too quickly. By introducing the /readyz endpoint and delaying the shutdown, we can handle ongoing CNI requests more gracefully, reducing the risk of disruptions during critical transitions.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Difficult to reproduce, might require CI signal

I talked with Gerd Oberlechner; the hack/app-sre/saas_template.yaml - it is not used anymore in app-interface. 

It should be safe to remove this.

Please review the following PR: https://github.com/openshift/cluster-api-provider-metal3/pull/21

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    The bug fixing of https://issues.redhat.com/browse/OCPBUGS-41184 introdcued the machine type validation error. 

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-10-14-021053

How reproducible:

    Always

Steps to Reproduce:

    1. "create install-config", and then insert the machine type settings (see [1])  
    2. "create manifests" (or "create cluster")
     

Actual results:

    ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [controlPlane.platform.gcp.type: Not found: "custom", compute[0].platform.gcp.type: Not found: "custom"] 

Expected results:

    Success

Additional info:

    FYI the 4.17 PROW CI test failure: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-amd64-nightly-gcp-ipi-mini-perm-custom-type-f28/1845589157397663744

The Telemetry userPreference added to the General tab in https://github.com/openshift/console/pull/13587 results in empty nodes being output to the DOM.  This results in extra spacing any time a new user preference is added to the bottom of the General tab.

Description of problem:

    The issue comes from https://issues.redhat.com/browse/OCPBUGS-37540?focusedId=25386451&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25386451.
Error message is shown when gather bootstrap log bundle although log bundle gzip file is generated.

ERROR Invalid log bundle or the bootstrap machine could not be reached and bootstrap logs were not collected.

Version-Release number of selected component (if applicable):

    4.17+

How reproducible:

    Always

Steps to Reproduce:

    1. Run `openshift-install gather bootstrap --dir <install-dir>`
    2.
    3.
    

Actual results:

    Error message shown in output of command `openshift-install gather bootstrap --dir <install-dir>`

Expected results:

    No error message shown there.

Additional info:

Analysis from Rafael, https://issues.redhat.com/browse/OCPBUGS-37540?focusedId=25387767&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25387767     

After multi-VC changes were merged, now when we use this tool, following warnings get logged:

E0812 13:04:34.813216   13159 config_yaml.go:208] Unmarshal failed: yaml: unmarshal errors:
line 1: cannot unmarshal !!seq into config.CommonConfigYAML
I0812 13:04:34.813376   13159 config.go:272] ReadConfig INI succeeded. INI-based cloud-config is deprecated and will be removed in 2.0. Please use YAML based cloud-config.

Which looks bit scarier than it should.

Description of problem:

storageNotConfiguredMessage contains link to https://docs.openshift.com/container-platform/%s/monitoring/configuring-the-monitoring-stack.html , which leads to 404, needs to be changed to https://docs.openshift.com/container-platform/%s/observability/monitoring/configuring-the-monitoring-stack.html
    

Version-Release number of selected component (if applicable):

4.16
    

How reproducible:

always

The fields in Shipwright build form show no hints or default values. They should provide examples and hints to help user provided correct values when creating a build. 

 

For example:

  • builder-image: example url of s2i builder image in the internal registry, or link to list of imagestreams in the "openshift" namespace
  • output-image: a hint with the text "Example for OpenShift internal registry: image-registry.openshift-image-registry.svc:5000/<namespace>/<image-name>:latest"

Description of problem:

    IHAC running 4.16.1 OCP cluster. In their cluster image registry pod is restarting with below messages:  

message: "/image-registry/vendor/github.com/aws/aws-sdk-go/service/s3/api.go:7629 +0x1d0\ngithub.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk(0xc000a3c120, {0x28924c0, 0xc0001f5b20}, 0xc00083bab8, {0xc00125b7d1, 0x20}, {0x2866860, 0x1}, 0xc00120a8d0)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1135 +0x348\ngithub.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).Walk(0xc000675ec0?, {0x28924c0, 0xc0001f5b20}, {0xc000675ec0, 0x20}, 0xc00083bc10?)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1095 +0x148\ngithub.com/distribution/distribution/v3/registry/storage/driver/base.(*Base).Walk(0xc000519480, {0x2892778?, 0xc00012cf00?}, {0xc000675ec0, 0x20}, 0x1?)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/base/base.go:237 +0x237\ngithub.com/distribution/distribution/v3/registry/storage.getOutstandingUploads({0x2892778, 0xc00012cf00}, {0x289d728?, 0xc000519480})\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:70 +0x1f9\ngithub.com/distribution/distribution/v3/registry/storage.PurgeUploads({0x2892778, 0xc00012cf00}, {0x289d728?, 0xc000519480?}, {0xc1a937efcf6aec96, 0xfffddc8e973b8a89, 0x3a94520}, 0x1)\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:34 +0x12d\ngithub.com/distribution/distribution/v3/registry/handlers.startUploadPurger.func1()\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:1139 +0x33f\ncreated by github.com/distribution/distribution/v3/registry/handlers.startUploadPurger in goroutine 1\n\t/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:1127 +0x329\n" reason: Error startedAt: "2024-08-27T09:08:14Z" name: registry ready: true restartCount: 250 started: true

Version-Release number of selected component (if applicable):

    4.16.1

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

   all the pods are restating

Expected results:

    It should not restart.

Additional info:

https://redhat-internal.slack.com/archives/C013VBYBJQH/p1724761756273879    
upstream report: https://github.com/distribution/distribution/issues/4358

Service :Labels, Pod selector, Location sorting doesn't work

Routes: all columns sorting doesn't work

Ingress: Host column sorting doesn't work

Please review the following PR: https://github.com/openshift/cluster-config-operator/pull/426

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/559

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/364

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

4.15 and 4.16

$ oc explain prometheus.spec.remoteWrite.sendExemplars
GROUP:      monitoring.coreos.com
KIND:       Prometheus
VERSION:    v1FIELD: sendExemplars <boolean>DESCRIPTION:
    Enables sending of exemplars over remote write. Note that exemplar-storage
    itself must be enabled using the `spec.enableFeature` option for exemplars
    to be scraped in the first place. 
     It requires Prometheus >= v2.27.0.

no `spec.enableFeature` option

$ oc explain prometheus.spec.enableFeature
GROUP:      monitoring.coreos.com
KIND:       Prometheus
VERSION:    v1
error: field "enableFeature" does not exist 

should be `spec.enableFeatures` 

$ oc explain prometheus.spec.enableFeatures
GROUP:      monitoring.coreos.com
KIND:       Prometheus
VERSION:    v1FIELD: enableFeatures <[]string>
DESCRIPTION:
    Enable access to Prometheus feature flags. By default, no features are
    enabled. 
     Enabling features which are disabled by default is entirely outside the
    scope of what the maintainers will support and by doing so, you accept that
    this behaviour may break at any time without notice. 
     For more information see
    https://prometheus.io/docs/prometheus/latest/feature_flags/ 

Version-Release number of selected component (if applicable):

 4.15 and 4.16

How reproducible:

always

Description of problem:

When a user is trying to deploy a Hosted Cluster using Hypershift, If in the hostedCluster CR under Spec.Configuration.Proxy.HTTPSProxy there is defined a proxy URL missing the port (because uses the default port) this is gonna be passed with this code] inside the "kube-apiserver-proxy" yaml manifest under the spec.containers.command like below:

$ oc get pod n kube-system kube-apiserver-proxy-xxxxx -o yaml| yq '.spec.containers[].command' [ "control-plane-operator", "kubernetes-default-proxy", "listen-addr=172.20.0.1:6443", "proxy-addr=example.proxy.com", "-apiserver-addr=<apiserver-IP>:<port>" ]

Then this code will parse these values. Here]

This command have these flags that will be used for the container to do the API calls.
 
The net.Dial function that is used from the golang net package expects a host/ip:port. Check the docs here: https://pkg.go.dev/net#Dial
 

For TCP and UDP networks, the address has the form "host:port". The host must be a literal IP address, or a host name that can be resolved to IP addresses. The port must be a literal port number or a service name.

So the pod will end up having this issue:

2024-08-19T06:55:44.831593820Z {"level":"error","ts":"2024-08-19T06:55:44Z","logger":"kubernetes-default-proxy","msg":"failed diaing backend","proxyAddr":"example.proxy.com","error":"dial tcp: address example.proxy.com: missing port in address","stacktrace":"github.com/openshift/hypershift/kubernetes-default-proxy.(*server).run.func1\n\t/hypershift/kubernetes-default-proxy/kubernetes_default_proxy.go:89"}

Some ideas on how to solve his are below:

  • Validate the hostedCluster CR
  • Add logic to append the default port if missing
  • Something else?

How reproducible:

Try to deploy a Hosted Cluster using Hypershift operator using a proxy URL without a port (e.g <example.proxy.com>:<port>) in the hostedCluster CR under "Spec.Configuration.Proxy.HTTPSProxy". This will result to the below error in the kube-apiserver-proxy container: "missing port in address"

Actual results:

The kube-apiserver-proxy container returns "missing port in address"

Expected results:

The kube-apiserver-proxy container to don't return "missing port in address"

Additional info:

This can be workarounded by adding a ":" and a port number after the proxy IP/URL in the hostedCluster."Spec.Configuration.Proxy.HTTPSProxy". 

Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/500

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

AWS EBS, Azure Disk and Azure File operators are now built from cmd/ and pkg/, there is no code used from legacy/ dir and we should remove it.

There are still test manifests in legacy/ directory that are still used! They need to be moved somewhere else + Dockerfile.*.test and CI steps must be updated!

Technically, this is a copy of STOR-1797, but we need a bug to be able to backport aws-ebs changes to 4.15 and not use legacy/ directory there too.

When https://github.com/openshift/machine-config-operator/pull/4597 landed, bootstrap tests startup began to fail as it is doesn't install the required CRDs. This is because the CRDs no longer live in the MCO repo and the startup code needs to be reconciled to pick up the MCO specific CRDs from the o/api repo.

Please review the following PR: https://github.com/openshift/ironic-image/pull/539

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Prow jobs upgrading from 4.9 to 4.16 are failing when they upgrade from 4.12 to 4.13.

Nodes become NotReady when MCO tries to apply the new 4.13 configuration to the MCPs.

The failing job is: periodic-ci-openshift-openshift-tests-private-release-4.16-amd64-nightly-4.16-upgrade-from-stable-4.9-azure-ipi-f28

We have reproduced the issue and we found an ordering cycle error in the journal log

Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 systemd-journald.service[838]: Runtime Journal (/run/log/journal/960b04f10e4f44d98453ce5faae27e84) is 8.0M, max 641.9M, 633.9M free.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found ordering cycle on network-online.target/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on node-valid-hostname.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on ovs-configuration.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on firstboot-osupdate.target/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-firstboot.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Found dependency on machine-config-daemon-pull.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: machine-config-daemon-pull.service: Job network-online.target/start deleted to break ordering cycle starting with machine-config-daemon-pull.service/start
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: Queued start job for default target Graphical Interface.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: unit configures an IP firewall, but the local system does not support BPF/cgroup firewalling.
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: (This warning is only shown for the first unit using IP firewalling.)
Wed 2024-07-24 21:12:17 UTC ci-op-g94jvswm-cc71e-998q8-master-2 init.scope[1]: systemd-journald.service: Deactivated successfully.

    

Version-Release number of selected component (if applicable):

    Using IPI on Azure, these are the version involved in the current issue upgrading from 4.9 to 4.13:
    
      version: 4.13.0-0.nightly-2024-07-23-154444
      version: 4.12.0-0.nightly-2024-07-23-230744
      version: 4.11.59
      version: 4.10.67
      version: 4.9.59

    

How reproducible:

    Always
    

Steps to Reproduce:

    1. Upgrade an IPI on Azure cluster from 4.9 to 4.13. Theoretically, upgrading from 4.12 to 4.13 should be enough, but we reproduced it following the whole path.

    

Actual results:


    Nodes become not ready
$ oc get nodes
NAME                                                 STATUS                        ROLES    AGE     VERSION
ci-op-g94jvswm-cc71e-998q8-master-0                  Ready                         master   6h14m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-master-1                  Ready                         master   6h13m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-master-2                  NotReady,SchedulingDisabled   master   6h13m   v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus1-c7ngb   NotReady,SchedulingDisabled   worker   6h2m    v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus2-2ppf6   Ready                         worker   6h4m    v1.25.16+306a47e
ci-op-g94jvswm-cc71e-998q8-worker-centralus3-nqshj   Ready                         worker   6h6m    v1.25.16+306a47e

And in the NotReady nodes we can see the ordering cycle error mentioned in the description of this ticket.
    

    

Expected results:

No ordering cycle error should happen and the upgrade should be executed without problems.
    

Additional info:


    

Description of problem:

When machineconfig fails to generate, we set upgradeable=false and degrade pools. The expectation is that the CO would also degrade after some time (normally 30 minutes) since master pool is degraded, but that doesn't seem to be happening. Based on our initial investigation, the event/degrade is happening but it seems to be being cleared.

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Should be always

Steps to Reproduce:

    1. Apply a wrong config, such as a bad image.config object:
spec:
  registrySources:
    allowedRegistries:
    - test.reg
    blockedRegistries:
    - blocked.reg
    
    2. upgrade the cluster or roll out a new MCO pod
    3. observe that pools are degraded but the CO isn't
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Occasional machine-config daemon panics in test-preview. For example this run has:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1076/pull-ci-openshift-cluster-version-operator-master-e2e-aws-ovn-techpreview/1819082707058036736

And the referenced logs include a full stack trace, the crux of which appears to be:

E0801 19:23:55.012345    2908 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 127 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x2424b80, 0x4166150})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0004d5340?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x2424b80?, 0x4166150?})
	/usr/lib/golang/src/runtime/panic.go:770 +0x132
github.com/openshift/machine-config-operator/pkg/helpers.ListPools(0xc0007c5208, {0x0, 0x0})
	/go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:142 +0x17d
github.com/openshift/machine-config-operator/pkg/helpers.GetPoolsForNode({0x0, 0x0}, 0xc0007c5208)
	/go/src/github.com/openshift/machine-config-operator/pkg/helpers/helpers.go:66 +0x65
github.com/openshift/machine-config-operator/pkg/daemon.(*PinnedImageSetManager).handleNodeEvent(0xc000a98480, {0x27e9e60?, 0xc0007c5208})
	/go/src/github.com/openshift/machine-config-operator/pkg/daemon/pinned_image_set.go:955 +0x92

Version-Release number of selected component (if applicable):

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-daemon.*Observed+a+panic' | grep 'failures match'
periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview (all) - 37 runs, 62% failed, 13% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-techpreview-serial (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-techpreview-serial (all) - 7 runs, 29% failed, 50% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview-serial (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-aws-ovn-arm64-techpreview (all) - 10 runs, 40% failed, 25% of failures match = 10% impact
periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview-serial (all) - 6 runs, 17% failed, 200% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview-serial (all) - 7 runs, 100% failed, 14% of failures match = 14% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-single-node-techpreview (all) - 7 runs, 57% failed, 50% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview (all) - 18 runs, 17% failed, 33% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-techpreview (all) - 6 runs, 17% failed, 100% of failures match = 17% impact
periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-arm64-techpreview-serial (all) - 11 runs, 18% failed, 50% of failures match = 9% impact
periodic-ci-openshift-release-master-ci-4.17-e2e-azure-ovn-techpreview-serial (all) - 7 runs, 57% failed, 25% of failures match = 14% impact

How reproducible:

looks like ~15% impact in those CI runs CI Search turns up.

Steps to Reproduce:

Run lots of CI. Look for MCD panics.

Actual results

CI Search results above.

Expected results

No hits.

Description of problem:

    Infrastructure object with platform None is ignored by node-joiner tool

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Always

Steps to Reproduce:

    1. Run the node-joiner add-nodes command
    

Actual results:

    Currently the node-joiner tool retrieves the platform type from the kube-system/cluster-config-v1 config map

Expected results:

Retrieve the platform type from the infrastructure cluster object

Additional info:

    

Description of problem:

All opentack-cinder-csi-driver-node pods are in crashloopback status during IPI installation with proxy configured:

2024-10-18 11:27:41.936 | NAMESPACE                                          NAME                                                         READY   STATUS             RESTARTS       AGE
2024-10-18 11:27:41.946 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-9dkwz                       1/3     CrashLoopBackOff   61 (59s ago)   106m
2024-10-18 11:27:41.956 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-cdf2d                       1/3     CrashLoopBackOff   53 (19s ago)   90m
2024-10-18 11:27:41.966 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-chnj6                       1/3     CrashLoopBackOff   61 (85s ago)   106m
2024-10-18 11:27:41.972 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-fwgg4                       1/3     CrashLoopBackOff   53 (32s ago)   90m
2024-10-18 11:27:41.979 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-h5qg8                       1/3     CrashLoopBackOff   61 (88s ago)   106m
2024-10-18 11:27:41.989 | openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-jbzj9                       1/3     CrashLoopBackOff   52 (42s ago)   90m
 

The pod complains with below:
 

2024-10-18T11:20:57.226298852Z W1018 11:20:57.226085 1 main.go:87] Failed to GetOpenStackProvider: Get "https://10.46.44.29:13000/": dial tcp 10.46.44.29:13000: i/o timeout  

Looks it is not using the proxy to reach the OSP API.
Version-Release number of selected component (if applicable):

   4.18.0-0.nightly-2024-10-16-094159

Must-gather for 4.18 proxy installation (& must-gather for successful 4.17 proxy installation for comparison) in private comment.

After changing internalJoinSubnet,internalTransitSwitchSubnet, on day2 and do live migration.  ovnkube node pod crashed

network part as below the service cidr has same subnet with the ovn default internalTransitSwitchSubnet

 

    clusterNetwork:
    - cidr: 100.64.0.0/15
      hostPrefix: 23
    serviceNetwork:
    - 100.88.0.0/16

 

and then:

 

oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":{"internalJoinSubnet": "100.82.0.0/16"}}}}}'
oc patch network.operator.openshift.io cluster --type='merge' -p='{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipv4":{"internalTransitSwitchSubnet": "100.69.0.0/16"}}}}}'

 

 

 

with error: 

 start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: EmptyLbEvents:false PodIP: RawNoHostSubnetNodes:migration.network.openshift.io/plugin= NoHostSubnetNodes:<nil> HostNetworkNamespace:openshift-host-network PlatformType:AWS HealthzBindAddress:0.0.0.0:10256 CompatMetricsBindAddress: CompatOVNMetricsBindAddress: CompatMetricsEnablePprof:false DNSServiceNamespace:openshift-dns DNSServiceName:dns-default} Metrics:{BindAddress: OVNMetricsBindAddress: ExportOVSMetrics:false EnablePprof:false NodeServerPrivKey: NodeServerCert: EnableConfigDuration:false EnableScaleMetrics:false} OvnNorth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} OvnSouth:{Address: PrivKey: Cert: CACert: CertCommonName: Scheme: ElectionTimer:0 northbound:false exec:<nil>} Gateway:{Mode:shared Interface: EgressGWInterface: NextHop: VLANID:0 NodeportEnable:true DisableSNATMultipleGWs:false V4JoinSubnet:100.64.0.0/16 V6JoinSubnet:fd98::/64 V4MasqueradeSubnet:100.254.0.0/17 V6MasqueradeSubnet:fd69::/125 MasqueradeIPs:

{V4OVNMasqueradeIP:169.254.169.1 V6OVNMasqueradeIP:fd69::1 V4HostMasqueradeIP:169.254.169.2 V6HostMasqueradeIP:fd69::2 V4HostETPLocalMasqueradeIP:169.254.169.3 V6HostETPLocalMasqueradeIP:fd69::3 V4DummyNextHopMasqueradeIP:169.254.169.4 V6DummyNextHopMasqueradeIP:fd69::4 V4OVNServiceHairpinMasqueradeIP:169.254.169.5 V6OVNServiceHairpinMasqueradeIP:fd69::5}

DisablePacke
 
 
 

Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/376

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

For example, the toggle button on Node and Pods logs page don't have unique identifier, it's hard to locate these buttons during automation 

`Select a path` toggle button has definition
<button class="pf-v5-c-menu-toggle" type="button" aria-label="Select a path" aria-expanded="false">
  <span class="pf-v5-c-menu-toggle__text">openshift-apiserver</span>
  <span class="pf-v5-c-menu-toggle__controls">...........
</button>

`Select a log file` toggle button
<button class="pf-v5-c-menu-toggle" type="button" aria-expanded="false">
  <span class="pf-v5-c-menu-toggle__text">Select a log file
  </span><span class="pf-v5-c-menu-toggle__controls">.......
</button>

Since we have many toggle buttons on the page, it's quite hard to locate without distinguishable identifiers

Version-Release number of selected component (if applicable):

    

How reproducible:

Always    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/126

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

QE Liang Quan requested a review of https://github.com/openshift/origin/pull/28912 and the OWNERS file doesn't reflect current staff available to review.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    N/A

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    OWNERS file contains
  - danehans
  - frobware
  - knobunc
  - Miciah
  - miheer
  - sgreene570

Expected results:

Add new OWNERS as reviewers/approvers:
- alebedev87
- candita
- gcs278
- rfredette
- Thealisyed
- grzpiotrowski

Move old OWNERS to emeritus_approvers:
  - danehans 
  - sgreene570

Additional info:

    Example in https://github.com/openshift/cluster-ingress-operator/blob/master/OWNERS

Component Readiness has found a potential regression in the following test:

operator conditions control-plane-machine-set

Probability of significant regression: 100.00%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-03T00:00:00Z
End Time: 2024-08-09T23:59:59Z
Success Rate: 92.05%
Successes: 81
Failures: 7
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 429
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=gcp&Platform=gcp&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=operator-conditions&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Cloud%20Compute%20%2F%20Other%20Provider&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20gcp%20unknown%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-09%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-03%2000%3A00%3A00&testId=Operator%20results%3A6d9ee55972f66121016367d07d52f0a9&testName=operator%20conditions%20control-plane-machine-set

Description of problem:

Operator is not getting installed. There are multiple install plans getting created/deleted for the same operator. There is not even any error indicated in the subscription or somewhere. The bundle unpacking job is completed.
Images:
quay.io/nigoyal/odf-operator-bundle:v0.0.1
quay.io/nigoyal/odf-operator-catalog:v0.0.1

Version-Release number of selected component (if applicable):

4.18    

How reproducible:

Always    

Steps to Reproduce:

Create the below manifests
   
---
apiVersion: v1
kind: Namespace
metadata:
  labels:
    openshift.io/cluster-monitoring: "true"
    pod-security.kubernetes.io/audit: baseline
    pod-security.kubernetes.io/audit-version: v1.25
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: v1.25
    pod-security.kubernetes.io/warn: baseline
    pod-security.kubernetes.io/warn-version: v1.25
  name: openshift-storage
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: odf-operatorgroup
  namespace: openshift-storage
spec:
  targetNamespaces:
  - openshift-storage
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: odf-catalogsource
  namespace: openshift-storage
spec:
  grpcPodConfig:
    securityContextConfig: legacy
  displayName: Openshift Data Foundation
  image: quay.io/nigoyal/odf-operator-catalog:v0.0.1
  priority: 100
  publisher: ODF
  sourceType: grpc
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: odf-subscription
  namespace: openshift-storage
spec:
  channel: alpha
  name: odf-operator
  source: odf-catalogsource
  sourceNamespace: openshift-storage 

Actual results:

Operator is not getting installed.    

Expected results:

Operator should get installed.    

Additional info:

The bundle is a unified bundle created from multiple bundles.

Slack Discussion: https://redhat-internal.slack.com/archives/C3VS0LV41/p1726026365936859

Description of the problem:

When attempting to install a spoke cluster, the AgentClusterInstall is not being generated correctly due to release image certificate not being trusted

  - lastProbeTime: "2024-08-20T20:10:16Z"
    lastTransitionTime: "2024-08-20T20:10:16Z"
    message: "The Spec could not be synced due to backend error: failed to get release
      image 'quay.io/openshift-release-dev/ocp-release@sha256:58c9cdeddb33100ee29441e374467592cbd39c3fc56552c57bf2a183a85025f3'.
      Please ensure the releaseImage field in ClusterImageSet '4.17.0' is valid,  (error:
      command 'oc adm release info -o template --template '{{.metadata.version}}'
      --insecure=false --icsp-file=/tmp/icsp-file98462205 quay.io/openshift-release-dev/ocp-release@sha256:58c9cdeddb33100ee29441e374467592cbd39c3fc56552c57bf2a183a85025f3
      --registry-config=/tmp/registry-config740495490' exited with non-zero exit code
      1: \nFlag --icsp-file has been deprecated, support for it will be removed in
      a future release. Use --idms-file instead.\nerror: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:58c9cdeddb33100ee29441e374467592cbd39c3fc56552c57bf2a183a85025f3:
      Get \"https://quay.io/v2/\": tls: failed to verify certificate: x509: certificate
      signed by unknown authority\n)." 

How reproducible:

Intermittent 

Steps to reproduce:

1. Attempt to create cluster resources after assisted-service is running

Actual results:

AgentClusterInstall fails due to certificate errors

Expected results:

The registry housing the release image has it's certificate verified correctly 

Additional Info:

Restarting the assisted-service pod fixes the issue. It seems like there is race condition between the operator setting up the configmap with the correct contents and the assisted pod starting and mounting the configmap to /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem

Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/111

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-machine-api-provider-aws-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

Description of problem:

While upgrading the cluster from web-console the below warning message observed. 
~~~
Warning alert:Admission Webhook Warning
ClusterVersion version violates policy 299 - "unknown field \"spec.desiredUpdate.channels\"", 299 - "unknown field \"spec.desiredUpdate.url\""
~~~

There are no such fields in the clusterVersion yaml for which the warning message fired.

From the documentation here: https://docs.openshift.com/container-platform/4.16/rest_api/config_apis/clusterversion-config-openshift-io-v1.html 

It's possible to see that "spec.desiredUpdate" exists, but there is no mention of values "channels" or "url" under desiredUpdate.



Note: This is not impacting the cluster upgrade. However creating confusion among customers due to the warning message.

 

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

Everytime    

Steps to Reproduce:

    1. Install cluster of version 4.16.4
    2. Upgrade the cluster from web-console to the next-minor version
    3.
    

Actual results:

    Upgrade should proceed with no such warnings

Expected results:

    

Additional info:

    

Description of problem:

Upon upgrade of 4.16.15, OLM is failing to upgrade operator cluster service versions due to a TLS validation error. 

From the OLM controller manager pod, logs show this: 
oc logs -n openshift-operator-lifecycle-manager olm-operator-7c9f76554-j22j5 | grep "tls" | head
"tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")"

It's also observed in the api-server-operator logs that many webhooks are affected with the following errors: 
$ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-8445495998-s6wgd | grep "failed to connect" | tail
W1018 21:44:07.641047       1 degraded_webhook.go:147] failed to connect to webhook "machineautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority
W1018 21:44:08.647623       1 degraded_webhook.go:147] failed to connect to webhook "machineautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority
W1018 21:53:58.542660       1 degraded_webhook.go:147] failed to connect to webhook "clusterautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority 

This is causing the OLM controller to hang and is failing to install/upgrade operators based on the OLM controller logs. 

 

How reproducible:

    Very reproducible upon upgrade from 4.16.14 to 4.16.15 on any Openshift Dedicated or ROSA Openshfit cluster.

Steps to Reproduce:

    1. Install OSD or ROSA cluster at 4.16.14 or below
    2. Upgrade to 4.16.15
    3. Attempt to install or upgrade operator via new ClusterServiceVersion     

Actual results:

# API SERVER OPERATOR
    $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-666b796d8b-lqp56 | grep "failed to connect" | tail
W1013 20:59:49.131870       1 degraded_webhook.go:147] failed to connect to webhook "webhook.pipeline.tekton.dev" via service "tekton-pipelines-webhook.openshift-pipelines.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "tekton-pipelines-webhook.openshift-pipelines.svc")
W1013 20:59:50.147945       1 degraded_webhook.go:147] failed to connect to webhook "webhook.pipeline.tekton.dev" via service "tekton-pipelines-webhook.openshift-pipelines.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "tekton-pipelines-webhook.openshift-pipelines.svc")

#OLM 
$ oc logs -n openshift-operator-lifecycle-manager olm-operator-7c9f76554-j22j5 | grep "tls" | head
2024/10/13 12:00:08 http: TLS handshake error from 10.128.18.80:53006: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")
2024/10/14 11:45:05 http: TLS handshake error from 10.130.19.10:36766: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")

Expected results:

    no tls validation errors upon upgrade or installation of operators via OLM

Additional info:

    

Description of problem:

On route creation page, when check on "Secure Route", select "Edge" or "Re-encrypt" TLS termination, there is "TLS certificates for edge and re-encrypt termination. If not specified, the router&apos;s default certificate is used." under "Certificates".  "router&apos;s" should be "router's"
    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-18-003538
4.18.0-0.nightly-2024-09-17-060032
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Check on route creation page, when check on "Secure Route", select "Edge" or "Re-encrypt" TLS termination.
    2.
    3.
    

Actual results:

1. There is "TLS certificates for edge and re-encrypt termination. If not specified, the router&apos;s default certificate is used.
    

Expected results:

1. "router&apos;s" should be "router's"
    

Additional info:


    

As an engineer I would like to have a functional test that make sure the ETCD recovery function works as expected without deploy a Full OCP or HostedCluster.

Alternatives:

  • Use an ensure function to test this procedure in a living cluster
  • Add a test using a Kubernetes instead of an OCP and create the ETCD cluster in there.

Description of problem:

On overview page's getting started resources card, there is "OpenShift LightSpeed" link when this operator is available on the cluster, the text should be updated to "OpenShift Lightspeed" to keep consistent with operator name.
    

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-08-08-013133
4.16.0-0.nightly-2024-08-08-111530
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Check overview page's getting started resources card,  
    2.
    3.
    

Actual results:

1. There is "OpenShift LightSpeed" link  in "Explore new features and capabilities"
    

Expected results:

1. The text should be "OpenShift Lightspped" to keep consistent with operator name.
    

Additional info:


    

Description of the problem:

 When provisioning a hosted cluster using a ZTP workflow to create BMH and NodePool CRs, corresponding agents are created for the BMHs, but those agents do not get added to the hostedCluster as they are not set to spec.approved=true

This is a recent change in behavior, and appears to be related to This commit meant to allow BMH CRs to be safely restored by OADP in DR scenarios.

Manual approval of the agents will result in a successful result.
Setting the PAUSE_PROVISIONED_BMHS boolean to false does result in a successful result.

How reproducible:

 Always

Steps to reproduce:

1. Create BMH and NodePool for HostedCluster

2. Observe creation of agents on cluster

3. Observe agents do not join cluster

Actual results:

Agents exist, are not added to nodepool 

Expected results:

Agents and their machines are added to the nodepool and the hosted cluster sees nodes appear.

Description of problem:

If folder is undefined and the datacenter exists in a datacenter-based folder
the installer will create the entire path of folders from the root of vcenter - which is incorrect

This does not occur if folder is defined.

An upstream bug was identified when debugging this:

https://github.com/vmware/govmomi/issues/3523

Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/163

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/router/pull/623

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

For light theme, the Lightspeed logo should use the multi-color version.

For dark theme, the Lightspeed logo should use the single color version for both the button and the content.

Background

In order for customers to easily access the troubleshooting panel in the console, we need to add a button that can be accessed globally.

Outcomes

  • The troubleshooting panel can be triggered from the application launcher menu, present in the OpenShift console masthead

 

 

Description of problem:

Get "https://openshift.default.svc/.well-known/oauth-authorization-server": tls: failed to verify certificate: x509: certificate is valid for localhost, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, kube-apiserver, kube-apiserver.ocm-production-2b0eqpjq13aaba19ncgajh1asp39602g-faldana-hcp.svc, kube-apiserver.ocm-production-2b0eqpjq13aaba19ncgajh1asp39602g-faldana-hcp.svc.cluster.local, api.faldana-hcp.rvvd.p3.openshiftapps.com, api.faldana-hcp.hypershift.local, not openshift.default.svc

Version-Release number of selected component (if applicable):

    4.15.9

How reproducible:

    stable

Steps to Reproduce:

    Get "https://openshift.default.svc/.well-known/oauth-authorization-server"

Actual results:

    x509: certificate is valid for ... kubernetes.default.svc ..., not openshift.default.svc

Expected results:

    OK

Additional info:

    Works fine with ROSA Classic.

The context: customer is configuring access to the RHACS console via Openshift Auth Provider.

Discussion:
https://redhat-internal.slack.com/archives/C028JE84N59/p1715048866276889

Description of problem:

    When using an internal publishing strategy, the client is not properly initialized and will cause a code path to be hit which tries to access a field of a null pointer.

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Easily

Steps to Reproduce:

    1. Try to deploy a private cluster
    2. segfault
    3. 
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

When user changes Infrastructure object, e.g. adds a new vCenter, the operator generates a new driver config (Secret named vsphere-csi-config-secret), but the controller pods are not restarted and use the old config.

Version-Release number of selected component (if applicable):

4.17.0-0.nightly *after* 2024-08-09-031511

How reproducible: always

Steps to Reproduce:

  1. Enable TechPreviewNoUpgrade
  2. Add a new vCenter to infrastructure. It can be the same one as the existing one - we just need to trigger "disable CSi migration when there are 2 or more vCenters"
  3. See that vsphere-csi-config-secret changed and has `migration-datastore-url =` (i.e. empty string value)

Actual results: the controller pods are not restarted

Expected results: the controller pods are  restarted

Description of problem:

The AWS Cluster API Provider (CAPA) runs a required check to resolve the DNS Name for load balancers it creates. If the CAPA controller (in this case, running in the installer) cannot resolve the DNS record, CAPA will not report infrastructure ready. We are seeing in some cases, that installations running on local hosts (we have not seen this problem in CI) will not be able to resolve the LB DNS name record and the install will fail like this:

    DEBUG I0625 17:05:45.939796    7645 awscluster_controller.go:295] "Waiting on API server ELB DNS name to resolve" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/umohnani-4-16test-5ndjw" namespace="openshift-cluster-api-guests" name="umohnani-4-16test-5ndjw" reconcileID="553beb3d-9b53-4d83-b417-9c70e00e277e" cluster="openshift-cluster-api-guests/umohnani-4-16test-5ndjw" 
DEBUG Collecting applied cluster api manifests...  
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 15m0s: client rate limiter Wait returned an error: context deadline exceeded

We do not know why some hosts cannot resolve these records, but it could be something like issues with the local DNS resolver cache, DNS records are slow to propagate in AWS, etc.

 

Version-Release number of selected component (if applicable):

    4.16, 4.17

How reproducible:

    Not reproducible / unknown -- this seems to be dependent on specific hosts and we have not determined why some hosts face this issue while others do not.

Steps to Reproduce:

n/a    

Actual results:

Install fails because CAPA cannot resolve LB DNS name 

Expected results:

    As the DNS record does exist, install should be able to proceed.

Additional info:

Slack thread:

https://redhat-internal.slack.com/archives/C68TNFWA2/p1719351032090749

Description of problem:

    When verifying OCPBUGS-38869 or in 4.18, the MOSB is still in updating state even though build pod is successfully removed and seeing error in machine-os build pod

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.Apply any MOSC  
    2.see build pod is successful
    3.But MOSB is still in updating state   
    4.And can see error in machine-os build pod
 

Actual results:
I have applied below MOSC

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: abc   
spec:
  machineConfigPool:
    name: worker
  buildOutputs:
    currentImagePullSecret:
      name: $(oc get -n openshift-machine-config-operator sa default -ojsonpath='{.secrets[0].name}')
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
    renderedImagePushSecret:
      name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
    renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
    containerFile:
    - containerfileArch: noarch
      content: |-
        # Pull the centos base image and enable the EPEL repository.
        FROM quay.io/centos/centos:stream9 AS centos
        RUN dnf install -y epel-release        # Pull an image containing the yq utility.
        FROM docker.io/mikefarah/yq:latest AS yq        # Build the final OS image for this MachineConfigPool.
        FROM configs AS final        # Copy the EPEL configs into the final image.
        COPY --from=yq /usr/bin/yq /usr/bin/yq
        COPY --from=centos /etc/yum.repos.d /etc/yum.repos.d
        COPY --from=centos /etc/pki/rpm-gpg/RPM-GPG-KEY-* /etc/pki/rpm-gpg/        # Install cowsay and ripgrep from the EPEL repository into the final image,
        # along with a custom cow file.
        RUN sed -i 's/\$stream/9-stream/g' /etc/yum.repos.d/centos*.repo && \
            rpm-ostree install cowsay ripgrep
EOF
 
$ oc get machineosconfig
NAME   AGE
abc    45m

$  oc logs build-rendered-worker-c67571b26a7e0d94dc2bf01dca97bbe5 -f
  ...
Copying blob sha256:a8157ed01dfc7fe15c8f2a86a3a5e30f7fcb7f3e50f8626b32425aaf821ae23d
Copying config sha256:4b15e94c47f72b6c082272cf1547fdd074bd3539b327305285d46926f295a71b
Writing manifest to image destination
+ return 0 

$  oc get machineosbuild
NAME                                                              PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED
worker-rendered-worker-c67571b26a7e0d94dc2bf01dca97bbe5-builder   False      True       False       False         False

$  oc logs machine-os-builder-654fc664bb-qvjkn  | grep -i error
I1003 16:12:52.463155       1 pod_build_controller.go:296] Error syncing pod openshift-machine-config-operator/build-rendered-worker-c67571b26a7e0d94dc2bf01dca97bbe5: unable to update with build pod status: could not update MachineOSConfig"abc": MachineOSConfig.machineconfiguration.openshift.io "abc" is invalid: [observedGeneration: Required value, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

 

Expected results:

MOSB should be successful    

Additional info:

    

Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/65

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Component Readiness has found a potential regression in the following test:

[sig-network] pods should successfully create sandboxes by adding pod to network

Probability of significant regression: 96.41%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-27T00:00:00Z
End Time: 2024-09-03T23:59:59Z
Success Rate: 88.37%
Successes: 26
Failures: 5
Flakes: 12

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 98.46%
Successes: 43
Failures: 1
Flakes: 21

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=metal&Platform=metal&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=minor&Upgrade=minor&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Networking%20%2F%20cluster-network-operator&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20metal%20unknown%20ha%20minor&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Ametal&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-09-03%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-27%2000%3A00%3A00&testId=openshift-tests-upgrade%3A65e48733eb0b6115134b2b8c6a365f16&testName=%5Bsig-network%5D%20pods%20should%20successfully%20create%20sandboxes%20by%20adding%20pod%20to%20network

 

Here is an example run

We see the following signature for the failure:

 

namespace/openshift-etcd node/master-0 pod/revision-pruner-11-master-0 hmsg/b90fda805a - 111.86 seconds after deletion - firstTimestamp/2024-09-02T13:14:37Z interesting/true lastTimestamp/2024-09-02T13:14:37Z reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-11-master-0_openshift-etcd_08346d8f-7d22-4d70-ab40-538a67e21e3c_0(d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57): error adding pod openshift-etcd_revision-pruner-11-master-0 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57" Netns:"/var/run/netns/97dc5eb9-19da-462f-8b2e-c301cfd7f3cf" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-etcd;K8S_POD_NAME=revision-pruner-11-master-0;K8S_POD_INFRA_CONTAINER_ID=d4b61f9ff9f2ddfd3b64352203e8a3eafc2c3bd7c3d31a0a573bc29e4ac6da57;K8S_POD_UID=08346d8f-7d22-4d70-ab40-538a67e21e3c" Path:"" ERRORED: error configuring pod [openshift-etcd/revision-pruner-11-master-0] networking: Multus: [openshift-etcd/revision-pruner-11-master-0/08346d8f-7d22-4d70-ab40-538a67e21e3c]: error waiting for pod: pod "revision-pruner-11-master-0" not found  

 

The same signature has been reported for both azure and x390x as well.

 

It is worth mentioning that sdn to ovn transition adds some complication to our analysis. From the component readiness above, you will see most of the failures are for job: periodic-ci-openshift-release-master-nightly-X.X-upgrade-from-stable-X.X-e2e-metal-ipi-ovn-upgrade. This is a new job for 4.17 and therefore miss base stats in 4.16.

 

So we ask for:

  1. An analysis of the root cause and impact of this issue
  2. Team can compare relevant 4.16 sdn jobs to see if this is really a regression.
  3. Given the current passing rate of 88%, what priority we should give to this?
  4. Since this is affecting component readiness, and management depends on a green dashboard for release decision, we need to figure out what is the best approach for handling this issue.  

 

Please review the following PR: https://github.com/openshift/must-gather/pull/441

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-operator/pull/115

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    When running a cypress test locally, with auth disabled, while logged in to kubeadmin, (e.g., running pipeline-ci.feature within test-cypress-pipelines), the before each fails because it expects there to be an empty message, when we are actually logged into kubeadmin

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    always

Steps to Reproduce:

    1. Run console the ./contrib/oc-environment.sh way while logged into kubeadmin
    2. Run pipeline-ci.feature within the test-cypress-pipelines yarn script in the frontend folder
    
    

Actual results:

    The after-each of the tests fail

Expected results:

    The after-each of the tests are allowed to pass

Additional info:

    

As OCP user, I want storage operators restarted quickly and newly started operator to start leading immediately without ~3 minute wait.

This means that the old operator should release its leadership after it receives SIGTERM and before it exists. Right now, storage operators fail to release the leadership in ~50% of cases.

Steps to reproduce:

  1. Delete an operator Pod (`oc delete pod xyz`).
  2. Wait for a replacement Pod to be created.
  3. Check logs of the replacement Pod. It should contain "successfully acquired lease XYZ" relatively quickly after the Pod start (+/- 1 second?)
  4. Go to 1. and retry few times.

 

This is an hack'n'hustle "work", not tied to any Epic, I'm using it just to get proper QE and tracking what operators are being updated (see linked github PRs).

Description of problem:

    See https://github.com/prometheus/prometheus/issues/14503 for more details
    

Version-Release number of selected component (if applicable):

    4.16
    

How reproducible:


    

Steps to Reproduce:
1. Make Prometheus scrape a target that exposes multiple samples of the same series with different explicit timestamps, for example:

# TYPE requests_per_second_requests gauge
# UNIT requests_per_second_requests requests
# HELP requests_per_second_requests test-description
requests_per_second_requests 16 1722466225604
requests_per_second_requests 14 1722466226604
requests_per_second_requests 40 1722466227604
requests_per_second_requests 15 1722466228604
# EOF

2. Not all the samples will be ingested
3. If Prometheus continues scraping that target for a moment, the PrometheusDuplicateTimestamps will fire.
Actual results:


    

Expected results: all the samples should be considered (or course if the timestamps are too old or are too in the future, Prometheus may refuses them.)


    

Additional info:

     Regression introduced in Prometheus 2.52.
    Proposed upstream fixes: https://github.com/prometheus/prometheus/pull/14683 https://github.com/prometheus/prometheus/pull/14685 
    

Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/850

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/oc/pull/1871

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

The helm charts in the monitoring-plugin can currently either deploy the monitoring-plugin in it's CMO state or with the acm-alerting feature flag enabled. Update it so that it can work with the incidents feature flag as well.

Description of problem:

Should save the release signature in the archive tar file instead of count on the enterprise cache (or working-dir)

Version-Release number of selected component (if applicable):

oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1) Prepare data for enterprise registry use mirror2disk+disk2mirror mode with the following command :
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    graph: true
    channels:
    - name: stable-4.15

`oc-mirror -c config-38037.yaml  file://out38037 --v2`
`oc-mirror -c config-38037.yaml --from file://out38037  docker://my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com --v2  --dest-tls-verify=false`

  2) Prepare the env to simulate the enclave cluster :
cat /etc/squid/squid.conf
http_port 3128
coredump_dir /var/spool/squid
acl whitelist dstdomain "/etc/squid/whitelist"
http_access allow whitelist
http_access deny !whitelist

cat /etc/squid/whitelist 
my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com             -------------registry route  (oc get route -n your registry app's project)
update-service-oc-mirror-route-openshift-update-service.apps.yinzhou-88.qe.devcluster.openshift.com        ---osus route  (oc get route -n openshift-update-service)

Sudo systemctl restart squid
export https_proxy=http://127.0.0.1:3128
export http_proxy=http://127.0.0.1:3128

Setting registry redirect with : 
cat ~/.config/containers/registries.conf 
[[registry]]
  location = "quay.io"
  insecure = false
  blocked = false
  mirror-by-digest-only = false
  prefix = ""
  [[registry.mirror]]
    location = "my-route-zhouy.apps.yinzhou-88.qe.devcluster.openshift.com"
    insecure = false


3) Simulate enclave mirror with same imagesetconfig with command :
 `oc-mirror -c config-38037.yaml file://new-folder --v2`

Actual results:

3) The mirror2disk failed with error :   

I0812 06:45:26.026441  199941 core-cincinnati.go:508] Using proxy 127.0.0.1:3128 to request updates from https://update-service-oc-mirror-route-openshift-update-service.apps.yinzhou-417.qe.devcluster.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=a6097264-8b29-438f-9e71-4aba1e9ec32d
2024/08/12 06:45:26  [ERROR]  : http request Get "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=0f55261077557d1bb909c06b115e0c79b0025677be57ba2f045495c11e2443ee/signature-1": Forbidden

 

Expected results:

No error and should contain the signature in the archives tar file , not count on the enterprise cache (From  custom usage, they may on different machine for enclave cluster , or they may not use the same directory )
 

 

Description of problem:

From the output of "oc adm upgrade --help":
...
    --to-latest=false:
        Use the next available version.
...

seems like "Use the latest available version" is more appropriate.


Version-Release number of selected component (if applicable):

    4.14.0 

How reproducible:

    100%

Steps to Reproduce:

    1. [kni@ocp-edge119 ~]$ oc adm upgrade --help  

Actual results:

...
    --to-latest=false:         Use the next available version. 
...

Expected results:

...     
    --to-latest=false:         Use the latest available version. 
...    

Additional info:

    

Description of problem:

    NodePool Controller doesn't respect LatestSupportedVersion https://github.com/openshift/hypershift/blob/main/support/supportedversion/version.go#L19

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Create HostedCluster / NodePool
    2. Upgrade both HostedCluster and NodePool at the same time to a version higher than the LatestSupportedVersion
    

Actual results:

    NodePool tries to upgrade to the new version while the HostedCluster ValidReleaseImage condition fails with: 'the latest version supported is: "x.y.z". Attempting to use: "x.y.z"'

Expected results:

    NodePool ValidReleaseImage condition also fails

Additional info:

    

Description of problem:

IBM ROKS uses Calico as their CNI. In previous versions of OpenShift, OpenShiftSDN would create IPTable rules that would force local endpoint for DNS Service. 

Starting in OCP 4.17 with the removal of SDN, IBM ROKS is not using OVN-K and therefor local endpoint for dns service is not working as expected. 

IBM ROKS is asking that the code block be restored to restore the functionality previously seen in OCP 4.16

https://github.com/openshift/sdn/blob/release-4.16/vendor/k8s.io/kubernetes/pkg/proxy/iptables/proxier.go#L979-L992

Without this functionality IBM ROKS is not able to GA OCP 4.17

User Story:

As a user of HyperShift, I want to be able to:

  • to pull the api server network proxy image from the release image

so that I can achieve

  • the HCCO will be using the image from the release image the HC is using.

Acceptance Criteria:

Description of criteria:

  • The api server network proxy image is no longer hardcoded.
  • All required tests are passing on the PR.

Out of Scope:

N/A

Engineering Details:

% oc adm release info quay.io/openshift-release-dev/ocp-release:4.14.33-multi -a ~/all-the-pull-secrets.json --pullspecs | grep apiserver
  apiserver-network-proxy 

This does not require a design proposal.
This does not require a feature gate.

Description of the problem:

The ingress TLS certificate, which is the one presented to HTTP clients e.g. when requesting resources under *.apps.<cluster-name>.<base-domain>, is not signed by a certificate included in the cluster's CA certificates. This results in those ingress HTTP requests to fail with the error: `tls: failed to verify certificate: x509: certificate signed by unknown authority`.

How reproducible:

100%

Steps to reproduce:

1. Before running an IBU, verify that the target cluster's ingress works properly:

2. Run an IBU.

3. Perform steps 1. and 2. again. You will see the error `curl: (60) SSL certificate problem: self signed certificate in certificate chain`.

 
Alternative steps using openssl:

1. Run an IBU

2. Download the cluster's CA bundle `oc config view --raw -o jsonpath='{.clusters[0].cluster.certificate-authority-data}' | base64 --decode > ca.crt`

3. Download the ingress certificate `openssl s_client -connect oauth.apps.target.ibo1.redhat.com:443 -showcerts </dev/null </dev/null 2>/dev/null | awk '/BEGIN CERTIFICAT/,/END CERTIFICATE/ {print}' > ingress.crt`

4. Try to verify the cert with the CA chain: `openssl verify -CAfile ca.crt ingress.crt` - this step fails.

Actual results:

Ingress HTTP requests using the cluster's CA TLS transport fail with unknown certificate authority error.

Expected results:

Ingress HTTP requests using the cluster's CA TLS transport should succeed.

Related to a component regression we found that looked like we had no clear test to catch, sample runs:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-kube-apiserver-rollout/1827763939853733888

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-ipv4/1826908352773361664

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-dualstack/1828844069434953728

All three runs show a pattern. The actual test failures look unpredictable, some tests are passing at the same time, others fail to talk to the apiserver.

The pattern we see is 1 or more tests failing right at the start of e2e testing, disruption, etcd log messages indicating slowness, and etcd leadership state changes.

Because the tests are unpredictable, we'd like a test that catches this symptom. We think the safest way to do this is to look for disruption within x minutes of the first e2e test.

This would be implemented as a monitortest, likely somewhere around here: https://github.com/openshift/origin/blob/master/pkg/monitortests/kubeapiserver/legacykubeapiservermonitortests/monitortest.go

Although it would be reasonable to add a new monitortest in the parent package above this level.

The test would need to do the following:

  • scan final intervals for the earliest interval with source=SourceE2ETest (constant in monitorapi/types.go), save it's start time
  • scan final intervals for those with source=SourceDisruption, and reason=DisruptionBegan, and a backend matching one of the apiservers (kube, openshift, oauth)
  • flake the test (return a failure junit result + a success junit result) if we see any SourceDisruption intervals within X minutes of that first e2e test.
  • Choose X based on what we see in the above links.

Description of problem:

This is essentially an incarnation of the bug https://bugzilla.redhat.com/show_bug.cgi?id=1312444 that was fixed in OpenShift 3 but is now present again.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

Select a template in the console web UI, try to enter a multiline value.

Actual results:

It's impossible to enter line breaks.

Expected results:

It should be possible to achieve entering a multiline parameter when creating apps from templates.

Additional info:

I also filed an issue here https://github.com/openshift/console/issues/13317.
P.S. It's happening on https://openshift-console.osci.io, not sure what version of OpenShift they're running exactly.

Description of problem:

We need to bump the Kubernetes Version. To the latest API version OCP is using.

This what was done last time:

https://github.com/openshift/cluster-samples-operator/pull/409

Find latest stable version from here: https://github.com/kubernetes/api

This is described in wiki: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities

    

Version-Release number of selected component (if applicable):


    

How reproducible:

Not really a bug, but we're using OCPBUGS so that automation can manage the PR lifecycle (SO project is no longer kept up-to-date with release versions, etc.).
    

Description of problem:

    There is another panic occurred in https://issues.redhat.com/browse/OCPBUGS-34877?focusedId=25580631&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25580631 which should be fixed

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    AdditionalTrustedCA is not wired correctly so the configmap is not found my its operator. This feature is meant to be exposed by XCMSTRAT-590, but at the moment it seems to be broken

Version-Release number of selected component (if applicable):

    4.16.5

How reproducible:

    Always

Steps to Reproduce:

1. Create a configmap containing a registry and PEM cert, like https://github.com/openshift/openshift-docs/blob/ef75d891786604e78dcc3bcb98ac6f1b3a75dad1/modules/images-configuration-cas.adoc#L17  
2. Refer to it in .spec.configuration.image.additionalTrustedCA.name     
3. image-registry-config-operator is not able to find the cm and the CO is degraded
    

Actual results:

   CO is degraded

Expected results:

    certs are used.

Additional info:

I think we may miss a copy of the configmap from the cluster NS to the target ns. It should be also deleted if it is deleted.

 

 % oc get hc -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd -o jsonpath="{.items[0].spec.configuration.image.additionalTrustedCA}" | jq
{
  "name": "registry-additional-ca-q9f6x5i4"
}

 

 

% oc get cm -n ocm-adecorte-2d525fsstsvtbv1h8qss14pkv171qhdd registry-additional-ca-q9f6x5i4
NAME                              DATA   AGE
registry-additional-ca-q9f6x5i4   1      16m

 

 

logs of cluster-image-registry operator

 

E0814 13:22:32.586416       1 imageregistrycertificates.go:141] ImageRegistryCertificatesController: unable to sync: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found, requeuing

 

 

CO is degraded

 

% oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.16.5    True        False         False      3h58m
csi-snapshot-controller                    4.16.5    True        False         False      4h11m
dns                                        4.16.5    True        False         False      3h58m
image-registry                             4.16.5    True        False         True       3h58m   ImageRegistryCertificatesControllerDegraded: failed to update object *v1.ConfigMap, Namespace=openshift-image-registry, Name=image-registry-certificates: image-registry-certificates: configmap "registry-additional-ca-q9f6x5i4" not found
ingress                                    4.16.5    True        False         False      3h59m
insights                                   4.16.5    True        False         False      4h
kube-apiserver                             4.16.5    True        False         False      4h11m
kube-controller-manager                    4.16.5    True        False         False      4h11m
kube-scheduler                             4.16.5    True        False         False      4h11m
kube-storage-version-migrator              4.16.5    True        False         False      166m
monitoring                                 4.16.5    True        False         False      3h55m

 

 

Description of problem:

    When an image is referenced by tag and digest, oc-mirror skips the image

Version-Release number of selected component (if applicable):

    

How reproducible:

    Do mirror to disk and disk to mirror using the registry.redhat.io/redhat/redhat-operator-index:v4.16 and the operator multiarch-tuning-operator

Steps to Reproduce:

    1 mirror to disk
    2 disk to mirror

Actual results:

    docker://gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1@sha256:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522 (Operator bundles: [multiarch-tuning-operator.v0.9.0] - Operators: [multiarch-tuning-operator]) error: Invalid source name docker://localhost:55000/kubebuilder/kube-rbac-proxy:v0.13.1:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522: invalid reference format

Expected results:

The image should be mirrored    

Additional info:

    

Please review the following PR: https://github.com/openshift/image-customization-controller/pull/128

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Creating a tuned profile with annotation  tuned.openshift.io/deferred: "update" first before label target node, then label node with profile=, the value of kernel.shmmni applied immediately. but it shows the message [The TuneD daemon profile is waiting for the next node restart: openshift-profile],  then reboot nodes, it will restore to default value of  kernel.shmmni, not setting to expected value.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Creating OCP cluster with latest 4.18 nightly version
    2. Create tuned profile before label node
       please refer to issue 1 if you want to reproduce the issue in the doc https://docs.google.com/document/d/1h-7AIyqf7sHa5Et2XF7a-RuuejwVkrjhiFFzqZnNfvg/edit
    

Actual results:

   It should show the message [TuneD profile applied]. the sysctl value should keep as expect after node reboot

Expected results:

    It shouldn't show the message The TuneD daemon profile is waiting for the next node restart: openshift-profile when executing oc get profile also the sysctl value shouldn't revert after node reboot

Additional info:

    

Description of problem:

    (MISSING) output with the `oc adm must-gather --help` output
[`4.15.0` - `oc`](https://mirror.openshift.com/pub/openshift-v4/multi/clients/ocp/4.15.0/ppc64le/openshift-client-linux.tar.gz) introduces strange output with the `oc adm must-gather --help` output.

Version-Release number of selected component (if applicable):

    4.15.0 and higher

How reproducible:

    4.15.0 and higher you can run the reproducer steps

Steps to Reproduce:

    1.curl -O -L  https://mirror.openshift.com/pub/openshift-v4/multi/clients/ocp/4.15.0/ppc64le/openshift-client-linux.tar.gz     
2. untar
    3. ./oc adm must-gather --help     

Actual results:

    # ./oc adm must-gather --help
    --volume-percentage=30:        Specify maximum percentage of must-gather pod's allocated volume that can be used. If this limit is exceeded,        must-gather will stop gathering, but still copy gathered data. Defaults to 30%!(MISSING)

Expected results:

    No (MISSING) content in the output

Additional info:

    

Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/207

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/console/pull/14238

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of the problem:

Trying to create a cluster from UI , fails.

 

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of problem:

Create VPC and subnets with following configs [refer to attached CF template]:
Subnets (subnets-pair-default) in CIDR 10.0.0.0/16
Subnets (subnets-pair-134) in CIDR 10.134.0.0/16
Subnets (subnets-pair-190) in CIDR 10.190.0.0/16

Create cluster into subnets-pair-134, the bootstrap process fails [see attached log-bundle logs]:

level=debug msg=I0605 09:52:49.548166 	937 loadbalancer.go:1262] "adding attributes to load balancer" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" attrs=[{"Key":"load_balancing.cross_zone.enabled","Value":"true"}]
level=debug msg=I0605 09:52:49.909861 	937 awscluster_controller.go:291] "Looking up IP address for DNS" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" namespace="openshift-cluster-api-guests" name="yunjiang29781a-86-rvqd9" reconcileID="a9310bd5-acc7-4b01-8a84-e47139fc0d1d" cluster="openshift-cluster-api-guests/yunjiang29781a-86-rvqd9" dns="yunjiang29781a-86-rvqd9-int-19a9485653bf29a1.elb.us-east-2.amazonaws.com"
level=debug msg=I0605 09:52:53.483058 	937 reflector.go:377] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: forcing resync
level=debug msg=Fetching Bootstrap SSH Key Pair...

Checking security groups:
<infraid>-lb allows 10.0.0.0/16:6443 and 10.0.0.0/16:22623
<infraid>-apiserver-lb allows 10.0.0.0/16:6443 and 10.134.0.0/16:22623 (and 0.0.0.0/0:6443)

are these settings correct?

    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-03-060250
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Create subnets using attached CG template
    2. Create cluster into subnets which CIDR is 10.134.0.0/16
    3.
    

Actual results:

Bootstrap process fails.
    

Expected results:

Bootstrap succeeds.
    

Additional info:

No issues if creating cluster into subnets-pair-default (10.0.0.0/16)
No issues if only one CIDR in VPC, e.g. set VpcCidr to 10.134.0.0/16 in https://github.com/openshift/installer/blob/master/upi/aws/cloudformation/01_vpc.yaml

    

Description of problem:


When using SecureBoot tuned reports the following error as debugfs access is restricted:

tuned.utils.commands: Writing to file '/sys/kernel/debug/sched/migration_cost_ns' error: '[Errno 1] Operation not permitted: '/sys/kernel/debug/sched/migration_cost_ns''
tuned.plugins.plugin_scheduler: Error writing value '5000000' to 'migration_cost_ns'

This issue has been reported with the following tickets:

As this is a confirmed limitation of the NTO due to the TuneD component, we should document this as a limitation in the OpenShift Docs:
https://docs.openshift.com/container-platform/4.16/nodes/nodes/nodes-node-tuning-operator.html

Expected Outcome:

  • Document that the NTO cannot leverage some of the Tuned features when secureboot is enabled.

Please review the following PR: https://github.com/openshift/telemeter/pull/543

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/csi-driver-nfs/pull/146

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    The smoke test for OLM run by the OpenShift e2e suite is specifying an unavailable operator for installation, causing it to fail.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always (when using 4.17+ catalog versions)

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

The customer uses Azure File CSI driver and without this they cannot make use of the Azure Workload Identity work which was one of the banner features of OCP 4.14. This feature is currently available in 4.16, however it will take the customer 3-6 months to validate 4.16 and start its rollout putting their plans to complete a large migration to Azure by end of 2024 at risk.
Could you please backport either the 1.29.3 feature for Azure Workload Idenity or rebase our Azure File CSI driver in 4.14 and 4.15 to at least 1.29.3 which includes the desired feature.
    

Version-Release number of selected component (if applicable):

azure-file-csi-driver in 4.14 and 4.15
- In 4.14, azure-file-csi-driver is version 1.28.1
- In 4.15, azure-file-csi-driver is version 1.29.2
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Install ocp 4.14 with Azure Workload Managed Identity
    2. Try to configure Managed Workload Identiy with Azure CSI file

https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/master/docs/workload-identity-static-pv-mount.md
    

Actual results:

Is not usable
    

Expected results:

Azure Workload Identity should be manage with Azure File CSi as part of the whole feature
    

Additional info:

    

Description of problem:

    Sort function on NetworkPolicies page is incorrect after enable Pagination

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-09-17-060032

How reproducible:

    Always

Steps to Reproduce:

    1. Create multiple resouces for NetworkPolicies
    2. Navigate to Networking -> NetworkPolicies page-> NetworkPolicies Tab
    3. Make sure the option of '15 per page' has been selected
    4. Click the 'Name column' button to sort the table
    

Actual results:

    The sort result is not correct

PFA: https://drive.google.com/file/d/12-eURLqMPZM5DNxfAPoWzX1CJr0Wyf_u/view?usp=drive_link

Expected results:

    Table data can be sorted by using resource name, even if pagination is enabled

Additional info:

    

In analytics events, console sends the Organization.id from OpenShift Cluster Manager's Account Service, rather than the Organization.external_id. The external_id is meaningful company-wide at Red Hat, while the plain id is only meaningful within OpenShift Cluster Manager. You can use id to lookup external_id in OCM, but it's an extra step we'd like to avoid if possible.

cc Ali Mobrem 

Description of problem:

co/ingress is always good even operator pod log error:

2024-07-24T06:42:09.580Z    ERROR    operator.canary_controller    wait/backoff.go:226    error performing canary route check    {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
    

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-2024-07-20-191204

How reproducible:

    100%

Steps to Reproduce:

    1. install AWS cluster
    2. update ingresscontroller/default and adding   "endpointPublishingStrategy.loadBalancer.allowedSourceRanges", eg

spec:
  endpointPublishingStrategy:
    loadBalancer:
      allowedSourceRanges:
      - 1.1.1.2/32

    3. above setting drop most traffic to LB, so some operator degraded  
    

Actual results:

    co/authentication and console degraded but co/ingress is still good

$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.17.0-0.nightly-2024-07-20-191204   False       False         True       22m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.hongli-aws.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 
console                                    4.17.0-0.nightly-2024-07-20-191204   False       False         True       22m     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-aws.qe.devcluster.openshift.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
 
ingress                                    4.17.0-0.nightly-2024-07-20-191204   True        False         False      3h58m   


check the ingress operator log and see:

2024-07-24T06:59:09.588Z    ERROR    operator.canary_controller    wait/backoff.go:226    error performing canary route check    {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.hongli-aws.qe.devcluster.openshift.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

Expected results:

    co/ingress status should reflect the real condition timely

Additional info:

    even co/ingress status can be updated in some scenarios, but it is always less sensitive than authentication and console, we always rely on authentication/console to know the route healthy, the purpose of ingress canary route becomes meaningless.

 

Since, we are not going to support addition of 2nd vCenter as day-2 operation - we need to block users from doing this.

It looks like the must gather pods are the worst culprits but these are not actually considered to be platform pods.

Step 1: Exclude must gather pods from this test.

Step 2: Research the other failures.

Description of the problem:

the GPU data in our host inventory is wrong

How reproducible:

Always

Steps to reproduce:

1.

2.

3.

Actual results:


 "gpus": [ \{ "address": "0000:00:0f.0" } ],

Expected results:

Description of problem:

OCPBUGS-42772 is verified. But testing found oauth-server panic with OAuth2.0 idp names that contain whitespaces

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-10-31-190119    

How reproducible:

Always    

Steps to Reproduce:

1. Set up Google IDP with below:
$ oc create secret generic google-secret-1 --from-literal=clientSecret=xxxxxxxx -n openshift-config
$ oc edit oauth cluster
spec:
  identityProviders:
  - google:
      clientID: 9745..snipped..apps.googleusercontent.com
      clientSecret:
        name: google-secret-1
      hostedDomain: redhat.com
    mappingMethod: claim
    name: 'my Google idp'
    type: Google
...

Actual results:

oauth-server panic:

$ oc get po -n openshift-authentication
NAME                               READY   STATUS             RESTARTS
oauth-openshift-59545c6f5-dwr6s    0/1     CrashLoopBackOff   11 (4m10s ago)
...

$ oc logs -p -n openshift-authentication oauth-openshift-59545c6f5-dwr6s
Copying system trust bundle
I1101 03:40:09.883698       1 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="serving-cert::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.crt::/var/config/system/secrets/v4-0-config-system-serving-cert/tls.key"
I1101 03:40:09.884046       1 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="sni-serving-cert::/var/config/system/secrets/v4-0-config-system-router-certs/apps.hongli-az.qe.azure.devcluster.openshift.com::/var/config/system/secrets/v4-0-config-system-router-certs/apps.hongli-az.qe.azure.devcluster.openshift.com"
I1101 03:40:10.335739       1 audit.go:340] Using audit backend: ignoreErrors<log>
I1101 03:40:10.347632       1 requestheader_controller.go:244] Loaded a new request header values for RequestHeaderAuthRequestController
panic: parsing "/oauth2callback/my Google idp": at offset 0: invalid method "/oauth2callback/my"goroutine 1 [running]:
net/http.(*ServeMux).register(...)
        net/http/server.go:2738
net/http.(*ServeMux).Handle(0x29844c0?, {0xc0008886a0?, 0x2984420?}, {0x2987fc0?, 0xc0006ff4a0?})
        net/http/server.go:2701 +0x56
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).getAuthenticationHandler(0xc0006c28c0, {0x298f618, 0xc0008a4d00}, {0x2984540, 0xc000171450})
        github.com/openshift/oauth-server/pkg/oauthserver/auth.go:407 +0x11ad
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).getAuthorizeAuthenticationHandlers(0xc0006c28c0, {0x298f618, 0xc0008a4d00}, {0x2984540, 0xc000171450})
        github.com/openshift/oauth-server/pkg/oauthserver/auth.go:243 +0x65
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).WithOAuth(0xc0006c28c0, {0x2982500, 0xc0000aca80})
        github.com/openshift/oauth-server/pkg/oauthserver/auth.go:108 +0x21d
github.com/openshift/oauth-server/pkg/oauthserver.(*OAuthServerConfig).buildHandlerChainForOAuth(0xc0006c28c0, {0x2982500?, 0xc0000aca80?}, 0xc000785888)
        github.com/openshift/oauth-server/pkg/oauthserver/oauth_apiserver.go:342 +0x45
k8s.io/apiserver/pkg/server.completedConfig.New.func1({0x2982500?, 0xc0000aca80?})
        k8s.io/apiserver@v0.29.2/pkg/server/config.go:825 +0x28
k8s.io/apiserver/pkg/server.NewAPIServerHandler({0x252ca0a, 0xf}, {0x2996020, 0xc000501a00}, 0xc0005d1740, {0x0, 0x0})
        k8s.io/apiserver@v0.29.2/pkg/server/handler.go:96 +0x2ad
k8s.io/apiserver/pkg/server.completedConfig.New({0xc000785888?, {0x0?, 0x0?}}, {0x252ca0a, 0xf}, {0x29b41a0, 0xc000171370})
        k8s.io/apiserver@v0.29.2/pkg/server/config.go:833 +0x2a5
github.com/openshift/oauth-server/pkg/oauthserver.completedOAuthConfig.New({{0xc0005add40?}, 0xc0006c28c8?}, {0x29b41a0?, 0xc000171370?})
        github.com/openshift/oauth-server/pkg/oauthserver/oauth_apiserver.go:322 +0x6a
github.com/openshift/oauth-server/pkg/cmd/oauth-server.RunOsinServer(0xc000451cc0?, 0xc000810000?, 0xc00061a5a0)
        github.com/openshift/oauth-server/pkg/cmd/oauth-server/server.go:45 +0x73
github.com/openshift/oauth-server/pkg/cmd/oauth-server.(*OsinServerOptions).RunOsinServer(0xc00030e168, 0xc00061a5a0)
        github.com/openshift/oauth-server/pkg/cmd/oauth-server/cmd.go:108 +0x259
github.com/openshift/oauth-server/pkg/cmd/oauth-server.NewOsinServerCommand.func1(0xc00061c300?, {0x251a8c8?, 0x4?, 0x251a8cc?})
        github.com/openshift/oauth-server/pkg/cmd/oauth-server/cmd.go:46 +0xed
github.com/spf13/cobra.(*Command).execute(0xc000780008, {0xc00058d6c0, 0x7, 0x7})
        github.com/spf13/cobra@v1.7.0/command.go:944 +0x867
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001a3b08)
        github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v1.7.0/command.go:992
k8s.io/component-base/cli.run(0xc0001a3b08)
        k8s.io/component-base@v0.29.2/cli/run.go:146 +0x290
k8s.io/component-base/cli.Run(0xc00061a5a0?)
        k8s.io/component-base@v0.29.2/cli/run.go:46 +0x17
main.main()
        github.com/openshift/oauth-server/cmd/oauth-server/main.go:46 +0x2de

Expected results:

No panic

Additional info:

Tried in old env like 4.16.20 with same steps, no panic:
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.20   True        False         95m     Cluster version is 4.16.20

$ oc get po -n openshift-authentication
NAME                               READY   STATUS    RESTARTS   AGE    
oauth-openshift-7dfcd8c8fd-77ltf   1/1     Running   0          116s   
oauth-openshift-7dfcd8c8fd-sr97w   1/1     Running   0          89s    
oauth-openshift-7dfcd8c8fd-tsrff   1/1     Running   0          62s

Description of problem:

New monitor test api-unreachable-from-client-metrics does not pass in MicroShift. Since this is a monitor test there is no way to skip it and a fix is needed.
This test is breaking conformance job for MicroShift, which is critical to the blocking job to be.

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Run conformance over MicroShift.

Steps to Reproduce:

1.
2.
3.

Actual results:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-microshift-release-4.18-periodics-e2e-aws-ovn-ocp-conformance/1828583537415032832

Expected results:

 

Additional info:

 

Please review the following PR: https://github.com/openshift/images/pull/191

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

oc-mirror should fail when call the cincinatti API failed

Version-Release number of selected component (if applicable):

oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202407291514.p0.gdbf115f.assembly.stream.el9-dbf115f", GitCommit:"dbf115f547a19f12ab72e7b326be219a47d460a0", GitTreeState:"clean", BuildDate:"2024-07-29T15:52:52Z", GoVersion:"go1.22.4 (Red Hat 1.22.4-2.el9) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1) Set squid proxy;
2) use following imagesetconfig to mirror ocp:
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    graph: true
    channels:
    - name: stable-4.15                                             
      type: ocp
      minVersion: '4.15.18'
      maxVersion: '4.15.18'

oc-mirror -c config.yaml  file://out38037 --v2

Actual results:

2) oc-mirror failed to get cincinatti API, but oc-mirror  just log an error, state that 0 images to copy and continue

oc-mirror -c config-38037.yaml  file://out38037 --v2

2024/08/13 04:27:41  [WARN]   : ⚠️  --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready.
2024/08/13 04:27:41  [INFO]   : 👋 Hello, welcome to oc-mirror
2024/08/13 04:27:41  [INFO]   : ⚙️  setting up the environment for you...
2024/08/13 04:27:41  [INFO]   : 🔀 workflow mode: mirrorToDisk 
2024/08/13 04:27:41  [INFO]   : 🕵️  going to discover the necessary images...
2024/08/13 04:27:41  [INFO]   : 🔍 collecting release images...
I0813 04:27:41.388376  203687 core-cincinnati.go:508] Using proxy 127.0.0.1:3128 to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=1454eaf7-7f41-4678-ae88-30d4957e24f9
2024/08/13 04:27:41  [ERROR]  : get release images: error list APIRequestError: channel "stable-4.15": RemoteFailed: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.15&id=1454eaf7-7f41-4678-ae88-30d4957e24f9": Forbidden 
2024/08/13 04:27:41  [WARN]   : error during graph image processing - SKIPPING: Get "https://api.openshift.com/api/upgrades_info/graph-data": Forbidden
2024/08/13 04:27:41  [INFO]   : 🔍 collecting operator images...
2024/08/13 04:27:41  [INFO]   : 🔍 collecting additional images...
2024/08/13 04:27:41  [INFO]   : 🚀 Start copying the images...
2024/08/13 04:27:41  [INFO]   : images to copy 0 
2024/08/13 04:27:41  [INFO]   : === Results ===
2024/08/13 04:27:41  [INFO]   : 📦 Preparing the tarball archive...
2024/08/13 04:27:41  [INFO]   : mirror time     : 464.620593ms
2024/08/13 04:27:41  [INFO]   : 👋 Goodbye, thank you for using oc-mirror

 

 

Expected results:

when Cincinatti API is not reacheable (api.openshift.com), oc-mirror should fail immediately

 

Expected results:

networking-console-plugin deployment has the required-scc annotation   

Additional info:

The deployment does not have any annotation about it   

CI warning

# [sig-auth] all workloads in ns/openshift-network-console must set the 'openshift.io/required-scc' annotation
annotation missing from pod 'networking-console-plugin-7c55b7546c-kc6db' (owners: replicaset/networking-console-plugin-7c55b7546c); suggested required-scc: 'restricted-v2'

Please review the following PR: https://github.com/openshift/installer/pull/8962

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

The network section will be delivered using the networking-console-plugin through the cluster-network-operator.
So we have to remove the section from here to avoid duplication.

Version-Release number of selected component (if applicable):
4.18

How reproducible:
Always

Steps to Reproduce:

  1. Open the network section

Actual results:
Service, Route, Ingress and NetworkPolicy are defined two times in the section

Expected results:
Service, Route, Ingress and NetworkPolicy are defined only one time in the section

Additional info:

From david:

pod/metal3-static-ip-set namespace/openshift-machine-api should trip some kind of test due to restartCount=5 on its container. Let’s say any pod that is created after the install is finished should restart=0 and see how many fail that criteria

We should have a test that makes sure that pods created after cluster is up should not have a non zero restartCount.

Please review the following PR: https://github.com/openshift/openshift-controller-manager/pull/330

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1140

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

   When installing a GCP cluster with the CAPI based method, the kube-api firewall rule that is created always uses a source range of 0.0.0.0/0. In the prior terraform based method, internal published clusters were limited to the network_cidr. This change opens up the API to additional sources, which could be problematic such as in situations where traffic is being routed from a non-cluster subnet.

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    Always

Steps to Reproduce:

    1. Install a cluster in GCP with publish: internal
    2.
    3.
    

Actual results:

    Kube-api firewall rule has source of 0.0.0.0/0

Expected results:

    Kube-api firewall rule has a more limited source of network_cidr

Additional info:

    

Please review the following PR: https://github.com/openshift/csi-operator/pull/87

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Tests:

  • Kubectl logs all pod logs the Deployment has 2 replicas and each pod has 2 containers should get logs from all pods based on default container
  • Kubectl logs all pod logs the Deployment has 2 replicas and each pod has 2 containers should get logs from each pod and each container in Deployment

are being disabled in https://github.com/openshift/kubernetes/blob/master/openshift-hack/e2e/annotate/rules.go

These tests should be enabled after the 1.31 kube bump in oc

Component Readiness has found a potential regression in the following test:

[sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers for ns/openshift-image-registry

Probability of significant regression: 98.02%

Sample (being evaluated) Release: 4.17
Start Time: 2024-08-15T00:00:00Z
End Time: 2024-08-22T23:59:59Z
Success Rate: 94.74%
Successes: 180
Failures: 10
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-31T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 89
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=aws&Platform=aws&Scheduler=default&SecurityMode=default&Suite=unknown&Suite=unknown&Topology=ha&Topology=ha&Upgrade=micro&Upgrade=micro&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Image%20Registry&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20aws%20unknown%20ha%20micro&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-22%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-15%2000%3A00%3A00&testId=openshift-tests-upgrade%3A10a9e2be27aa9ae799fde61bf8c992f6&testName=%5Bsig-cluster-lifecycle%5D%20pathological%20event%20should%20not%20see%20excessive%20Back-off%20restarting%20failed%20containers%20for%20ns%2Fopenshift-image-registry

Also hitting 4.17, I've aligned this bug to 4.18 so the backport process is cleaner.

The problem appears to be a permissions error preventing the pods from starting:

2024-08-22T06:14:14.743856620Z ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied

Originating from this code: https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L489

Both 4.17 and 4.18 nightlies bumped rhcos and in there is an upgrade like this:

container-selinux-3-2.231.0-1.rhaos4.16.el9-noarch container-selinux-3-2.231.0-2.rhaos4.17.el9-noarch

With slightly different versions in each stream, but both were on 3-2.231.

Hits other tests too:

operator conditions image-registry
Operator upgrade image-registry
[sig-cluster-lifecycle] Cluster completes upgrade
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

Description of problem:

checked in 4.17.0-0.nightly-2024-09-18-003538, default thanos-ruler retention time is 24h, not 15d mentioned in https://github.com/openshift/cluster-monitoring-operator/blob/release-4.17/Documentation/api.md#thanosrulerconfig, the issue exists in 4.12+

$ for i in $(oc -n openshift-user-workload-monitoring get sts --no-headers | awk '{print $1}'); do echo $i; oc -n openshift-user-workload-monitoring get sts $i -oyaml | grep retention; echo -e "\n"; done
prometheus-user-workload
        - --storage.tsdb.retention.time=24h

thanos-ruler-user-workload
        - --tsdb.retention=24h

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-18-003538    

How reproducible:

always

Steps to Reproduce:

1. see the description

Actual results:

default thanos-ruler retention time is 15d in api.md

Expected results:

should be 24h

Additional info:

    

Related with https://issues.redhat.com/browse/OCPBUGS-23000

Cluster-autoscaler by default evict all those pods -including those coming from daemon sets-
In the case of EFS-CSI drivers, which are mounted as nfs volumes, this is causing nfs stale and that application worloads are not terminated gracefully. 

Version-Release number of selected component (if applicable):

4.11

How reproducible:

- While scaling down a node from the cluster-autoscaler-operator, the DS pods are beeing evicted.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

CSI pods might not be evicted by the cluster autoscaler (at least prior to workloads termination) as it might produce data corruption

Additional info:

Is possible to disable csi pods eviction adding the following annotation label on the csi driver pod
cluster-autoscaler.kubernetes.io/enable-ds-eviction: "false"

Description of problem:

In discussion of https://issues.redhat.com/browse/OCPBUGS-37862 it was noticed that sometimes the haproxy-monitor is reporting "API is not reachable through HAProxy" which means it is removing the firewall rule to direct traffic to HAProxy. This is not ideal since it means keepalived will likely fail over the VIP and it may be breaking existing connections to HAProxy.

There are a few possible reasons for this. One is that we only require two failures of the healthcheck in the monitor to trigger this removal. For something we don't expect to need to happen often during normal operation of a cluster, this is probably a bit too harsh, especially since we only check every 6 seconds so it's not like we're looking for quick error detection. This is more a bootstrapping thing and a last ditch effort to keep the API functional if something has gone terribly wrong in the cluster. If it takes a few more seconds to detect an outage that's better than detecting outages that aren't actually outages.

The first thing we're going to try to fix this is to increase what amounts to the "fall" value for the monitor check. If that doesn't eliminate the problem we will have to look deeper at the HAProxy behavior during node reboots.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Panic seen in below CI job when run the below command

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match'
periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-insights-operator-release-4.17-insights-operator-e2e-tests-periodic (all) - 2 runs, 100% failed, 50% of failures match = 50% impact

Panic observed:

E0910 09:00:04.283647       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 268 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x36c8b40, 0x5660c90})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000ce8540?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x36c8b40?, 0x5660c90?})
	/usr/lib/golang/src/runtime/panic.go:770 +0x132
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateNode(0xc000d6e360, {0x3abd580?, 0xc00224a608}, {0x3abd580?, 0xc001bd2308})
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:585 +0x1f3
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:246
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:976 +0xea
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001933f70, {0x3faaba0, 0xc000759710}, 0x1, 0xc00097bda0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000750f70, 0x3b9aca00, 0x0, 0x1, 0xc00097bda0)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000dc2630)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:972 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 261
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x33204b3] 

 

Version-Release number of selected component (if applicable):

    

How reproducible:

Seen in this CI run -https://prow.ci.openshift.org/job-history/test-platform-results/logs/periodic-ci-openshift-insights-operator-stage-insights-operator-e2e-tests-periodic

Steps to Reproduce:

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?name=^periodic&type=junit&search=machine-config-controller.*Observed+a+panic' | grep 'failures match'

Actual results:

    

Expected results:

 No panic to observe

Additional info:

    

Failures beginning in 4.18.0-0.ci-2024-10-08-185524

 Suite run returned error: unable to extract image references from release payload: failed extracting image-references from "registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524": error during image extract: exit status 1 (error: unable to read image registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524: unauthorized: authentication required
)
error running options: unable to extract image references from release payload: failed extracting image-references from "registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524": error during image extract: exit status 1 (error: unable to read image registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524: unauthorized: authentication required
)error: unable to extract image references from release payload: failed extracting image-references from "registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524": error during image extract: exit status 1 (error: unable to read image registry.ci.openshift.org/ocp/release:4.18.0-0.ci-2024-10-08-185524: unauthorized: authentication required
) 

revert

Undiagnosed panic detected in pod

This test is failing the majority of the time on hypershift jobs.

The failure looks straightforward:

{  pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-18-18.ec2.internal_cluster-policy-controller.log.gz:E1015 12:53:31.246033       1 scctopsamapping.go:336] "Observed a panic" panic="unknown volume type: image" panicGoValue="&errors.errorString{s:\"unknown volume type: image\"}" stacktrace=<

We're close to not being able to see, but it looks like this may have started Oct 3rd.

For job runs with the test failure see here.

Description of problem:

    When hosted zones are created in the cluster creator account, and the ingress role is a role in the cluster creator account, the private link controller fails to create DNS records in the local zone.

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    Always

Steps to Reproduce:

    1. Set up shared vpc infrastructure in which the hosted zone and local zone exist in the cluster creator account. 
    2. Create a hosted cluster
    

Actual results:

    The hosted cluster never gets nodes to join because it is missing records in the local hosted zone.

Expected results:

    The hosted cluster completes installation with available nodes.

Additional info:

    Creating the hosted zones in the cluster creator account is an alternative way of setting up shared vpc infrastructure. In this mode, the role to assume for creating DNS records is a role in the cluster creator account and not in the vpc account.
  • Cluster-admin user click on 'Create Route' on list page, it will be always loading, sometimes this is also happening for normal user
  • Delete Route doesn't work, every time we delete a route, a new one will be generated(this seems only reproducible when there are more than one routes)
  • Route creation form, `Path` is required field, but when we create 'Secure Route', it will report error
    Error "Invalid value: "/": passthrough termination does not support paths" for field "spec.path".

Description of problem:

The [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup test frequently fails on OpenStack platform, which in turn also causes the [sig-network] can collect pod-to-service poller pod logs and [sig-network] can collect host-to-service poller pod logs tests to fail.

These failure happen frequently in vh-mecha, for example for all CSI jobs, such as 4.16-e2e-openstack-csi-cinder.

   

Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/442

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

    These two tests have been flaking more often lately. The TestLeaderElection flake is partially (but not solely) connected to OCPBUGS-41903.

   TestOperandProxyConfiguration seems to fail in the teardown while waiting for other cluster operators to become available.

   Although these flakes aren't customer facing, they considerably slow development cycles (due to retests) and also consume more resources than they should (every retest runs on a new cluster), so we want to backport the fixes.

Version-Release number of selected component (if applicable):

    4.18, 4.17, 4.16, 4.15, 4.14

How reproducible:

    Sometimes

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    installing into GCP shared VPC with BYO hosted zone failed with error "failed to create the private managed zone"

Version-Release number of selected component (if applicable):

    4.17.0-0.nightly-multi-2024-08-26-170521

How reproducible:

    Always

Steps to Reproduce:

    1. pre-create the dns private zone in the service project, with the zone's dns name like "<cluster name>.<base domain>" and binding to the shared VPC
    2. activate the service account having minimum permissions, i.e. no permission to bind a private zone to the shared VPC in the host project (see [1])
    3. "create install-config" and then insert the interested settings (e.g. see [2])
    4. "create cluster"     

Actual results:

    It still tries to create a private zone, which is unexpected.

failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed provisioning resources after infrastructure ready: failed to create the private managed zone: failed to create private managed zone: googleapi: Error 403: Forbidden, forbidden

Expected results:

    The installer should use the pre-configured dns private zone, rather than try to create a new one. 

Additional info:

The 4.16 epic adding the support: https://issues.redhat.com/browse/CORS-2591

One PROW CI test which succeeded using Terraform installation: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-4.17-upgrade-from-stable-4.17-gcp-ipi-xpn-mini-perm-byo-hosted-zone-arm-f28/1821177143447523328

The PROW CI test which failed: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.17-multi-nightly-gcp-ipi-xpn-mini-perm-byo-hosted-zone-amd-f28-destructive/1828255050678407168

Description of problem:

    OCP Conformance MonitorTests can fail based on CSI Drivers pod and ClusterRole applied order. SA, CR, CRB likely should be applied first prior to deployment/pods.

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    60%

Steps to Reproduce:

    1. Create IPI cluster on IBM Cloud
    2. Run OCP Conformance w/ MonitorTests
    

Actual results:

    : [sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]

{  fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors
Error creating: pods "ibm-vpc-block-csi-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[7]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[9]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[2].runAsUser: Invalid value: 0: must be in the ranges: [1000180000, 1000189999], provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/ibm-vpc-block-csi-node -n openshift-cluster-csi-drivers happened 7 times

Ginkgo exit error 1: exit with code 1}

Expected results:

    No pod creation failures using the wrong SCC, because the ClusterRole/ClusterRoleBinding, etc. had not been applied yet.

Additional info:

Sorry, I did not see an IBM Cloud Storage listed in the targeted Component for this bug, so selected the generic Storage component. Please forward as necessary/possible.


Items to consider:

ClusterRole:  https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/privileged_role.yaml

ClusterRoleBinding:  https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/rbac/node_privileged_binding.yaml

The ibm-vpc-block-csi-node-* pods eventually reach running using privileged SCC. I do not know whether it is possible to stage the resources that get created first, within the CSI Driver Operator
https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/9288e5078f2fe3ce2e69a4be3d94622c164c3dbd/pkg/operator/starter.go#L98-L99
Prior to the CSI Driver daemonset (`node.yaml`), perhaps order matters within the list.

Example of failure in CI:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8235/pull-ci-openshift-installer-master-e2e-ibmcloud-ovn/1836521032031145984

 

Description of problem:

On "Search" page, search resource VolumeSnapshots/VolumeSnapshotClasses and filter with label, the filter doesn't work.
    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-04-08-024331
    

How reproducible:

always
    

Steps to Reproduce:

To reproduce VolumeSnapshot bug:  
    1. Go to VolumeSnapshots page under a namespace that has VolumeSnapshotClaims defined (e.g. openshift-pipelines)
    2. Create two new VolumeSnapshots - use one of the defined VolumeSnapshotClaims during creation.
    3. Click on one of the created VolumeSnapshots and add a label - e.g. "demoLabell".
    4. Go to "Search" page, choose "VolumeSnapshots" resource, filter with any label, eg "demoLabel", "something"
To reproduce VolumeSnapshotClass bug: 
    1. Go to VolumeSnapshotsClasses page
    2. Create two new VolumeSnapshotClasses. 
    3. Click on one of the created VolumeSnapshotClasses and add a label - e.g. "demoLabel".    
    4. Go to "Search" page, choose "VolumeSnapshots" resource, filter with any label, eg "demoLabel", "something"

 

Actual results:

1. Label filters don't work.
2. VolumeSnapshots are listed without being filtered by label.
2. VolumeSnapshotClasses are listed without being filtered by label.      

Expected results:

1. VSs and VSCs should be filtered by label.

    

Additional info:

Screenshots VS: 
https://drive.google.com/drive/folders/1GEUgOn5FXr-l3LJNF-FWBmn-bQ8uE_rD?usp=sharing   
Screenshoft VSC:
https://drive.google.com/drive/folders/1gI7PNCzcCngfmFT5oI1D6Bask5EPsN7v?usp=sharing

 

  

Description of problem:

    %s is not populated with authoritativeAPI , when cluster is enabled for migration

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-09-23-182657

How reproducible:

    Always

Steps to Reproduce:

Set featuregate as below 
spec:
  featureSet: CustomNoUpgrade
  customNoUpgrade:
    enabled:
    - MachineAPIMigration    

Update - oc edit --subresource status` to add the `.status.authoritativeAPI` field to see the behaviour of the pausing.

eg- oc edit --subresource status machineset.machine.openshift.io miyadav-2709a-5v7g7-worker-eastus2

 

Actual results:

    status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2024-09-27T07:22:58Z"
    reason: AuthoritativeAPI is set to MachineAPI
    severity: The AuthoritativeAPI is set to %s
    status: "False"
    type: Paused
  fullyLabeledRepl

Expected results:

    status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2024-09-27T07:22:58Z"
    reason: AuthoritativeAPI is set to MachineAPI
    severity: The AuthoritativeAPI is set to MachineAPI
    status: "False"
    type: Paused
  fullyLabeledRepl

Additional info:

    related to - https://issues.redhat.com/browse/OCPCLOUD-2565

    message: 'The AuthoritativeAPI is set to '
    reason: AuthoritativeAPIMachineAPI
    severity: Info
    status: "False"
    type: Paused

 

Description of problem:

   Specifying additionalTrustBundle in the HC doesnt propogate down to the worker nodes

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    100%

Steps to Reproduce:

    1.Create CM with additionalTrustBundle
    2.Specify CM in HC.Spec.AdditionalTrustBundle
    3.Debug worker nodes and check if additionalTrustBundle has been updated
    

Actual results:

    additionalTrustBundle hasnt propogated down to nodes

Expected results:

     additionalTrustBundle propogated down to nodes

Additional info:

    

Description of problem:

Redfish exception occurred while provisioning a worker using HW RAID configuration on HP server with ILO 5:

step': 'delete_configuration', 'abortable': False, 'priority': 0}: Redfish exception occurred. Error: The attribute StorageControllers/Name is missing from the resource /redfish/v1/Systems/1/Storage/DE00A000

spec used:
spec:
  raid:
    hardwareRAIDVolumes:
    - name: test-vol
      level: "1"
      numberOfPhysicalDisks: 2
      sizeGibibytes: 350
  online: true

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. Provision an HEP worker with ILO 5 using redfish
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Software production has changed the key they want ART to sign with. ART is currently signing with the original key we were provided and sigstore-3.

Description of problem:

After configuring remote-write for UWM prometheus named "user-workload" in configmap named user-workload-monitoring-config, the proxyURL (same as cluster proxy resource) is not getting injected at all.

Version-Release number of selected component (if applicable):

4.16.4

How reproducible:

100%

Steps to Reproduce:

1. Configure proxy custom resource in RHOCP 4.16.4 cluster
2. Create user-workload-monitoring-config configmap in openshift-monitoring project
3. Inject remote-write config (without specifically configuring proxy for remote-write)
4. After saving the modification in  user-workload-monitoring-config configmap, check the remoteWrite config in Prometheus user-workload CR. Now it does NOT contain the proxyUrl. Example snippet:
==============
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
[...]
  name: user-workload
  namespace: openshift-user-workload-monitoring
spec:
[...]
  remoteWrite:
  - url: http://test-remotewrite.test.svc.cluster.local:9090    <<== No Proxy URL Injected 

Actual results:

UWM prometheus CR named "user-workload" doesn't inherit the proxyURL from cluster proxy resource.

Expected results:

UWM prometheus CR named "user-workload" should inherit proxyURL from cluster proxy resource and it should also respect noProxy which is configured in cluster proxy.

Additional info:

    

Description of problem:

    CNO doesnt report, as a metric, when there is a network overlap when live migration is initiated. 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

 

Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/200

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Cloud Compute / Other Provider". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.

Component name: ose-cluster-capi-operator-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml

Please review the following PR: https://github.com/openshift/thanos/pull/151

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

Regions `us-east2` and `us-east3` GCP regions doesn't have zones and when these regions are used for creating cluster, installer crashes with below stack trace.

$ openshift-install version
openshift-install 4.16.0-0.ci.test-2024-04-23-055943-ci-ln-z602w5b-latest
built from commit 0bbbb0261b724628c8e68569f31f86fd84669436
release image registry.build03.ci.openshift.org/ci-ln-z602w5b/release@sha256:a0df6e54dfd5d45e8ec6f2fcb07fa99cf682f7a79ea3bc07459d3ba1dbb47581
release architecture amd64
$ yq-3.3.0 r test4/install-config.yaml platform
gcp:
  projectID: openshift-qe
  region: us-east2
  userTags:
  - parentID: 54643501348
    key: ocp_tag_dev
    value: bar
  - parentID: openshift-qe
    key: Su.Shi-Jiang_Cheng_Zi
    value: SHI NIAN
  userLabels:
  - key: createdby
    value: installer-qe
  - key: a
    value: 8
$ yq-3.3.0 r test4/install-config.yaml credentialsMode
Passthrough
$ yq-3.3.0 r test4/install-config.yaml featureSet
TechPreviewNoUpgrade
$ yq-3.3.0 r test4/install-config.yaml metadata.name
jiwei-0424a
$
$ openshift-install create cluster --dir test4
INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json"
panic: runtime error: index out of range [0] with length 0goroutine 1 [running]:
github.com/openshift/installer/pkg/asset/installconfig/gcp.(*Client).GetMachineTypeWithZones(0xc0017a7f90?, {0x1f6dd998, 0x23ab4be0}, {0xc0007e6650, 0xc}, {0xc0007e6660, 0x8}, {0x7c2b604, 0xd})
        /go/src/github.com/openshift/installer/pkg/asset/installconfig/gcp/client.go:142 +0x5e8
github.com/openshift/installer/pkg/asset/installconfig/gcp.ValidateInstanceType({0x1f6fe5e8?, 0xc0007e0428?}, 0xc001a7cde0, {0xc0007e6650?, 0x27f?}, {0xc0007e6660?, 0x40ffcf?}, {0xc000efe980, 0x0, 0x0}, ...)
        /go/src/github.com/openshift/installer/pkg/asset/installconfig/gcp/validation.go:80 +0x6c
github.com/openshift/installer/pkg/asset/installconfig/gcp.validateInstanceTypes({0x1f6fe5e8, 0xc0007e0428}, 0xc00107f080)
        /go/src/github.com/openshift/installer/pkg/asset/installconfig/gcp/validation.go:189 +0x4f7
github.com/openshift/installer/pkg/asset/installconfig/gcp.Validate({0x1f6fe5e8?, 0xc0007e0428}, 0xc00107f080)
        /go/src/github.com/openshift/installer/pkg/asset/installconfig/gcp/validation.go:63 +0xf45
github.com/openshift/installer/pkg/asset/installconfig.(*InstallConfig).platformValidation(0xc0011d8f80)
        /go/src/github.com/openshift/installer/pkg/asset/installconfig/installconfig.go:199 +0x21a
github.com/openshift/installer/pkg/asset/installconfig.(*InstallConfig).finish(0xc0011d8f80, {0x7c518a9, 0x13})
        /go/src/github.com/openshift/installer/pkg/asset/installconfig/installconfig.go:171 +0x6ce
github.com/openshift/installer/pkg/asset/installconfig.(*InstallConfig).Load(0xc0011d8f80, {0x1f69a550?, 0xc001155c70?})
        /go/src/github.com/openshift/installer/pkg/asset/installconfig/installconfig.go:112 +0x55
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc000c5f440, {0x1f6c8080, 0xc0011d8ac0}, {0xc001163c6c, 0x4})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:264 +0x33f
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc000c5f440, {0x1f6cc230, 0xc001199360}, {0x7c056f3, 0x2})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:247 +0x23a
github.com/openshift/installer/pkg/asset/store.(*storeImpl).load(0xc000c5f440, {0x7f88420a5000, 0x23a57c20}, {0x0, 0x0})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:247 +0x23a
github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000c5f440, {0x1f6ddab0, 0xc0011b6eb0}, {0x7f88420a5000, 0x23a57c20}, {0x0, 0x0})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:201 +0x1b1
github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7fffbc31408f?, {0x1f6ddab0?, 0xc0011b6eb0?}, {0x7f88420a5000, 0x23a57c20}, {0x23a27e60, 0x8, 0x8})
        /go/src/github.com/openshift/installer/pkg/asset/store/store.go:77 +0x54
github.com/openshift/installer/pkg/asset/store.(*fetcher).FetchAndPersist(0xc001155c60, {0x1f6ddab0, 0xc0011b6eb0}, {0x23a27e60, 0x8, 0x8})
        /go/src/github.com/openshift/installer/pkg/asset/store/assetsfetcher.go:47 +0x165
main.newCreateCmd.runTargetCmd.func3({0x7fffbc31408f?, 0x5?})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:301 +0x6a
main.newCreateCmd.runTargetCmd.func4(0xc000fdf600?, {0xc001199260?, 0x4?, 0x7c06e81?})
        /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:315 +0x102
github.com/spf13/cobra.(*Command).execute(0x23a324c0, {0xc001199220, 0x2, 0x2})
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:987 +0xaa3
github.com/spf13/cobra.(*Command).ExecuteC(0xc001005500)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1115 +0x3ff
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1039
main.installerMain()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:62 +0x385
main.main()
        /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:36 +0x11d 

Additional info: slack thread discussion https://redhat-internal.slack.com/archives/C01V1DP387R/p1713959395808119

1. Proposed title of this feature request
Collect number of resources in etcd

2. What is the nature and description of the request?
The number of resources is useful in several scenarios, like kube-apiserver high memory usage.

3. Why does the customer need this? (List the business requirements here)
The information will be useful for OpenShift Support. The number of resources is useful in several scenarios, like kube-apiserver high memory usage.

4. List any affected packages or components.
must-gather

This feature conditionally creates a button within the VirtualizedTable component that allows clients to download the data within the table as comma-separated values (.csv). 

 

Both PRs are needed to test the feature.

The PRs are 

https://github.com/openshift/console/pull/14050

and 

https://github.com/openshift/monitoring-plugin/pull/133

 

The monitoring-plugin passes a string called 'csvData', which contains metrics data formatted in comma-separated values. The console then consumes the 'csvData' in the component 'VirtualizedTable'. 'VirtualizedTable' renders the 'Export as CSV' button only if this property, 'cvsData' is present. Without the property the button 'Export as CSV' will not render. 

 

The console's CI/CD pipeline > tide requires that issues have a valid Jira reference, presumably in this (OpenShift Console) board. This ticket is a duplication of

https://issues.redhat.com/browse/OU-431

 

 

Component Readiness has found a potential regression in the following test:

[sig-storage] [Serial] Volume metrics Ephemeral should create volume metrics with the correct BlockMode PVC ref [Suite:openshift/conformance/serial] [Suite:k8s]

Probability of significant regression: 100.00%

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&FeatureSet=default&Installer=ipi&Network=ovn&NetworkAccess=default&Platform=aws&Scheduler=default&SecurityMode=default&Suite=serial&Topology=ha&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-31%2000%3A00%3A00&capability=Other&columnGroupBy=Platform%2CArchitecture%2CNetwork&component=Storage&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20aws%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-08-12%2023%3A59%3A59&sampleRelease=4.17&sampleStartTime=2024-08-06%2000%3A00%3A00&testId=openshift-tests%3Acf25df8798e0307458a9892a76dd7a4a&testName=%5Bsig-storage%5D%20%5BSerial%5D%20Volume%20metrics%20Ephemeral%20should%20create%20volume%20metrics%20with%20the%20correct%20BlockMode%20PVC%20ref%20%5BSuite%3Aopenshift%2Fconformance%2Fserial%5D%20%5BSuite%3Ak8s%5D

Description of problem:

AWS api-int lb is either misconfigured or buggy: it allows connections when apiserver is being shutdown. Termination log has

>Request to %q (source IP %s, user agent %q) through a connection created very late in the graceful termination process (more than 80%% has passed), possibly a sign for a broken load balancer setup.

messages and in-cluster monitoring suite shows multiple one second disruptions

Slack thread

For troubleshooting OSUS cases, the default must-gather doesn't collect OSUS information, and an inspect of the openshift-update-service namespace is missing several OSUS related resources like UpdateService, ImageSetConfiguration, and maybe more.

Create an specific must-gather image for OSUS (as there are for other operators/components [1]).

[1] https://access.redhat.com/solutions/5459251

User Story

The Cluster API provider Azure has a deployment manifest that deploys Azure service operator from mcr.microsoft.com/k8s/azureserviceoperator:v2.6.0 image.

We need to set up OpenShift builds of the operator and update the manifest generator to use the OpenShift image.

Background

Azure have split the API calls out of their provider so that they now use the service operator. We now need to ship service operator as part of the CAPI operator to make sure that we can support CAPZ.

Steps

  • Request for the ASO repo to be created and build in Openshift
  • Set up release repo configuration and basic testing for ASO repo
  • Set up ART build for ASO repo
  • Fill out prodsec survey (https://issues.redhat.com/browse/PSSECDEV-7783)
  • Update the manifest generator in cluster-capi-operator to replace the image with our image stream.
  • Manifest generator should know how to filter parts of ASO so that we only ship the parts of Azure care about
  • Create manifests for deploying the subset of required ASO that we need

Stakeholders

  • Cluster Infrastructure

Definition of Done

  • CAPZ deploys OpenShift version of ASO
  • Docs
  •  
  • Testing
  •  

As an openshift engineer keep the vsphere provider up to date with the most current version of capi so that we don't get behind and cause potential future problems.

 

For disconnected clusters, we will need to move to use ImageDigestMirrorSet (IDMS) since ImageContentSourcePolicy (ICSP) is currently deprecated and will eventually be removed.

 

There are several scenarios:

  1. Both ICSP and IDMS exist, we will need to handle both and somehow merge them into IDMS as that is the path forward
  2. Only ICSP exists, we will need to warn users and move it to IDMS - will this be done by us or the users?
  3. Only IDMS exists, we need logic to handle this 

 

 For disconnected clusters, we will need to move to use ImageDigestMirrorSet (IDMS) since ImageContentSourcePolicy (ICSP) is currently deprecated and will eventually be removed.

 

There are several scenarios:

  1. Both ICSP and IDMS exist, we will need to handle both and somehow merge them into IDMS as that is the path forward
  2. Only ICSP exists, we will need to warn users and move it to IDMS - will this be done by us or the users?
  3. Only IDMS exists, we need logic to handle this 

 

 For disconnected clusters, we will need to move to use ImageDigestMirrorSet (IDMS) since ImageContentSourcePolicy (ICSP) is currently deprecated and will eventually be removed.

 

There are several scenarios:

  1. Both ICSP and IDMS exist, we will need to handle both and somehow merge them into IDMS as that is the path forward
  2. Only ICSP exists, we will need to warn users and move it to IDMS - will this be done by us or the users?
  3. Only IDMS exists, we need logic to handle this 

 

 

Currently we use ICSP flag when using oc cli commands. We need to use the IDMS flag instead

https://github.com/openshift/oc/blob/34c69c72be5a0c71863965a5c6480c236b0f843e/pkg/cli/image/extract/extract.go#L190

To allow for easier injection of values and if/else switches, we should move the existing pod template in pod.yaml to a gotemplate.

AC:

  • pod.yaml should be able to contain common gotpl directives
  • targetconfigcontroller should render a go template with values
  • etcd pod should start and render as it was before

Description of problem:

Sippy complains about pathological events in ns/openshift-cluster-csi-drivers in vsphere-ovn-serial jobs. See this job as one example.

Jan noticed that the DaemonSet generation is 10-12, while in 4.17 is 2. Why is our operator updating the DaemonSet so often?

I wrote a quick "one-liner" to generate json diffs from the vmware-vsphere-csi-driver-operator logs:

prev=''; grep 'DaemonSet "openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-node" changes' openshift-cluster-csi-drivers_vmware-vsphere-csi-driver-operator-5b79c58f6f-hpr6g_vmware-vsphere-csi-driver-operator.log | sed 's/^.*changes: //' | while read -r line; do diff <(echo $prev | jq .) <(echo $line | jq .); prev=$line; echo "####"; done 

It really seems to be only operator.openshift.io/spec-hash and operator.openshift.io/dep-* fields changing in the json diffs:

####
4,5c4,5
<       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==",
<       "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36"
---
>       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==",
>       "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09"
13c13
<           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q=="
---
>           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A=="
####
4,5c4,5
<       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==",
<       "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09"
---
>       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==",
>       "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36"
13c13
<           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A=="
---
>           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q=="
#### 

The deployment is also changing in the same way. We need to find what is causing the spec-hash and dep-* fields to change and avoid the unnecessary churn that causes new daemonset / deployment rollouts.

 

Version-Release number of selected component (if applicable):

4.18.0

How reproducible:

~20% failure rate in 4.18 vsphere-ovn-serial jobs

Steps to Reproduce:

    

Actual results:

operator rolls out unnecessary daemonset / deployment changes

Expected results:

don't roll out changes unless there is a spec change

Additional info:

    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

What

Add a test that RBACs unauthorized group to have access to images.

Why

  • We broke this feature, with changes to RBAC and we want to protect this feature.

Request for sending data via telemetry for OpenTelemetry operator

The goal is to collect metrics about some features used by the OpenTelemetry operator because this will be useful
for improve the product, knowing which features the customer use we can focus our efforts better on improve those features.

opentelemetry_collector_receivers

opentelemetry_collector_receivers gauge that represents the number of OpenTelemetry collector instances that uses certain receiver
Labels

  • type, possible values are:
  • jaegerreceiver
  • hostmetricsreceiver
  • opencensusreceiver
  • prometheusreceiver
  • zipkinreceiver
  • kafkareceiver
  • filelogreceiver
  • journaldreceiver
  • k8seventsreceiver
  • kubeletstatsreceiver
  • k8sclusterreceiver
  • k8sobjectsreceiver

Cardinality: 12

opentelemetry_collector_exporters

opentelemetry_collector_exporters gauge that represents the number of OpenTelemetry collector instances that uses certain exporter
Labels

  • type - possible values are:
  • debugexporter
  • loggingexporter
  • otlpexporter
  • otlphttpexporter
  • prometheusexporter
  • lokiexporter
  • kafkaexporter
  • awscloudwatchlogsexporter
  • loadbalancingexporter

Cardinality: 9

opentelemetry_collector_processors

opentelemetry_collector_processors gauge that represents the number of OpenTelemetry collector instances that uses certain processor
Labels

  • type - possible values are:
  • batchprocessor
  • memorylimiterprocessor
  • attributesprocessor
  • resourceprocessor
  • spanprocessor
  • k8sattributesprocessor
  • resourcedetectionprocessor
  • filterprocessor
  • routingprocessor
  • cumulativetodeltaprocessor
  • groupbyattrsprocessor

Cardinality: 11

opentelemetry_collector_extensions

opentelemetry_collector_extensions gauge that represents the number of OpenTelemetry collector instances that uses certain extension
Labels

  • type - possible values are:
  • zpagesextension
  • ballastextension
  • memorylimiterextension
  • jaegerremotesampling
  • healthcheckextension
  • pprofextension
  • oauth2clientauthextension
  • oidcauthextension
  • bearertokenauthextension
  • filestorage

Cardinality: 10

opentelemetry_collector_connectors

opentelemetry_collector_connectors gauge that represents the number of OpenTelemetry collector instances that uses certain connector
Labels

  • type - possible values are:
  • spanmetricsconnector
  • forwardconnector

Cardinality: 2

opentelemetry_collector_info

opentelemetry_collector_info gauge that represents the number of OpenTelemetry collector instances that uses certain deployment type
Labels

  • type - possible values are:
  • deployment
  • daemonset
  • sidecar
  • statefulset

Cardinality: 4

This test failed 3 times in the last week with the following error:

{{

Unknown macro: { KubeAPIErrorBudgetBurn was at or above info for at least 2m28s on platformidentification.JobType{Release}

(maxAllowed=0s): pending for 1h33m52s, firing for 2m28s: Sep 16 21:20:56.839 - 148s E namespace/openshift-kube-apiserver alert/KubeAPIErrorBudgetBurn alertstate/firing severity/critical ALERTS{alertname="KubeAPIErrorBudgetBurn", alertstate="firing", long="6h", namespace="openshift-kube-apiserver", prometheus="openshift-monitoring/k8s", severity="critical", short="30m"}}}}

 

It didn't fail a single time in the previous month on 4.17 nor in the month before we shipped 4.16 so I'm proposing this as a blocker to be investigated.  Below you have the boilerplate Component Readiness text:

 


Component Readiness has found a potential regression in the following test:

[bz-kube-apiserver][invariant] alert/KubeAPIErrorBudgetBurn should not be at or above info

Probability of significant regression: 99.04%

Sample (being evaluated) Release: 4.17
Start Time: 2024-09-10T00:00:00Z
End Time: 2024-09-17T23:59:59Z
Success Rate: 85.71%
Successes: 18
Failures: 3
Flakes: 0

Base (historical) Release: 4.16
Start Time: 2024-05-28T00:00:00Z
End Time: 2024-06-27T23:59:59Z
Success Rate: 100.00%
Successes: 74
Failures: 0
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&Architecture=amd64&FeatureSet=default&FeatureSet=default&Installer=ipi&Installer=ipi&Network=ovn&Network=ovn&NetworkAccess=default&Platform=azure&Platform=azure&Scheduler=default&SecurityMode=default&Suite=serial&Suite=serial&Topology=ha&Topology=ha&Upgrade=none&Upgrade=none&baseEndTime=2024-06-27%2023%3A59%3A59&baseRelease=4.16&baseStartTime=2024-05-28%2000%3A00%3A00&capability=Alerts&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=kube-apiserver&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20azure%20serial%20ha%20none&ignoreDisruption=1&ignoreMissing=0&includeVariant=Architecture%3Aamd64&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&pity=5&sampleEndTime=2024-09-17%2023%3A59%3A59&samplePRNumber=&samplePROrg=&samplePRRepo=&sampleRelease=4.17&sampleStartTime=2024-09-10%2000%3A00%3A00&testId=openshift-tests%3Ad6b41cee7afca1c2a0b52f9e6975425f&testName=%5Bbz-kube-apiserver%5D%5Binvariant%5D%20alert%2FKubeAPIErrorBudgetBurn%20should%20not%20be%20at%20or%20above%20info&view=

Some of the E2E tests could be considered read-only, such as looping until a PromQL expression is true.

Additionally, some tests are non-disruptive: all their operations are performed within a temporary namespace without impacting the monitoring components' statuses.

We can t.Parallel() them to save some minutes.

Also, we can:

Isolate specific tests to enable parallel execution

Enhance the resilience of some tests and fix those prone to errors.

Fix some tests that were running wrong checks.

Make some the tests idempotent to be easily debugged and run locally

Description of problem:

    certrotation controller is using applySecret/applyConfigmap functions from library-go to update secret/configmap. This controller has several replicas running in parallel, so it may overwrite changes applied by a different replica, which leads to unexpected signer updates and corrupted CA bundles.

applySecret/applyConfigmap does initial Get and calls Update, which overwrites the changes done to a copy received from the informer.
Instead it should issue .Update calls directly using a copy received from the informer, so that etcd would reject a change if its done after the resourceVersion was updated in parallel

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

On "Networking"->"NetworkPolicies" page, when "MultiNetworkPolicies disabled", on "NetworkPolicies" tab, select a project, eg "default" from dropdown list. Then click tab "MultiNetworkPolicies", and click back to "NetworkPolicies" tab, the project dropdown is set to "All Projects" automatically
    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-09-09-150616
4.17.0-0.nightly-2024-09-09-120947
    
    

How reproducible:

Always
    

Steps to Reproduce:

    1.On "Networking"->"NetworkPolicies" page, when "MultiNetworkPolicies disabled", on "NetworkPolicies" tab, select a project, eg "default" from dropdown list. Then click tab "MultiNetworkPolicies", and click back to "NetworkPolicies" tab
    2.
    3.
    

Actual results:

1. The project dropdown is set to "All Projects" automatically
    

Expected results:

1. The project dropdown should be set to "default" as originally selected.
    

Additional info:


    

Request for sending data via telemetry

The goal is to collect metrics about Cluster Logging Operator 6.y, so that we can track usage of features in the new release version.

openshift_logging:log_forwarder_pipelines:sum

"openshift_logging:log_forwarder_pipelines:sum" represents the number of logging pipelines managed by CLO per namespace.

Labels

  • "resource_namespace", the namespace in which the forwarder is deployed. Most customers only use "openshift-logging".
  • "version", the version of the Cluster Logging Operator

The cardinality of the metric is "one per namespace", which for most clusters will be one.

openshift_logging:log_forwarder_pipelines:count

"openshift_logging:log_forwarder_pipelines:count" represents the number of deployed ClusterLogForwarders per namespace.

Labels

  • "resource_namespace", the namespace in which the forwarder is deployed. Most customers only use "openshift-logging".
  • "version", the version of the Cluster Logging Operator

The cardinality of the metric is "one per namespace", which for most clusters will be one.

openshift_logging:log_forwarder_input_type:sum

"openshift_logging:log_forwarder_input_type:sum" represents the number of inputs managed by CLO per namespace.

Labels

  • "resource_namespace", the namespace in which the forwarder is deployed. Most customers only use "openshift-logging".
  • "version", the version of the Cluster Logging Operator
  • "input", the type of input used. There are four input types.

The cardinality of the metric is "one per namespace and input type". I expect this to be two for most customers.

openshift_logging:log_forwarder_output_type:sum

"openshift_logging:log_forwarder_output_type:sum" represents the number of outputs managed by CLO per namespace.

Labels

  • "resource_namespace", the namespace in which the forwarder is deployed. Most customers only use "openshift-logging".
  • "version", the version of the Cluster Logging Operator
  • "output", the type of output used. There are eleven output types.

The cardinality of the metric is "one per namespace and output type". I expect most customers to use one or two output types.

openshift_logging:vector_component_received_bytes_total:rate5m

"openshift_logging:vector_component_received_bytes_total:rate5m" represents current total log rate for a cluster for log collectors managed by CLO.

Labels

  • "namespace", the namespace in which the forwarder is deployed. Most customers only use "openshift-logging".

The cardinality of the metric is "one per namespace". which for most clusters will be one.

Links

Component exposing the metric: https://github.com/openshift/cluster-logging-operator/blob/master/internal/metrics/telemetry/telemetry.go#L25-L47

The recording rules for these metrics are currently reviewed in this PR: https://github.com/openshift/cluster-logging-operator/pull/2823

User Story

As a developer I want have automated e2e testing on PRs so that I can make sure the changes for cluster-api-provider-ibmcloud are thoroughly tested.

Steps

  • Add e2e tests script to the CAPIBM repo
  • Add CI presubmit job configuration to the release repo
  • Make sure everything is wired up correctly

Stakeholders

  • Cluster Infra team
  • IBM PowerVS team

Background 

The monitoring-plugin is still using Patternfly v4; it needs to be upgraded to Patternfly v5. This major version release deprecates components in the monitoring-plugin. These components will need to be replaced/removed to accommodate the version update. 

We need to remove the deprecated components from the monitoring plugin, extending the work from CONSOLE-4124

Work to be done: 

  • upgrade monitoring-plugin > package.json > Patternfly v5
  • Remove/replace any deprecated components after upgrading to Patternfly v5. 

Outcome 

  • The monitoring-plugin > package.json will be upgrade to use Patternfly v5
  • Any deprecrated components from Patternfly v4 will be removed or replaced my similiar Patternfly v5 components

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

Delete the openshift-monitoring/monitoring-plugin-cert secret, SCO will re-create a new one with different content

Actual results:

- monitoring-plugin is still using the old cert content.
- If the cluster doesn’t show much activity, the hash may take time to be updated.

Expected results:

CMO should detect that exact change and run a sync to  recompute and set the new hash.

Additional info:

- We shouldn't rely on another changeto trigger the sync loop.
- CMO should maybe watch that secret? (its name isn't known in advance). 

Whenever we update dependencies in the main module or the api module, compilation breaks for developers that are using a go workspace locally. We can ensure that the dependencies are kept in sync by running a 'go work sync' in a module where the hypershift repo is a symlinked child.

Along with disruption monitoring via external endpoint we should add in-cluster monitors which run the same checks over:

  • service network (kubernetes.default.svc)
  • api-int endpoint (via hostnetwork)
  • localhosts (on masters only)

These tests should be implemented as deployments with anti-affinity landing on different nodes. Deployments are selected so that the nodes could properly be drained. These deployments are writing to host disk and on restart the pod will pick up existing data. When a special configmap is created the pod will stop collecting disruption data.

External part of the test will create deployments (and necessary RBAC objects) when test is started, create stop configmap when it ends and collect data from the nodes. The test will expose them on intervals chart, so that the data could be used to find the source of disruption

Description of problem:

[vmware-vsphere-csi-driver-operator] driver controller/node/webhook update events repeat pathologically    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-03-161006    

How reproducible:

Always    

Steps to Reproduce:

    1. Install an Openshift cluster on vSphere of version 4.17 nightly.
    2. Upgrade the cluster to 4.18 nightly.
    3. Check the driver controller/node/webhook update events should not repeat pathologically.     

CI failure record -> https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-vsphere-ovn-upgrade/1854191939318976512  

Actual results:

 In step 3: the driver controller/node/webhook update events repeat pathologically   

Expected results:

 In step 3: the driver controller/node/webhook update events should not repeat pathologically    

Additional info:

    

Description of problem:

While setting userTags in the install-config file for AWS does not support all AWS valid characters as per [1].  
platform:
  aws:
    region: us-east-1
    propagateUserTags: true
    userTags:
      key1: "Test Space" 
      key2: value2

ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: platform.aws.userTags[key1]: Invalid value: "Test Space": value contains invalid characters

The documentation at: https://docs.openshift.com/container-platform/4.16/installing/installing_aws/installation-config-parameters-aws.html#installation-configuration-parameters-optional-aws_installation-config-parameters-aws does not refer to any restrictions.

However:

Validation is done here:

https://github.com/openshift/installer/blob/74ee94f2a34555a41107a5a7da627ab5de0c7373/pkg/types/aws/validation/platform.go#L106

Which in turn refers to a regex here:

https://github.com/openshift/installer/blob/74ee94f2a34555a41107a5a7da627ab5de0c7373/pkg/types/aws/validation/platform.go#L17

Which allows these characters: `^[0-9A-Za-z_.:/=+-@]*$`

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html#tag-restrictions).

Version-Release number of selected component (if applicable):

    

How reproducible:

    100 %

Steps to Reproduce:

    1. Create a install-config with a value usertags as mention in description.
    2. Run the installer.


   
    

Actual results:

Command failed with below error:

ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: platform.aws.userTags[key1]: Invalid value: "Test Space": value contains invalid characters

 

Expected results:

    Installer should run successfully.

Additional info:

    In userTags when the value with space is set then the installer failed to compile the install-config. 

The ovnkube-node pods are crash looping with:

1010 23:12:06.421980 6605 ovnkube.go:137] failed to run ovnkube: [failed to initialize libovsdb NB client: failed to connect to unix:/var/run/ovn/ovnnb_db.sock: database OVN_Northbound validation error (8): database model contains a model for table Sample that does not exist in schema. database model contains a model for table Sampling_App that does not exist in schema. Mapper Error. Object type nbdb.ACL contains field SampleEst (*string) ovs tag sample_est: Column does not exist in schema. Mapper Error. Object type nbdb.NAT contains field Match (string) ovs tag match: Column does not exist in schema. database model contains a model for table Sample_Collector that does not exist in schema. Mapper Error. Object type nbdb.LogicalRouterPort contains field DhcpRelay (*string) ovs tag dhcp_relay: Column does not exist in schema. database model contains a model for table DHCP_Relay that does not exist in schema. database model contains a client index for table ACL that does not exist in schema, failed to start node network controller: error in syncing cache for *v1.Pod informer] 

The ovn builds for cs9 are old and have not been built with the latest. The team is working to build the rpms and once we have it, we need builds of ovn-kubernetes with the latest ovn rpms to fix this issue.

Description of problem:

There are lots of customers that deploy cluster that are not directly connected to Internet so they use a corporate proxy. Customers have been unable to correctly understand how to configure cluster wide proxy for a new HostedCluster and they are finding issues to deploy the HostedCluster

For example, given the following configuration:

--
apiVersion: hypershift.openshift.io/v1beta1
kind: HostedCluster
metadata:
  creationTimestamp: null
  name: cluster-hcp
  namespace: clusters
spec:
  configuration:
    proxy:
      httpProxy: http://proxy.testlab.local:80
      httpsProxy: http://proxy.testlab.local:80
      noProxy: testlab.local,192.168.0.0/16
--

A customer normally would add the MachineNetwork CIDR and local domain to the noProxy variable. However this will cause a problem in Openshift Virtualization. Hosted Control Plane KAS won't be able to contact node's kubelet since pods will try to reach tcp/10250 through the proxy, causing an error. So in this scenario, it is needed to add the Hub cluster ClusterNetwork CIDR to the noProxy variable:

--
noProxy: testlab.local,192.168.0.0/16,10.128.0.0/14
--

However, I was unable to find this information in our documentation. Also, there is a known issue that is explained in the following kcs: https://access.redhat.com/solutions/7068827

The problem is, the Hosted Cluster deploys the control-plane-operator binary instead of the haproxy binary in kube-apiserver-proxy pods under kube-system in the HostedCluster. The kcs explains that the problem is fixed but It is not clear for customer what subnetwork should be added to the noProxy to trigger the logic that deploys the haproxy image so the proxy is not used to expose the kubernetes internal endpoint (172.20.0.1). 

The code seems to compare if the HostedCluster Clusternetwork (10.132.0.0/14) or ServiceNetwork (172.31.0.0/16) or the internal kubernetes address (172.20.0.1) is listed in the noProxy variable, to honor the noProxy setting and deploy haproxy images. This lead us to under trial and error find the correct way to honor the noProxy and allow the HostedCluster to work correctly and be able to connect from kube-apiserver-proxy pods to hosted KAS and also connect from hosted KAS to kubelet bypassing the cluster wide proxy. 

The questions are:

1.- Is it possible to add the information in our documentation about what is the correct way to configure a HostedCluster using noProxy variables?
2.- What is the correct subnet that needs to be added to the noProxy variable so the haproxy images are deployed instead of control-plane operator and allow kube-apiserver-proxy pods to bypass the cluster-wide proxy?






      

Version-Release number of selected component (if applicable):

    4.14.z, 4.15.z, 4.16.z

How reproducible:

Deploy a HostedCluster using noProxy variables    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    Components from Hosted Cluster are still using the proxy not honoring the noProxy variables set.

Expected results:

    Hosted Cluster should be able to deploy correctly.

Additional info:

    

 There have been several instances where assisted would start downloading ClusterImageSet images and it could cause issues like

  • MGMT-18839 - there are so many ClusterImageSets and it takes so long to go through all of them that its client times out.
  • MGMT-17403 - osImage is ignored in the infraEnv
  • MGMT-19023 - image selected isn't the one specified on infra env (disconnected environment
  • Slack thread - InfraEnv chooses image for Ironic Agent based on hub which might not match the desired architecture - chooses based on CVO of Hub HERE - should add multiarch or allow choosing the arch that matches
  • MGMT-19047 - ACM created image failed - after a successful installation, 15 days later after assisted was restarted, the release image is no longer in the cache and fails to be pulled/found/recreated
  • MGMT-19112 - Downloading images takes a long time

 

Possible solution ideas:

  • Download the images asynchronously and do this when assisted first starts up
  • After the image downloads successfully, mark the ClusterImageSet as already downloaded so that assisted doesn't retry it

 

 As mention in the previous review when this was added https://github.com/openshift/assisted-service/pull/4650/files#r1044872735 "late binding usecase would be broken for OKD" so to prevent this, we should detect if the infra-env is late bound and not check for the image if it is.

 The only time a requested ClusterImageSet is cached is when a Cluster is created.

 

This leads to problems such as

  1. If a cluster is already created, but assisted happens to restart, then the cached image is lost and never gets recached
  2. The cluster requests a specific image, but that might not be the one chosen because assisted may cache all the other cluster image set images 

Install's of recent nightly/stable 4.16 SCOS releases are branded as Openshift instead of OKD.

Testing on the following versions shows incorrect branding on oauth URL

4.16.0-0.okd-scos-2024-08-15-225023
4.16.0-0.okd-scos-2024-08-20-110455
4.16.0-0.okd-scos-2024-08-21-155613

Description of problem:

Bootstrap process failed due to API_URL and API_INT_URL are not resolvable:

Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time.
Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Check if API and API-Int URLs are resolvable during bootstrap
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_URL is resolvable
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-url
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_URL api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_INT_URL is resolvable
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-int-url
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_INT_URL api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8905]: https://localhost:2379 is healthy: successfully committed proposal: took = 7.880477ms
Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting cluster-bootstrap...
Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Starting temporary bootstrap control plane...
Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Waiting up to 20m0s for the Kubernetes API
Feb 06 06:42:00 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: API is up

install logs:
...
time="2024-02-06T06:54:28Z" level=debug msg="Unable to connect to the server: dial tcp: lookup api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com on 169.254.169.254:53: no such host"
time="2024-02-06T06:54:28Z" level=debug msg="Log bundle written to /var/home/core/log-bundle-20240206065419.tar.gz"
time="2024-02-06T06:54:29Z" level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
time="2024-02-06T06:54:29Z" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
...


    

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-02-05-184957,openshift/machine-config-operator#4165

    

How reproducible:


Always.
    

Steps to Reproduce:

    1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade
    2. Create cluster
    3.
    

Actual results:

Failed to complete bootstrap process.
    

Expected results:

See description.

    

Additional info:

I believe 4.15 is affected as well once https://github.com/openshift/machine-config-operator/pull/4165 backport to 4.15, currently, it failed at an early phase, see https://issues.redhat.com/browse/OCPBUGS-28969

This duplicate issue was created because openshift/console github bots require a valid CONSOLE Jira to be associated with all PRs. 

Description

Migrate Developer View > Observe > Silences Tab code from openshift/console to openshift/monitoring-plugin. This is part of the ongoing effort to consolidate code between the Administrative and Developer Views of the Observe section. 

Related Jira Issue 

https://issues.redhat.com/browse/OU-257

Related PRs 

 

User Story:

As a HyperShift service provider, I want to be able to:

  • Disable ignition-server (non-RHCOS use case)

so that I can achieve

  • Reducing managed control plane footprint
  • Operating HyperShift control planes at scale

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Toggle to disable ignition-server deployment
  • Implementation options:
    • HostedCluster annotation (static rendering)
    • HostedCluster spec/API (static rendering)
    • Reconcile based on NodePool existence (dynamic rendering)

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

The cluster-baremetal-operator sets up a number of watches for resources using Owns() that have no effect because the Provisioning CR does not (and should not) own any resources of the given type or using EnqueueRequestForObject{}, which similarly has no effect because the resource name and namespace are different from that of the Provisioning CR.

The commit https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e should be reverted as it adds considerable complexity to no effect whatsoever.

The correct way to trigger a reconcile of the provisioning CR is using EnqueueRequestsFromMapFunc(watchOCPConfigPullSecret) (note that the map function watchOCPConfigPullSecret() is poorly named - it always returns the name/namespace of the Provisioning CR singleton, regardless of the input, which is what we want). We should replace the ClusterOperator, Proxy, and Machine watches with ones of this form.

See https://github.com/openshift/cluster-baremetal-operator/pull/423/files#r1628777876 and https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e#r1628776168 for commentary.

 

Description of problem:

   This bug is filed a result of https://access.redhat.com/support/cases/#/case/03977446
ALthough both nodes topologies are equavilent, PPC reported a false negative:

  Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    always

Steps to Reproduce:

    1.TBD
    2.
    3.
    

Actual results:

    Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]

Expected results:

    topologies matches, the PPC should work fine

Additional info:

    

Description of problem:

Below tests fail on ipv6primary dualstack cluster because the router deployed is not prepared for dualstack:

[sig-network][Feature:Router][apigroup:image.openshift.io] The HAProxy router should serve a route that points to two services and respect weights [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:operator.openshift.io] The HAProxy router should respond with 503 to unrecognized hosts [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:operator.openshift.io] The HAProxy router should serve routes that were created from an ingress [apigroup:route.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io][apigroup:operator.openshift.io] The HAProxy router should support reencrypt to services backed by a serving certificate automatically [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should override the route host for overridden domains with a custom value [apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should override the route host with a custom value [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should run even if it has no access to update status [apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should serve the correct routes when scoped to a single namespace and label set [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router][apigroup:route.openshift.io] when FIPS is disabled the HAProxy router should serve routes when configured with a 1024-bit RSA key [Feature:Networking-IPv4] [Suite:openshift/conformance/parallel]

[sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route [apigroup:route.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

That is confirmed by accessing to the router pod and checking the connectivity locally:

sh-4.4$  curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://127.0.0.1/Letter"                      
200
sh-4.4$ echo $?
0
   
sh-4.4$  curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://fd01:0:0:5::551/Letter" 
000
sh-4.4$ echo $?
3             
sh-4.4$  curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://[fd01:0:0:5::551]/Letter"
000
sh-4.4$ echo $?
7    

The default router deployed in the cluster supports dualstack. Hence it's possible and required to update the router image configuration usedin the tests to be able to answer both ipv4 and ipv6.

Version-Release number of selected component (if applicable): https://github.com/openshift/origin/tree/release-4.15/test/extended/router/
How reproducible: Always.
Steps to Reproduce: Run the tests in ipv6primary dualstack cluster.
Actual results: Tests failing as below:

    <*errors.errorString | 0xc001eec080>:
    last response from server was not 200:
{
    s: "last response from server was not 200:\n",
}
occurred
Ginkgo exit error 1: exit with code 1
    

Expected results: Test passing

Looking at the logs for ironic-python-agent in a preprovisioning image, we get each log message twice - once directly from the agent process and once from podman:

Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.834 1 DEBUG ironic_python_agent.ironic_api_client [-] Heartbeat: announcing callback URL https://10.9.53.20:9999, API version is 1.68 heartbeat /usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.p
y:186
Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.834 1 DEBUG ironic_python_agent.ironic_api_client [-] Heartbeat: announcing callback URL https://10.9.53.20:9999, API version is 1.68 heartbeat /usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.py:186
Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent [-] error sending heartbeat to ['https://10.9.49.125:6385']: ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53
868b6 could not be found.
Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent Traceback (most recent call last):
Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 148, in do_heartbeat
Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent     self.api.heartbeat(
Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.py", line 200, in heartbeat
Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent     raise errors.HeartbeatError(error)
Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found.
Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent 
Oct 30 14:23:45 localhost.localdomain podman[3035]: 2024-10-30 14:23:44.867 1 INFO ironic_python_agent.agent [-] sleeping before next heartbeat, interval: 5.029721378959369
Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent [-] error sending heartbeat to ['https://10.9.49.125:6385']: ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found.
Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent Traceback (most recent call last):
Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/agent.py", line 148, in do_heartbeat
Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent     self.api.heartbeat(
Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent   File "/usr/lib/python3.9/site-packages/ironic_python_agent/ironic_api_client.py", line 200, in heartbeat
Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent     raise errors.HeartbeatError(error)
Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent ironic_python_agent.errors.HeartbeatError: Error heartbeating to agent API: Error 404: Node 6f7546a2-f49e-4d8d-88f6-a462d53868b6 could not be found.
Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.866 1 ERROR ironic_python_agent.agent 
Oct 30 14:23:44 localhost.localdomain ironic-agent[3071]: 2024-10-30 14:23:44.867 1 INFO ironic_python_agent.agent [-] sleeping before next heartbeat, interval: 5.029721378959369

This is confusing and unnecessary, especially as the two sets of logs can be interleaved (note also the non-monotonic timestamps in the third column).

The log above actually comes from a ZTP deployment (the one in OCPBUGS-44026), but the IPA configuration even for that ultimately comes from the image-customization-controller.

Currently there is no log driver flag passed to podman so we get the default, which is journald. We have the use_stderr option set in the IPA config so logs get written to stderr, which podman will send to journald. We are also running the podman pod in the foreground, which I think will cause it to also pass the stderr to systemd, which also sends it to the journal.

I believe another side-effect of this misconfiguration is that one lot of logs show up red in journalctl and the other don't. Ideally we would have colour-coded logs by severity. This can be turned on for the stderr logging by setting log_color in the IPA config, but it might be better to enable use-journal instead of use-stderr so we get native logging to journald.

Duplication of Issue from OU Jira board 

Duplicate of https://issues.redhat.com/browse/OU-259

The openshift/console CI needs a valid issue on the OpenShift Console Jira board. 

Overview 

This PR aims to consolidate code for the Developer perspective > Observe > Dashboard page. We will remove the code that renders this page from openshift/console. The page will now be rendered by openshift/monitoring-plugin through this PR: openshift/monitoring-plugin#167.

Testing 

Must be tested with PR: openshift/monitoring-plugin#167

This PR #14192 removes the Developer perspective > Observe > Dashboard page
This PR openshift/monitoring-plugin#167 adds the Developer perspective > Observe > Dashboard page

Expected Results: All behaviors should be the same as before the migration.

 

Request for sending data via telemetry for tempo operator.

The goal is to collect metrics about some features used by the Tempo operator because this will be useful for improve the product, knowing which features the customer use we can focus our efforts better on improve those features.

tempo_operator_tempostack_multi_tenancy

tempo_operator_tempostack_multi_tenancy gauge that represents the number of TempoStack instances that uses tempo_operator_tempostack_multi_tenancy
Labels

  • type - possible values are: enabled/disabled

tempo_operator_tempostack_managed

tempo_operator_tempostack_managed gauge that represent the number of TempoStack instances managed/unmanaged
Labels

  • state - possible values are: Managed/Unmanaged

tempo_operator_tempostack_jaeger_ui

tempo_operator_tempostack_jaeger_ui gauge that represent the number of TempoStack instances true/false
Labels

  • enabled - possible values are:  true/false

tempo_operator_tempostack_storage_backend

tempo_operator_tempostack_storage_backend gauge that represent the number of TempoStack instances that uses certain storage type
Labels

  • type - possible values are: azure, gcs ,s3

 

Description of problem:

Customer wants to boot a VM using the Assisted Installer ISO. The customer originally installed the OpenShift Container Platform cluster using version 4.13, however in the meantime the cluster was updated to 4.16.

As a result, the customer updated the field "osImageVersion" to "4.16.11". This lead to the new ISO being generated as expected. However, when reviewing the "status" of the InfraEnv, they can still see the following URL:

~~~
  isoDownloadURL: 'https://assisted-image-service-multicluster-engine.cluster.example.com/byapikey/<REDACTED>/4.13/x86_64/minimal.iso' 
~~~

Other artifacts are also still showing "?version=4.13":

~~~
    kernel: 'https://assisted-image-service-multicluster-engine.cluster.example.com/boot-artifacts/kernel?arch=x86_64&version=4.13'     rootfs: 'https://assisted-image-service-multicluster-engine.cluster.example.com/boot-artifacts/rootfs?arch=x86_64&version=4.13' 
~~~

Workaround is to downloading the ISO by replacing the version and works as expected.

Version-Release number of selected component (if applicable):

RHACM 2.10
OpenShift Container Platform 4.16.11

How reproducible:

Always at customer side

Steps to Reproduce:

    1. Create a cluster with an InfraEnv with the "osImageVersion" set to 4.14 (or 4.13)
    2. Update the cluster to the next OpenShift Container Platform version
    3. Update the InfraEnv "osImageVersion" field with the new version (you may need to create the ClusterImageSet)

Actual results:

URLs in the "status" of the InfraEnv are not updated with the new version    

Expected results:

URLs in the "status" of the InfraEnv are updated with the new version

Additional info:

* Discussion in Slack: https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1726483803662359

Description of problem:

It's not possible to either 
- create RWOP PVC
- Create RWOP clone
- Restore to RWOP PVC 
using 4.16.0/4.17.0 UI with ODF StorageClasses. 

Please see the attached print screen. 
The RWOP access mode should be added to all the relevant screens in the UI.     

Version-Release number of selected component (if applicable):

    OCP 4.16.0 & 4.17.0
    ODF (OpenShift Data Foundation) 4.16.0 & 4.17.0

How reproducible:

    

Steps to Reproduce:

    1. Open UI, go to OperatorHub
    2. Install ODF, once installed refresh for ConsolePlugin to get populated
    3. Go to operand "StorageSystem" and create the CR using the custom UI (you can just keep on clicking "Next" with the default selected options, it will work well on AWS cluster)
    5. Wait for "ocs-storagecluster-cephfs" and "ocs-storagecluster-ceph-rbd" StorageClasses to get created by ODF operator
    6. Go to PVC creation page, try to create new PVC (using StorageClasses mentioned in step 5)
    7. Try to create clone
    8. Try to restore PVC to RWOP pvc from existing snapshot 
    

Actual results:

It's not possible to create RWOP PVC, not possible to create RWOP clone and to restore to RWOP PVC from a snapshot using 4.16.0 & 4.17.0 UI. 

Expected result: 

 

It should be possible to create RWOP PVC, to create RWOP clone and to restore to a RWOP snapshot from PVC 

 

Additional info:

https://github.com/openshift/console/blob/master/frontend/public/components/storage/shared.ts#L111-L119 >> these needs to be updated

Description of problem:

Spun out of https://issues.redhat.com/browse/OCPBUGS-38121, we noticed that there were logged requests against a non-existent certificatesigningrequests.v1beta1.certificates.k8s.io API in 4.17.

These requests should not be logged if the API doesn't exist.

See also slack discussion https://redhat-internal.slack.com/archives/C01CQA76KMX/p1724854657518169

Version-Release number of selected component (if applicable):

    4.17

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of the problem:

Orange is complains the log errors, though they do not cause actual problems:

  1. kubectl -n assisted-installer logs agentinstalladmission-995576d54-mnnrb | tail -5
    E0904 12:10:11.728584       1 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.FlowSchema: failed to list *v1.FlowSchema: flowschemas.flowcontrol.apiserver.k8s.io is forbidden: User "system:serviceaccount:assisted-installer:agentinstalladmission" cannot list resource "flowschemas" in API group "flowcontrol.apiserver.k8s.io" at the cluster scope
    W0904 12:10:44.955236       1 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.FlowSchema: flowschemas.flowcontrol.apiserver.k8s.io is forbidden: User "system:serviceaccount:assisted-installer:agentinstalladmission" cannot list resource "flowschemas" in API group "flowcontrol.apiserver.k8s.io" at the cluster scope
    E0904 12:10:44.955259       1 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.FlowSchema: failed to list *v1.FlowSchema: flowschemas.flowcontrol.apiserver.k8s.io is forbidden: User "system:serviceaccount:assisted-installer:agentinstalladmission" cannot list resource "flowschemas" in API group "flowcontrol.apiserver.k8s.io" at the cluster scope
    W0904 12:10:47.189983       1 reflector.go:539] k8s.io/client-go/informers/factory.go:159: failed to list *v1.PriorityLevelConfiguration: prioritylevelconfigurations.flowcontrol.apiserver.k8s.io is forbidden: User "system:serviceaccount:assisted-installer:agentinstalladmission" cannot list resource "prioritylevelconfigurations" in API group "flowcontrol.apiserver.k8s.io" at the cluster scope
    E0904 12:10:47.190006       1 reflector.go:147] k8s.io/client-go/informers/factory.go:159: Failed to watch *v1.PriorityLevelConfiguration: failed to list *v1.PriorityLevelConfiguration: prioritylevelconfigurations.flowcontrol.apiserver.k8s.io is forbidden: User "system:serviceaccount:assisted-installer:agentinstalladmission" cannot list resource "prioritylevelconfigurations" in API group "flowcontrol.apiserver.k8s.io" at the cluster scope

How reproducible:

every time when agentinstalladmission starts

Steps to reproduce:

1. in a k8s cluster install infra operator

2. install agentconfig CR,

 kind: AgentServiceConfig
metadata:
  name: agent
spec:
  ingress:
    className: nginx
    assistedServiceHostname: assisted-service.example.com
    imageServiceHostname: image-service.example.com
  databaseStorage:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 8Gi
  filesystemStorage:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 8Gi
  imageStorage:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
  osImages: [{"openshiftVersion":"4.17.0","cpuArchitecture":"x86_64","url":"https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/pre-release/4.17.0-ec.3/rhcos-4.17.0-ec.3-x86_64-live.x86_64.iso","version":"4.17.0"}]

3. check the agentinstalladmission container log

Actual results:

some error show up in the log

Expected results:

log free of errors