Back to index

4.20.0-0.konflux-nightly-2025-04-02-123413

Jump to: Incomplete Features | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.19.0-ec.4

Note: this page shows the Feature-Based Change Log for a release

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

Goal:
Graduate to GA (full support) Gateway API with Istio to unify the management of cluster ingress with a common, open, expressive, and extensible API.

Description:
Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.

The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.

Epic Goal

  • Add Gateway API via Istio Gateway implementation as GA in future release

Problem: ** As an administrator, I would like to securely expose cluster resources to remote clients and services while providing a self-service experience to application developers. 

GA:  A feature is implemented as GA so that developers can issue an update to the Tech Preview MVP and:

  • can no longer change APIs without following a deprecating or backwards compatibility process.
  • are required to fix bugs customers uncover
  • must support upgrading the cluster and your component
  • provide docs
  • provide education to CEE about the feature
  • must also follow Red Hat's support policy for GA

Why is this important?

  • Reduces the burden on Red Hat developers to maintain IngressController and Route custom resources
  • Brings OpenShift ingress configuration more in line with standard Kubernetes APIs
  • Demonstrates Red Hat’s leadership in the Kubernetes community.

Scenarios

  1. ...

Acceptance Criteria

  • Gateway API and Istio Gateway are in an acceptable standing for GA
  • Istio Gateway installation without sidecars enabled
  • Decision completed on whether a new operator is required, especially for upgrade and status reports
  • Decision completed on whether Ingress->Gateway (or Route->Gateway) translation is needed
  • Enhancement Proposals, Migration details, Tech Enablement, and other input for QA and Docs as needed
  • API server integration, Installation, CI, E2E tests, Upgrade details, Telemetry as needed
  • TBD

Dependencies (internal and external)

  1. OSSM release schedule aligned with OpenShift's cadence, or workaround designed
  2. ...tbd

Previous Work (Optional):

  1. https://issues.redhat.com/browse/NE-993
  2. https://issues.redhat.com/browse/NE-1036 
  3.  

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

GWAPI and istio logs are not in the must-gather reports.

Add Gateway API resources and possibly OSSM resources to the operator's relatedObjects field.

Overview

Gateway API is the next generation of the Ingress API in upstream Kubernetes.

OpenShift Service Mesh (OSSM) and several other offering of ours like Kuadrant, Microshift and OpenShift AI all have critical dependencies on Gateway API's API resources. However, even though Gateway API is an official Kubernetes project its API resources are not available in the core API (like Ingress) and instead require the installation of Custom Resource Definitions (CRDs).

OCP will be fully in charge of managing the life-cycle of the Gateway API CRDs going forward. This will make the Gateway API a "core-like" API on OCP. If the CRDs are already present on a cluster when it upgrades to the new version where they are managed, the cluster admin is responsible for the safety of existing Gateway API implementations. The Cluster Ingress Operator (CIO)  enacts a process called "CRD Management Succession" to ensure the transfer of control occurs safely, which includes multiple pre-upgrade checks and CIO startup checks.

Acceptance Criteria

  • If not present the Gateway API CRDs should be deployed at the install-time of a cluster, and management thereafter handled by the platform
  • Any existing CRDs not managed by the platform should be removed, or management and control transferred to the platform
  • Only the platform can manage or make any changes to the Gateway API CRDs, others will be blocked
  • Documentation about these APIs, and the process to upgrade to a version where they are being managed needs to be provided

Cross-Team Coordination

The organization as a whole needs to be made aware of this as new projects will continue to pop up with Gateway API support over the years. This includes (but is not limited to)

  • OSSM Team (Istio)
  • Connectivity Link Team (Kuadrant)
  • MicroShift Team
  • OpenShift AI Team (KServe)

Importantly our cluster infrastructure work with Cluster API (CAPI) is working through similar dilemmas for CAPI CRDs, and so we need to make sure to work directly with them as they've already broken a lot of ground here. Here are the relevant docs with the work they've done so far:

What?

On OCP 4.19 onward we will ensure the Gateway API CRDs are present a specific version with its own feature gate which will default to true. If we can not ensure the CRDs are present at the expected version we will mark the cluster degraded.

Why?

See the description of NE-1898.

How?

The Cluster Ingress Operator (CIO) currently provides some logic around handling the Gateway API CRDs, and a chunk of this work is simply updating that. The CIO should:

  • deploy the CRDs with the current expected version
    • at the time of writing this is v1.2.1, but is subject to change
  • upgrade any present CRDs with the current expected version
    • similar to NE-1952 ensure that ONLY the exact CRDs we expect are present:
      • this means only GatewayClass, Gateway, HTTPRoute and Reference grant (and not older versions of them)
      • if the wrong CRDs versions are present the CIO overwrites them to the correct version
      • if unexpected CRDs become present (e.g. TCPRoute) the CIO reports them using Degraded=True on the clusteroperator status.
      • for any of the above situations to occur, something has to be very broken (e.g. cluster admin destroyed the VAP). Provide an appropriate alert when these situations are encountered.
  • Make the ControllerName for GatewayClass include a /v1 version suffix to indicate which implementation is provided (i.e. for now, OSSM)
    • this opens up the door for DP, TP and potentially eventual GA releases of alternative Gateway API implementations other than OSSM, which can then be enabled via a new GatewayClass

Helpful Links

See some of the current CRD management logic here.

Use cases:

  1. As a developer I would like to test for unacceptable failures that exist in the Gateway API with Ingress product.

This Epic is a place holder for stories regarding e2e and unit tests that are missing for old features and to determine whether OSSM 3.x TP2 bugs affect us before they are fixed in GA. There is already one epic for DNS and test cases should be added for any new features in the release.

Write and run test cases that are currently missing.

see thread: https://redhat-internal.slack.com/archives/CBWMXQJKD/p1740071510670649?thread_ts=1740071056.472839&cid=CBWMXQJKD

and https://github.com/openshift/api?tab=readme-ov-file#defining-featuregate-e2e-tests

the tests would be covered in Origin are:

  • Verify Gateway API CRDs and esnure required CRDs should already be installed
  • Verify Gateway API CRDs and ensure existing CRDs can not be deleted
  • Verify Gateway API CRDs and ensure existing CRDs can not be updated
  • Verify Gateway API CRDs and ensure CRD of standard group can not be created
  • Verify Gateway API CRDs and ensure CRD of experimental group is not installed
  • Verify Gateway API CRDs and ensure CRD of experimental group can not be created

Feature Overview (aka. Goal Summary)  

Cgroup V1 was deprecated in OCP 4.16 . RHEL will be removing support for cgroup v1 in RHEL 10 so we will remove it in OCP 4.19 

Goal 

Upgrade Scenario

For clusters running cgroup v1 on OpenShift 4.18 or earlier, upgrading to OpenShift 4.19 will be blocked. To proceed with the upgrade, clusters on OpenShift 4.18 must first switch from cgroup v1 to cgroup v2. Once this transition is complete, the cluster upgrade to OpenShift 4.19 can be performed.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Remove the support for cgroup v1 in 4.19

Why is this important?

  • Without dependant components like systemd, RHCOS moving away from cgroups v1 it is important for the node to make this move as well. 

Scenarios

  1. As a system administrator I would like to make sure my cluster doesn't use cgroup v1 from 4.19 onwards 

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Remove the CgroupModeV1 config option from the openshift/api repository
Ref: https://github.com/openshift/api/blob/master/config/v1/types_node.go#L84 

Add a CRD validation check on the CgroupMode field of the nodes.config spec to avoid the update to "v1" and only allow the "v2" and "" as valid values.

 

Latest update:
Raise a PR with the updated enhancement proposal to handle the removal of cgroupsv1

Feature Overview (aka. Goal Summary)  

Enable OpenShift to be deployed on Confidential VMs on GCP using Intel TDX technology

Goals (aka. expected user outcomes)

Users deploying OpenShift on GCP can choose to deploy Confidential VMs using Intel TDX technology to rely on confidential computing to secure the data in use

Requirements (aka. Acceptance Criteria):

As a user, I can choose OpenShift Nodes to be deployed with the Confidential VM capability on GCP using Intel TDX technology at install time

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Background

This is a piece of a higher-level effort to secure data in use with OpenShift on every platform

Documentation Considerations

Documentation on how to use this new option must be added as usual

Epic Goal

  • Add support to deploy Confidential VMs on GCP using Intel TDX technology

Why is this important?

  • As part of the Zero Trust initiative we want to enable OpenShift to support data in use protection using confidential computing technologies

Scenarios

  1. As a user I want all my OpenShift Nodes to be deployed as Confidential VMs on Google Cloud using Intel TDX technology

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Previous Work (Optional):

  1. We enabled Confidential VMs for GCP using SEV technology already - OCPSTRAT-690

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

As LUKS encryption is required for certain customer environments e.g. being PCI compliant and the current implementation with Network Based LUKS encryption are a) complex and b) not reliable and secure we need to support our Customers with an way to have the Root Device encrypted on a secure way with IBM HW based HSM to secure the LUKS Key. This is a kind of TPM approach to store the luks key but fence it from the user.

Hardware based LUKS encryption requires injection of the read of secure keys in clevis during boot time.  

Goals (aka. expected user outcomes)

Provide hardware based root volume encryption

Requirements (aka. Acceptance Criteria):

Provide hardware based root volume encryption with LUKS

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Y
Classic (standalone cluster) Y
Hosted control planes Y
Multi node, Compact (three node), or Single node (SNO), or all Y
Connected / Restricted Network Y
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) IBM Z
Operator compatibility n/a
Backport needed (list applicable versions) n/a
UI need (e.g. OpenShift Console, dynamic plugin, OCM) n/a
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Once ignition spec 3.5 stablizes, we should switch to using spec 3.5 as the default in the MCO to enable additional features in RHCOS.

 

(example: https://issues.redhat.com/browse/MULTIARCH-3776 needs 3.5)

This story covers all the needed work from the code side that needs to be done to support the 3.5 ignition spec.

To support 3.5 we need to, from a high level perspective:

  • Bump the ignition dependency to v2.20.0, that contains the 3.5 types.
  • Switch all imports that points to 3.4 to point to 
    github.com/coreos/ignition/v2/config/v3_4/types.
  • Create the conversion logic and update the existing ones:
    • convertIgnition34to22 changes to convertIgnition35to22
    • convertIgnition22to34 changes to convertIgnition22to35
    • create convertIgnition35to34
  • Update UTs to reflect above changes and to cover the new function.

 

Done When:

  • The code points to the new ignition release
  • The code uses ignition 3.5 as the default version
  • The UT tests are updated to match the changes for the already existing code plus the new conversion functions

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • CKAO needs to expose a topology signal based on the current OpenShift-topology and ingest it into telemetry

Why is this important?

  • Some stakeholders want to query telemetry through the lens of topology. Today the topology is often derived from node roles and they respective counts. With additional topologies this becomes ambigious.
  • Allows stakeholders to embrace the upcoming OpenShift-specific variants and make changes to their monitoring stack based on the topology they are in.

Scenarios

  • Dashboards that cater to particular types of OpenShift variants.
  • Alerts that utilize the topology that they are in.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • We believe CKAO is the best place to expose a metric satisfying the above requirements.

 

As a developer of TNF, I need:

  • To have a feature gate ensuring clusters are not upgradable while in dev or tech preview
  • To have a new control plane topology for TNF set in the installer
  • To fix any operator logic that is sensitive to topology declarations

Acceptance Criteria

  • Feature gate is added for TNF
  • New control plane topology is added to the infrastructure spec
  • Topology-sensitive operators are updated with TNF specific logic
  • Installer is updated to set the new topology in the infra config

TNF Enhancement Proposal

As a developer of 2NO, I need:

  • Cluster installations of 2NO to succeed
  • Authentication operator not to degrade when there are only 2 api-server replicas when the infra topology is `DualReplica`

Acceptance Criteria

  • PR is merged in openshift authentication operator repo
  • Unit test is added

In order to add the TNF support to the authenticator operator it would be best to do the dependency update in a separate PR to avoid behavior differences between dep changes and TNF change.

As a developer of TNF, I need:

  • To be able to install TNF clusters using the core installer

Acceptance Criteria

  • OpenShift installer enforces feature gate for enabling 2 node openshift with fencing
  • Installer enforces validations new requirements for fencing credentials for `baremetal` and `none` platforms 

TNF Enhancement Proposal

As a developer of 2NO, I need:

  • To update the openshift-installler so that it reports issues found in the pacemaker service when an installation fails

Acceptance Criteria

  • PR is merged in openshift/installer

As a developer of 2NO, I need:

  • To ensure that that the control plane can accept fencing credentials

Acceptance Criteria

  • PR is merged in openshift/installer
  • Unit test is added (as required)

Feature Overview (aka. Goal Summary)

This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.

Goals (aka. Expected User Outcomes)

  • Unified Codebase: Achieve a consistent and unified codebase across different HCP components, reducing redundancy and making the code easier to understand and maintain.
  • Enhanced Developer Experience: Streamline the developer workflow by reducing boilerplate code, standardizing interfaces, and improving documentation, leading to faster and safer development cycles.
  • Improved Maintainability: Refactor large, complex components into smaller, modular, and more manageable pieces, making the codebase more maintainable and easier to evolve over time.
  • Increased Reliability: Enhance the reliability of the platform by increasing test coverage, enforcing immutability where necessary, and ensuring that all components adhere to best practices for code quality.
  • Simplified Networking and Upgrade Mechanisms: Standardize and simplify the handling of networking flows and NodePool upgrade triggers, providing a clear, consistent, and maintainable approach to these critical operations.

Requirements (aka. Acceptance Criteria)

  • Standardized CLI Implementation: Ensure that the CLI is consistent across all supported platforms, with increased unit test coverage and refactored dependencies.
  • Unified NodePool Upgrade Logic: Implement a common abstraction for NodePool upgrade triggers, consolidating scattered inputs and ensuring a clear, consistent upgrade process.
  • Refactored Controllers: Break down large, monolithic controllers into modular, reusable components, improving maintainability and readability.
  • Improved Networking Documentation and Flows: Update networking documentation to reflect the current state, and refactor network proxies for simplicity and reusability.
  • Centralized Logic for Token and Userdata Generation: Abstract the logic for token and userdata generation into a single, reusable library, improving code clarity and reducing duplication.
  • Enforced Immutability for Critical API Fields: Ensure that immutable fields within key APIs are enforced through proper validation mechanisms, maintaining API coherence and predictability.
  • Documented and Clarified Service Publish Strategies: Provide clear documentation on supported service publish strategies, and lock down the API to prevent unsupported configurations.

Use Cases (Optional)

  • Developer Onboarding: New developers can quickly understand and contribute to the HCP project due to the reduced complexity and improved documentation.
  • Consistent Operations: Operators and administrators experience a more predictable and consistent platform, with reduced bugs and operational overhead due to the standardized and refactored components.

Out of Scope

  • Introduction of new features or functionalities unrelated to the refactor and standardization efforts.
  • Major changes to user-facing commands or APIs beyond what is necessary for standardization.

Background

Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.

Customer Considerations

  • Minimal Disruption: Ensure that existing users experience minimal disruption during the refactor, with clear communication about any changes that might impact their workflows.
  • Enhanced Stability: Customers should benefit from a more stable and reliable platform as a result of the increased test coverage and standardization efforts.

Documentation Considerations

Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.

This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.

Goal

Refactor and modularize controllers and other components to improve maintainability, scalability, and ease of use.

User Story:

As a (user persona), I want to be able to:

  • As an external dev I want to be able to add new components to the CPO easily
  • As a core dev I want to feel safe when adding new components to the CPO
  • As a core dev I want to add new components to the CPO with our copy/pasting big chunks of code

 

https://issues.redhat.com//browse/HOSTEDCP-1801 introduced a new abstraction to be used by ControlPlane components. We need to refactor every component to use this abstraction. 

Acceptance Criteria:

Description of criteria:

All ControlPlane Components are refactored:

  • HCCO
  • kube-apiserver (Mulham)
  • kube-controller-manager (Mulham)
  • ocm (Mulham)
  • etcd (Mulham)
  • oapi (Mulham)
  • scheduler (Mulham)
  • clusterpolicy (Mulham)
  • CVO (Mulham)
  • oauth (Mulham)
  • hcp-router (Mulham)
  • storage (Mulham)
  • CCO (Mulham)
  • CNO (Jparrill)
  • CSI (Jparrill)
  • dnsoperator
  • ignition (Ahmed)
  • ingressoperator 
  • machineapprover
  • nto
  • olm
  • pkioperator
  • registryoperator 
  • snapshotcontroller

 

Example PR to refactor cloud-credential-operator : https://github.com/openshift/hypershift/pull/5203
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

Upgrade the OCP console to Pattern Fly 6.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

The core OCP Console should be upgraded to PF 6 and the Dynamic Plugin Framework should add support for PF6 and deprecate PF4.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

Console, Dynamic Plugin Framework, Dynamic Plugin Template, and Examples all should be upgraded to PF6 and all PF4 code should be removed.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

As a company we have all agreed to getting our products to look and feel the same. The current level is PF6.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Problem:

Console is adopting PF6 and removing the PF4 support. It creates lots of UI issues in the Developer Console which we need to support to fix. 

Goal:

Fix all the UI issues in the ODC related to PF6 upgrade

Why is it important?

Acceptance criteria:

  1. Fix all the ODC issues https://docs.google.com/spreadsheets/d/1J7udCkoCks7Pc_jIRdDDBDtbY4U5OOZu_kG4aqX1GlU/edit?gid=0#gid=0

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Most of the *-theme-dark classes defined in the console code base were for PF5 and are likely unnecessary in PF6 (although the version number was updated).  We should evaluate each class and determine if it is still necessary.  If it is not, we should remove it.

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

<your text here>

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal

  • Base on user analytics many of customers switch back and fourth between perspectives, and average15 times per session. 
  • The following steps will be need:
    • Surface all Dev specific Nav items in the Admin Console
    • Disable the Dev perspective by default but allow admins to enable via console setting
    • All quickstarts need to be updated to reflect the removal of the dev perspective
    • Guided tour to show updated nav for merged perpspective

Why is this important?

  • We need to alleviate this pain point and improve the overall user experience for our users.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Hypershift currently allows NodePools to be up to three minor versions behind the HostedCluster control plane (y-3), by virtue of refererencing the floating upstream docs (which changed from n-2 to n-3), but only tests configurations up to two minor versions behind at best (y-2). 

This feature will align the allowed NodePool skew with the tested and supported versions to improve stability and prevent users from deploying unsupported configurations.

Background

Hypershift currently allows for NodePool minor version skew based on the upstream Kubernetes skew policy. However, our testing capacity only allows us to fully validate up to y-2 skew at best. This mismatch creates a potential risk for users deploying unsupported configurations.

Goals (aka. expected user outcomes)

  • As a cluster service consumer I can reliably deploy NodePools with a maximum minor version skew of y-2.
  • As a cluster service consumer , I am prevented from deploying NodePools with an unsupported skew (y-3) or any untested upstream limit.
  • As a cluster service consumer , I understand the supported NodePool skew limits through clear documentation.

Requirements (aka. Acceptance Criteria):

  • The HCP documentation accurately reflects the supported NodePool minor version skew of y-2 and removes the floating reference to upstream kubernetes.
  • HCP prevents the creation of NodePools with a skew greater than y-2 (for 4.even control planes) or y-1 (for 4.odd control planes) through code guardrails. Example of how[ its done in standalone OpenShift|https://github.com/openshift/cluster-kube-apiserver-operator/pull/1199#issue-958481467].
  • Existing NodePools with a skew greater than y-2 are not impacted (but upgrading the control plane may be blocked).
  • Appropriate error messages are provided to users when attempting to create NodePools with an unsupported skew.
Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster)  
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

  • Main Success Scenario: A user attempts to create a NodePool with a supported skew (y-2 or less). The NodePool is created successfully.
  • Alternative Flow Scenario: A user attempts to create a NodePool with an unsupported skew (y-3). An error message is displayed, and the NodePool creation is blocked by a gate.

Customer Considerations

Customers who have deployed NodePools with a skew greater than y-2 may need to upgrade their NodePools before upgrading the HostedCluster control plane in the future.

Documentation Considerations

The HCP documentation ] on NodePool versioning and upgrading needs to be updated to reflect the new supported skew limits.

Interoperability Considerations

Impacts ROSA/ARO HCP

Goal

The goal of this feature is to align the allowed NodePool minor version skew with the tested and supported versions (y-2) to improve stability and prevent users from deploying unsupported configurations. This feature ensures that only configurations that have been fully validated and tested are deployed, reducing the risk of instability or issues with unsupported version skews.

Why is this important? (mandatory)

This is important because the current mismatch between the allowed NodePool skew (which allows up to y-3) and the actual tested configurations (which only support up to y-2) creates a risk for users deploying unsupported configurations. These unsupported configurations could lead to untested or unstable deployments, causing potential issues or failures within the cluster. By enforcing a stricter version skew policy, this change will:

  1. Improve stability: Users are prevented from deploying configurations that could cause compatibility or functionality issues.
  2. Provide clarity: Users will have a clear understanding of the supported NodePool skew limits, making it easier to plan and manage deployments.
  3. Enhance user experience: By eliminating unsupported configurations, we can reduce the likelihood of unexpected failures and create a more predictable environment for users.

Scenarios

Main Success Scenario:

  • A user attempts to create a NodePool with a supported version skew (y-2 or less).
  • The NodePool is created successfully, and the user’s cluster remains stable.

Alternative Flow Scenario:

  • A user attempts to create a NodePool with an unsupported version skew (y-3).
  • The system blocks the creation of the NodePool and displays a clear error message informing the user that the version skew is unsupported. The NodePool creation is prevented by a gate, ensuring the user cannot proceed with an invalid configuration.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams (and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

User Story:

As a developer, I want to be able to:

  • Add Minor Version Compatibility Validation to CLI

Acceptance Criteria:

Description of criteria:

  • When adding a NodePool and ControlPlane that don't match, block the creation.
    • For even-numbered 4.x versions, a y-2 difference is allowed.
    • For odd-numbered 4.x versions, a y-1 difference is allowed.
    • The NodePool version cannot be higher than the ControlPlane version.

Goal

Support for more than one disk in machineset API for vSphere provider

 

Feature description

Customers using vSphere should be able to create machines with more than one disk. This is already available for other cloud and on-prem providers.

Why do customers need this?

To have Proper disk layout that better address their needs. Some examples are using the local storage operator or ODF.

 

Affected packages or components

RHCOS, Machine API, Cluster Infrastructure, CAPV.

User Story:
As an OpenShift administrator, I need to be able to configure my OpenShift cluster to have additional disks on each vSphere VM so that I can use the new data disks for various OS needs.

 

Description: 
This goal of this epic is to be able to allow the cluster administrator to install and configure after install new machines with additional disks attached to each virtual machine for various OS needs.

 

Required:

  • Installer allows configuring additional disks for control plane and compute virtual machines
  • Control Plane Machine Sets (CPMS) allows configuring control plane virtual machines with additional disks
  • Machine API (MAPI) allows for configuring Machines and MachineSets with additional disks
  • Cluster API (CAPI) allows for configuring Machines and MachineSets with additional disks

 

Nice to Have:

 

Acceptance Criteria:

 

Notes:

USER STORY:

As an OpenShift administrator, I want to be able to configure thin provisioned for my new data disks so that adjust the behavior that may be different than my default storage policy.

DESCRIPTION:

Currently, we have the machine api changes forcing the thin provisioned flag to true. We need to add a flag to allow admin to configure this.  The default behavior will be to not set the flag and use default storage policy.

ACCEPTANCE CRITERIA:

  • API has new flag
  • Machine API has been modified to to use the new flag if set, else do not set the thinProvisioned attribute during clone.

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Cluster Infrastructure owned CAPI components should be running on Kubernetes 1.30
  • target is 4.18 since CAPI is always a release behind upstream

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

To align with the 4.19 release, dependencies need to be updated to 1.30. This should be done by rebasing/updating as appropriate for the repository

Feature Overview (aka. Goal Summary)  

We need to maintain our dependencies across all the libraries we use in order to stay in compliance. 

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

As a user, I do not want to load polyfills for browsers that OCP console no longer supports.

Add unit tests for the Timestamp component to prevent regressions like https://issues.redhat.com/browse/OCPBUGS-51202

 

AC:

  • A unit test is implemented which tests the functionality of the Timestamp component
  • The unit test exercises both the relative and full date formats.

Feature Overview

Note: This feature will be a TechPreview in 4.16 since the newly introduced API must graduate to v1.

Overarching Goal

Customers should be able to update and boot a cluster without a container registry in disconnected environments. This feature is for Baremetal disconnected cluster.

Background

  • For a single node cluster effectively cut off from all other networking, update the cluster despite the lack of access to image registries, local or remote.
  • For multi-node clusters that could have a complete power outage, recover smoothly from that kind of disruption, despite the lack of access to image registries, local or remote.
  • Allow cluster node(s) to boot without any access to a registry in case all the required images are pinned

 

This epic describes the work required to GA a minimal viable version of the Machine Config Node feature to enable the subsequent GAing of the Pinned Image Sets feature. The GAing of status reporting as well as any further enhancements for the Machine Config Node feature will be tracked in MCO-1506.

Related Items:

Done when:

  • MCN API is GAed
  • MCN functionality is consistent across all MCPs (default & custom) and both clusters with and without OCL enabled
  • Tests are created, encompassing of major functionality, and passing
  • The team is confident that the state of MCN is robust enough to support the GAing of Pinned Image Sets

The first step in GAing the MCN API is finalizing the v1alpha1 API. This will allow for testing of the final API design before the API is graduated to V1. Since there are a fair amount of changes likely to be made for the MCN API, making our changes in v1alpha1 first seems to follow the API team’s preference of V1 API graduations only having minor changes.

Done when:

  • V1alpha1 API for MCN is finalized
  • The MCN API fields are all properly documented
  • All design decisions are appropriately documented

Feature Overview (aka. Goal Summary)  

In order for Managed OpenShift Hosted Control Planes to run as part of the Azure Redhat OpenShift, it is necessary to support the new AKS design for secrets/identities.

Goals (aka. expected user outcomes)

Hosted Cluster components use the secrets/identities provided/referenced in the Hosted Cluster resources creation.

Requirements (aka. Acceptance Criteria):

All OpenShift Hosted Cluster components running with the appropriate managed or workload identity.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both Managed
Classic (standalone cluster) No
Hosted control planes Yes
Multi node, Compact (three node), or Single node (SNO), or all All supported ARO/HCP topologies
Connected / Restricted Network All supported ARO/HCP topologies
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All supported ARO/HCP topologies
Operator compatibility All core operators
Backport needed (list applicable versions) OCP 4.18.z
UI need (e.g. OpenShift Console, dynamic plugin, OCM) No
Other (please specify)  

Background

This is a follow-up to OCPSTRAT-979 required by an AKS sweeping change to how identities need to be handled.

Documentation Considerations

Should only affect ARO/HCP documentation rather than Hosted Control Planes documentation.

Interoperability Considerations

Does not affect ROSA or any of the supported self-managed Hosted Control Planes platforms

Goal

  • Today the current, the current HyperShift Azure API for Control Plane Managed Identities (MI) stores the client ID and its certificate name for each MI. The goal for this epic is to modify this API to instead allow a NestedCredentialsObject to be stored for each Control Plane MI.
  • In ARO HCP, CS will store the NestedCredentialsObject for each Control Plane MI, in its JSON format, in Azure Key Vault under a secret name for each MI. The secret name for a Control Plane MI will be provided to the HyperShift Azure API (i.e. HostedCluster). The control plane operator will read and parse the ClientID ClientSecret, AuthenticationEndpoint, and TenantID for each Control Plane MI and either pass or use this data to use ClientCertificate authentication for each Control Plane component that needs to authenticate with Azure Cloud.

Why is this important?

  • As part of the msi-dataplane repository walk-through, a gap in the way ARO HCP is approaching authentication as managed identities for control plane components was found.  The gap was that we're not overriding the ActiveDirectoryAuthorityHost as requested by the MSI team when authenticating as a managed identity.  This prompted a wider discussion with HyperShift which led to the proposal here and allowing HyperShift to use the full nested credentials objects and leverage the fields they need within the struct. 

Scenarios

  1. The HyperShift Azure API supports only a secret name for each Control Plane MI (instead of a client ID and certificate name today).
  2. The Control Plane Operator, using the SecretsStore CSI Driver, will retrieve the NestedCredentialsObject from Azure Key Vault and mount it to a volume in any pod needing to authenticate with Azure Cloud.
  3. The Control Plane Operator, through possibly a parsing function from the library-go or msi-dataplane repo, will parse the ClientID ClientSecret, AuthenticationEndpoint, and TenantID from the NestedCredentialsObject and either use or pass this data along to authenticate with ClientCertificate. This will be done for each control plane component needing to authenticate to Azure Cloud.
  4. Remove the filewatcher functionality from HyperShift and in OpenShift repos (CIO, CIRO, CNO/CNCC)

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. External - dependent on upstream communities accepting the changes needed to support ActiveDirectoryAuthorityHost in ClientCertificateCredentialOptions.
  2. External - dependent on Microsoft having the SDK ready prior to HyperShift's work on this epic.

Previous Work (Optional):

  1. Previous Microsoft work:
    1. https://github.com/Azure/msi-dataplane/pull/29
    2. https://github.com/Azure/msi-dataplane/pull/30 

Open questions:

  1. This information is retrieved from a 1P Microsoft application; to my knowledge, there is no way for HyperShift to test this in our current environments?
  2. Can HyperShift get a mock/real example of the JSON structure that would be stored in the Key Vault? (to be used in development, unit testing since we cannot retrieve a real version of this in our current test environments).

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As an ARO HCP user, I want to be able to:

  • have the Secret Store CSI driver retrieve the NestedCredentialsObject from an Azure Key Vault based on the control plane component's secret name in the HyperShift Azure API for these control plane components: CAPZ, Cloud Provider, KMS, and CPO.
  • the corresponding HCP pod needing to authenticate with Azure cloud reads the ClientID ,ClientSecret, AuthenticationEndpoint, and TenantID from the NestedCredentialsObject and uses the data to authenticate with Azure Cloud

so that

  • the NestedCredentialsObject is mounted in a volume in the pod needing to authenticate with Azure Cloud
  • the ClientCertificate authentication is using the right fields needed for managed identities in ARO HCP.

Acceptance Criteria:

Description of criteria:

  • Upstream documentation 
  • Update all the SecretProviderClasses to pull from the new HyperShift Azure API field holding the secret name
  • Update each HyperShift HCP component to use UserAssignedIdentityCredentials
  • Remove the filewatcher functionality from HyperShift

Out of Scope:

Updating any external OpenShift components that run in the HCP

Engineering Details:

 This does not require a design proposal.
This does not require a feature gate.

As an ARO HCP user, I want to be able to:

  • have the Secret Store CSI driver retrieve the UserAssignedIdentityCredentials from an Azure Key Vault based on the control plane component's secret name in the HyperShift Azure API for these control plane components: CNO, CIRO, CSO, and CIO.
  • the corresponding HCP pod needing to authenticate with Azure cloud can read the file path to the UserAssignedIdentityCredentials object and uses the data to authenticate with Azure Cloud

so that

  • the UserAssignedIdentityCredentials is mounted in a volume in the pod needing to authenticate with Azure Cloud

Acceptance Criteria:

Description of criteria:

  • Upstream documentation 
  • Update all the SecretProviderClasses to pull from the new HyperShift Azure API field holding the secret name
  • Update each HyperShift HCP component to use UserAssignedIdentityCredentials
  • Remove the filewatcher functionality from OpenShift repos (CIO, CIRO, CNO/CNCC)

Out of Scope:

Updating any HyperShift-only components that run in the HCP

Engineering Details:

 This does not require a design proposal.
This does not require a feature gate.

Summary

The installation process for the OpenShift Virtualization Engine (OVE) has been identified as a critical area for improvement to address customer concerns regarding its complexity compared to competitors like VMware, Nutanix, and Proxmox. Customers often struggle with disconnected environments, operator configuration, and managing external dependencies, making the initial deployment challenging and time-consuming. 

To resolve these issues, the goal is to deliver a streamlined, opinionated installation workflow that leverages existing tools like the Agent-Based Installer, the Assisted Installer, and the OpenShift Appliance (all sharing the same underlying technology) while pre-configuring essential operators and minimizing dependencies, especially the need for an image registry before installation.

By focusing on enterprise customers, particularly VMware administrators working in isolated networks, this effort aims to provide a user-friendly, UI-based installation experience that simplifies cluster setup and ensures quick time-to-value.

Objectives and Goals

Primary Objectives

  • Simplify the OpenShift Virtualization installation process to reduce complexity for enterprise customers coming from VMware vSphere.
  • Enable installation in disconnected environments with minimal prerequisites.
  • Eliminate the dependency on a pre-existing image registry in disconnected installations.
  • Provide a user-friendly, UI-driven installation experience for users used to VMware vSphere.

Goals

  • Deliver an installation experience leveraging existing tools like the Agent-Based Installer, Assisted Installer, and OpenShift Appliance, i.e. the Assisted Service.
  • Pre-configure essential operators for OVE and minimize external day 1 dependencies (see OCPSTRAT-1811 "Agent Installer interface to install Operators") 
  • Ensure successful installation in disconnected environments with standalone OpenShift, with minimal requirements and no pre-existing registry

Personas

Primary Audience 

VMware administrators transitioning to OpenShift Virtualization in isolated/disconnected environments.

Pain Points

  • Lack of UI-driven workflows; writing YAML files is a barrier for the target user (virtualization platforms admins)
  • Complex setup requirements (e.g., image registries in disconnected environments).
  • Difficulty in configuring network settings interactively.
  • Lack of understanding when to use a specific installation method
  • Hard time finding the relevant installation method (docs or at console.redhat.com)

Technical Requirements

Image Registry Simplification

  • Eliminate the dependency on an existing external image registry for disconnected environments.
  • Support a workflow similar to the OpenShift Appliance model, where users can deploy a cluster without external dependencies.

Agent-Based Installer Enhancements

  • Extend the existing UI to capture all essential data points (e.g., cluster details, network settings, storage configuration) without requiring YAML files.
  • Install without a pre-existing registry in disconnected environment
  • Install required operators for virtualization
  • OpenShift Virtualization Reference Implementation Guide v1.0.2
  • List of Operators:
    • OpenShift Virtualization Operator
    • Machine and Node Configuration
    • Machine Config Operator
    • Node Health Check Operator
    • Fence Agents Remediation Operator
    • Additional Operators
    • Node Maintenance Operator
    • OpenShift Logging
    • MetalLB
    • Migration Toolkit for Virtualization
    • Migration Toolkit for Containers
    • Compliance Operator
    • Kube Descheduler Operator
    • NUMA Resources Operator
    • Ansible Automation Platform Operator
    • Network
    • NMState Operator
    • Node Failure
    • Self Node Remediation Operator
    • Disaster Discovery
    • OADP
    • ODF
  • Note: we need each operator owner to enable the operator to allow its installation via the installer. We won't block the release due to not having the full list of operators included and they'll be added as required and prioritized with each team.

User experience requirements

The first area of focus is a disconnected environment. We target these environments with the Agent-Based Installer

The current docs for installing on disconnected environment are very long and hard to follow.

Installation without pre-existing Image Registry

The image registry is required in disconnected installations before the installation process can start. We must simplify this point so that users can start the installation with one image, without having to explicitly install one.

This isn't a new requirement and in the past we've analyzed options for this and even did a POC, we could revisit this point, see Deploy OpenShift without external registry in disconnected environments.

The OpenShift Appliance can in fact be installed without a registry. 

Additionally, we started work in this direction AGENT-262 (Strategy to complete installations where there isn't a pre-existing registry).

We also had the field (Brandon Jozsa) doing a POC which was promising:

https://gist.github.com/v1k0d3n/cbadfb78d45498b79428f5632853112a 

User Interface (no configuration files)

The type of users coming from VMware vSphere expect a UI. They aren't used to writing YAML files and this has been identified as a blocker for some of them. We must provide a simple UI to stand up a cluster.

Proposed Workflow 

https://miro.com/app/board/uXjVLja4xXQ=/ 

PRD and notes from regular meetings

Epic Goal

  • Setup a workflow to generate an ISO that will contain all the relevant pieces to install an OVE cluster

Why is this important?

  • As per OCPSTRAT-1874, the user must be able to install into a disconnected environment an OVE cluster, with the help of a UI, and without requiring explicitly to setup an external registry

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

 

Previous work:

Dependencies (internal and external)

  1. ...

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Recently the appliance allowed using an internal registry (see https://github.com/openshift/appliance/pull/349).

Modify the script to use that (instead of the external one), and test the installation workflow.

 

Currently the builder script embeds the agent-setup-tui.service in the ignition files, but the script directly in the ISO. For consistency, also the script should be placed inside the ISO ignition

Feature goal (what are we trying to solve here?)

Openshift Virtualization team marked (see https://issues.redhat.com/browse/OCPSTRAT-1874 and https://issues.redhat.com/browse/RFE-6327)

 

Node Health Check Operator
Fence Agents Remediation Operator
Node Maintenance Operator
Migration Toolkit for Virtualization
Kube Descheduler Operator
NMState Operator
Self Node Remediation Operator

 

as the MVP operators for the virtualization bundle.

MTV and k8s nmstate are already supported by AI.

 

DoD (Definition of Done)

The operators can be installed in AI and part of the virtualization bundle

Does it need documentation support?

yes

Feature origin (who asked for this feature?)

  • Internal request

      • According to the Openshift Virtualization team, those are operators that Openshfit Virtualization user will probably need. Fabian Deutsch 

Reasoning (why it’s important?)

  • Allow users to easily migrate to Openshift Virtualization with minimal day operators.
  • The request is to extend AI to include these components by default, if virtualization is selected in order to be more welcoming to new adopters, and lower the entrance barrier.
    Aka make the installer more opinionated in order to provide an ideal virt platform out of the box - if the admin indicated that it will be a virt platform.

Competitor analysis reference

  • Do our competitors have this feature?
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • We have no data - the feature doesn’t exist anywhere

Feature availability (why should/shouldn't it live inside the UI/API?)

  • The feature should have UI (the virtualization bundle is already there, it just have to be extended).
  • There is no need to be able to install the whole bundle via the API (the API is for advanced users while this feature is for beginners).

Feature Overview (aka. Goal Summary)  

Once CCM was moved out-of-tree for Azure the 'azurerm_user_assigned_identity' resource the Installer creates is not required anymore. To make sure the Installer only creates the minimum permissions required to deploy OpenShift on Azure this resource created at install time needs to be removed

Goals (aka. expected user outcomes)

The installer doesn't create the 'azurerm_user_assigned_identity' resource anymore that is no longer required for the Nodes
**

Requirements (aka. Acceptance Criteria)

The Installer only creates the minimum permissions required to deploy OpenShift on Azure

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Background

Once CCM was moved out-of-tree this permission is not required anymore. We are implementing this change into 4.19 and backported to 4.18.z

At the same time, for customers running previous OpenShift releases, we will test upgrades between EUS releases (4.14.z - 4.16.z - 4.18.z) when `azurerm_user_assigned_identity` resource is removed previously to ensure the upgrade process is working with no issues and OpenShift is not reporting any issues because of this change

Customer Considerations

A KCS will be created for customers running previous OpenShift releases who want to remove this resource

Documentation Considerations

The new permissions requirements will be documented

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Remove automatic (opinionated) creation (and attachment) of identities to Azure nodes
  • Allow API to configure identities for nodes

Why is this important?

  • Creating and attaching identities to nodes requires elevated permissions
  • The identities are no longer required (or used) so we can reduce the required permissions

Scenarios

  1. Users want to do a default ipi install that just works without the User Access Admin role
  2. Users want to BYO user-assigned identity (requires some permissions)
  3. Users want to use a system assigned identity

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Enable OpenShift to be deployed on Confidential VMs on GCP using AMD SEV-SNP technology

Goals (aka. expected user outcomes)

Users deploying OpenShift on GCP can choose to deploy Confidential VMs using AMD SEV-SNP technology to rely on confidential computing to secure the data in use

Requirements (aka. Acceptance Criteria):

As a user, I can choose OpenShift Nodes to be deployed with the Confidential VM capability on GCP using AMD SEV-SNP technology at install time

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Background

This is a piece of a higher-level effort to secure data in use with OpenShift on every platform

Documentation Considerations

Documentation on how to use this new option must be added as usual

Epic Goal

  • Add support to deploy Confidential VMs on GCP using AMD SEV-SNP technology

Why is this important?

  • As part of the Zero Trust initiative we want to enable OpenShift to support data in use protection using confidential computing technologies

Scenarios

  1. As a user I want all my OpenShift Nodes to be deployed as Confidential VMs on Google Cloud using SEV-SNP technology

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Previous Work (Optional):

  1. We enabled Confidential VMs for GCP using SEV technology already - OCPSTRAT-690

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal

Add Nutanix platform integration support to the Agent-based Installer

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Implement Migration core for MAPI to CAPI for AWS

  • This feature covers the design and implementation of converting from using the Machine API (MAPI) to Cluster API (CAPI) for AWS
  • This Design investigates possible solutions for AWS
  • Once AWS shim/sync layer is implemented use the architecture for other clouds in phase-2 & phase 3

Acceptance Criteria

When customers use CAPI, There must be no negative effect to switching over to using CAPI .  Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • To bring MAPI and CAPI to feature parity and unblock conversions between MAPI and CAPI resources

Why is this important?

  • Blocks migration to Cluster API

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

When converting CAPI2MAPI, we convert CAPA's `AdditionalSecurityGroups` into the security groups for MAPA. While this looks correct, there are also fields like `SecurityGroupOverrides` which when present currently, would cause an error.

We need to understand how security groups work today in MAPA, compare that to CAPA, and be certain that we are correctly handling the conversion here.

Is CAPA doing anything else under the hood? Is it currently applying extra security groups that are standard that would otherwise cause issues?

Steps

  • Understand how security groups work in CAPA and MAPA
  • Determine if our current conversion of security groups is appropriate and understand the role of securityGroupOverrides
  • Update documentation/make appropriate changes to the security groups conversion based on the above findings.

Stakeholders

  • Cluster infra

Definition of Done

  • We are confident that converted machines behave correctly with respect to the security group configuration.
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Why is this important?

  • We need to build out the core so that development of the migration for individual providers can then happen in parallel
  •  

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Background

Presently, the mapi2capi and capi2mapi code cannot handle translations of owner references.

We need to be able to map CAPI/MAPI machines to their correct CAPI/MAPI MachineSet/CPMS and have the owner references correctly set.

This requires identifying the correct owner and determining the correct UID to set.

This will likely mean extending the conversion utils to be able to make API calls to identify the correct owners.

Owner references for non-MachineSet types should still cause an error.

Steps

  • Add a client to the conversion util constructors (members of the conversion structs?)
  • OR handle this outside of the conversion library?
  • Work out the correct way to convert MachineSet/CPMS owner references between namespaces

Stakeholders

  • Cluster Infra

Definition of Done

  • Owner references are correctly converted between MAPI and CAPI machines
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

Background

To enable CAPI MachineSets to still mirror MAPI MachineSets accurately, and to enable MAPI MachineSets to be implemented by CAPI MachineSets in the future, we need to implement a way to convert CAPI Machines back into MAPI Machines.

These steps assume that the CAPI Machine is authoritative, or, that there is no MAPI Machines.

Behaviours

  • If no Machine exists in MAPI
    • But the CAPI Machine is owned, and that owner exists in MAPI
      • Create a MAPI Machine to mirror the CAPI Machine
      • MAPI Machines should set authority to CAPI on create
  • If a MAPI Machine exists
    • Convert infrastructure template from InfraMachine to providerSpec
    • Update spec and status fields of MAPI Machine to reflect CAPI Machine
  •  On failures
    • Set Synchronized condition to False and report error on MAPI resource
  • On success
    • Set Synchronized condition to True on MAPI resource
    • Set status.synchronizedGeneration to match the auth resource generation

Steps

  • Implement conversion based on the behaviours outlined above using the CAPI to MAPI conversion library

Stakeholders

  • Cluster Infra

Definition of Done

  • When a CAPI MachineSet scales up and is mirrored in MAPI, the CAPI Machine gets mirrored into MAPI
  • Docs
  • <Add docs requirements for this card>
  • Testing
  • <Explain testing that will be added>

PERSONAS:

The following personas are borrowed from Hypershift docs used in the user stories below.

  • Cluster service consumer: The user empowered to request control planes, request workers, and drive upgrades or modify externalized configuration. Likely not empowered to manage or access cloud credentials or infrastructure encryption keys. In the case of managed services, this is someone employed by the customer.
  • Cluster service provider: The user hosting cluster control planes, responsible for up-time. UI for fleet wide alerts, configuring AWS account to host control planes in, user provisioned infra (host awareness of available compute), where to pull VMs from. Has cluster admin management. In the case of managed services, this persona represents Red Hat SRE.

USER STORY:

  • As a cluster service consumer, I want to provision hosted control planes and clusters without the Image Registry, so that my hosted clusters do not contain resources from a component I do not use, such as workloads, storage accounts, pull-secrets, etc, which allows me to save on computing resources
  • As a cluster service provider, I want users to be able to disable the Image Registry so that I don't need to maintain hosted control plane components that users don't care about.

ACCEPTANCE CRITERIA:

What is "done", and how do we measure it? You might need to duplicate this a few times.
 
Given a
When  b
Then  c
 
CUSTOMER EXPERIENCE:

Only fill this out for Product Management / customer-driven work. Otherwise, delete it.

  • Does this feature require customer facing documentation? YES/NO
    • If yes, provide the link once available
  • Does this feature need to be communicated with the customer? YES/NO
      • How far in advance does the customer need to be notified?
      • Ensure PM signoff that communications for enabling this feature are complete
  • Does this feature require a feature enablement run (i.e. feature flags update) YES/NO
    • If YES, what feature flags need to change?
      • FLAG1=valueA
    • If YES, is it safe to bundle this feature enablement with other feature enablement tasks? YES/NO

 

BREADCRUMBS:

Where can SREs look for additional information? Mark with "N/A" if these items do not exist yet so Functional Teams know they need to create them.

  • ADR: a
  • Design Doc: b
  • Wiki: c
  • Similar Work PRs: d
  • Subject Matter Experts: e
  • PRD: f

 

NOTES:

If there's anything else to add.

User Story

As a hypershift CLI user, I want to be able to disable the image registry capability when creating hosted clusters via `hypershift create cluster`.

Definition of Done

Mark with an X when done; strikethrough for non-applicable items. All items
must be considered before closing this issue.

[ ] Ensure all pull request (PR) checks, including ci & e2e, are passing
[ ] Document manual test steps and results
[ ] Manual test steps executed by someone other than the primary implementer or a test artifact such as a recording are attached
[ ] All PRs are merged
[ ] Ensure necessary actions to take during this change's release are communicated and documented
[ ] Troubleshooting Guides (TSGs), ADRs, or other documents are updated as necessary

TBD

GROOMING CHECKLIST:

You can find out more information about ARO workflow, including roles and responsibilities here. Some items in the list should be left for Team Leads (TL) and Region Leads (RL) to perform. Otherwise, all other fields should be populated.

  1. Size and Scope - this Epic can be accomplished by a single Functional Team (ideally in under a quarter)
  2. Component has been set to `ARO`
  3. Labels have been set appropriately for the type of work (only use one):
    1. Product Management (customer-driven) work should always include `mcsp-aro` and `Ready4TLGrooming`
    2. Toil mitigation (SRE-Driven) work should always include `shift-improvement` and `Ready4RLGrooming`
    3. If the shift-improvement item is appropriate for new team members, add label 'Good-First-Item'
  4. Links for Parents, Blockers, and/or Dependencies have been set for Epics/Issues, including those that must complete before work on this item can begin
  5. The below template is complete for User Story, Acceptance Criteria, Customer Experience, and Breadcrumbs
  6. (TL/RL only) The FixVersion has been set to `SREPYYYYQX` for the expected release period for the Epic
  7. (TL/RL only) If applicable, set a Target End Date for items that have a promised delivery date
  8. (TL/RL only) The Priority has been set
  9. (TL/RL only) When all above items are complete, the `team-X-backlog` and `Ready4FLGrooming` labels have been added, the `Ready4TLGrooming` or `Ready4RLGrooming` label has been removed, and this checklist has been deleted from the Epic

USER STORY:

What are we attempting to achieve? You might need to duplicate this a few times.

As a/an a
I want  b
So that  c

 
ACCEPTANCE CRITERIA:

What is "done", and how do we measure it? You might need to duplicate this a few times.
 
Given a
When  b
Then  c
 
CUSTOMER EXPERIENCE:

Only fill this out for Product Management / customer-driven work. Otherwise, delete it.

  • Does this feature require customer facing documentation? YES/NO
    • If yes, provide the link once available
  • Does this feature need to be communicated with the customer? YES/NO
      • How far in advance does the customer need to be notified?
      • Ensure PM signoff that communications for enabling this feature are complete
  • Does this feature require a feature enablement run (i.e. feature flags update) YES/NO
    • If YES, what feature flags need to change?
      • FLAG1=valueA
    • If YES, is it safe to bundle this feature enablement with other feature enablement tasks? YES/NO

 

BREADCRUMBS:

Where can SREs look for additional information? Mark with "N/A" if these items do not exist yet so Functional Teams know they need to create them.

  • ADR: a
  • Design Doc: b
  • Wiki: c
  • Similar Work PRs: d
  • Subject Matter Experts: e
  • PRD: f

 

NOTES:

If there's anything else to add.

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

All images using cachito on Brew should also work with cachi2 on Konflux. https://issues.redhat.com/browse/ART-11902 outlines the ART automation that will support these changes, but ARTists can start testing by adding the annotations to the PipelineRun directly. 

 

If an image build fails on konflux that requires changes to the Dockerfile, an OCPBUGS ticket should be raised. The process doc (which is attached to this ticket) should also be attached to the bugs ticket. ARTists will work with the image owners to hash out any issues until the image builds successful on both Konflux and Brew

Feature goal (what are we trying to solve here?)

CAPI Agent Control Plane Provider and CAPI Bootstrap Provider will provide an easy way to install clusters through CAPI.

 

Those providers will not be generic OpenShift providers, as they are geared towards Bare Metal. Those providers will leverage Assisted Installer ZTP flow, and will benefit BM users by avoiding to provision a bootstrap node (as opposed to regular OpenShift install where the bootstrap node is required, but it will comply better to CAPI interface)

milestones:

  • [spike] create PoC with full provisioning flow
    • install control plane nodes (SNO, compact)
    • add workers
    • coupled with metal3

 

  • production ready code
    • review naming/repository to be hosted etc
    • review project structure
    • review

DoD (Definition of Done)

  • Can install a new cluster via CAPI by using the OpenShift Agent Control Plane provider
  • Can upgrade a managed cluster via CAPI
  • Can decomission a managed cluster via CAPI
  • Support scaling/autoscaling

Does it need documentation support?

Yes

Feature origin (who asked for this feature?)

  • A Customer asked for it

    • Sylva
    • ...

Reasoning (why it’s important?)

  • ..

Competitor analysis reference

  • Do our competitors have this feature?
    • Yes, they have it and we can have some reference
    • No, it's unique or explicit to our product
    • No idea. Need to check

Feature usage (do we have numbers/data?)

  • ..

 we should leverage onprem data collection system and identify when a cluster has been installed with the CAPI provider

Feature goal (what are we trying to solve here?)

Deprecate high_availability_mode as it was replaced by control_plane_count

DoD (Definition of Done)

high_availability_mode is no longer used in our code

Does it need documentation support?

Yes

Feature origin (who asked for this feature?)

  • Catching up with OpenShift

Reasoning (why it’s important?)

  • It doesn't have a meaning anymore oncecontrol_plane_count introduced

Epic goal

When an Assisted Service SaaS user performs the creation of a new OpenShift cluster, nmstate operator should be installed bundled together with other operators when virtualization capability is requested. Operator bundling is out of the scope of this epic, it is only for enabling nmstate operator installation in the assisted installer, without any UI support.

Why is this important?

nmstate operator is one of the enablers of virtualization platforms

Scenarios

  1. When a RH cloud user logs into console.redhat SaaS, they can leverage the Assisted Service SaaS flow to create a new cluster
  2. During the Assisted Service SaaS create flow, a RH cloud user can see a list of available capabilities (i.e. operator bundles) that they want to install at the same time as the cluster create. 
  3. An option is offered to select check a box next to "Virtualization Platform" 
  4. The RH cloud user can read a tool-tip or info-box with short description of the capability and click a link for more details to review documentation.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Ensure nmstate release channel can automatically deploy the latest x.y.z without needing any DevOps/SRE intervention
  • Ensure nmstate release channel can be updated quickly (if not automatically) to ensure the later release x.y can be offered to the cloud user.

 Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Assisted installer should not define any resource requirements for the operators unless specifically stated in their official installation instructions.

Following our migration to konflux in MGMT-18343, we will use this epic for future tasks related to konflux.

More and more tasks are becoming mandatory in Konflux pipeline

Konflux used to have an automation that opened PR to add those tasks. It seems it's not triggered anymore, so we ave to add those tasks manually.

As of today, it raises a warning in the IntegrationTest pipeline that is very likely not seen by anyone. (The build pipeline is not raising any warning)

In the short term we have to add those tasks to all pipelines (maybe only the product one ? I haven't checked)

  • One of "sast-coverity-check", "sast-coverity-check-oci-ta" tasks is missing and will be required on 2025-04-01T00:00:00Z
  • One of "sast-shell-check", "sast-shell-check-oci-ta" tasks is missing and will be required on 2025-04-01T00:00:00Z
  • One of "sast-unicode-check", "sast-unicode-check-oci-ta" tasks is missing and will be required on 2025-04-01T00:00:00Z

In the long term, if we can't have the konflux PR back, we should have some automation that detects the warning and inform us we have to update the pipelines

Slack thread: https://redhat-internal.slack.com/archives/C04PZ7H0VA8/p1741091688194839 
PR example: https://github.com/openshift/assisted-service/pull/7358 

Goal

Add support for syncing CA bundle to the credentials generated by Cloud Credential Operator.

Why is this important?

It it generally necessary to provide a CA file to OpenStack clients in order to communicate with a cloud that uses self-signed certificates. The cloud-credential-operator syncs clouds.yaml files to various namespaces so that services running in those namespaces are able to communicate with the cloud, but it does not sync the CA file. Instead, this must be managed using another mechanism. This has led to some odd situations, such as the Cinder CSI driver operator inspecting cloud-provider configuration to pull out this file.

We should start syncing not only the clouds.yaml file but also the CA file to anyone that requests it via a CredentialsRequest. Once we've done this, we can modify other components such as the Installer, CSI Driver Operator, Hypershift, and CCM Operator to pull the CA file from the same secrets that they pull the clouds.yaml from, rather than the litany of places they currently use.

Scenarios

  • As a deployer, I should be able to update all cloud credential-related information - including certificates - in one central place and see these rolled out to all components that require them.

Acceptance Criteria

  • The cloud-credential-operator is capable of consuming a CA cert from kube-system / openstack-credentials and rolling this out to the secrets in other namespaces
  • The installer includes the CA cert in the root kube-system / openstack-credentials secret
  • The UPI playbooks are modified to includes the CA cert in the root kube-system / openstack-credentials secret
  • No regressions. Since we use self-signed certificates in many of our CI systems, we should see regressions early.
  • Release notes and credential rotation documentation is updated to document this change

Dependencies (internal and external)

None.

Previous Work (Optional):

None.

Open questions::

None.

The Installer creates the initial version of the root credential secret at kube-system / openstack-credentials, which cloud-credential-operator (CCO) will consume. Once we have support in CCO for consuming a CA cert from this root credential, we should modify the Installer to start populating the CA cert field. We should also stop adding the CA cert to the openshift-cloud-controller-manager / cloud-conf config map since the Cloud Config Operator (and CSI Drivers) will be able to start consuming the CA cert from the secret instead. This may need to be done separately depending on the order that patches land in.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

Feature gates must demonstrate completeness and reliability.

As per https://github.com/openshift/api?tab=readme-ov-file#defining-featuregate-e2e-tests:

  1. Tests must contain either [OCPFeatureGate:<FeatureGateName>] or the standard upstream [FeatureGate:<FeatureGateName>].
  2. There must be at least five tests for each FeatureGate.
  3. Every test must be run on every TechPreview platform we have jobs for. (Ask for an exception if your feature doesn't support a variant.)
  4. Every test must run at least 14 times on every platform/variant.
  5. Every test must pass at least 95% of the time on every platform/variant.

If your FeatureGate lacks automated testing, there is an exception process that allows QE to sign off on the promotion by commenting on the PR.

The introduced functionality is not that complex. The only newly introduced ability is to modify the CVO log level using the API. However, we should still introduce an e2e test or tests to demonstrate that the CVO correctly reconciles the new configuration API. 

The tests may be:

  • Check whether the CVO notices a new configuration in a reasonable time.
  • Check whether the CVO increments the observedGeneration correctly.
  • Check whether the CVO changes its log level correctly.
  • TODO: Think of more cases.

Definition of Done:

  • e2e test/s exists to ensure that the CVO is correctly reconciling the new configuration API

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Description of problem:

Azure creates a nic in "provisioning failed" and the code is not checking the provisioning status.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

100%

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

https://github.com/openshift/machine-api-provider-azure/blob/main/pkg/cloud/azure/actuators/machine/reconciler.go

https://pkg.go.dev/github.com/Azure/azure-sdk-for-go@v68.0.0+incompatible/services/network/mgmt/2021-02-01/network#InterfacePropertiesFormat

Description of problem:

cluster with custom endpoints, fail to ssh to the created bastion and master vm

Version-Release number of selected component (if applicable):

4.19 pre-merge 
main@de563b96, merging: #9523 f1119b4a, #9397 487587cf, #9385 e365e12c    

How reproducible:

Alwats    

Steps to Reproduce:

1. create install-config with customer endpoint
  serviceEndpoints:
  - name: COS
    url: https://s3.direct.jp-tok.cloud-object-storage.appdomain.cloud
2. create the cluster
3.
    

Actual results:

create the cluster failed.
ssh to the bootstrap and master vm failed

Expected results:

create the cluster succeed.    

Additional info:
the VNC console of ci-op-lgk38x3xaa049-hk2z5-bootstrap:

Mar 05 11:36:34 ignition[783]: error at $.ignition.config.replace.source, line 1 col 1542: unable to parse url
Mar 05 11:36:34 ignition[783]: error at $.ignition.config.replace.httpHeaders, line 1 col 50: unable to parse url
Mar 05 11:36:34 ignition[783]: failed to fetch config: config is not valid
Mar 05 11:36:34 ignition[783]: failed to acquire config: config is not valid
Mar 05 11:36:34 systemd[1]: ignition-fetch-offline.service: Main process exited, code=exited, status=1/FAILURE
Mar 05 11:36:34 ignition[783]: Ignition failed: config is not valid
Mar 05 11:36:34 systemd[1]: ignition-fetch-offline.service: Failed with result 'exit-code'.
Mar 05 11:36:34 systemd[1]: Failed to start Ignition (fetch-offline).
Mar 05 11:36:34 systemd[1]: ignition-fetch-offline.service: Triggering OnFailure dependencies.
Generating "/run/initramfs/rdsosreport.txt"

the VNC console of ci-op-lgk38x3xaa049-hk2z5-master-0:

[ 2284.471078] ignition[840]: GET https://api-int.ci-op-lgk38x3xaa049.private-ibmcloud-1.qe.devcluster.openshift.com:22623/config/master: attempt #460
[ 2284.477585] ignition[840]: GET error: Get "https://api-int.ci-op-lgk38x3xaa049.private-ibmcloud-1.qe.devcluster.openshift.com:22623/config/master": EOF

Description of the problem:

Create cluster ,booted from iSCSI multipath.
When node discovered the mpath nodes were up .

I changed one of the path's to offline by adding blackhole routing.
ip route add blackhole 192.168.145.1/32

The disk validation caught it but there is a message exposing internal kind of function..

Disk is not eligible for installation.

  • iSCSI disk is not in running state
  • Cannot parse iSCSI host IP : ParseAddr(""): unable to parse IP

Looks like the address that is set named ->  
Iface IPaddress: [default]

Probably we will have to change the validation message to something more general like:
iSCSI ipv4 address is not routable  ( and not ParseAddr)

 

How reproducible:

 

Steps to reproduce:

1.

2.

3.

Actual results:

 

Expected results:

Description of problem:

Console show time out error when trying to edit deployment with annotation `image.openshift.io/triggers: ''` 

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Always

Steps to Reproduce:

1. Install a 4.12 cluster
2. Create a deployment withh annotation `image.openshift.io/triggers: ''` 
3. Select edit deployment in console
4. Console gives time out error

Actual results:

Console gives time out error

Expected results:

Console should be able to handle bad values

Additional info:

The issue is observed when we check from actions section.Deployment-><name_of_deployment>>Actions-> Edit DeploymentThe page gives error when annotation is present as: "Oh no! Something went wrong"When annotation is removed, deployment is shown.

This fix updates OpenShift 4.19 to Kubernetes v1.32.3, incorporating the latest upstream changes and fixes.

For details on the changes included in this update, see the Kubernetes changelog:

https://github.com/kubernetes/kubernetes/blob/release-1.32/CHANGELOG/CHANGELOG-1.32.md#changelog-since-v1322

Description of problem:

platform.powervs.clusterOSImage is still required and should not be removed from the install-config    

Version-Release number of selected component (if applicable):

4.19.0    

Steps to Reproduce:

    1. Specify OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE and try to deploy
    2. The deploy does not use the override value
    

Actual results:

The value of platform.powervs.clusterOSImage will be ignored.    

Expected results:

The deploy uses the overriden value of OS_IMAGE_OVERRIDE    

Additional info:

    

Description of problem:

when trying to use the ImageSetConfig as described below i see that oc-mirror gets killed abruptly.
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
   channels:
   - name: stable-4.16    # Version of OpenShift to be mirrored
     minVersion: 4.16.30  # Minimum version of OpenShift to be mirrored
     maxVersion: 4.16.30  # Maximum version of OpenShift to be mirrored
     shortestPath: true
     type: ocp
   graph: true
  operators:
  - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.16
    full: false
  - catalog: registry.redhat.io/redhat/certified-operator-index:v4.16
    full: false
  - catalog: registry.redhat.io/redhat/community-operator-index:v4.16
    full: false
  helm: {}

    

Version-Release number of selected component (if applicable):

     4.18
    

How reproducible:

    Always
    

Steps to Reproduce:

    1. Use the ImageSetConfig as above
    2. Run command `oc-mirror -c /tmp/config.yaml file://test --v2`
    3.
    

Actual results:

    oc-mirror command gets killed even after having about 24GB of Ram and 12 core cpu, for some customers even after having 64GB of ram it looks like it never worked.

2025/03/03 10:40:01  [INFO]   : :mag: collecting operator images...
2025/03/03 10:40:01  [DEBUG]  : [OperatorImageCollector] setting copy option o.Opts.MultiArch=all when collecting operator images
2025/03/03 10:40:01  [DEBUG]  : [OperatorImageCollector] copying operator image registry.redhat.io/redhat/redhat-operator-index:v4.16
     (24s) Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.16 
2025/03/03 10:40:26  [DEBUG]  : [OperatorImageCollector] manifest 2be15a52aa4978d9134dfb438e51c01b77c9585578244b97b8ba1d4f5e6c0ea1
     (5m59s) Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.16 
2025/03/03 10:46:01  [WARN]   : error parsing image registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9 : registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9 unable to parse image correctly : tag and dige ✓   (5m59s) Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.16 
2025/03/03 10:46:01  [DEBUG]  : [OperatorImageCollector] copying operator image registry.redhat.io/redhat/certified-operator-index:v4.16
 ⠦   (2s) Collecting catalog registry.redhat.io/redhat/certified-operator-index:v4.16 
2025/03/03 10:46:03  [DEBUG]  : [OperatorImageCollector] manifest 816c65bcab1086e3fa158e2391d84c67cf96916027c59ab8fe44cf68a1bfe57a
2025/03/03 10:46:03  [DEBUG]  : [OperatorImageCollector] label /configs
 ✓   (51s) Collecting catalog registry.redhat.io/redhat/certified-operator-index:v4.16 
2025/03/03 10:46:53  [DEBUG]  : [OperatorImageCollector] copying operator image registry.redhat.io/redhat/community-operator-index:v4.16
 ⠇   (11s) Collecting catalog registry.redhat.io/redhat/community-operator-index:v4.16 
2025/03/03 10:47:04  [DEBUG]  : [OperatorImageCollector] manifest 7a8cb7df2447b26c43b274f387197e0789c6ccc55c18b48bf0807ee00286550d
 ⠹   (34m26s) Collecting catalog registry.redhat.io/redhat/community-operator-index:v4.16 
Killed
    

Expected results:

     oc-mirror process should not get killed abruptly.
    

Additional info:

    More info in the link here: https://redhat-internal.slack.com/archives/C02JW6VCYS1/p1740783474190069
    

Description of problem:

"Export as CSV" on "Observe"->"Alerting" page is not marked for i18n.
    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-12-133926
    

How reproducible:

Always
    

Steps to Reproduce:

    1.Check "Export as CSV" on "Observe"->"Alerting" page.
    2.
    3.
    

Actual results:

1. It's not marked for i18n
    

Expected results:

1. Should marked for i18n
    

Additional info:

"Export as CSV" also need i18n for each languages.
    

Description of problem:

 Fix labels for allow opentelemetry allow list, currently all labels has exporter/receiver postfix on it. This is incorrect, because the name of the exporter/importer doesn't contain such postfix.

 

 

Description of problem:

    When debugging a node using the OpenShift Console, the logs of the <NodeName>-debug pod are not accessible from either the Console UI or the CLI. However, when debugging the node via CLI (oc debug node/<node_name>), the logs are accessible as expected.

Version-Release number of selected component (if applicable):

    OpenShift Versions Tested: 4.8.14, 4.8.18, 4.9.0 ... so 4.12

How reproducible:

    always

Steps to Reproduce:

1. Open OpenShift Console.
2. Navigate to Compute → Node → <node_name> → Terminal.
3. Run any command in the terminal.
4. A new <NodeName>-debug pod is created in a dynamically generated namespace (openshift-debug-node-xxx).
5. Try to access logs:

Console UI: Workloads → Pod → <NodeName>-debug → Logs → Logs not visible.
CLI: Run oc logs <NodeName-debug_pod> -n <openshift-debug-node-xxx> → No logs available.

Actual results:

Logs of the <NodeName>-debug pod are not available in either the Console UI or CLI when debugging via Console.

Expected results:

The <NodeName>-debug pod logs should be accessible in both the Console UI and CLI, similar to the behavior observed when debugging via oc debug node/<node_name>.

Additional info:

Debugging via CLI (oc debug node/<node_name>) creates the debug pod in the current namespace (e.g., <project_name>). Logs are accessible via:

$ oc logs -n <project_name> -f <NodeName-debug_pod>

Debugging via Console creates the pod in a new dynamic namespace (openshift-debug-node-xxx), and logs are not accessible.

Possible Cause:
Namespace issue - Debug pod is created in openshift-debug-node-xxx, which may not be configured to expose logs correctly.

Description of problem:

    Two favorite icon is shows on same page. Operator details page with CR.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. install Red Hat Serverless operator
    2. navigate to Operator details > knative serving page
    

Actual results:

    Two star icon on the same page

Expected results:

Only one star icon should present on a page

Additional info:

    

Description of problem

HyperShift currently seems to only maintain one version at a time in status on a FeatureGate resource. For example, in a HostedControlPlane that had been installed a while back, and recently done 4.14.37 > 4.14.38 > 4.14.39, the only version in FeatureGate was 4.14.39:

$ jq -r '.status.featureGates[].version' featuregates.yaml
4.14.39

Compare that with standalone clusters, where FeatureGates status is appended with each release. For example, in this 4.18.0-rc.0 to 4.18.0-rc.1 CI run:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/release-openshift-origin-installer-e2e-aws-upgrade/1865110488958898176/artifacts/e2e-aws-upgrade/must-gather.tar | tar -xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b7fd0a8ff4df55c00e9e4e676d8c06fad2222fe83282fbbea3dad3ff9aca1ebb/cluster-scoped-resources/config.openshift.io/featuregates/cluster.yaml | yaml2json | jq -r '.status.featureGates[].version'
4.18.0-rc.1
4.18.0-rc.0

The append approach allows consumers to gracefully transition over time, as they each update from the outgoing version to the incoming version. With the current HyperShift logic, there's a race between the FeatureGate status bump and the consuming component bumps:

  1. HCP running vA
  2. HostedControlPlane spec bumped to request vB, and vB control-plane operator launched.
  3. CPO (or some other HyperShift component?) pushes vB status to FeatureGate.
  4. All the vA components looking for vA in FeatureGate status break.
  5. Dangerous race period, hopefully the CPO doesn't get stuck here.
  6. CPO bumps the other components to vB.
  7. All the vB components looking for vB in FeatureGate status are happy.

In this bug, I'm asking for HyperShift to adopt the standalone approach of appending to FeatureGate status instead of dropping the outgoing version, to avoid that kind of race window. At least until there's some assurance that the update to the incoming version has completely rolled out. Standalone pruning removes versions that no longer exist in ClusterVersion history. Checking a long-lived standalone cluster I have access to, I see:

$ oc get -o json featuregate cluster | jq -r '.status.featureGates[].version'
4.18.0-ec.4
4.18.0-ec.3
...
4.14.0-ec.1
4.14.0-ec.0
$ oc get -o json featuregate cluster | jq -r '.status.featureGates[].version' | wc -l
27

so it seems like pruning is currently either non-existent, or pretty relaxed.

Version-Release number of selected component

Seen in a 4.14.38 to 4.14.39 HostedCluster update. May or may not apply to more recent 4.y.

How reproducible

Unclear

Steps to Reproduce

  1. Install vA HostedCluster.
  2. Watch the cluster FeatureGate's status.
  3. Update to vB.
  4. Wait for the update to complete.

Actual results

When vB is added to FeatureGate status, vA is dropped.

If the CPO gets stuck during the transition, some management-cluster-side pods (cloud-network-config-controller, cluster-network-operator, ingress-operator, cluster-storage-operator, etc.) crash loop with logs like:

E1211 15:43:58.314619       1 simple_featuregate_reader.go:290] cluster failed with : unable to determine features: missing desired version "4.14.38" in featuregates.config.openshift.io/cluster
E1211 15:43:58.635080       1 simple_featuregate_reader.go:290] cluster failed with : unable to determine features: missing desired version "4.14.38" in featuregates.config.openshift.io/cluster

Expected results

vB is added to FeatureGate status early in the update, and vA is preserved through much of the update, and only removed when it seems like there might not be any more consumers (when a version is dropped from ClusterVersion history, if you want to match the current standalone handling on this).

Additional info

None yet.

Description of problem:

    Webhook prompt should be given when marketType is invalid like other features
liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml 
Error from server (Forbidden): error when creating "ms1.yaml": admission webhook "validation.machineset.machine.openshift.io" denied the request: providerSpec.networkInterfaceType: Invalid value: "1": Valid values are: ENA, EFA and omitted

Version-Release number of selected component (if applicable):

    4.19.0-0.nightly-2025-03-05-160850

How reproducible:

always    

Steps to Reproduce:

    1.Install an AWS cluster

liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-0.nightly-2025-03-05-160850   True        False         5h37m   Cluster version is 4.19.0-0.nightly-2025-03-05-160850

    2.Create a machineset with invalid marketType, for example, marketType: "1", the machine stuck in Provisioning, although I can see some messages in the machine providerStatus and machine-controller log, I think we should give explicit webhook prompt to be consistent with other features.

huliu-aws36a-6bslb-worker-us-east-2aa-f89jk   Provisioning                                         8m42s

  providerStatus:
    conditions:
    - lastTransitionTime: "2025-03-06T07:49:51Z"
      message: invalid MarketType "1"
      reason: MachineCreationFailed
      status: "False"
      type: MachineCreation

E0306 08:01:07.645341       1 actuator.go:72] huliu-aws36a-6bslb-worker-us-east-2aa-f89jk error: huliu-aws36a-6bslb-worker-us-east-2aa-f89jk: reconciler failed to Create machine: failed to launch instance: invalid MarketType "1"
W0306 08:01:07.645377       1 controller.go:409] huliu-aws36a-6bslb-worker-us-east-2aa-f89jk: failed to create machine: huliu-aws36a-6bslb-worker-us-east-2aa-f89jk: reconciler failed to Create machine: failed to launch instance: invalid MarketType "1"
E0306 08:01:07.645427       1 controller.go:341] "msg"="Reconciler error" "error"="huliu-aws36a-6bslb-worker-us-east-2aa-f89jk: reconciler failed to Create machine: failed to launch instance: invalid MarketType \"1\"" "controller"="machine-controller" "name"="huliu-aws36a-6bslb-worker-us-east-2aa-f89jk" "namespace"="openshift-machine-api" "object"={"name":"huliu-aws36a-6bslb-worker-us-east-2aa-f89jk","namespace":"openshift-machine-api"} "reconcileID"="e3aeeeda-2537-4e83-a787-2cbcf9926646"
I0306 08:01:07.645499       1 recorder.go:104] "msg"="huliu-aws36a-6bslb-worker-us-east-2aa-f89jk: reconciler failed to Create machine: failed to launch instance: invalid MarketType \"1\"" "logger"="events" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"huliu-aws36a-6bslb-worker-us-east-2aa-f89jk","uid":"a7ef8a7b-87d5-4569-93a4-47a7a2d16325","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"118757"} "reason"="FailedCreate" "type"="Warning"


liuhuali@Lius-MacBook-Pro huali-test % oc get machineset huliu-aws36a-6bslb-worker-us-east-2aa -oyaml
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    capacity.cluster-autoscaler.kubernetes.io/labels: kubernetes.io/arch=amd64
    machine.openshift.io/GPU: "0"
    machine.openshift.io/memoryMb: "16384"
    machine.openshift.io/vCPU: "4"
  creationTimestamp: "2025-03-06T07:49:50Z"
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: huliu-aws36a-6bslb
  name: huliu-aws36a-6bslb-worker-us-east-2aa
  namespace: openshift-machine-api
  resourceVersion: "118745"
  uid: 65e94786-6c1a-42b8-9bf3-9fe0d3f4adf3
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: huliu-aws36a-6bslb
      machine.openshift.io/cluster-api-machineset: huliu-aws36a-6bslb-worker-us-east-2aa
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: huliu-aws36a-6bslb
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: huliu-aws36a-6bslb-worker-us-east-2aa
    spec:
      lifecycleHooks: {}
      metadata: {}
      providerSpec:
        value:
          ami:
            id: ami-0e763ecd8ccccbc99
          apiVersion: machine.openshift.io/v1beta1
          blockDevices:
          - ebs:
              encrypted: true
              iops: 0
              kmsKey:
                arn: ""
              volumeSize: 120
              volumeType: gp3
          capacityReservationId: ""
          credentialsSecret:
            name: aws-cloud-credentials
          deviceIndex: 0
          iamInstanceProfile:
            id: huliu-aws36a-6bslb-worker-profile
          instanceType: m6i.xlarge
          kind: AWSMachineProviderConfig
          marketType: "1"
          metadata:
            creationTimestamp: null
          metadataServiceOptions: {}
          placement:
            availabilityZone: us-east-2a
            region: us-east-2
          securityGroups:
          - filters:
            - name: tag:Name
              values:
              - huliu-aws36a-6bslb-node
          - filters:
            - name: tag:Name
              values:
              - huliu-aws36a-6bslb-lb
          subnet:
            filters:
            - name: tag:Name
              values:
              - huliu-aws36a-6bslb-subnet-private-us-east-2a
          tags:
          - name: kubernetes.io/cluster/huliu-aws36a-6bslb
            value: owned
          userDataSecret:
            name: worker-user-data
status:
  fullyLabeledReplicas: 1
  observedGeneration: 1
  replicas: 1
liuhuali@Lius-MacBook-Pro huali-test % 

    

Actual results:

   machine stuck in Provisioning, and some messages shown in the machine providerStatus and machine-controller log

Expected results:

   give explicit webhook prompt to be consistent with other features. like

liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml Error from server (Forbidden): error when creating "ms1.yaml": admission webhook "validation.machineset.machine.openshift.io" denied the request: providerSpec.networkInterfaceType: Invalid value: "1": Valid values are: ENA, EFA and omitted

Additional info:

New feature testing for https://issues.redhat.com/browse/OCPCLOUD-2780    

Description of problem:

The OpenShift-Installer does not validate if the apiVIPs and ingressVIPs are specified when the load balancer is configured as UserManaged and fall back to the default behaviour where it picks the 5th and 7th IPs of the machine network

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    100%

Steps to Reproduce:

1. Create an install-config.yaml file with the following content:

$ cat ocp4/install-config.yaml

apiVersion: v1
baseDomain: mydomain.test
compute:
- name: worker
  platform:
    openstack:
      type: m1.xlarge
  replicas: 3
controlPlane:
  name: master
  platform:
    openstack:
      type: m1.xlarge
  replicas: 3
metadata:
  name: mycluster
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 192.168.10.0/24
platform:
  openstack:
    loadBalancer:
      type: UserManaged


2. Run the following command to generate manifests:
$ openshift-installer create manifests --dir ocp4

3. Check the generated cluster-config.yaml:
$ cat ocp4/manifests/cluster-config.yaml     

4.Observe the following unexpected output:
platform:
  openstack:
    cloud: openstack
    externalDNS: null
    apiVIPs:
    - 192.168.10.5
    ingressVIPs:
    - 192.168.10.7
    loadBalancer:
      type: UserManaged

Actual results:

    The apiVIPs and ingressVIPs fields are unexpectedly added to cluster-config.yaml.

Expected results:

The apiVIPs and ingressVIPs fields should not be automatically assigned.

Additional info:

    

Description of problem:

the CIS "plugin did not respond" blocked the public install    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2025-03-14-195326     

How reproducible:

 Always   

Steps to Reproduce:

    1.create public ipi cluster on IBMCloud platform
    2.
    3.
    

Actual results:

level=info msg=Creating infrastructure resources...
msg=Error: Plugin did not respond    
...
msg=panic: runtime error: invalid memory address or nil pointer dereference
msg=[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x24046dc]256level=error257level=error msg=goroutine 2090 [running]:258level=error msg=github.com/IBM-Cloud/terraform-provider-ibm/ibm/service/cis.ResourceIBMCISDnsRecordRead(0xc003573900, {0x4ed2fa0?, 0xc00380c008?})

Expected results:

create cluster succeed.    

Additional info:

https://github.com/IBM-Cloud/terraform-provider-ibm/issues/6066
ibm_cis_dns_record leads to plugin crash    

Description of problem:

/k8s/all-namespaces/volumesnapshots returns 404 Page Not Found    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2025-03-17-135359    

How reproducible:

Always    

Steps to Reproduce:

1. navigate to Storage -> VolumeSnapshots, make sure 'All Projects' selected 
2. Click on 'Create VolumeSnapshot' button, user will be redirected to /k8s/ns/default/volumesnapshots/~new/form page and project selection will be changed to `default
3. open project selector dropdown and change project to 'All Projects' again 

$ oc get volumesnapshots -A
No resources found

Actual results:

3. URL path will be changed to /k8s/all-namespaces/volumesnapshots and we will see error
404: Page Not Found
The server doesn't have a resource type "volumesnapshots". Try refreshing the page if it was recently added.

Expected results:

3. should display volumesnapshots in all projects, volumesnapshots resources can be successfully listed/queried
$ oc get volumesnapshots -A
No resources found
 

Additional info:

    

Description of problem:

Create cluster on instance type Standard_M8-4ms, installer failed to provision machines.

install-config:
================
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      type: Standard_M8-4ms

Create cluster:
=====================
$ ./openshift-install create cluster --dir ipi3
INFO Waiting up to 15m0s (until 2:31AM UTC) for machines [jimainstance01-h45wv-bootstrap jimainstance01-h45wv-master-0 jimainstance01-h45wv-master-1 jimainstance01-h45wv-master-2] to provision... 
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded 
INFO Shutting down local Cluster API controllers... 
INFO Stopped controller: Cluster API              
WARNING process cluster-api-provider-azure exited with error: signal: killed 
INFO Stopped controller: azure infrastructure provider 
INFO Stopped controller: azureaso infrastructure provider 
INFO Shutting down local Cluster API control plane... 
INFO Local Cluster API system has completed operation

In openshift-install.log, all machines were created failed with below error:
=================
time="2024-09-20T02:17:07Z" level=debug msg="I0920 02:17:07.757980 1747698 recorder.go:104] \"failed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to get desired parameters for resource jimainstance01-h45wv-rg/jimainstance01-h45wv-bootstrap (service: virtualmachine): reconcile error that cannot be recovered occurred: failed to validate the memory capability: failed to parse string '218.75' as int64: strconv.ParseInt: parsing \\\"218.75\\\": invalid syntax. Object will not be requeued\" logger=\"events\" type=\"Warning\" object={\"kind\":\"AzureMachine\",\"namespace\":\"openshift-cluster-api-guests\",\"name\":\"jimainstance01-h45wv-bootstrap\",\"uid\":\"d67a2010-f489-44b4-9be9-88d7b136a45b\",\"apiVersion\":\"infrastructure.cluster.x-k8s.io/v1beta1\",\"resourceVersion\":\"1530\"} reason=\"ReconcileError\""
...
time="2024-09-20T02:17:12Z" level=debug msg="Checking that machine jimainstance01-h45wv-bootstrap has provisioned..."
time="2024-09-20T02:17:12Z" level=debug msg="Machine jimainstance01-h45wv-bootstrap has not yet provisioned: Failed"
time="2024-09-20T02:17:12Z" level=debug msg="Checking that machine jimainstance01-h45wv-master-0 has provisioned..."
time="2024-09-20T02:17:12Z" level=debug msg="Machine jimainstance01-h45wv-master-0 has not yet provisioned: Failed"
time="2024-09-20T02:17:12Z" level=debug msg="Checking that machine jimainstance01-h45wv-master-1 has provisioned..."
time="2024-09-20T02:17:12Z" level=debug msg="Machine jimainstance01-h45wv-master-1 has not yet provisioned: Failed"
time="2024-09-20T02:17:12Z" level=debug msg="Checking that machine jimainstance01-h45wv-master-2 has provisioned..."
time="2024-09-20T02:17:12Z" level=debug msg="Machine jimainstance01-h45wv-master-2 has not yet provisioned: Failed"
... 

Also see same error in .clusterapi_output/Machine-openshift-cluster-api-guests-jimainstance01-h45wv-bootstrap.yaml
===================
$ yq-go r Machine-openshift-cluster-api-guests-jimainstance01-h45wv-bootstrap.yaml 'status'
noderef: null
nodeinfo: null
lastupdated: "2024-09-20T02:17:07Z"
failurereason: CreateError
failuremessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1beta1,
  Kind=AzureMachine with name "jimainstance01-h45wv-bootstrap": failed to reconcile
  AzureMachine service virtualmachine: failed to get desired parameters for resource
  jimainstance01-h45wv-rg/jimainstance01-h45wv-bootstrap (service: virtualmachine):
  reconcile error that cannot be recovered occurred: failed to validate the memory
  capability: failed to parse string ''218.75'' as int64: strconv.ParseInt: parsing
  "218.75": invalid syntax. Object will not be requeued'
addresses: []
phase: Failed
certificatesexpirydate: null
bootstrapready: false
infrastructureready: false
observedgeneration: 1
conditions:
- type: Ready
  status: "False"
  severity: Error
  lasttransitiontime: "2024-09-20T02:17:07Z"
  reason: Failed
  message: 0 of 2 completed
- type: InfrastructureReady
  status: "False"
  severity: Error
  lasttransitiontime: "2024-09-20T02:17:07Z"
  reason: Failed
  message: 'virtualmachine failed to create or update. err: failed to get desired
    parameters for resource jimainstance01-h45wv-rg/jimainstance01-h45wv-bootstrap
    (service: virtualmachine): reconcile error that cannot be recovered occurred:
    failed to validate the memory capability: failed to parse string ''218.75'' as
    int64: strconv.ParseInt: parsing "218.75": invalid syntax. Object will not be
    requeued'
- type: NodeHealthy
  status: "False"
  severity: Info
  lasttransitiontime: "2024-09-20T02:16:27Z"
  reason: WaitingForNodeRef
  message: ""


From above error, seems unable to parse the memory of instance type Standard_M8-4ms, which is a decimal, not an integer.

$ az vm list-skus --size Standard_M8-4ms  --location southcentralus | jq -r '.[].capabilities[] | select(.name=="MemoryGB")'
{
  "name": "MemoryGB",
  "value": "218.75"
}

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-09-16-082730

 

How reproducible:

 Always

Steps to Reproduce:

    1. set controlPlane type as Standard_M8-4ms in install-config
    2. create cluster
    3.
    

Actual results:

    Installation failed

Expected results:

    Installation succeeded

Additional info:

    

Description of problem:

We should use resource kind HelmChartRepository on details page, action items and breadcrumb link  

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2025-03-09-063419    

How reproducible:

Always    

Steps to Reproduce:

    1. navigate to Helm -> Repositories page, click on on HelmChartRepository
    2. Check the details page heading name, breadcrumb link name and action items name
    3.
    

Actual results:

Details page heading is: Helm Chart Repository
Breadcrumb link name is: Repositories -> Helm Chart Repository details
Two action items are: Edit Helm Chart Repository and Delete Helm Chart Repository  

Expected results:

We should use HelmChartRepository(no space between words) in these places 

Additional info:

    

When a (Fibre Channel) multipath disk is discovered by the assisted-installer-agent, the wwn field is not included:

  {
    "bootable": true,
    "by_id": "/dev/disk/by-id/wwn-0xdeadbeef",
    "drive_type": "Multipath",
    "has_uuid": true,
    "holders": "dm-3,dm-5,dm-7",
    "id": "/dev/disk/by-id/wwn-0xdeadbeef",
    "installation_eligibility": {
      "eligible": true,
      "not_eligible_reasons": null
    },
    "name": "dm-2",
    "path": "/dev/dm-2",
    "size_bytes": 549755813888
  }, 

Thus there is no way to match this disk with a wwn: root device hint. Since assisted does not allow installing directly to a fibre channel disk (without multipath) until 4.19 with MGMT-19631, and there is no /dev/disk/by-path/ symlink for a multipath device, this means that when there are multiple multipath disks in the system there is no way to select between them other than by size.

When ghw lists the disks, it fills in the WWN field from the ID_WWN_WITH_EXTENSION or ID_WWN udev values. It's not clear to me how udev is creating the /dev/disk/by-id/ symlink without those fields. There is a separate DM_WWN field (DM = Device Mapper), but I don't see it used in udev rules for whole disks, only for partitions. I don't have access to any hardware so it's impossible to say what the data in /run/udev/data looks like.

Description of problem:

The TestControllerEventuallyReconciles within the e2e-gcp-op-ocl test suite fails very often, which prevents the rest of the job from running. This causes reduced confidence in the test suite and lowers the overall quality signal for OCL.

 

Version-Release number of selected component (if applicable):

N/A

 

How reproducible:

Often.

 

Steps to Reproduce:

Run the e2e-gcp-op-ocl job by opening a PR. The job will eventually fail on this test.

 

Actual results:

The test, TestControllerEventuallyReconciles fails on a fairly consistent basis.

 

Expected results:

The test should pass.

 

Additional info:

I suspect that part of the problem is that the "success" criteria between the Build Controller and the e2e test suite are not the same. As part of the potential fix I've found, I exported the success criteria function so that it can be reused with the e2e test suite and I've also set certain hard-coded values as constants instead so that they can be adjusted from one central place.

This enabled machineset preflights by default https://github.com/kubernetes-sigs/cluster-api/pull/11228

We won't to disable this functionality in hcp because of the following reasons:

  • MachineSetPreflightCheckControlPlaneIsStable
    We currently don't express intent for a version via spec.version but via spec.release
  • MachineSetPreflightCheckKubernetesVersionSkew
    We preserve our ability to control our skew policy at a higher layer i.e. NodePool API

MachineSetPreflightCheckKubeadmVersionSkew

  • We don't run kubeadmin at all.

Description of problem:

During debugging ocp-42855 failure, hostedcluster conditions Degraded is True

Version-Release number of selected component (if applicable):

quay.io/openshift-release-dev/ocp-release:4.12.0-rc.6-x86_64

How reproducible:

follow ocp-42855 test steps

Steps to Reproduce:

1.Create a basic hosted cluster using hypershift tool
2.check hostedcluster conditions

Actual results:

[hmx@ovpn-12-45 hypershift]$ oc get pods -n clusters-mihuanghy
NAME                                                  READY   STATUS             RESTARTS         AGE
aws-ebs-csi-driver-controller-9c46694f-mqrlc          7/7     Running            0                55m
aws-ebs-csi-driver-operator-5d7867bc9f-hqzd5          1/1     Running            0                55m
capi-provider-6df855dbb5-tcmvq                        2/2     Running            0                58m
catalog-operator-7544b8d6d8-dk4hh                     2/2     Running            0                57m
certified-operators-catalog-7f8f6598b5-2blv4          0/1     CrashLoopBackOff   15 (4m20s ago)   57m
cloud-network-config-controller-545fcfc797-mgszj      3/3     Running            0                55m
cluster-api-54c7f7c477-kgvzn                          1/1     Running            0                58m
cluster-autoscaler-658756f99-vr2hk                    1/1     Running            0                58m
cluster-image-registry-operator-84d84dbc9f-zpcsq      3/3     Running            0                57m
cluster-network-operator-9b6985cc8-sd7d7              1/1     Running            0                57m
cluster-node-tuning-operator-65c8f6fbb9-xzpws         1/1     Running            0                57m
cluster-policy-controller-b5c76cf58-b4rth             1/1     Running            0                57m
cluster-storage-operator-7474f76c99-9chl7             1/1     Running            0                57m
cluster-version-operator-646d97ccc9-l72m5             1/1     Running            0                57m
community-operators-catalog-774fdb48fc-z6s4d          1/1     Running            0                57m
control-plane-operator-5bc8c4c996-4nz8c               2/2     Running            0                58m
csi-snapshot-controller-5b7d6bb685-vf8rf              1/1     Running            0                55m
csi-snapshot-controller-operator-6f74db85c6-89bts     1/1     Running            0                57m
csi-snapshot-webhook-57c5bd7f85-lqnwf                 1/1     Running            0                55m
dns-operator-767c5bbdd8-rb7fl                         1/1     Running            0                57m
etcd-0                                                2/2     Running            0                58m
hosted-cluster-config-operator-88b9d49b7-2gvbt        1/1     Running            0                57m
ignition-server-949d9fd8c-cgtxb                       1/1     Running            0                58m
ingress-operator-5c6f5d4f48-gh7fl                     3/3     Running            0                57m
konnectivity-agent-79c5ff9585-pqctc                   1/1     Running            0                58m
konnectivity-server-65956d468c-lpwfv                  1/1     Running            0                58m
kube-apiserver-d9f887c4b-xwdcx                        5/5     Running            0                58m
kube-controller-manager-64b6f757f9-6qszq              2/2     Running            0                52m
kube-scheduler-58ffcdf789-fch2n                       1/1     Running            0                57m
machine-approver-559d66d4d6-2v64w                     1/1     Running            0                58m
multus-admission-controller-8695985fbc-hjtqb          2/2     Running            0                55m
oauth-openshift-6b9695fc7f-pf4j6                      2/2     Running            0                55m
olm-operator-bf694b84-gvz6x                           2/2     Running            0                57m
openshift-apiserver-55c69bc497-x8bft                  2/2     Running            0                52m
openshift-controller-manager-8597c66d58-jb7w2         1/1     Running            0                57m
openshift-oauth-apiserver-674cd6df6d-ckg55            1/1     Running            0                57m
openshift-route-controller-manager-76d78f897c-9mfmj   1/1     Running            0                57m
ovnkube-master-0                                      7/7     Running            0                55m
packageserver-7988d8ddfc-wnh6l                        2/2     Running            0                57m
redhat-marketplace-catalog-77547cc685-hnh65           0/1     CrashLoopBackOff   15 (4m15s ago)   57m
redhat-operators-catalog-7784d45f54-58lgg             1/1     Running            0                57m


 
{
                "lastTransitionTime": "2022-12-31T18:45:28Z",
                "message": "[certified-operators-catalog deployment has 1 unavailable replicas, redhat-marketplace-catalog deployment has 1 unavailable replicas]",
                "observedGeneration": 3,
                "reason": "UnavailableReplicas",
                "status": "True",
                "type": "Degraded"
            },

Expected results:

Degraded is False 

Additional info:

$ oc describe pod certified-operators-catalog-7f8f6598b5-2blv4 -n clusters-mihuanghy
Name:                 certified-operators-catalog-7f8f6598b5-2blv4
Namespace:            clusters-mihuanghy
Priority:             100000000
Priority Class Name:  hypershift-control-plane
Node:                 ip-10-0-202-149.us-east-2.compute.internal/10.0.202.149
Start Time:           Sun, 01 Jan 2023 02:47:03 +0800
Labels:               app=certified-operators-catalog
                      hypershift.openshift.io/control-plane-component=certified-operators-catalog
                      hypershift.openshift.io/hosted-control-plane=clusters-mihuanghy
                      olm.catalogSource=certified-operators
                      pod-template-hash=7f8f6598b5
Annotations:          hypershift.openshift.io/release-image: quay.io/openshift-release-dev/ocp-release:4.12.0-rc.6-x86_64
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.131.0.38"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.131.0.38"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      openshift.io/scc: restricted-v2
                      seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:               Running
IP:                   10.131.0.38
IPs:
  IP:           10.131.0.38
Controlled By:  ReplicaSet/certified-operators-catalog-7f8f6598b5
Containers:
  registry:
    Container ID:   cri-o://f32b8d4c31b729c1b7deef0da622ddd661d840428aa4847968b1b2b3bf76b6cf
    Image:          registry.redhat.io/redhat/certified-operator-index:v4.11
    Image ID:       registry.redhat.io/redhat/certified-operator-index@sha256:93f667597eee33b9bdbc9a61af60978b414b6f6df8e7c5f496c4298c1dfe9b62
    Port:           50051/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 01 Jan 2023 03:39:44 +0800
      Finished:     Sun, 01 Jan 2023 03:39:44 +0800
    Ready:          False
    Restart Count:  15
    Requests:
      cpu:        10m
      memory:     160Mi
    Liveness:     exec [grpc_health_probe -addr=:50051] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:    exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
    Startup:      exec [grpc_health_probe -addr=:50051] delay=0s timeout=1s period=10s #success=1 #failure=15
    Environment:  <none>
    Mounts:       <none>
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:            <none>
QoS Class:          Burstable
Node-Selectors:     <none>
Tolerations:        hypershift.openshift.io/cluster=clusters-mihuanghy:NoSchedule
                    hypershift.openshift.io/control-plane=true:NoSchedule
                    node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                    node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                    node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       54m                    default-scheduler  Successfully assigned clusters-mihuanghy/certified-operators-catalog-7f8f6598b5-2blv4 to ip-10-0-202-149.us-east-2.compute.internal
  Normal   AddedInterface  53m                    multus             Add eth0 [10.131.0.38/23] from openshift-sdn
  Normal   Pulling         53m                    kubelet            Pulling image "registry.redhat.io/redhat/certified-operator-index:v4.11"
  Normal   Pulled          53m                    kubelet            Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.11" in 40.628843349s
  Normal   Pulled          52m (x3 over 53m)      kubelet            Container image "registry.redhat.io/redhat/certified-operator-index:v4.11" already present on machine
  Normal   Created         52m (x4 over 53m)      kubelet            Created container registry
  Normal   Started         52m (x4 over 53m)      kubelet            Started container registry
  Warning  BackOff         3m59s (x256 over 53m)  kubelet            Back-off restarting failed container

$ oc describe pod redhat-marketplace-catalog-77547cc685-hnh65 -n clusters-mihuanghy
Name:                 redhat-marketplace-catalog-77547cc685-hnh65
Namespace:            clusters-mihuanghy
Priority:             100000000
Priority Class Name:  hypershift-control-plane
Node:                 ip-10-0-202-149.us-east-2.compute.internal/10.0.202.149
Start Time:           Sun, 01 Jan 2023 02:47:03 +0800
Labels:               app=redhat-marketplace-catalog
                      hypershift.openshift.io/control-plane-component=redhat-marketplace-catalog
                      hypershift.openshift.io/hosted-control-plane=clusters-mihuanghy
                      olm.catalogSource=redhat-marketplace
                      pod-template-hash=77547cc685
Annotations:          hypershift.openshift.io/release-image: quay.io/openshift-release-dev/ocp-release:4.12.0-rc.6-x86_64
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.131.0.40"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.131.0.40"
                            ],
                            "default": true,
                            "dns": {}
                        }]
                      openshift.io/scc: restricted-v2
                      seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:               Running
IP:                   10.131.0.40
IPs:
  IP:           10.131.0.40
Controlled By:  ReplicaSet/redhat-marketplace-catalog-77547cc685
Containers:
  registry:
    Container ID:   cri-o://7afba8993dac8f1c07a2946d8b791def3b0c80ce62d5d6160770a5a9990bf922
    Image:          registry.redhat.io/redhat/redhat-marketplace-index:v4.11
    Image ID:       registry.redhat.io/redhat/redhat-marketplace-index@sha256:074498ac11b5691ba8975e8f63fa04407ce11bb035dde0ced2f439d7a4640510
    Port:           50051/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 01 Jan 2023 03:39:49 +0800
      Finished:     Sun, 01 Jan 2023 03:39:49 +0800
    Ready:          False
    Restart Count:  15
    Requests:
      cpu:        10m
      memory:     340Mi
    Liveness:     exec [grpc_health_probe -addr=:50051] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:    exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
    Startup:      exec [grpc_health_probe -addr=:50051] delay=0s timeout=1s period=10s #success=1 #failure=15
    Environment:  <none>
    Mounts:       <none>
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:            <none>
QoS Class:          Burstable
Node-Selectors:     <none>
Tolerations:        hypershift.openshift.io/cluster=clusters-mihuanghy:NoSchedule
                    hypershift.openshift.io/control-plane=true:NoSchedule
                    node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                    node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                    node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                  From               Message
  ----     ------          ----                 ----               -------
  Normal   Scheduled       55m                  default-scheduler  Successfully assigned clusters-mihuanghy/redhat-marketplace-catalog-77547cc685-hnh65 to ip-10-0-202-149.us-east-2.compute.internal
  Normal   AddedInterface  55m                  multus             Add eth0 [10.131.0.40/23] from openshift-sdn
  Normal   Pulling         55m                  kubelet            Pulling image "registry.redhat.io/redhat/redhat-marketplace-index:v4.11"
  Normal   Pulled          54m                  kubelet            Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.11" in 40.862526792s
  Normal   Pulled          53m (x3 over 54m)    kubelet            Container image "registry.redhat.io/redhat/redhat-marketplace-index:v4.11" already present on machine
  Normal   Created         53m (x4 over 54m)    kubelet            Created container registry
  Normal   Started         53m (x4 over 54m)    kubelet            Started container registry
  Warning  BackOff         21s (x276 over 54m)  kubelet            Back-off restarting failed container

   $ oc describe deployment redhat-marketplace-catalog -n clusters-mihuanghy
Name:                   redhat-marketplace-catalog
Namespace:              clusters-mihuanghy
CreationTimestamp:      Sun, 01 Jan 2023 02:47:03 +0800
Labels:                 hypershift.openshift.io/managed-by=control-plane-operator
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               olm.catalogSource=redhat-marketplace
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:       app=redhat-marketplace-catalog
                hypershift.openshift.io/control-plane-component=redhat-marketplace-catalog
                hypershift.openshift.io/hosted-control-plane=clusters-mihuanghy
                olm.catalogSource=redhat-marketplace
  Annotations:  hypershift.openshift.io/release-image: quay.io/openshift-release-dev/ocp-release:4.12.0-rc.6-x86_64
  Containers:
   registry:
    Image:      registry.redhat.io/redhat/redhat-marketplace-index:v4.11
    Port:       50051/TCP
    Host Port:  0/TCP
    Requests:
      cpu:              10m
      memory:           340Mi
    Liveness:           exec [grpc_health_probe -addr=:50051] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:          exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
    Startup:            exec [grpc_health_probe -addr=:50051] delay=0s timeout=1s period=10s #success=1 #failure=15
    Environment:        <none>
    Mounts:             <none>
  Volumes:              <none>
  Priority Class Name:  hypershift-control-plane
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    False   ProgressDeadlineExceeded
OldReplicaSets:  <none>
NewReplicaSet:   redhat-marketplace-catalog-77547cc685 (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  22m   deployment-controller  Scaled up replica set redhat-marketplace-catalog-77547cc685 to 1
[hmx@ovpn-12-45 hypershift]$ oc get hostedcluster -A
NAMESPACE   NAME        VERSION       KUBECONFIG                   PROGRESS    AVAILABLE   PROGRESSING   MESSAGE
clusters    mihuanghy   4.12.0-rc.6   mihuanghy-admin-kubeconfig   Completed   True        False         The hosted control plane is available


$ oc describe deployment certified-operators-catalog -n clusters-mihuanghy
Name:                   certified-operators-catalog
Namespace:              clusters-mihuanghy
CreationTimestamp:      Sun, 01 Jan 2023 02:47:03 +0800
Labels:                 hypershift.openshift.io/managed-by=control-plane-operator
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               olm.catalogSource=certified-operators
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:       app=certified-operators-catalog
                hypershift.openshift.io/control-plane-component=certified-operators-catalog
                hypershift.openshift.io/hosted-control-plane=clusters-mihuanghy
                olm.catalogSource=certified-operators
  Annotations:  hypershift.openshift.io/release-image: quay.io/openshift-release-dev/ocp-release:4.12.0-rc.6-x86_64
  Containers:
   registry:
    Image:      registry.redhat.io/redhat/certified-operator-index:v4.11
    Port:       50051/TCP
    Host Port:  0/TCP
    Requests:
      cpu:              10m
      memory:           160Mi
    Liveness:           exec [grpc_health_probe -addr=:50051] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:          exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
    Startup:            exec [grpc_health_probe -addr=:50051] delay=0s timeout=1s period=10s #success=1 #failure=15
    Environment:        <none>
    Mounts:             <none>
  Volumes:              <none>
  Priority Class Name:  hypershift-control-plane
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    False   ProgressDeadlineExceeded
OldReplicaSets:  <none>
NewReplicaSet:   certified-operators-catalog-7f8f6598b5 (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  21m   deployment-controller  Scaled up replica set certified-operators-catalog-7f8f6598b5 to 1

warning  React Hook React.useMemo has a missing dependency: 'hasRevealableContent'

 

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    The position of the play/pause button in the events page is different when there are no events vs when there are events

Version-Release number of selected component (if applicable):

    4.19.0

How reproducible:

    always

Steps to Reproduce:

    1. open the events page
    2. observe play/pause button position shift
    
    

Actual results:

    the button moves

Expected results:

    no shift

Additional info:

    

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

We recently hit a limit in our subscription where we could no longer assign role assignments to service principals.

This is because we are not deleting role assignments made during our CI runs. We previously thought we didn't have to delete those, but it turns out we need to.

we're just getting a regexp search bar and then a blank chart. Using the browser dev tools console we see this error:

Uncaught SyntaxError: import declarations may only appear at top level of a module timelines-chart:1:1
Uncaught ReferenceError: TimelinesChart is not defined
    renderChart https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-csi/1902920985443569664/artifacts/e2e-vsphere-ovn-csi/openshift-e2e-test/artifacts/junit/e2e-timelines_spyglass_20250321-040532.html:33606
    <anonymous> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-csi/1902920985443569664/artifacts/e2e-vsphere-ovn-csi/openshift-e2e-test/artifacts/junit/e2e-timelines_spyglass_20250321-040532.html:33650

Seems to be hitting 4.18 as well, not sure when it started exactly.

Description of the problem:

For some hardware, particularly simplynuc (https://edge.simplynuc.com/) it was found that when the Motherboard serial number is not set it default to "-". Since this is treated as a valid string in the UUID generation in https://github.com/openshift/assisted-installer-agent/blob/master/src/scanners/machine_uuid_scanner.go#L96-L107 it results in all hosts with the same UUID, causing installation failures.

The user-ca-bundle on our managed cluster contains two copies of all the entries from the parent cluster's trusted ca bundle, resulting in a massive user-ca-bundle, around 300 entries.

Our configMap on the hub cluster that contains the registries.conf and ca-bundle.crt only has 1 ca cert, under our understanding this should be the only ca cert that is transferred into the new managed cluster.

We may be missing configuration somewhere, but we are unable to find anything and do not know where that would be configured. Our agentServiceConfig only specifies our one configMap.

We are deploying the cluster on baremetal using the ztp cluster-instance pattern.

This is causing us to be unable to deploy more clusters from our hub cluster due to the ignition file being too large.

Description of problem:

    Adding a node with `oc adm node-image` is unable to pull the release image container and fails to generate the new node ISO.

Version-Release number of selected component (if applicable):

    

How reproducible:

    100%

Steps to Reproduce:

    1. Deploy OpenShift cluster with private registry in an offline environment
    2. Create the nodes-config.yaml for new nodes
    3. Run "oc adm node-image create --dir=/tmp/assets
    

Actual results:

    Command fails with error saying that it cannot pull from quay.io/openshift-release-dev/ocp-release@shaXXXXX

Expected results:

    Command generates an ISO used to add the new worker nodes

Additional info:

    When creating the initial agent ISO using "openshift-install agent create image" command, we can see in the output that a sub command is run, "oc adm release extract". When the install-config.yaml contains the ImageContentSourcePolicy section, or ImageDigestMirrorSet section, a flag is added to "oc adm release extract --icsp or idms" which contains the mappings from quay.io to the private registry.

    The oc command does not have a top level icsp or idms flag. The oc adm node-image command needs to have a flag for icsp or idms such that it is able to understand that instead of pulling the release image from quay.io it should pull the image from the private registry.

    Without this flag, the oc command has no way to know that it should be pulling container images from a private registry.