Back to index

4.19.0-0.nightly-2024-11-22-092717

Jump to: Incomplete Features | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.18.0-ec.4

Note: this page shows the Feature-Based Change Log for a release

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

Provide a simple way to get a VM-friendly networking setup, without having to configure the underlying physical network.

Goal

Primary used-defined networks can be managed from the UI and the user flow is seamless.

User Stories

  • As a cluster admin,
    I want to use the UI to define a ClusterUserDefinedNetwork, assigned with a namespace selector.
  • As a project admin,
    I want to use the UI to define a UserDefinedNetwork in my namespace.
  • As a project admin,
    I want to be queried to create a UserDefinedNetwork before I create any Pods/VMs in my new project.
  • As a project admin running VMs in a namespace with UDN defined,
    I expect the "pod network" to be called "user-defined primary network",
    and I expect that when using it, the proper network binding is used.
  • As a project admin,
    I want to use the UI to request a specific IP for my VM connected to UDN.

UX doc

https://docs.google.com/document/d/1WqkTPvpWMNEGlUIETiqPIt6ZEXnfWKRElBsmAs9OVE0/edit?tab=t.0#heading=h.yn2cvj2pci1l

Non-Requirements

  • <List of things not included in this epic, to alleviate any doubt raised during the grooming process.>

Notes

Feature Overview (aka. Goal Summary)

As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.

To avoid an increased support overhead once the license changes at the end of the year, we want to provision IBM Cloud VPC infrastructure without the use of Terraform.

Requirements (aka. Acceptance Criteria):

  • The IBM Cloud VPC IPI Installer no longer contains or uses Terraform.
  • The new provider should aim to provide the same results and have parity with the existing IBM Cloud VPC Terraform provider. Specifically, we should aim for feature parity against the install config and the cluster it creates to minimize impact on existing customers' UX.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic Goal

  • Replace Terraform infrastructure and machine (bootstrap, control plane) provisioning with CAPI-based approach.

Feature Overview (aka. Goal Summary)  

  • With this next-gen OLM GA release (graduated from ‘Tech Preview’), customers can: 
    • discover collections of k8s extension/operator contents released in the FBC format with richer visibility into their release channels, versions, update graphs, and the deprecation information (if any) to make informed decisions about installation and/or update them.
    • install a k8s extension/operator declaratively and potentially automate with GitOps to ensure predictable and reliable deployments.
    • update a k8s extension/operator to a desired target version or keep it updated within a specific version range for security fixes without breaking changes.
    • remove a k8s extension/operator declaratively and entirely including cleaning up its CRDs and other relevant on-cluster resources (with a way to opt out of this coming up in a later release).
  • To address the security needs of 30% of our customers who run clusters in disconnected environments, the GA release will include cluster extension lifecycle management functionality for offline environments.
  • [Tech Preview] (Cluster)Extension lifecycle management can handle runtime signature validation for container images to support OpenShift’s integration with the rising Sigstore project for secure validation of cloud-native artifacts,

Goals (aka. expected user outcomes)

1. Pre-installation:

  • Customers can access a collection of k8s extension contents from a set of default catalogs leveraging the existing catalog images shipped with OpenShift (in the FBC format) with the new Catalog API from the OLM v1 GA release.
  • With the new GAed Catalog API, customers get richer package content visibility in their release channels, versions, update graphs, and the deprecation information (if any) to help make informed decisions about installation and/or update.
  • With the new GAed Catalog API, customers can render the catalog content in their clusters with fewer resources in terms of CPU and memory usage and faster performance.
  • Customers can filter the available packages based on the package name and see the relevant information from the metadata shipped within the package. 

2. Installation:

  • Customers using a ServiceAccount with sufficient permissions can install a k8s extension/operator with a desired target version or the latest version within a specific version range (from the associated channel) to get the latest security fixes.
  • Customers can easily automate the installation flow declaratively with GitOps to ensure predictable and reliable deployments.
  • Customers get protection from having two conflicting k8s extensions/operators owning the same API objects, i.e., no conflicting ownership, ensuring cluster stability.
  • Customers can access the* metadata of the installed k8s extension/operator to see essential information such as its provided APIs, example YAMLs of its provided APIs, descriptions, infrastructure features, valid subscriptions, etc.

3. Update:

  • Customers can see what updates are available for their k8s extension/operators in the form of immediate target versions and the associated update channels.
  • Customers can trigger the update of a k8s extension/operator with a desired target version or the latest version within a specific version range (from the associated channel) to get the latest security fixes.
  • Customers get protection from workload or k8s extension/operator breakage due to CustomResourceDefinition (CRD) being upgraded to a backward incompatible version during an update.
  • During OpenShift cluster update, customers* get Informed when installed k8s extensions/operators ** do not support the next OpenShift version *(when annotated by the package author/provider).  Customers must update those k8s extensions/operators to a newer/compatible version before OLM unblocks the OpenShift cluster update. 

4. Uninstallation/Deletion:

  • Customers can cleanly remove an installed k8s extension/operator including deleting CustomResourceDefinitions (CRDs), custom resource objects (CRs) of the CRDs, and other relevant resources to revert the cluster to its original state before the installation declaratively.

5. Disconnected Environments for High-Security Workloads:

  • Approximately 30% of our customers prioritize high security by running their clusters in internet-disconnected environments, especially for mission-critical production workloads. To benefit these users, our supported GA release needs to include cluster extension lifecycle management functionality that functions within these disconnected environments.

6. [Tech Preview] Signature Validation for Secure Workflows:

  • The Red Hat-sponsored Sigstore project is gaining traction in the Kubernetes community, aiming to simplify the signing of cloud-native artifacts. OpenShift leverages Sigstore tooling to enable scalable and flexible signature validation, including support for disconnected environments. This functionality will be available as a Tech Preview in 4.17 and is targeted for General Availability (GA) Tech Preview Phase 2 in the upcoming 4.18 release. To fully support this integration as a Tech Preview release, the (cluster)extension lifecycle management needs to (be prepared to) handle runtime validation of Sigstore signatures for container images.

Requirements (aka. Acceptance Criteria):

All the expected user outcomes and the acceptance criteria in the engineering epics are covered.

Background

OLM: Gateway to the OpenShift Ecosystem

Operator Lifecycle Manager (OLM) has been a game-changer for OpenShift Container Platform (OCP) 4.  Since its launch in 2019, OLM has fostered a rich ecosystem, expanding from a curated set of 25 operators to over 100 officially supported Red Hat operators and hundreds more from certified ISVs and the community.

OLM empowers users to manage diverse technologies with ease, including ACM, ACS, Quay, GitOps, Pipelines, Service Mesh, Serverless, and Virtualization.  It has also facilitated the introduction of groundbreaking operators for entirely new workloads, like Nvidia GPU, PTP, Windows Machine Config, SR-IOV networking, and more.  Today, a staggering 91% of our connected customers leverage OLM's capabilities.

OLM v0: A Stepping Stone

While OLM v0 has been instrumental, it has limitations.  The API design, not fully GitOps-friendly or entirely declarative, presents a steeper learning curve due to its complexity.  Furthermore, OLM v0 was designed with the assumption of namespace-scoped CRDs (Custom Resource Definitions), allowing for independent operator installations and parallel versions within a single cluster.  However, this functionality never materialized in core Kubernetes, and OLM v0's attempt to simulate it has introduced limitations and bugs.

The Operator Framework Team: Building the Future

The Operator Framework team is the cornerstone of the OpenShift ecosystem.  They build and manage OLM, the Operator SDK, operator catalog formats, and tooling (opm, file-based catalogs).  Their work directly impacts how operators are developed, packaged, delivered, and managed by users and SRE teams on OpenShift clusters.

A Streamlined Future with OLM v1

The Operator Framework team has undergone significant restructuring to focus on the next generation of OLM – OLM v1.  This transition includes moving the Operator SDK to a feature-complete state with ongoing maintenance for compatibility with the latest Kubernetes and controller-runtime libraries.  This strategic shift allows the team to dedicate resources to completely revamping OLM's API and management concepts for catalog content delivery.  

Leveraging learnings and customer feedback since OCP 4's inception, OLM v1 is designed to be a major overhaul, and it will be shipped as a Generally Available (GA) feature in OpenShift 4.17.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

1. Pre-installation:

  • [GA release] Docs provide instructions on how to add Red Hat-provided Operator catalogs with the pull secret for catalogs hosted on a secure registry.
  • [GA release] Docs provide instructions on how to discover the Operator packages from a catalog.
  • [GA release] Docs provide instructions on how to query and inspect the metadata of Operator bundles and find feasible ones to be installed with the OLM v1.

2. Installation:

  • [GA release] Docs provide instructions on how to use a ServiceAccount with sufficient permissions to install a k8s extension/operator with a desired target version or the latest version within a specific version range to get the latest security fixes.
  • [GA release] Docs provide instructions on how to automate the installation flow declaratively with GitOps to ensure predictable and reliable deployments.
  • [GA release] Docs mention the OLM v1’s protection from having two conflicting k8s extensions/operators owning the same API objects, i.e., no conflicting ownership, ensuring cluster stability.
  • [GA release] Docs provide instructions on how to access the metadata of the installed k8s extension/operator to see essential information such as its provided APIs, example YAMLs of its provided APIs, descriptions, infrastructure features, valid subscriptions, etc.
  • [GA release] Docs explain how to create RBACs from a CRD to grant cluster users access to the installed k8s extension/operator's provided APIs.

3. Update:

  • [GA release] Docs provide instructions on how to see what updates are available for their k8s extension/operators in the form of immediate target versions and the associated update channels.
  • [GA release] Docs provide instructions on how to trigger the update of a k8s extension/operator with a desired target version or the latest version within a specific version range to get the latest security fixes.
  • [GA release] Docs mention OLM v1’s protection from workload or k8s extension/operator breakage due to CustomResourceDefinition (CRD) being upgraded to a backward incompatible version during an update.
  • [GA release] Docs mention OLM v1 will block the OpenShift cluster update if installed k8s extensions/operators do not support the next OpenShift version (when annotated by the package author/provider).  Provide instructions on how to find and update to a newer/compatible version before OLM unblocks the OpenShift cluster update.

4. Uninstallation/Deletion:

  • [GA release] Docs provide instructions on how to cleanly remove an installed k8s extension/operator including deleting CustomResourceDefinitions (CRDs), custom resource objects (CRs) of the CRDs, and other relevant resources.
  • [GA release] Docs provide instructions to verify the cluster has been reverted to its original state after uninstalling a k8s extension/operator

Relevant upstream CNCF OLM v1 requirements, engineering brief, and epics:

1. Pre-installation:

2. Installation:

3. Update:

4. Uninstallation/Deletion:

Relevant documents:

 

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Once OLM v1.0.0 is feature complete and the team feels comfortable enabling it by default, we should remove the OLM v1 feature flag and deploy it on all clusters by default.
  • We should also introduce OLMv1 behind a CVO capability to give customers the option of leaving it disabled in their clusters.

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • OLMv1 is enabled by default in OCP
  • OLMv1 can be fully disabled at install/upgrade time using CVO capabilities

 

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>
  1. We are encountering toleration misses in origin azure tests which are preventing our components from stabilizing.
  2. cluster-olm-operator is spamming api server with condition lastUpdateTimes
  3. disconnected environment in CI/origin is different from OLMv1 expectations (but we do feel that v1 disconnected functionality is getting enough validation elsewhere to be confident).  Created OCPBUGS-44810 to align expectations of the disconnected environments
  4.  

 

Update cluster-olm-operator manifests to be in the payload by default.

 

A/C:

 - Removed "release.openshift.io/feature-set: TechPreviewNoUpgrade" annotation
 - Ensure the following cluster profiles are targeted by all manifests:
    - include.release.openshift.io/hypershift: "true"
    - include.release.openshift.io/ibm-cloud-managed: "true"
    - include.release.openshift.io/self-managed-high-availability: "true"
    - include.release.openshift.io/single-node-developer: "true"
 - No installation related annotations are present in downstream operator-controller and catalogd manifests

OpenShift offers a "capabilities" to allow users to select which components to include in the cluster at install time.

It was decided the capability name should be: OperatorLifecycleManagerV1 [ref

A/C:

 - ClusterVersion resource updated with OLM v1 capability
 - cluster-olm-operator manifests updated with capability.openshift.io/name=OperatorLifecycleManagerV1 annotation

Feature Overview (aka. Goal Summary)  

Customers who deploy a large number of OpenShift on OpenStack clusters want to minimise the resource requirements of their cluster control planes.

Customers deploying RHOSO (OpenShift services for OpenStack, i.e. OpenStack control plane on bare metal OpenShift) already have a bare metal management cluster capable of serving Hosted Control Planes.

We should enable self-hosted (i.e. on-prem) Hosted Control Planes to serve Hosted Control Planes to OpenShift on OpenStack clusters, with a specific focus of serving Hosted Control Planes from the RHOSO management cluster.

Goals (aka. expected user outcomes)

As an enterprise IT department and OpenStack customer, I want to provide self-managed OpenShift clusters to my internal customers with minimum cost to the business.

As an internal customer of said enterprise, I want to be able to provision an OpenShift cluster for myself using the business's existing OpenStack infrastructure.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

TBD
 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Goal

  • Ability to run cinder and manila operators as controller Pods in a hosted control plane
  • Ability to run Node DaemonSet in a guest clusters

Why is this important?

  • Continue supporting usage of CSIs for the guest cluster just how it's possible with standalone OpenShift clusters.

Scenarios

\

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. https://github.com/openshift/enhancements/blob/master/enhancements/storage/storage-hypershift.md
  2. https://issues.redhat.com/browse/OCPSTRAT-210
  3.  

Open questions::

In OSASINFRA-3608, we merged the openshift/openstack-cinder-csi-driver-operator repository into openshift/csi-operator and modified it to take advantage of the new generator framework provided therein. Now, we want to build on this, adding Hypershift-specific assets and tweaking whatever else is needed.

In a HCP deployment, the hosted-cluster-config-operator is responsible for deploying other operators, such as the cluster-storage-operator. We need to modify this operator to deploy cluster-storage-operator and enable the openstack-manila-csi-driver-operator when deployed in an OpenStack environment.

In a HCP deployment, the hosted-cluster-config-operator is responsible for deploying other operators, such as the cluster-storage-operator. We need to modify this operator to deploy cluster-storage-operator and enable the openstack-cinder-csi-driver-operator when deployed in an OpenStack environment.

Feature Overview

Ability to install OpenShift on Nutanix with nodes having multiple NICs (multiple subnets) from IPI and for autoscaling with MachineSets.

 

Feature Overview

Ability to install OpenShift on Nutanix with nodes having multiple NICs (multiple subnets) from IPI and for autoscaling with MachineSets.

Feature Overview

Implement authorization to secure API access for different user personas/actors in the agent-based installer.

User Personas:

  • Read-Only Access: For "wait-for" and "monitor-add-nodes" commands.
  • Read-Write Access: For systemd services and the agent service.

This is 

Goals

The agent-based installer APIs have implemented basic security measures through authentication, as covered in AGENT-145.

To further enhance security, it is crucial to implement user persona/actor-based authorization, allowing for differentiated access control, such as read-only or read-write permissions, based on the user's role.

The goal of this implementation is to provide a more robust and secure API framework, ensuring that users can only perform actions appropriate to their role.

Epic Goal

  • Implement authorization to secure API access for different user personas/actors in the agent-based installer.
  • User Personas:
    • Read-Only Access: For "wait-for" and "monitor-add-nodes" commands.
    • Read-Write Access: For systemd services and the agent service.

Why is this important?

  • The agent-based installer APIs have implemented basic security measures through authentication, as covered in AGENT-145. To further enhance security, it is crucial to implement user persona/actor-based authorization, allowing for differentiated access control, such as read-only or read-write permissions, based on the user's role. This approach will provide a more robust and secure API framework, ensuring that users can only perform actions appropriate to their role. 

Scenarios

  1. Users running the wait-for or monitor-add-nodes commands should have read-only permissions. They should not be able to write to the API. If they attempt to perform write operations, appropriate error messages could be displayed, indicating that they are not authorized to write.
  2. Users associated with running systemd services should have both read and write permissions.
  3. Users associated with running the agent service should also have read and write permissions.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1.  

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story:

As a ABI user, I want to be able to:

  • add worker nodes on day2 when the authorization implementation creates 3 seperate auth tokens for each user persona
  • save 3 auth tokens generated when creating nodes iso into a cluster as a secret
  • regenerate the auth tokens and refresh the asset store if the tokens stored in cluster secret are expired.

so that I can achieve

  • successful installation
  • adding workers to a cluster
  •  

Acceptance Criteria:

Description of criteria:

  • Upstream documentation
  • Point 1
  • Point 2
  • Point 3

(optional) Out of Scope:

Detail about what is specifically not being delivered in the story

Engineering Details:

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview

Improve the cluster expansion with the agent workflow added in OpenShift 4.16 (TP) and OpenShift 4.17 (GA) with:

  • Caching RHCOS image for faster node addition, i.e. no extraction of image every time)
  • Add a single node with just one command, no need to write config files describing node
  • Support creating PXE artifacts 

Goals

Improve the user experience and functionality of the commands to add nodes to clusters using the image creation functionality.

Epic Goal

  • Cleanup/carryover work from AGENT-682 and WRKLDS-937 that were non-urgent for GA of the day 2 implementation

Make more similar the two commands output, by using the recently introduced base command logger

Currently the oc node-image create command does not report any revelant information that could help the user to understand which element was retrieved from (for example, the SSH key), thus making more difficult to troubleshoot an eventual issue.

For this reason, it could be useful that the node-joiner tool would produce a proper json file, reporting all the details about the relevent resources fetched for generating image. The oc command should be able to expose them when required (ie via command flag)

Currently all the *.iso generated by the node-joiner tool are copied back to the user. Since the node-joiner created unconditionally also the node-config, this one is copied even if it not requested, resulting than confusing for the end user.

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

<your text here>

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

link back to OCPSTRAT-1644 somehow

 

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

 
Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

We need to do a lot of R&D and fix some known issues (e.g., see linked BZs). 

 

R&D targetted at 4.16 and productisation of this feature in 4.17

 

Goal
To make the current implementation of the HAProxy config manager the default configuration.

Objectives

  • Disable pre-allocation route blueprints
  • Limit dynamic server allocation
  • Provide customer opt-out
    • Offer customers a handler to opt out of the default config manager implementation.

 

The goal of this user story is to combine the code from the smoke test user story and results from the spike into an implementation PR.

Since multiple gaps were discovered a feature gate will be needed to ensure stability of OCP before the feature can be enabled by default.

https://issues.redhat.com/browse/NE-1788 describes 3 gaps in the implementation of DAC:

  • Idled services are waken up by the health check from the servers set by DAC (server-template).
  • ALPN TLS extension is not enabled for reencrypt routes.
  • Dynamic servers produce dummy metrics.

Additional gaps were discovered along the way:

This story aims at fixing those gaps.

Feature Overview (aka. Goal Summary)  

Add support for the Installer to configure IPV4Subnet to customize internal OVN network in BYO VPC.

Goals (aka. expected user outcomes)

As an OpenShift user I'm able to provide IPv4 subnets to the Installer so I can customize the OVN networks at install time

Requirements (aka. Acceptance Criteria):

The Installer will allow the user to provide the information via the install config manifest and this information will be used at install time to configure the OVN network and deploy the cluster into an existing VPC provided by the user. 

Background

This is a requirement for ROSA, ARO and OSD

Documentation Considerations

As any other option for the Installer this will be documented as usual.

Implementation Considerations

Terraform is used for creating or referencing VPCs

OCP/Telco Definition of Done
Epic Template descriptions and documentation.{}

Feature Overview (aka. Goal Summary)  

Configure IPV4Subnet to customize internal OVN network in BYOVPC

Goals (aka. expected user outcomes)

Users are able to successfully provide IPV4Subnets through the install config that are used to customize the OVN networks.

Requirements (aka. Acceptance Criteria):

  • Install config parameter is added to accept user input.
  • Input is provided to the OVN network during installation and is used to install them onto the BYOVPC

Use Cases (Optional):

ROSA, ARO and OSD needs this for their product.

Questions to Answer (Optional):

-

Out of Scope

Other cloud platforms except AWS

Background

-

Customer Considerations

-

Documentation Considerations

-

Interoperability Considerations

-

 

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

  • Make CMO config parsing/unmarshalling more strict
  • Implement the validate logic (make sure it uses the same code blocks than when CMO tries to applt the config to avoid divergence)
  • Expose the webhook on CMO.
  • Provide an opt-out mechanism

Epic Goal
Through this epic, we will update our CI to use a have an available agent-based workflow instead of the libvirt openshift-installer, allowing us to eliminate the use of terraform in our deployments.

Why is this important?
There is an active initiative in openshift to remove terraform from the openshift installer.

Acceptance Criteria

  • All tasks within the epic are completed.

Done Checklist

  • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
  • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

Context thread.

Description of problem:

     Monitoring the 4.18 agent-based installer CI job for s390x (https://github.com/openshift/release/pull/50293) I discovered unexpected behavoir onces the installation triggers reboot into disk step for the 2nd and 3rd control plane nodes. (The first control plane node is rebooted last because it's also the bootstrap node). Instead of rebooting successully as expected, it fails to find the OSTree and drops to dracut, stalling the installation.

Version-Release number of selected component (if applicable):

    OpenShift 4.18 on s390x only; discovered using agent installer

How reproducible:

    Try to install OpenShift 4.18 using agent-based installer on s390x

Steps to Reproduce:

    1. Boot nodes with XML (see attached)
    2. Wait for installation to get to reboot phase.
    

Actual results:

    Control plane nodes fail to reboot.

Expected results:

    Control plane nodes reboot and installation progresses.

Additional info:

    See attached logs.

The history of this epic starts with this PR which triggered a lengthy conversation around the workings of the image  API with respect to importing imagestreams  images as single vs manifestlisted. The imagestreams today by default have the `importMode` flag set to `Legacy` to avoid breaking behavior of existing clusters in the field. This makes sense for single arch clusters deployed with a single  arch payload, but when users migrate to use the multi payload, more often than not, their intent is to add nodes of other architecture types. When this happens - it gives rise to problems when using imagestreams with the default behavior of importing a single manifest image. The oc commands do have a new flag to toggle the importMode, but this breaks functionality of existing users who just want to create an imagestream and use it with existing commands.

There was a discussion with David Eads and other staff engineers and it was decided that the approach to be taken is to default imagestreams' importMode to `preserveOriginal` if the cluster is installed with/ upgraded to a multi payload. So a few things need to happen to achieve this:

  • CVO would need to expose a field in the status section indicative of the type of payload in the cluster (single vs multi)
  • cluster-openshift-apiserver-operator would read this field and add it to the apiserver configmap. openshift-apiserver would use this value to determine the setting of importMode value.
  • Document clearly that the behavior of imagestreams in a cluster with multi payload is different from the traditional single payload

Some open questions:

  • What happens to existing imagestreams on upgrades
  • How do we handle CVO managed imagestreams (IMO, CVO managed imagestreams should always set importMode to preserveOriginal as the images are associated with the payload)

 

For the apiserver operator to figure out the payload type and set the import mode defaults, the CVO needs to expose that value through the status field. This information is available today in the conditions list, but it's not pretty to extract it and infer the payload type as it is contained in the message string. The way to do it today is shown here. It would be better for CVO to expose it as a separate field which can be easily consumed by any controller and also be used for telemetry in the future.

 

Epic Goal

  • Update all images that we ship with OpenShift to the latest upstream releases and libraries.
  • Exact content of what needs to be updated will be determined as new images are released upstream, which is not known at the beginning of OCP development work. We don't know what new features will be included and should be tested and documented. Especially new CSI drivers releases may bring new, currently unknown features. We expect that the amount of work will be roughly the same as in the previous releases. Of course, QE or docs can reject an update if it's too close to deadline and/or looks too big.

Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).

Why is this important?

  • We want to ship the latest software that contains new features and bugfixes.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.

(Using separate cards for each driver because these updates can be more complicated)

https://github.com/openshift/gcp-pd-csi-driver

Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
Please wait for openshift/api, openshift/library-go, and openshift/client-go  are updated to the newest Kubernetes release! There may be non-trivial changes in these libraries.

This includes (but is not limited to):

  • Kubernetes:
    • client-go
    • controller-runtime
  • OCP:
    • library-go
    • openshift/api
    • openshift/client-go
    • operator-sdk

Operators:

  • csi-operator
  • openstack-cinder-csi-driver-operator
  • gcp-pd-csi-driver-operator
  • gcp-filestore-csi-driver-operator
  • csi-driver-manila-operator
  • vmware-vsphere-csi-driver-operator
  • ibm-vpc-block-csi-driver-operator
  • ibm-powervs-block-csi-driver-operator
  • secrets-store-csi-driver-operator
  • cluster-storage-operator
  • cluster-csi-snapshot-controller-operator
  • local-storage-operator
  • vsphere-problem-detector

(please cross-check with *-operator + vsphere-problem-detector in our tracking sheet)

EOL, do not upgrade:

  • github.com/oVirt/csi-driver-operator
  • github.com/openshift/alibaba-disk-csi-driver-operator
  • github.com/openshift/csi-driver-shared-resource-operator

The following operators were migrated to csi-operator, do not update these obsolete repos:

  • github.com/openshift/aws-efs-csi-driver-operator
  • github.com/openshift/azure-disk-csi-driver-operator
  • github.com/openshift/azure-file-csi-driver-operator

tools/library-bump.py  and tools/bump-all  may be useful. For 4.16, this was enough:

mkdir 4.16-bump
cd 4.16-bump
../library-bump.py --debug --web <file with repo list> STOR-1574 --run "$PWD/../bump-all github.com/google/cel-go@v0.17.7" --commit-message "Bump all deps for 4.16" 

4.17 perhaps needs an older prometheus:

../library-bump.py --debug --web <file with repo list> STOR-XXX --run "$PWD/../bump-all github.com/google/cel-go@v0.17.8 github.com/prometheus/common@v0.44.0 github.com/prometheus/client_golang@v1.16.0 github.com/prometheus/client_model@v0.4.0 github.com/prometheus/procfs@v0.10.1" --commit-message "Bump all deps for 4.17" 

4.18 special:

Add "spec.unhealthyEvictionPolicy: AlwaysAllow" to all PodDisruptionBudget objects of all our operators + operands. See WRKLDS-1490 for details

There has been change in library-go function called `WithReplicasHook`. See https://github.com/openshift/library-go/pull/1796.

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Description of problem:

    `tag:UntagResources` is required for the AWS SDK call [UntagResourcesWithContext](https://github.com/openshift/installer/blob/master/pkg/destroy/aws/shared.go#L121) when removing the "shared" tag from the IAM profile.

Version-Release number of selected component (if applicable):

    4.17+

How reproducible:

    always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    time="2024-11-19T12:22:19Z" level=debug msg="search for IAM instance profiles"
time="2024-11-19T12:22:19Z" level=debug msg="Search for and remove tags in us-east-1 matching kubernetes.io/cluster/ci-op-y8wbktiq-e515e-q6kvb: shared"
time="2024-11-19T12:22:19Z" level=debug msg="Nothing to clean for shared iam resource" arn="arn:aws:iam::460538899914:instance-profile/ci-op-y8wbktiq-e515e-byo-profile-worker"
time="2024-11-19T12:22:19Z" level=debug msg="Nothing to clean for shared iam resource" arn="arn:aws:iam::460538899914:instance-profile/ci-op-y8wbktiq-e515e-byo-profile-master"
time="2024-11-19T12:22:19Z" level=info msg="untag shared resources: AccessDeniedException: User: arn:aws:iam::460538899914:user/ci-op-y8wbktiq-e515e-minimal-perm is not authorized to perform: tag:UntagResources because no identity-based policy allows the tag:UntagResources action\n\tstatus code: 400, request id: 464de6ab-3de5-496d-a163-594dade11619"

See: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/58833/rehearse-58833-pull-ci-openshift-installer-release-4.18-e2e-aws-ovn-custom-iam-profile/1858807924600606720

Expected results:

    The perm is added to the required list when BYO IAM profile and the "shared" tag is removed from the profiles.

Additional info:

    

Description of problem:

During the integration of Manila into csi-operator a new controller was added to csi-operator that checks if a precondition is valid in order to trigger all the other controllers. The precondition defined for manila checks that manila shares exists and if that is the case it syncs the CSI Driver and the Storage Classes. We need to handle the error returned in case those syncs fail.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

The finally tasks do not get removed and remain in the pipeline.    

Version-Release number of selected component (if applicable):

    In all supported OCP version

How reproducible:

    Always

Steps to Reproduce:

1. Create a finally task in a pipeline in pipeline builder
2. Save pipeline
3. Edit pipeline and remove finally task in pipeline builder
4. Save pipeline
5. Observe that the finally task has not been removed

Actual results:

The finally tasks do not get removed and remain in the pipeline.    

Expected results:

Finally task gets removed from pipeline when removing the finally tasks and saving the pipeline in the "pipeline builder" mode.    

Additional info:

    

Description of problem:

    If zones are not specified in the install-config.yaml, the installer will discover all the zones available for the region. Then it will try to filter those zones based on the instance type, which requires the `ec2:DescribeInstanceTypeOfferings` permission.

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    Always by not specifying zones in the install-config.yaml and installing cluster with a minimal permissions user.

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    TBA

Expected results:

    A failure message indicating that `ec2:DescribeInstanceTypeOfferings` is need when zones are not set.

Additional info:

    

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

IBI and IBU use diffrent lables for var-lib-containers partitioin.
This results in failure to mount the partition in case of label mismatch (var-lib-containers vs varlibcontainers).
We should always use `var-lib-containers` as the label

See more details in the slack thread
https://redhat-internal.slack.com/archives/C05JHD9QYTC/p1731542185936629

 

Installer part

https://github.com/openshift/installer/blob/master/pkg/asset/imagebased/image/imagebased_config.go#L33

https://github.com/openshift/installer/blob/master/pkg/types/imagebased/imagebased_config_types.go#L65

 

Lca part (less interesting as config will be generated in installer)

https://github.com/openshift-kni/lifecycle-agent/blob/main/api/ibiconfig/ibiconfig.go#L20

Description of problem:

[vmware-vsphere-csi-driver-operator] driver controller/node/webhook update events repeat pathologically    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-03-161006    

How reproducible:

Always    

Steps to Reproduce:

    1. Install an Openshift cluster on vSphere of version 4.17 nightly.
    2. Upgrade the cluster to 4.18 nightly.
    3. Check the driver controller/node/webhook update events should not repeat pathologically.     

CI failure record -> https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-vsphere-ovn-upgrade/1854191939318976512  

Actual results:

 In step 3: the driver controller/node/webhook update events repeat pathologically   

Expected results:

 In step 3: the driver controller/node/webhook update events should not repeat pathologically    

Additional info: