Back to index

4.10.37

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.9.59

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Feature Overview
Insights Advisor for OpenShift is integrated within OpenShift Cluster Manager. This has some limitations for adding new features and also for sharing codebase between RHEL Advisor and OCM Insights Advisor tab. Insights Advisor for OpenShift lacks certain features from the RHEL UI, the codebase is not 1:1 clone.
As a customer of Insights I will have same/very similar user experience with Insights for OpenShift and Insights for RHEL. The workflows will share the main concepts, the UI elements will be same and features introduced to Advisor will be automatically considered for both all supported platforms.
As OpenShift users I will still see integrations of Insights Advisor within OpenShift Cluster Manager that shows aggregated information for customer account and single cluster view on Advisor data. These integration will point to new Insights Advisor for OpenShift app that will be tightly integrated into OpenShift Cluster Manager.

  • Note: The application will be reusing the codebase but will run as a separate app for OpenShift. THere's no intent to merge RHEL and OpenShift workflows into a single app.

Goals

  • Q2CY21: Explore possibility to unify codebase between RHEL Advisor and OCM Insights Advisor tab. Identify architecture misalignments, create UI mockups to merge the two existing UIs.
  • Q3CY21: Integrate OpenShift into Advisor codebase, standup the Insights Advisor for OpenShift application and change integration in OpenShift Cluster manager to point at the new app
  • Q4CY21: Deliver missing screen of Insights Advisor for OpenShift (Systems and Recommendations views)

Requirements

  • UX overview of UI elements in both UIs - Marie Doruskova
  • Architecture overview/misalignments for both UIs - Jan Zeleny [~fjansen]

Benefits

  • Feature parity between RHEL and OpenShift
  • Adopting new features developed by RHEL Advisor team quicker
  • Smaller maintenance cost

Questions to answer...

  • Possible deviations between OpenShift and RHEL
  • Remediation workflow different between OpenShift and RHEL

Out of Scope

  • Single app that combines RHEL hosts and OpenShift clusters. Goal is still to differentiate between platforms and offer view only for a single platform.
  • Direct/Supervised remediations and integration of remediations with Advanced Cluster Manager (as a Service)

Background, and strategic fit

  • Insights Advisor for OpenShift follows the goal to introduce multiple applications that add value for OpenShift customers under the Insights brand. The current UI and integration of Advisor into OpenShift cluster manager doesn't follow pattern that other Insights for OpenShift applications can/will follow.

Documentation Considerations

  • OCM documentation is impacted, existing workflows described in OCM documentation will persist. The placement of the application within OCM will be different.

 

OCP WebConsole, in the main dashboard, has an Insights Advisor widget, which has been redirecting users to OCM. Due to the Insights Advisor tab decommission in OCM, the links should point to Advisor instead.

4.10 code freeze = 28 January (marking the task as urgent)

Problem Alignment

The Problem

Today, all configuration for setting individual, for example, routing configuration is done via a single configuration file that only admins have access to. If an environment uses multiple tenants and each tenant, for example, has different systems that they are using to notify teams in case of an issue, then someone needs to file a request w/ an admin to add the required settings.

That can be bothersome for individual teams, since requests like that usually disappear in the backlog of an administrator. At the same time, administrators might get tons of requests that they have to look at and prioritize, which takes them away from more crucial work.

We would like to introduce a more self service approach whereas individual teams can create their own configuration for their needs w/o the administrators involvement.

Last but not least, since Monitoring is deployed as a Core service of OpenShift there are multiple restrictions that the SRE team has to apply to all OSD and ROSA clusters. One restriction is the ability for customers to use the central Alertmanager that is owned and managed by the SRE team. They can't give access to the central managed secret due to security concerns so that users can add their own routing information.

High-Level Approach

Provide a new API (based on the Operator CRD approach) as part of the Prometheus Operator that allows creating a subset of the Alertmanager configuration without touching the central Alertmanager configuration file.

Please note that we do not plan to support additional individual webhooks with this work. Customers will need to deploy their own version of the third party webhooks.

Goal & Success

  • Allow users to deploy individual configurations that allow setting up Alertmanager for their needs without an administrator.

Solution Alignment

Key Capabilities

  • As an OpenShift administrator, I want to control who can CRUD individual configuration so that I can make sure that any unknown third person can touch the central Alertmanager instance shipped within OpenShift Monitoring.
  • As a team owner, I want to deploy a routing configuration to push notifications for alerts to my system of choice.

Key Flows

Team A wants to send all their important notifications to a specific Slack channel.

  • Administrator gives permission to Team A to allow creating a new configuration CR in their individual namespace.
  • Team A creates a new configuration CR.
  • Team A configures what alerts should go into their Slack channel.
  • Open Questions & Key Decisions (optional)
  • Do we want to improve anything inside the developer console to allow configuration?

Feature Overview

This Feature is a general "catch all" for the time being. There are a number of existing priorities from Q1 that should be aligned with existing priorities below but if not, assign to this feature as needed.

Goals

In order to get a better overall portfolio view, we'll leverage this Feature to gather work that doesn't fall into other existing priorities on this board. As this list grows, the portfolio priority grooming team will look to split out or handle appropriately.

Requirements

A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts.  If a non MVP requirement slips, it does not shift the feature.

 

requirement                                                                        Notes                                                              isMvp
     
     
     
     

 

 

(Optional) Use Cases

< How will the user interact with this feature? >

< Which users will use this and when will they use it? >

< Is this feature used as part of current user interface? >

Out of Scope

 

Background, and strategic fit

< What does the person writing code, testing, documenting need to know? >

Assumptions

< Are there assumptions being made regarding prerequisites and dependencies?>

< Are there assumptions about hardware, software or people resources?>

Customer Considerations

< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>

Documentation Considerations

< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >

<What does success look like?>

< Does this feature have doc impact? Possible values are: New Content, Updates to existing content,  Release Note, or No Doc Impact?>

 <If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>

  • <What concepts do customers need to understand to be successful in [action]?>
  • <How do we expect customers will use the feature? For what purpose(s)?>
  • <What reference material might a customer want/need to complete [action]?>
  • <Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available. >
  • <What is the doc impact (New Content, Updates to existing content, or Release Note)?>

Questions

Question Outcome
   

 

Problem:

Console provides support UI for operators which is dynamically enabled when the operator is installed; by using feature flags against presence of CRDs. While operators have their own release cadence separately from OpenShift which makes for alignment of UI to API difficult. As new features are released for the operator, the UI becomes out of sync with APIs and customers must wait till the following OpenShift release to get any new UI.

Goal:

  • Create an extensibility mechanism which allows Red Hat operators to build and package their own UI that extends the console.
  • Make console extensible in areas required to support the needs of contributing plugins.

Why is it important?

  • Allows an operator to maintain their own UI and release at their own cadence.
  • Alleviates the pressure on console to deliver UI features for multiple operators within a release.

Use cases:

  1. Serverless / Pipelines / Helm to contribute resource details pages, import flows, topology visuals etc...

Acceptance criteria:

  1. Red Hat Operator can build their own UI which is deployed alongside the operator and extend the dev-console
    1. objective is to get to a point where it is possible to accomplish this however code will not be moved to a separate repository, nor deployed by an operator
  2. New extensions for console to allow operators to extend the various areas of console needed in order to provide the proper user experience.
  3. Enable operators to override the static built in support, and supply their own UI

Dependencies (External/Internal):

Design Artifacts:

Console extensions:
https://docs.google.com/document/d/1HW5_cl6cOX5P14PQN-1_8c60o9dMY6HbFDRftH6aTno/edit

Dynamic Plugins:
https://docs.google.com/document/d/19BAFo_8BtMZVvKsU-bE61bZpSydeYONkCMWntMU9NgE/edit

Enhancement proposal:
https://github.com/openshift/enhancements/pull/441

Exploration:

Note:

  • plugin framework covered by another epic
  • out of scope:
    • moving plugins to separate git repository

Description

As a developer, I want to be able to contribute a dynamic plugin extension and override the same extension contributed by static plugin.

Acceptance Criteria

  1. Should replace static plugin contribution of same name by dynamic plugin contribution

Additional Details:

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

tldr: three basic claims, the rest is explanation and one example

  1. We cannot improve long term maintainability solely by fixing bugs.
  2. Teams should be asked to produce designs for improving maintainability/debugability.
  3. Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.


Relevant links:

Feature Overview

Plugin teams need a mechanism to extend the OCP console that is decoupled enough so they can deliver at the cadence of their projects and not be forced in to the OCP Console release timelines.

The OCP Console Dynamic Plugin Framework will enable all our plugin teams to do the following:

  • Extend the Console
  • Deliver UI code with their Operator
  • Work in their own git Repo
  • Deliver at their own cadence

Goals

    • Operators can deliver console plugins separate from the console image and update plugins when the operator updates.
    • The dynamic plugin API is similar to the static plugin API to ease migration.
    • Plugins can use shared console components such as list and details page components.
    • Shared components from core will be part of a well-defined plugin API.
    • Plugins can use Patternfly 4 components.
    • Cluster admins control what plugins are enabled.
    • Misbehaving plugins should not break console.
    • Existing static plugins are not affected and will continue to work as expected.

Out of Scope

    • Initially we don't plan to make this a public API. The target use is for Red Hat operators. We might reevaluate later when dynamic plugins are more mature.
    • We can't avoid breaking changes in console dependencies such as Patternfly even if we don't break the console plugin API itself. We'll need a way for plugins to declare compatibility.
    • Plugins won't be sandboxed. They will have full JavaScript access to the DOM and network. Plugins won't be enabled by default, however. A cluster admin will need to enable the plugin.
    • This proposal does not cover allowing plugins to contribute backend console endpoints.

 

Requirements

 

Requirement Notes isMvp?
 UI to enable and disable plugins    YES 
 Dynamic Plugin Framework in place    YES 
Testing Infra up and running   YES 
 Docs and read me for creating and testing Plugins    YES 
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 
 Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Currently, webpack tree shakes PatternFly and only includes the components used by console in its vendor bundle. We need to expose all of the core PatternFly components for use in dynamic plugin, which means we have to disable tree shaking for PatternFly. We should expose this as a separate bundle. This will allow browsers to cache more efficiently and only need to load the PF bundle again when we upgrade PatternFly.

Open Questions

What parts of PatternFly do we consider core?

Acceptance Criteria

  • All PatternFly core components are exposed to dynamic plugins
  • PatternFly is exposed as a separate bundle that is not part of the main vendor bundle

cc Christian Vogt Vojtech Szocs Joseph Caiani James Talton

Feature Overview

  • This Section:* High-Level description of the feature ie: Executive Summary
  • Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As a user, I want the ability to run a pod in debug mode.

This should be the equivalent of running:  oc debug pod

Acceptance Criteria for MVP

  • Build off of the crash-loop back off popover from https://github.com/openshift/console/pull/7302 to include a description of what crash-loop back off is, a link to view logs, a link to view events and a link to debug (container-name) in terminal. If more than one container is crash-looping list them individually.
  • Create a debug container page that includes breadcrumbs as well as the terminal to debug. Add an informational alert at the top to make it clear that this is a temporary Pod and closing this page will delete the temporary pod.
  • Add debug in terminal as an action to the logs tool bar. Only enable the action when the crash-loop back off status occurs for the selected container. Add a tool tip to explain when the action is disabled.

Assets
Designs (WIP): https://docs.google.com/document/d/1b2n9Ox4xDNJ6AkVsQkXc5HyG8DXJIzU8tF6IsJCiowo/edit#

OCP/Telco Definition of Done
Feature Template descriptions and documentation.
Feature Overview

  • Connect OpenShift workloads to Google services with Google Workload Identity

Goals

  • Customers want to be able to manage and operate OpenShift on Google Cloud Platform with workload identity, much like they do with AWS + STS or Azure + workload identity.
  • Customers want to be able to manage and operate operators and customer workloads on top of OCP on GCP with workload identity.

Requirements

  • Add support to CCO for the Installation and Upgrade using both UPI and IPI methods with GCP workload identity.
  • Support install and upgrades for connected and disconnected/restriction environments.
  • Support the use of Operators with GCP workload identity with minimal friction.
  • Support for HyperShift and non-HyperShift clusters.
  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

Epic Goal

  • Complete the implementation for GCP  workload identity, including support and documentation.

Why is this important?

  • Many customers want to follow best security practices for handling credentials.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

We need to ensure following things in the openshift operators

1)  Make sure to operator uses v0.0.0-20210218202405-ba52d332ba99 or later version of the golang.org/x/oauth2 module

2) Mount the oidc token in the operator pod, this needs to go in the deployment. We have done it for cluster-image-registry-operator here

3) For workload identity to work, gco credentials that the operator pod uses should be of external_account type (not service_account). The external_account credentials type have path to oidc token along, url of the service account to impersonate along with other details. These type of credentials can be generated from gcp console or programmatically (supported by ccoctl). The operator pod can then consume it from a kube secret. Make appropriate code changes to the operators so that can consume these new credentials 

 

Following repos need one or more of above changes

Feature Overview

Enable sharing ConfigMap and Secret across namespaces

Requirements

Requirement Notes isMvp?
Secrets and ConfigMaps can get shared across namespaces   YES

Questions to answer…

NA

Out of Scope

NA

Background, and strategic fit

Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them. 

Documentation Considerations

Questions to be addressed:
 * What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
 * Does this feature have doc impact?
 * New Content, Updates to existing content, Release Note, or No Doc Impact
 * If unsure and no Technical Writer is available, please contact Content Strategy.
 * What concepts do customers need to understand to be successful in [action]?
 * How do we expect customers will use the feature? For what purpose(s)?
 * What reference material might a customer want/need to complete [action]?
 * Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
 * What is the doc impact (New Content, Updates to existing content, or Release Note)?

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Deliver the Projected Resources CSI driver via the OpenShift Payload

Why is this important?

  • Projected resource shares will be a core feature of OpenShift. The share and CSI driver have multiple use cases that are important to users and cluster administrators.
  • The use of projected resources will be critical to distributing Simple Content Access (SCA) certificates to workloads, such as Deployments, DaemonSets, and OpenShift Builds.

Scenarios

As a developer using OpenShift
I want to mount a Simple Content Access certificate into my build
So that I can access RHEL content within a Docker strategy build.

As a application developer or administrator
I want to share credentials across namespaces
So that I don't need to copy credentials to every workspace

Acceptance Criteria

  • OCP conformance suite must ensure that the projected resource CSI driver is installed on every OpenShift deployment.
  • OCP build suite tests that projected resource CSI driver volumes can be added to builds. Only if builds support inline CSI volumes.
  • Release Technical Enablement - Docs and demos on how to create a Projected Resource share and add it as a volume to workloads. A special use case for adding RHEL entitlements to builds should be included.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a cluster admin
I want the cluster storage operator to install the shared resources CSI driver
So that I can test the shared resources CSI driver on my cluster

Acceptance Criteria

  • Cluster storage operator uses image references to resolve the csi-driver-shared-resource-operator and all images needed to deploy the csi driver.
  • Shared resources CSI driver is installed when the cluster enables the CSIDriverSharedResources feature gate, OR
  • Shared resource CSI driver is installed when the cluster enables the TechPreviewNoUpgrade feature set
  • CI ensures that if the TechPreviewNoUpgrade feature set is enabled on the cluster, the shared resource CSI driver is deployed and functions correctly.

Docs Impact

Docs will need to identify how to install the shared resources CSI driver (by enabling the tech preview feature set)

Notes

Tasks:

  • Add the Share APIs (SharedSecret, SharedConfigMap) to openshift/api
  • Generate clients in openshift/client-go for Share APIs
  • Update the CSI driver name used in the enum for the ClusterCSIDriver custom resource.
  • Generate custom resource definitions and include it in the deployment YAMLs for the shared resource operator
  • Add YAML deployment manifests for the shared resource operator to the cluster storage operator (include necessary RBAC)
  • Ensure cluster storage operator has permission to create custom resource definitions
  • Enhance the cluster storage operator to install the shared resource CSI driver only when the cluster enables the CSIDriverSharedResources feature gate

Note that to be able to test all of this on any cloud provider, we need STOR-616 to be implemented. We can work around this by making the CSI driver installable on AWS or GCP for testing purposes.

The cluster storage operator has cluster-admin permissions. However, no other CSI driver managed by the operator includes a CRD for its API.

See https://issues.redhat.com/browse/BUILD-159?focusedCommentId=16360509&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16360509

User Story

As an OpenShift engineer
I want to know which clusters are using the Shared Resource CSI Driver
So that I can be proactive in supporting customers who are using this tech preview feature

Acceptance Criteria

  • Key metrics for the shared resource CSI driver are exported to Telemeter via the cluster monitoring operator.

Docs Impact

None - metrics exported to telemetry are not formally documented.

QE Impact

QE can verify that the query/recording rule for cluster monitoring operator returns data if the cluster has the Shared Resource CSI driver installed and utilizes a SharedSecret or SharedConfigMap in a pod/workload.

PX Impact

Insights rules can potentially be created off of these exported metrics. This would allow CEE to identify which clusters are using SharedSecrets or SharedConfigMaps, especially if we are exporting mount failure metrics.

Notes

To implement, a prometheus query/recording rule needs to be added to the cluster monitoring operator. Once approved by the monitoring team, the metric data will be available on DataHub once 4.10 clusters are installed with the updated version of the monitoring operator.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Allow CSI volumes to be mounted into a build

Why is this important?

  • CSI volumes allow data to be mounted into containers via ephemeral CSI Volumes
  • Ephemeral CSI volumes are provided by CSI drivers that support this feature. Such drivers include:
  • When using sensitive credentials in a build, accessing secrets as a mounted volume ensure that these credentials are not present in the resulting container image.

Scenarios

  1. Access private artifact repositories (Artifactory, jFrog, Mavein)
  2. Download RHEL packages in a build

Acceptance Criteria

  • Builds can mount a CSI volume in a build
  • Content in the CSI volume is not present in the resulting container image.
  • If SCCs do not support fine controls over CSI volumes, provide this feature on a TechPreview basis with a feature gate.

Dependencies (internal and external)

  1. Buildah - support mounting of volumes when building with a Dockerfile
  2. (optional) Auth - use SCCs to control which CSI drivers are allowed to be used with ephemeral CSI volumes.

Previous Work (Optional):

  1. BUILD-257 - Build Resource Volume Mounts

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As an OpenShift cluster admin
I want to use the TechPreview feature set to try out CSI volumes in builds
So that I can test using CSI volumes in builds through tech preview features

Minimum Acceptance Criteria

  • Openshift-controller-manager-operator reads the cluster feature gate (BuildCSIVolumes).
  • If the feature gate for Build CSI volumes is enabled, the operator enables configuration in openshift-controller-manager to turn on CSI volumes for the build controller.
  • CI testing verifies that we pass the appropriate feature gate to the build controller when tech preview is enabled.

Docs Impact

Product documentation will not be required until BUILD-275 is complete. Documentation for CSI volumes in builds will need to note that the TechPreviewNoUpgrade feature set needs to be enabled on the cluster.

PX Impact

Additional training enablement materials may not be needed - product docs should be sufficient.

QE Imact

Full e2e testing may not be feasible until BUILD-275 is completed.
CI testing should verify that the appropriate configuration values were passed to the build controller.

We will likely need a new CI job that installs the cluster with tech preview enabled before we verify that the BuildCSIVolumes feature gate has been enabled.

Open Questions

  • Should ocm-o mark itself Upgradeable=false if we detect the BuildCSIVolumes feature gate has been enabled?

Notes

OpenShift already has feature gates baked into the core platform via the FeatureGate API object. For this feature, we need to declare a feature gate that is added to the TechPreviewNoUpgrade feature set, which openshift-controller-manager-operator then reads and applies to the build controller.

Feature gate needs to be proposed to openshift/api (add to the TechPreviewNoUpgrade feature set).
An example PR on how to do this: https://github.com/openshift/api/pull/982.
Once approved, the updated tech preview feature set needs to be vendored into openshift/library-go.
Openshift-controller-manager-operator needs to read the feature gate, pass it on to the build controller.
The build controller has its own configuration "API" - this was a relic of the 3.x master configuration that is not exposed to admins in OCP 4.x: https://github.com/openshift/api/blob/master/openshiftcontrolplane/v1/types.go#L198-L207

A separate operator looks checks if a *NoUpgrade feature set is enabled, and if so marks the cluster as unable to be upgraded to the next minor OCP
release.

To test this in CI, we need a suite that runs with the TechPreviewNoUpgrade feature set enabled. The step registry has primitives which bring up a cluster with tech preview features enabled. We will need to update ocm-o's CI configuration to run our operator tests with tech preview enabled. Testing for this specific feature will need to have separate logic that verifies we are sending the right configuration to the build controller under normal and TechPreview mode.

 

Existing techpreview CI step registry setups (note the per cloud elements, which make sense, since the existing CSI drivers are per cloud):

/ci-operator/step-registry/ipi/aws/pre/techpreview
./ci-operator/step-registry/ipi/azure/pre/techpreview
./ci-operator/step-registry/ipi/conf/aws/techpreview
./ci-operator/step-registry/ipi/conf/azure/techpreview
./ci-operator/step-registry/ipi/conf/techpreview
./ci-operator/step-registry/ipi/conf/openstack/techpreview
./ci-operator/step-registry/ipi/openstack/pre/techpreview
./ci-operator/step-registry/openshift/e2e/aws/techpreview
./ci-operator/step-registry/openshift/e2e/gcp/techpreview
./ci-operator/step-registry/openshift/e2e/azure/techpreview
./ci-operator/step-registry/openshift/e2e/openstack/techpreview
./ci-operator/step-registry/openshift/e2e/vsphere/techpreview

 

Given shared resources span all clouds, etc. does that mean we touch each of these, or create a new one, or both?

Feature Overview (aka. Goal Summary)  

Upstream Kuberenetes is following other SIGs by moving it's intree cloud providers to an out of tree plugin format, Cloud Controller Manager, at some point in a future Kubernetes release. OpenShift needs to be ready to action this change  

Goals (aka. expected user outcomes)

Bring together all the cloud controller managers (AWS, GCP, Azure), complete testing and prepare for final GA

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

 

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

 

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Prepare the Cluster Cloud Controller Manager Operator (CCCMO) component, introduced in 4.9 for GA

Why is this important?

  • We must ensure that the component is stable before we can declare the product GA

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Initial work was started there: https://github.com/lobziik/cluster-cloud-controller-manager-operator/pull/1/files

Need to isolate provider specific code in respective packages and introduce interface to leverage it (regular and bootstrap manifests rendering should be there atm)

DoD:

  • Introduce templating logic to replace existing substitution mixture
  • Isolate templating logic so that this is transparent to the core of the CCCMO
  • Improve testing of the substitution

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

As an adopter of the @openshift-console/dynamic-plugin-sdk I want to easily integrate into my development pipeline so that I can extend the OCP console.

Trying to pull in the dynamic-plugin-sdk into ACM is proving to be problematic. We would have to move to older dependencies. Integrating with webpack and typescript requires a very specific setup.

The dynamic-plugin-sdk has only really been used internally by OCP and is strongly tied to the setup and dependencies of OCP. For the dynamic-plugin-sdk to be externally consumable by adopters, it should be as easy to use as other webpack plugins such as HtmlWebpackPlugin or CompressionPlugin.

Acceptance Criteria

  • Uses up to date dependencies - not tied to specific versions OCP console uses
  • Includes it's own dependencies - does not require adopters to include those dependencies
  • The dynamic demo plugin should be updated to use newer dependencies and use the plugin without a bunch of tweaks to tsconfig paths. 

Currently

  • requires old dependencies 
    • ts-node 5.0.1 → 10.2.1

 

Update console from Cypress 6.0.0 to 8.5.0. Changes that impact us:

  • cypress run is headless by default
  • cy.intercept URL matching is more strict
  • Uncaught exception and unhandled promise rejection checks are more strict

https://docs.cypress.io/guides/references/migration-guide#Migrating-to-Cypress-8-0

The console has many instances of old variables, $grid-float-breakpoint and $grid-gutter-width, controlling margins/padding and responsive breakpoints throughout the Admin and Dev Console. These do not provide spacing and behaviors consistent with Patternfly components which use their own variables, $pf-global-gutter-md, $pf-global-gutter, and $pf-global-breakpoint-{size}. By replacing these, the intent it to bring the console closer to a pure Patternfly structure and behavior, requiring less overrides and customizations.

Epic Goal

  • Improve CI testing of the image registry components.

Why is this important?

  • The image registry, image API and the image pruner had a lot of tests removed during transition 4.0. This may make the platform less stable and/or slow down the team.

Scenarios

  1. ...

Acceptance Criteria

  • CI - tests should be more stable and have broader coverage

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.

In the image-registry, we have packages origin-common and kubernetes-common. The problem is that this code doesn't get updates. We can replace them with more supported library-go.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As a OpenShift engineer
I want image-registry to use the latest k8s libraries
so that image-registry can benefit from new upstream features.

Acceptance criteria

  • image-registry uses k8s.io/api v1.23.z
  • image-registry uses latest openshift/api, openshift/library-go, openshift/client-go
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

User Story

As a developer using Jenkins to build my application
I want to use the base Jenkins agent image as a sidecar in my PodTemplate
So that I can use any s2i builder image in my Jenkins pipelines

Acceptance Criteria

  • Provide new Kubernetes Plugin Pod Templates which uses the sidecar pattern for NodeJS and Maven.
  • Add documentation on how to use the new pod template in a Jenkinsfile (need to specify the container where the build occurs).
  • Add documentation on how developers can provide an inline pod template within a Jenkinsfile. Documentation should have the following formats:
    • New YAML declarative format
    • Deprecated Groovy format
  • Existing pipelines that use the default Kubernetes Plugin Pod Templates do not break.
  • End to end testing (for client or sync plugin) verifies that the new pod templates work.

QE Impact

QE will need to verify that the new pod templates can successfully execute a JenkinsPipeline build.

Docs Impact

Documentation needs to be updated to explain how to use the new template.

PX Impact

Unclear if we need new CEE/PX materials beyond doc updates.

Notes

We currently have built-in pod templates for NodeJS and Maven, which use specialized agent images with NodeJS/Maven image.
Blog post here outlines the process: https://developers.redhat.com/blog/2020/06/04/an-easier-way-to-create-custom-jenkins-containers/

The Groovy style of declaring in-line pod templates is deprecated in favor of a YAML-style format.

Existing documentation for the Jenkin pod templates: https://docs.openshift.com/container-platform/4.9/openshift_images/using_images/images-other-jenkins.html#images-other-jenkins-config-kubernetes_images-other-jenkins

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

Epic Goal

  • As a CFE team, we would like to enable query logging for all Prometheus read paths
  • As part of this, we would like to enable audit & query logging for Prometheus Adapter(aggregated server audit log), Prometheus(query log) and ThanosQuerier(query log)

Why is this important?

  • This would help all parties(customers, app-sres, CCX, monitoring team,..) to debug an overloaded Prometheus instance.

Scenarios

  1. When a customer faces a high cpu consumption in any of the Prometheus instance, they can enable audit logging in Prometheus Adapter to see which component is calling metrics API
  2. When a customer faces a high cpu consumption in any of the Prometheus instance, they can enable query logging in all Prometheus instances(PM & UWM) and ThanosQuerier to see which query is frequently executed
  3. https://bugzilla.redhat.com/show_bug.cgi?id=1982302

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Prometheus Adapter audit logs must be enabled by default
  • Prometheus Adapter audit logs must be preserved after each CI run

Open questions::

  1. Should we enable ThanosRuler query logs?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

After investigating a complex Bugzilla involving many applications making queries to prometheus-adapter, we've noticed that we were lacking insights on the requests made to prometheus-adapter. To have such information for an aggregated API, the best would be to have audit logs for prometheus-adapter. This wasn't configurable before, but with https://github.com/kubernetes-sigs/custom-metrics-apiserver/pull/92, upstream users should now be able to configure it.

Since this would greatly help in investigating prometheus-adapter Bugzilla in the future, it would be great if we allowed OpenShift users to configure the audit logs so that they could provide them to us.

Note for the assignee: as of the time of the creation of this ticket, the upstream PR hasn't been merged in custom-metrics-apiserver and thus wasn't synced in prometheus-adapter. So we will have to wait a bit before starting looking into this ticket.

 

DoD:

  • Allow OpenShift users to configure audit logs for prometheus-adapter
  • Integrate with must-gather
  • Document how to configure audit logs in the official OpenShift documentation
  • Upstream jsonnet patch that enables this feature through a configuration
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

The console requires to know the network type capabilities to show/hide some Network Policy form fields.

As a result of https://issues.redhat.com/browse/NETOBSERV-27, this logic is implemented as a features document inside the console code. The console fetches the network type from the network operator and checks the supported features towards this document.

However, this limits the feature to admin users, as other logged-in users do not have permissions to fetch the network type.

This task aims to modify the current Cluster Network Operator to expose the network capabilities as an `sdn-public` Config Map, writeable only by the SDN, readable by any `system:authenticated` user.

Enhancement Proposal PR: https://github.com/openshift/enhancements/pull/875

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

We want to configure 'default' and 'allowed' values in validation webhook for Guest Accelerators field in GCPProviderSpec. Also revendor it to include newly added Guest Accelerators field.

This can be done after https://github.com/openshift/cluster-api-provider-gcp/pull/172  is merged.

DoD:

  • Make sure that validations return errors on issues with GPU configuration
  • Ensure the unit tests for the webhooks are updated

Description:

Openshift on RHV is composed of the following subproject the team maintains:

Each of those projects currently uses the generated oVirt API project go-ovirt.

This leads to a number of issues:

  1. Duplicated code between the subprojects: Since the go-ovirt is a thin layer around the API then a lot of the code which interacts with oVirt is duplicated between the projects, which leads to all the classic duplication problems such as maintaining the project, lack of clear conventions, and so on.
  2. Bad error handling and unclear errors:
    1. Since the go-ovirt is a thin layer there is a lot of error handling and checking which needs to be done, since a lot of the times it looks like a certain error should be ignored, it is never checked which could lead to unexpected situations.
    2. Since the errors which are returned from the oVirt Engine are sometimes unclear, when we return those errors to the users or log them is hard to understand what is the actual issue.
  3. Lack of retries: sometimes an operation can take some time due to some condition that needs to be met, or an operation can fail due to infrastructure issues, the go-ovirt library doesn't contain any retry logic which means each client needs to implement its own retry logic which is not done at the moment and will cause more duplicated code.
  4. Poor logging: The current go-ovirt library doesn't log anything, and all the logs come from the subprojects, this leads to:
    1. Inconsistent logging between the projects.
    2. Lack of logs.
  5. Almost no test coverage:
    1. It's very hard to mock and write tests with go-ovirt since there are so many calls, but will be much easier to mock and write tests with go-ovirt-clent.
    2. go-ovirt only has rudimentary tests.

Then came go-ovirt-client, go-ovirt-client-log, go-ovirt-client-log-klog and k8sOVirtCredentialsMonitor to the rescue!

The go-ovirt-client is a wrapper around the go-ovirt which contains all the error handling/retry logic/logs/tests needed to provide a decent user experience and an easy-to-use API to the oVirt engine.

go-ovirt-client-log is a library to unify the logging logic between the projects, it is used by go-ovirt-client and should be used by all the sub-projects.

go-ovirt-client-log-klog is a companion library to go-ovirt-client-log enabling logging via the Kubernetes "klog" facility.

k8sOVirtCredentialsMonitor is a utility for monitoring the oVirt credentials secret, which will automatically update the ovirt credentials is they are changed. 

We aim to move all projects which are using the go-ovirt to use go-ovirt-client, go-ovirt-client-log and k8sOVirtCredentialsMonitor instead.

Benefits for the eng:

  • Possible to write unit tests.
  • Easier to maintain since less code duplication - reduce the amount of code.
  • Test coverage exists on the ovirt-client as well.
  • No(Less) bugs regarding operations that needed a retry or polling logic.
  • Solves a number of existing bugs

Benefits for the customers:

  • Clearer error messages and logs.
  • Fewer bugs.

Acceptance criteria:

  1. All sub-projects are not using go-ovirt directly - at least 90% of the calls to go-ovirt should be migrated to go-ovirt-client.
  2. All sub-projects should use the corresponding go-ovirt-client-log for logging.
  3. All csi-driver and cluster provide use k8sOVirtCredentialsMonitor.
  4. CI tests are green for all components.

How to test:

  1. QE regression - make sure all flows are still working.
  2. Green CI on all jobs.
  3. Keep an eye out for log messages that might confuse customers.

Description:

  1. Identify all the communication between ovirt-csi-driver and the go-ovirt.
  2. Port all the logic to go-ovirt-client.
  3. Port all calls on ovirt-csi-driver to go-ovirt-client.

Acceptance:

ovirt-csi-driver uses go-ovirt-client for 95% percent of all oVirt related logic.

T-shirt size: M

Goal:

Provide an easy and successful experience for front end developers to build and deploy their applications

Why is it important?

Currently, the front end dev experience is not positive. It's much easier for them to use other platforms. Improving the front end dev experience will enable us to gain more marketshare

Use cases:

  1. Need to be able to override the npm command when using Node Builder Image
  2. Need to expose target port
  3. Need access to the URL to access my application

Although we provide the ability for 2 & 3 today, the current journey does not match with the mental model of the front end developer

Acceptance criteria:

  1. When importing an app, I should be able to easily provide the npm build and run commands
  2. When opting in to create a route, the target port should be exposed without having to open any Advanced Options
  3. After importing my app, if a route is exposed, I should be able to access/copy that URL

Dependencies (External/Internal):

Design Artifacts:

Desired UX experience

  • enable user to provide the *Build Command* when Node Builder image is being used
  • enable user to provide the *Run Command* when Node Builder image is being used
  • expose the Target Port under the *Create a route to the Application *rather than inside Show advanced Routing options
  • NEED TO FINALIZE HOW TO PROVIDE THE ROUTE TO EASILY COPY – Inline Notification maybe? As well as side panel?

Note:

Description

As a user, I want have the option to add additional labels to a Route, as I could do in OCP3. See RFE-622

The additional labels should only be added to the route, not the service or other components. The advanced option "Labels" should not be touched and these labels are added to all components.

As an small additional we should also show always the "Target port" since it also defines the Service port and to make this more clear, the "Target port" should be shown before the "Create a route to the Application" checkbox.

Acceptance Criteria

The following changes should be applied to the Import flow (from Git, from Container, ...) and to the Edit page as well:

  1. Move the option "Target port" before the checkbox "Create a route to the Application" and do not hide the "Target port" when the checkbox is disabled
  2. Add a new "Additional route labels" option, with a label input field to the "Advanced Routing options"
  3. Save (Import) and update (Edit) the labels to the Route resource. When editing a Deployment with a Route the route labels should not show the shared labels.

Additional Details:

Problem:

This epic is mainly focused on the 4.10 Release QE activities

Goal:

1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis

Why is it important?

To the track the QE progress at one place in 4.10 Release Confluence page

Use cases:

  1. <case>

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

There are different code spots which maps the old action items "From Git", "From Dockerfile" and "From Devfile" to the new action "Import from Git".

We should avoid mapping different strings to the new version and instead update our tests so that the feature and page object files matches the latest frontend code.

Code areas I found are marked with

      // TODO (ODC-6455): Tests should use latest UI labels like "Import from Git" instead of mapping strings

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Epic Goal

  • Port all remaining Protractor tests to Cypress

Why is this important?

  • Protractor is very hard to debug when tests fail/flake
  • Once all protractor tests are ported we can remove all Protractor dependencies, scripts, and configuration files.
  • Cypress has better debugging, plug-ins, and reporting tools

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Please read: migrating-protractor-tests-to-cypress

Protractor test to migrate:  `frontend/integration-tests/tests/oauth.scenario.ts`
Large but straight forward

47) OAuth

   48) BasicAuth IDP
      ✔ creates a Basic Authentication IDP
      ✔ shows the BasicAuth IDP on the OAuth settings page

   49) GitHub IDP
      ✔ creates a GitHub IDP
      ✔ shows the GitHub IDP on the OAuth settings page

   50) GitLab IDP
      ✔ creates a GitLab IDP
      ✔ shows the GitLab IDP on the OAuth settings page

   51) Google IDP
      ✔ creates a Google IDP
      ✔ shows the Google IDP on the OAuth settings page

   52) Keystone IDP
      ✔ creates a Keystone IDP
      ✔ shows the Keystone IDP on the OAuth settings page

   53) LDAP IDP
      ✔ creates a LDAP IDP
      ✔ shows the LDAP IDP on the OAuth settings page

   54) OpenID IDP
      ✔ creates a OpenID IDP
      ✔ shows the OpenID IDP on the OAuth settings page

 Accpetance Criteria

  • Protractor test ported to cypress
  • Remove any unused legacy data-test-id`s
  • Protractor test deleted, and non longer referenced in `frontend/integration-tests/protractor.conf.ts`
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

 

Background

As a follow up to OCPCLOUD-693, we need to, once all of the API definitions are present in openshift/api, migrate the existing code bases to use the new API locations.

 

This will include:

  • Machine API Operator
  • Cluster Machine Approver
  • Cluster API Provider AWS|Azure|GCP|IBM|Alibaba|OpenStack|Kubevirt
  • Cluster API actuator pkg
  • Installer
  • WMCO
  • MCO
  • Hive
  • Grep OpenShift for other references to our old APIs

Steps

  • Replace the Machine API imports with the new openshift/API MAPI locations

Stakeholders

  • Cluster Infra
  • Owners of the repos listed above

Definition of Done

  • The openshift/API defintions are used across components in the MAPI ecosystem
  • Docs
  • Generated docs for API types should now come from openshift/API
  • Testing
  • Regular regression testing should be sufficient, this is a copy paste for the most part and we expect the code won't compile if we break this

Problem:

Complete all the 4.9 epic features automation user stories and merge it to master branch.

Goal:

4.9 epics automation completion

Why is it important?

Tech debt should be completed

Use cases:

  1. <case>

Acceptance criteria:

Create the pr's for 4.9 epic user stories automation
Review it
Merge it to 4.10 master branch and 4.9 master branch

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Description

As a user, I want to store my delivery pipelines in a Git repository as the source of truth and execute the pipeline on OpenShift on Git events, so that I can version and trace changes to the delivery pipelines in Git.

Use Cases

  • Developer can see the list of Git repositories that are added to the namespace for pipeline-as-code execution
  • Developer can navigate from the Console to the Git repository on the Git provider
  • For each Git repository, developer can see the details of the last pipeline execution and the commit id that triggered it with possibility to navigating to the Git commit in the Git provider
  • Developer can see the list of pipelinerun executions related to a Git repository in a chronological order and the commit id that triggered each

Acceptance Criteria

  1. As a user, looking at the Pipelines page in the Developer Console, I should be able to see a list of (a) Git repositories that are added to the namespace for PAC execution AND (b) all pipelines in the namespace
  2. As a user, I should be able to navigate to a details page of the git repo.
    1. This details page should provide access to (a) details of the git repo and (b) a list of pipeline runs.
    2. This PLR tab should show additional information than the typical PLR List view, including SHA (commit id), commit message, branch & trigger type
  3. As a user, when looking at a Pipeline Run Details page, if associate with a git repo (PAC),
    1. Indicate that it's from a specific git repo rather than a PL resource
    2. Include the SHA (commit id), commit message, branch & trigger type

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Description of problem:

To avoid any potential bugs, the oVirt CSI driver should use the latest go-ovirt-client, preferably the tagged 1.0.0 version.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

The CMO e2e tests create a bunch of resources. These should be cleaned up on a successful run. However:

  • Some test failures leave the create resource behind, which have to be cleaned up before a re-run.
  • There have been developer reports that even successful runs don't tidy up everything.

In a CI context this is rarely a problem, however running the tests locally can be made quite awkward, especially repeated runs on the same cluster.

We should tag all resources created by the e2e tests with a label (app.kubernetes.io/created-by: cmo-e2e-test).
This will allow easy cleanup by deleting all resources with that label and will allow for checking proper clean-up.

DoD:
All e2e resources get properly tagged.
It is straight forward to ensure that future code changes don't skip adding this tag.

Description of problem:

See https://bugzilla.redhat.com/show_bug.cgi?id=2104275

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-709. The following is the description of the original issue:

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

Catalog affected : icr.io/cpopen/datapower-operator-catalog:1.6.2

opm render icr.io/cpopen/datapower-operator-catalog:1.6.2 -o yaml > catalog.yaml

 yq 'select(.schema == "olm.channel") | select(.name=="v1.6")' catalog.yaml
entries:
  - name: datapower-operator.v1.6.0
    skipRange: '>=1.0.0 <1.6.0'
  - name: datapower-operator.v1.6.1
    replaces: datapower-operator.v1.6.0
    skipRange: '>=1.0.0 <1.6.1'
  - name: datapower-operator.v1.6.2
    replaces: datapower-operator.v1.6.1
    skipRange: '>=1.0.0 <1.6.2'
name: v1.6
package: datapower-operator
schema: olm.channel

This have worked fine untill 4.10 resolver changes. Also using both the replaces and skiprange seems to be okay the way it was explained here
https://v0-18-z.olm.operatorframework.io/docs/concepts/olm-architecture/operator-catalog/creating-an-update-graph/#skiprange

How reproducible:
use following subscription, install 1.6.0 and then upgrade to 1.6.2

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/datapower-operator.openshift-operators: ""
  name: datapower-operator
  namespace: openshift-operators
spec:
  channel: v1.6
  installPlanApproval: Manual
  name: datapower-operator
  source: datapower
  sourceNamespace: openshift-marketplace
  startingCSV: datapower-operator.v1.6.0

Error n subscription yaml :

  conditions:
  - lastTransitionTime: "2022-09-09T13:42:08Z"
    message: all available catalogsources are healthy
    reason: AllCatalogSourcesHealthy
    status: "False"
    type: CatalogSourcesUnhealthy
  - message: 'a unique replacement chain within a channel is required to determine
      the relative order between channel entries, but 2 replacement chains were found
      in channel "v1.6" of package "datapower-operator": datapower-operator.v1.6.2...datapower-operator.v1.6.0,
      datapower-operator.v1.6.1...datapower-operator.v1.6.0'
    reason: ErrorPreventedResolution
    status: "True"
    type: ResolutionFailed

Logs

I0909 13:43:51.492784       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"openshift-operators", UID:"aacda32d-748f-4408-88df-f895e74a23fe", APIVersion:"v1", ResourceVersion:"1260", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' a unique replacement chain within a channel is required to determine the relative order between channel entries, but 2 replacement chains were found in channel "v1.6" of package "datapower-operator": datapower-operator.v1.6.2...datapower-operator.v1.6.0, datapower-operator.v1.6.1...datapower-operator.v1.6.0
E0909 13:43:52.095288       1 queueinformer_operator.go:290] sync "openshift-operators" failed: a unique replacement chain within a channel is required to determine the relative order between channel entries, but 2 replacement chains were found in channel "v1.6" of package "datapower-operator": datapower-operator.v1.6.2...datapower-operator.v1.6.0, datapower-operator.v1.6.1...datapower-operator.v1.6.0

Expected results:

if this upgrade strategy which has worked fine before is still okay, this error should not be there. As per affected catalog maintainer, this seems to affecting reconciling of other resources.

console-operator codebase contains a lot of inline manifests. Instead we should put those manifests into a `/bindata` folder, from which they will be read and then updated per purpose.

The current integration of prometheus-adapter in OpenShift uses the platform Prometheus as a backend to get metrics. The problem with this design is that we are getting metrics from 2 different Prometheus instances which don't have replicated data, so two queries sent at the same time to prometheus-adapter might yield different results since the underlying promQL queries executed by prometheus-adapter might be on different Prometheus servers. The consequence is that we end up having inconsistent data across multiple autoscaling requests.

This can be easily tested by running:

$ while true ; do date; oc adm top pod -n openshift-monitoring  prometheus-k8s-0 ; echo; sleep 1 ;done 

Mon Jul 26 03:55:07 EDT 2021
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   208m         4879Mi          

Mon Jul 26 03:55:08 EDT 2021                               
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   246m         4877Mi          

Mon Jul 26 03:55:09 EDT 2021                               
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   208m         4879Mi          

Mon Jul 26 03:55:10 EDT 2021
NAME               CPU(cores)   MEMORY(bytes)   
prometheus-k8s-0   246m         4877Mi          

This isn't a bug in itself since it was designed that way, but we could do better by using thanos-querier as a backend instead of the platform Prometheus because it will duplicate the metrics from both instances and serve one consistent result based on the data that it will get from the Prometheuses.

DoD:

  • Use thanos-querier as a backend for prometheus-adapter

Description of problem:

When queried dns hostname from certain pod on the certain node, responded from random coredns pod, not prefer local one. Is it expected result ?

# In OCP v4.8.13 case
// Ran dig command on the certain node which is running the following test-7cc4488d48-tqc4m pod.
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
:
07:16:33 :172.217.175.238
07:16:34 :172.217.175.238 <--- Refreshed the upstream result
07:16:36 :142.250.207.46
07:16:37 :142.250.207.46

// The dig results is matched with the running node one as you can see the above one.
$ oc rsh  test-7cc4488d48-tqc4m bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
:
07:16:35 :172.217.175.238 
07:16:36 :172.217.175.238 <--- At the same time, the pod dig result is also refreshed.
07:16:37 :142.250.207.46
07:16:38 :142.250.207.46


But in v4.10 case, in contrast, the dns query result is various and responded randomly regardless local dns results on the node as follows.

# In OCP v4.10.23 case, pod's response from DNS services are not consistent.
$ oc rsh test-848fcf8ddb-zrcbx  bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
07:23:00 :142.250.199.110
07:23:01 :142.250.207.46
07:23:02 :142.250.207.46
07:23:03 :142.250.199.110
07:23:04 :142.250.199.110
07:23:05 :172.217.161.78

# Even though the node which is running the pod keep responding the same IP...
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
07:23:00 :172.217.161.78
07:23:01 :172.217.161.78
07:23:02 :172.217.161.78
07:23:03 :172.217.161.78
07:23:04 :172.217.161.78
07:23:05 :172.217.161.78

Version-Release number of selected component (if applicable):

v4.10.23 (ROSA)
SDN: OpenShiftSDN

How reproducible:

You can always reproduce this issue using "dig google.com" from both any pod and the node the pod running according to the above "Description" details.

Steps to Reproduce:

1. Run any usual pod, and check which node the pod is running on.
2. Run dig google.com on the pod and the node.
3. Check the IP is consistent with the running node each other. 

Actual results:

The response IPs are not consistent and random IP is responded.

Expected results:

The response IP is kind of consistent, and aware of prefer local dns.

Additional info:

This issue affects EgressNetworkPolicy dnsName feature.

Description of problem:
OSD cluster, cluster admin is not allowed to update ClusterVersion details, however console is rendering an editable YAML editor 

Version-Release number of selected component (if applicable):
4.10.18

How reproducible:
Always

Steps to Reproduce:
1. navigate to ClusterVersion YAML page /k8s/cluster/config.openshift.io~v1~ClusterVersion/version, click on YAML tab
2. cluster-admin is able to do some changes in YAML editor, however when saving the changes it will report 
An error occurred
admission webhook "regular-user-validation.managed.openshift.io" denied the request: Prevented from accessing Red Hat managed resources. This is in an effort to prevent harmful actions that may cause unintended consequences or affect the stability of the cluster. If you have any questions about this, please reach out to Red Hat support at https://access.redhat.com/support

Actual results:
2. cluster admin user is able to edit but not allowed to save the changes

Expected results:

 

ISSUE 2:

 

Steps to Reproduce:
1.On OSD console, cluster admin user adds idp from "Administration">"Cluster Settings">"Configuration"->"OAuth",
2.
3.

Actual results:
1.Could add idp successfully.

Expected results:
1. Should disable the function to add idp from OSD console.

 

Created from:

+++ This bug was initially created as a clone of OCPBUGSM-46761 +++

Description of problem:
When using the admin console, under "Cluster Settings" and choosing Upstream Configuration, the "window" that appears has a dead link to documentation for how to create a local (disconnected) update server.

The link in the window is
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.10/html/updating_clusters/installing-update-service

Assume the right link should be something like:
https://docs.openshift.com/container-platform/4.10/updating/updating-restricted-network-cluster.html#update-restricted-network-cluster-update-service

Version-Release number of selected component (if applicable):

4.10.*

— Additional comment from plarsen@redhat.com on 2022-06-17 16:33:53 UTC —

Created attachment 1890947
Screenshot showing the "window" with the 404 link

— Additional comment from rhamilto@redhat.com on 2022-06-28 19:56:17 UTC —

Thank you, Peter, for wonderfully documenting the bug!

This is a clone of issue OCPBUGS-1523. The following is the description of the original issue:

Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.

Actual results:
Taking too much time too load

Expected results:
Time taken should be reduced

Additional info:
Attached a gif for reference

+++ This bug was initially created as a clone of
Bug #2070318
+++

Description of problem:
In OCP VRRP deployment (using OCP cluster networking), we have an additional data interface which is configured along with the regular management interface in each control node. In some deployments, the kubernetes address 172.30.0.1:443 is nat’ed to the data management interface instead of the mgmt interface (10.40.1.4:6443 vs 10.30.1.4:6443 as we configure the boostrap node) even though the default route is set to 10.30.1.0 network. Because of that, all requests to 172.30.0.1:443 were failed. After 10-15 minutes, OCP magically fixes it and nat’ing correctly to 10.30.1.4:6443.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.Provision OCP cluster using cluster networking for DNS & Load Balancer instead of external DNS & Load Balancer. Provision the host with 1 management interface and an additional interface for data network. Along with OCP manifest, add manifest to create a pod which will trigger communication with kube-apiserver.

2.Start cluster installation.

3.Check on the custom pod log in the cluster when the first 2 master nodes were installing to see GET operation to kube-apiserver timed out. Check nft table and chase the ip chains to see the that the data IP address was nat'ed to kubernetes service IP address instead of the management IP. This is not happening all the time, we have seen 50:50 chance.

Actual results:
After 10-15 minutes OCP will correct that by itself.

Expected results:
Wrong natting should not happen.

Additional info:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895
ClusterVersion: Stable at "4.8.29"
ClusterOperators:
clusteroperator/authentication is not available (OAuthServerRouteEndpointAccessibleControllerAvailable: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because OAuthServerRouteEndpointAccessibleControllerDegraded: Get "
https://oauth-openshift.apps.ocp-binhle-wqepch.contrail.juniper.net/healthz
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/baremetal is degraded because metal3 deployment inaccessible
clusteroperator/console is not available (RouteHealthAvailable: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) because RouteHealthDegraded: failed to GET route (
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
): Get "
https://console-openshift-console.apps.ocp-binhle-wqepch.contrail.juniper.net/health
": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."
clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
clusteroperator/insights is degraded because Unable to report: unable to build request to connect to Insights server: Post "
https://cloud.redhat.com/api/ingress/v1/upload
": dial tcp: lookup cloud.redhat.com on 172.30.0.10:53: read udp 10.128.0.26:53697->172.30.0.10:53: i/o timeout
clusteroperator/network is progressing: DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)

— Additional comment from
bnemec@redhat.com
on 2022-03-30 20:00:25 UTC —

This is not managed by runtimecfg, but in order to route the bug correctly I need to know which CNI plugin you're using - OpenShiftSDN or OVNKubernetes. Thanks.

— Additional comment from
lpbinh@gmail.com
on 2022-03-31 08:09:11 UTC —

Hi Ben,

We were deploying Contrail CNI with OCP. However, this issue happens at very early deployment time, right after the bootstrap node is started
and there's no SDN/CNI there yet.

— Additional comment from
bnemec@redhat.com
on 2022-03-31 15:26:23 UTC —

Okay, I'm just going to send this to the SDN team then. They'll be able to provide more useful input than I can.

— Additional comment from
trozet@redhat.com
on 2022-04-04 15:22:21 UTC —

Can you please provide the iptables rules causing the DNAT as well as the routes on the host? Might be easiest to get a sosreport during initial bring up during that 10-15 min when the problem occurs.

— Additional comment from
lpbinh@gmail.com
on 2022-04-05 16:45:13 UTC —

All nodes have two interfaces:

eth0: 10.30.1.0/24
eth1: 10.40.1.0/24

machineNetwork is 10.30.1.0/24
default route points to 10.30.1.1

The kubeapi service ip is 172.30.0.1:443

all Kubernetes services are supposed to be reachable via machineNetwork (10.30.1.0/24)

To make the kubeapi service ip reachable in hostnetwork, something (openshift installer?) creates a set of nat rules which translates the service ip to the real ip of the nodes which have kubeapi active.

Initially kubeapi is only active on the bootstrap node so there should be a nat rule like

172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)

However, what we see is
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)

The rule is configured on the controller nodes and lead to asymmetrical routing as the controller sends a packet FROM machineNetwork (10.30.1.x) to 172.30.0.1 which is then translated and forwarded to 10.40.1.10 which then tries to reply back on the 10.40.1.0 network which fails as the request came from 10.30.1.0 network.

So, we want to understand why openshift installer picks the 10.40.1.x ip address rather than the 10.30.1.x ip for the nat rule. What's the mechanism for getting the ip in case the system has multiple interfaces with ips configured.

Note: after a while (10-20 minutes) the bootstrap process resets itself and then it picks the correct ip address from the machineNetwork and things start to work.

— Additional comment from
smerrow@redhat.com
on 2022-04-13 13:55:04 UTC —

Note from Juniper regarding requested SOS report:

In reference to
https://bugzilla.redhat.com/show_bug.cgi?id=2070318
that @Binh Le has been working on. The mustgather was too big to upload for this Bugzilla. Can you access this link?
https://junipernetworks-my.sharepoint.com/:u:/g/personal/sleigon_juniper_net/ETOrHMqao1tLm10Gmq9rzikB09H5OUwQWZRAuiOvx1nZpQ

  • Making note private to hide partner link

— Additional comment from
smerrow@redhat.com
on 2022-04-21 12:24:33 UTC —

Can we please get an update on this BZ?

Do let us know if there is any other information needed.

— Additional comment from
trozet@redhat.com
on 2022-04-21 14:06:00 UTC —

Can you please provide another link to the sosreport? Looks like the link is dead.

— Additional comment from
smerrow@redhat.com
on 2022-04-21 19:01:39 UTC —

See mustgather here:
https://drive.google.com/file/d/16y9IfLAs7rtO-SMphbYBPgSbR4od5hcQ
— Additional comment from
trozet@redhat.com
on 2022-04-21 20:57:24 UTC —

Looking at the must-gather I think your iptables rules are most likely coming from the fact that kube-proxy is installed:

[trozet@fedora must-gather.local.288458111102725709]$ omg get pods -n openshift-kube-proxy
NAME READY STATUS RESTARTS AGE
openshift-kube-proxy-kmm2p 2/2 Running 0 19h
openshift-kube-proxy-m2dz7 2/2 Running 0 16h
openshift-kube-proxy-s9p9g 2/2 Running 1 19h
openshift-kube-proxy-skrcv 2/2 Running 0 19h
openshift-kube-proxy-z4kjj 2/2 Running 0 19h

I'm not sure why this is installed. Is it intentional? I don't see the configuration in CNO to enable kube-proxy. Anyway the node IP detection is done via:
https://github.com/kubernetes/kubernetes/blob/f173d01c011c3574dea73a6fa3e20b0ab94531bb/cmd/kube-proxy/app/server.go#L844
Which just looks at the IP of the node. During bare metal install a VIP is chosen and used with keepalived for kubelet to have kapi access. I don't think there is any NAT rule for services until CNO comes up. So I suspect what really is happening is your node IP is changing during install, and kube-proxy is getting deployed (either intentionally or unintentionally) and that is causing the behavior you see. The node IP is chosen via the node ip configuration service:
https://github.com/openshift/machine-config-operator/blob/da6494c26c643826f44fbc005f26e0dfd10513ae/templates/common/_base/units/nodeip-configuration.service.yaml
This service will determine the node ip via which interfaces have a default route and which one has the lowest metric. With your 2 interfaces, do they both have default routes? If so, are they using dhcp and perhaps its random which route gets installed with a lower metric?

— Additional comment from
trozet@redhat.com
on 2022-04-21 21:13:15 UTC —

Correction: looks like standalone kube-proxy is installed by default when the provider is not SDN, OVN, or kuryr so this looks like the correct default behavior for kube-proxy to be deployed.

— Additional comment from
lpbinh@gmail.com
on 2022-04-25 04:05:14 UTC —

Hi Tim,

You are right, kube-proxy is deployed by default and we don't change that behavior.

There is only 1 default route configured for the management interface (10.30.1.x) , we used to have a default route for the data/vrrp interface (10.40.1.x) with higher metric before. As said, we don't have the default route for the second interface any more but still encounter the issue pretty often.

— Additional comment from
trozet@redhat.com
on 2022-04-25 14:24:05 UTC —

Binh, can you please provide a sosreport for one of the nodes that shows this behavior? Then we can try to figure out what is going on with the interfaces and the node ip service. Thanks.

— Additional comment from
trozet@redhat.com
on 2022-04-25 16:12:04 UTC —

Actually Ben reminded me that the invalid endpoint is actually the boostrap node itself:
172.30.0.1:443 -> 10.30.1.10:6443 (assuming that 10.30.1.10 is the bootstrap nodes' ip address in the machine network)

vs
172.30.0.1:443 -> 10.40.1.10:6443 (which is the bootstrap nodes' eth1 ip address)

So maybe a sosreport off that node is necessary? I'm not as familiar with the bare metal install process, moving back to Ben.

— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:33:45 UTC —

Created attachment 1875023 [details]sosreport

— Additional comment from
lpbinh@gmail.com
on 2022-04-26 08:34:59 UTC —

Created attachment 1875024 [details]sosreport-part2

Hi Tim,

We observe this issue when deploying clusters using OpenStack instances as our infrastructure is based on OpenStack.

I followed the steps here to collect the sosreport:
https://docs.openshift.com/container-platform/4.8/support/gathering-cluster-data.html
Got the sosreport which is 22MB which exceeds the size permitted (19MB), so I split it to 2 files (xaa and xab), if you can't join them then we will need to put the collected sosreport on a share drive like we did with the must-gather data.

Here are some notes about the cluster:

First two control nodes are below, ocp-binhle-8dvald-ctrl-3 is the bootstrap node.

[core@ocp-binhle-8dvald-ctrl-2 ~]$ oc get node
NAME STATUS ROLES AGE VERSION
ocp-binhle-8dvald-ctrl-1 Ready master 14m v1.21.8+ed4d8fd
ocp-binhle-8dvald-ctrl-2 Ready master 22m v1.21.8+ed4d8fd

We see the behavior that wrong nat'ing was done at the beginning, then corrected later:

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 3 bytes 180 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 3 bytes 180 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }

}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 3 bytes 180 dnat to 10.40.1.7:6443 }

}
sh-4.4#
sh-4.4#
<....after a while....>
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 0 bytes 0 jump KUBE-SEP-X33IBTDFOZRR6ONM }
}
sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 0 bytes 0 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y { counter packets 0 bytes 0 jump KUBE-SEP-X33IBTDFOZRR6ONM }

}
sh-4.4# nft list chain ip nat KUBE-SEP-X33IBTDFOZRR6ONM
table ip nat {
chain KUBE-SEP-X33IBTDFOZRR6ONM

{ ip saddr 10.30.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 0 bytes 0 dnat to 10.30.1.7:6443 }

}
sh-4.4#

— Additional comment from
lpbinh@gmail.com
on 2022-05-12 17:46:51 UTC —

@
trozet@redhat.com
May we have an update on the fix, or the plan for the fix? Thank you.

— Additional comment from
lpbinh@gmail.com
on 2022-05-18 21:27:45 UTC —

Created support Case 03223143.

— Additional comment from
vkochuku@redhat.com
on 2022-05-31 16:09:47 UTC —

Hello Team,

Any update on this?

Thanks,
Vinu K

— Additional comment from
smerrow@redhat.com
on 2022-05-31 17:28:54 UTC —

This issue is causing delays in Juniper's CI/CD pipeline and makes for a less than ideal user experience for deployments.

I'm getting a lot of pressure from the partner on this for an update and progress. I've had them open a case [1] to help progress.

Please let us know if there is any other data needed by Juniper or if there is anything I can do to help move this forward.

[1]
https://access.redhat.com/support/cases/#/case/03223143
— Additional comment from
vpickard@redhat.com
on 2022-06-02 22:14:23 UTC —

@
bnemec@redhat.com
Tim mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=2070318#c14
that this issue appears to be at BM install time. Is this something you can help with, or do we need help from the BM install team?

— Additional comment from
bnemec@redhat.com
on 2022-06-03 18:15:17 UTC —

Sorry, I missed that this came back to me.

(In reply to Binh Le from
comment #16
)> We observe this issue when deploying clusters using OpenStack instances as
> our infrastructure is based on OpenStack.This does not match the configuration in the must-gathers provided so far, which are baremetal. Are we talking about the same environments?

I'm currently discussing this with some other internal teams because I'm unfamiliar with this type of bootstrap setup. I need to understand what the intended behavior is before we decide on a path forward.

— Additional comment from
rurena@redhat.com
on 2022-06-06 14:36:54 UTC —

(In reply to Ben Nemec from
comment #22
)> Sorry, I missed that this came back to me.
>
> (In reply to Binh Le from comment #16)
> > We observe this issue when deploying clusters using OpenStack instances as
> > our infrastructure is based on OpenStack.
>
> This does not match the configuration in the must-gathers provided so far,
> which are baremetal. Are we talking about the same environments?
>
> I'm currently discussing this with some other internal teams because I'm
> unfamiliar with this type of bootstrap setup. I need to understand what the
> intended behavior is before we decide on a path forward.I spoke to the CU they tell me that all work should be on baremetal. They were probably just testing on OSP and pointing out that they saw the same behavior.

— Additional comment from
bnemec@redhat.com
on 2022-06-06 16:19:37 UTC —

Okay, I see now that this is an assisted installer deployment. Can we get the cluster ID assigned by AI so we can take a look at the logs on our side? Thanks.

— Additional comment from
lpbinh@gmail.com
on 2022-06-06 16:38:56 UTC —

Here is the cluster ID, copied from the bug description:
ClusterID: 24bbde0b-79b3-4ae6-afc5-cb694fa48895

In regard to your earlier question about OpenStack & baremetal (2022-06-03 18:15:17 UTC):

We had an issue with platform validation in OpenStack earlier. Host validation was failing with the error message “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”

It's found out that there is no platform type "OpenStack" available in [
https://github.com/openshift/assisted-service/blob/master/models/platform_type.go#L29
] so we set "baremetal" as the platform type on our computes. That's the reason why you are seeing baremetal as the platform type.

Thank you

— Additional comment from
ercohen@redhat.com
on 2022-06-08 08:00:18 UTC —

Hey, first you are currect, When you set 10.30.1.0/24 as the machine network, the bootstrap process should use the IP on that subnet in the bootstrap node.

I'm trying to understand how exactly this cluster was installed.
You are using on-prem deployment of assisted-installer (podman/ACM)?
You are trying to form a cluster from OpenStack Vms?
You set the platform to Baremetal where?
Did you set user-managed-netwroking?

Some more info, when using OpenStack platform you should install the cluster with user-managed-netwroking.
And that's what the failing validation is for.

— Additional comment from
bnemec@redhat.com
on 2022-06-08 14:56:53 UTC —

Moving to the assisted-installer component for further investigation.

— Additional comment from
lpbinh@gmail.com
on 2022-06-09 07:37:54 UTC —

@Eran Cohen:

Please see my response inline.

You are using on-prem deployment of assisted-installer (podman/ACM)?
--> Yes, we are using on-prem deployment of assisted-installer.

You are trying to form a cluster from OpenStack Vms?
--> Yes.

You set the platform to Baremetal where?
--> It was set in the Cluster object, Platform field when we model the cluster.

Did you set user-managed-netwroking?
--> Yes, we set it to false for VRRP.

— Additional comment from
itsoiref@redhat.com
on 2022-06-09 08:17:23 UTC —

@
lpbinh@gmail.com
can you please share assisted logs that you can download when cluster is failed or installed?
Will help us to see the full picture

— Additional comment from
ercohen@redhat.com
on 2022-06-09 08:23:18 UTC —

OK, as noted before when using OpenStack platform you should install the cluster with user-managed-netwroking (set to true).
Can you explain how you workaround this failing validation? “Platform network settings: Platform OpenStack Compute is allowed only for Single Node OpenShift or user-managed networking.”
What does this mean exactly? 'we set "baremetal" as the platform type on our computes'

To be honest I'm surprised that the installation was completed successfully.

@
oamizur@redhat.com
I thought installing on OpenStack VMs with baremetal platform (user-managed-networking=false) will always fail?

— Additional comment from
lpbinh@gmail.com
on 2022-06-10 16:04:56 UTC —

@
itsoiref@redhat.com
: I will reproduce and collect the logs. Is that supposed to be included in the provided must-gather?
@
ercohen@redhat.com
:

  • user-managed-networking set to true when we use external Load Balancer and DNS server. For VRRP we use OpenShift's internal LB and DNS server hence it's set to false, following the doc.
  • As explained OpenShift returns platform type as 'none' for OpenStack:
    https://github.com/openshift/assisted-service/blob/master/models/platform_type.go#L29
    , therefore we set the platformtype as 'baremetal' in the cluster object for provisioning the cluster using OpenStack VMs.

— Additional comment from
itsoiref@redhat.com
on 2022-06-13 13:08:17 UTC —

@
lpbinh@gmail.com
you will have download_logs link in UI. Those logs are not part of must-gather

— Additional comment from
lpbinh@gmail.com
on 2022-06-14 18:52:02 UTC —

Created attachment 1889993 [details]cluster log per need info request - Cluster ID caa475b0-df04-4c52-8ad9-abfed1509506

Attached is the cluster log per need info request.
Cluster ID: caa475b0-df04-4c52-8ad9-abfed1509506
In this reproduction, the issue is not resolved by OpenShift itself, wrong NAT still remained and cluster deployment failed eventually

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 2 bytes 120 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y

{ counter packets 2 bytes 120 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }
}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 17:40:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 17:59:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 9 bytes 540 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 18:17:38 UTC 2022
sh-4.4#
sh-4.4#
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4 { ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 7 bytes 420 dnat to 10.40.1.7:6443 }
}
Tue Jun 14 18:49:28 UTC 2022
sh-4.4#

— Additional comment from
lpbinh@gmail.com
on 2022-06-14 18:56:06 UTC —

Created attachment 1889994 [details]cluster log per need info request - Cluster ID caa475b0-df04-4c52-8ad9-abfed1509506

Please find the cluster-log attached per your request. In this deployment the wrong NAT was not automatically resolved by OpenShift hence the deployment failed eventually.

sh-4.4# nft list table ip nat | grep 172.30.0.1
meta l4proto tcp ip daddr 172.30.0.1 tcp dport 443 counter packets 2 bytes 120 jump KUBE-SVC-NPX46M4PTMTKRN6Y
sh-4.4# nft list chain ip nat KUBE-SVC-NPX46M4PTMTKRN6Y
table ip nat {
chain KUBE-SVC-NPX46M4PTMTKRN6Y { counter packets 2 bytes 120 jump KUBE-SEP-VZ2X7DROOLWBXBJ4 }

}
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 17:40:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 2 bytes 120 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 17:59:19 UTC 2022
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 9 bytes 540 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 18:17:38 UTC 2022
sh-4.4#
sh-4.4#
sh-4.4# nft list chain ip nat KUBE-SEP-VZ2X7DROOLWBXBJ4; date
table ip nat {
chain KUBE-SEP-VZ2X7DROOLWBXBJ4

{ ip saddr 10.40.1.7 counter packets 0 bytes 0 jump KUBE-MARK-MASQ meta l4proto tcp counter packets 7 bytes 420 dnat to 10.40.1.7:6443 }

}
Tue Jun 14 18:49:28 UTC 2022
sh-4.4#

— Additional comment from
itsoiref@redhat.com
on 2022-06-15 15:59:22 UTC —

@
lpbinh@gmail.com
just for the protocol, we don't support baremetal ocp on openstack that's why validation is failing

— Additional comment from
lpbinh@gmail.com
on 2022-06-15 17:47:39 UTC —

@
itsoiref@redhat.com
as explained it's just a workaround on our side to make OCP work in our lab, and from my understanding on OCP perspective it will see that deployment is on baremetal only, not related to OpenStack (please correct me if I am wrong).

We have been doing thousands of OCP cluster deployments in our automation so far, if it's why validation is failing then it should be failing every time. However it only occurs occasionally when nodes have 2 interfaces, using OCP internal DNS and Load balancer, and sometime resolved by itself and sometime not.

— Additional comment from
itsoiref@redhat.com
on 2022-06-19 17:00:01 UTC —

For now i can assume that this endpoint is causing the issue:
{
"apiVersion": "v1",
"kind": "Endpoints",
"metadata": {
"creationTimestamp": "2022-06-14T17:31:10Z",
"labels":

{ "endpointslice.kubernetes.io/skip-mirror": "true" }

,
"name": "kubernetes",
"namespace": "default",
"resourceVersion": "265",
"uid": "d8f558be-bb68-44ac-b7c2-85ca7a0fdab3"
},
"subsets": [
{
"addresses": [

{ "ip": "10.40.1.7" }

],
"ports": [
{
"name": "https",
"port": 6443,
"protocol": "TCP"
}
]
}
]
},

— Additional comment from
itsoiref@redhat.com
on 2022-06-21 17:03:51 UTC —

The issue is that kube-api service advertise wrong ip but it does it cause kubelet chooses the one arbitrary and we currently have no mechanism to set kubelet ip, especially in bootstrap flow.

— Additional comment from
lpbinh@gmail.com
on 2022-06-22 16:07:29 UTC —

@
itsoiref@redhat.com
how do you perform OCP deployment in setups that have multiple interfaces if letting kubelet chooses an interface arbitrary instead of configuring a specific IP address for it to listen on? With what you describe above chance of deployment failure in system with multiple interfaces would be high.

— Additional comment from
dhellard@redhat.com
on 2022-06-24 16:32:26 UTC —

I set the Customer Escalation flag = Yes, per ACE EN-52253.
The impact is noted by the RH Account team: "Juniper is pressing and this impacts the Unica Next Project at Telefónica Spain. Unica Next is a critical project for Red Hat. We go live the 1st of July and this issue could impact the go live dates. We need clear information about the status and its possible resolution.

— Additional comment from
itsoiref@redhat.com
on 2022-06-26 07:28:44 UTC —

I have sent an image with possible fix to Juniper and waiting for their feedback, once they will confirm it works for them we will proceed with the PRs.

— Additional comment from
pratshar@redhat.com
on 2022-06-30 13:26:26 UTC —

=== In Red Hat Customer Portal Case 03223143 ===
— Comment by Prateeksha Sharma on 6/30/2022 6:56 PM —

//EMT note//

Update from our consultant Manuel Martinez Briceno -

====
on 28th June, 2022 the last feedback from Juniper Project Manager and our Partner Manager was that they are testing the fix. They didn't give an Estimate Time to finish, but we will be tracking this closely and let us know of any news.
====

Thanks & Regards,
Prateeksha Sharma
Escalation Manager | RHCSA
Global Support Services, Red Hat

 As mentioned in [1], the cluster monitoring operator doesn't define the relatedObjects field in the ClusterOperator manifest which is initially deployed by CVO [2].
If the CMO pod fails to start, the must-gather might miss information from the monitoring namespace. Note that once CMO runs, it will update the initial ClusterOperator object with the proper information [3].

[1] http://mailman-int.corp.redhat.com/archives/aos-devel/2021-May/msg00139.html
[2] https://github.com/openshift/cluster-monitoring-operator/blob/master/manifests/0000_50_cluster-monitoring-operator_06-clusteroperator.yaml
[3] https://github.com/openshift/cluster-monitoring-operator/blob/a6bc9824035ceb8dbfe7c53cf0c138bfb2ec5643/pkg/client/status_reporter.go#L49-L63

+++ This bug was initially created as a clone of Bug #2117811 +++

Description of problem:
We are currently unable to merge any pull requests to fix CVEs because of the use of the xmlstarlet command line utility which is not currently packaged for and available for RHEL.

Version-Release number of selected component (if applicable):

How reproducible:
ALways

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:
Another bug will be opened to reenable this code once the xmlstarlet package is available in RHEL or we find an alternative fix.

#Description of problem:

Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template

Input values in Instantiate Template are disappeared randomly.

#Version-Release number of selected component (if applicable):

  • Customer ENV
  • OCP4.10.9
  • Developer Console
  • Edge 88x. / Edge 85.0.x / Chrome 97.x /Chrome 88.x
  • Internet Disconnected OCP cluster
  • quicklab test ENV
  • Developer Console
  • OCP4.10.12 
  • Chrome 100

#How reproducible:

I reproduced this issue in ocp410ovn shared cluster in the quicklab

Select Apache HTTP Server > Input name "test" in Application Hostname box
After several seconds, the value has disappeared in the web console.

#Steps to Reproduce:

0. Developer Console > +ADD > Develoeper Catalog > Service > select Types Templates > Initiate Template

1. Input values in the box of template menu.

2. The values are disappeared after several seconds later. (20s~ or randomly)

3. Many users have experienced this issue.

  • The web browser version of users experiencing this issue.
  • Customer: Edge 88x. / Edge 85.0.x / Chrome 97.x /Chrome 88.x
  • My browser: Chrome 102.x

==> the browser version doesn't matter.

#Actual results:

Input values in "Instantiate Template" are disappeared randomly.
Users can't use the Initiate Template feature in the Dev console.

#Expected results:
Input values remain in the web console and users creat the object by the "Instantiate Template"

#Additional info:

See "Application Name" has disappeared in the video I attached.

This is a clone of issue OCPBUGS-1354. The following is the description of the original issue:

This was originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=2046335

Description of problem:

The issue reported here https://bugzilla.redhat.com/show_bug.cgi?id=1954121 still occur (tested on OCP 4.8.11, the CU also verified that the issue can happen even with OpenShift 4.7.30, 4.8.17 and 4.9.11)

How reproducible:

Attach a NIC to a master node will trigger the issue

Steps to Reproduce:
1. Deploy an OCP cluster (I've tested it IPI on AWS)
2. Attach a second NIC to a running master node (in my case "ip-10-0-178-163.eu-central-1.compute.internal")

Actual results:

~~~
$ oc get node ip-10-0-178-163.eu-central-1.compute.internal -o json | jq ".status.addresses"
[

{ "address": "10.0.178.163", "type": "InternalIP" }

,

{ "address": "10.0.187.247", "type": "InternalIP" }

,

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "Hostname" }

,

{ "address": "ip-10-0-178-163.eu-central-1.compute.internal", "type": "InternalDNS" }

]

$ oc get co etcd
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
etcd 4.8.11 True False True 31h

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:47:42Z", "message": "EtcdCertSignerControllerDegraded: [x509: certificate is valid for 10.0.178.163, not 10.0.187.247, x509: certificate is valid for ::1, 10.0.178.163, 127.0.0.1, ::1, not 10.0.187.247]", "reason": "EtcdCertSignerController_Error", "status": "True", "type": "Degraded" }

~~~

Expected results:

To have the certificate valid also for the second IP (the newly created one "10.0.187.247")

Additional info:

Deleting the following secrets seems to solve the issue:
~~~
$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd
etcd-client kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 61s
etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 60s
etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal kubernetes.io/tls 2 58s
etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal kubernetes.io/tls 2 59s
etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal kubernetes.io/tls 2 58s

$ oc get secret n openshift-etcd | grep kubernetes.io/tls | grep ^etcd | awk '

{print $1}

' | xargs -I {} oc delete secret {} -n openshift-etcd
secret "etcd-client" deleted
secret "etcd-peer-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-peer-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-ip-10-0-202-187.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-132-49.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-178-163.eu-central-1.compute.internal" deleted
secret "etcd-serving-metrics-ip-10-0-202-187.eu-central-1.compute.internal" deleted

$ oc get co etcd -o json | jq ".status.conditions[0]"

{ "lastTransitionTime": "2022-01-26T15:52:21Z", "message": "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found", "reason": "AsExpected", "status": "False", "type": "Degraded" }

~~~

copy of BZ https://bugzilla.redhat.com/show_bug.cgi?id=2053622

Description of problem:

PodDisruptionBudgetAtLimit Warning alert when CR replica count is zero.

Version-Release number of selected component (if applicable):
4.7

How reproducible: Everytime

Steps to Reproduce:
1. oc new-project test

2. oc new-app httpd

3. oc create -f pdb

$ cat pdb.yaml

~~~
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: my-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
deployment: httpd
~~~

$ oc get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
my-pdb N/A 0 0 3h27m

4. oc scale deployment httpd --replicas=0

5. Wait for some time alert will be triggered at the console.

Actual results: unexpected warning alert

Expected results: As we are intentionally dropping down the replicas it should not generate an alert.

Description of problem:

Customer is facing issue similar to https://github.com/devfile/api/issues/897

Version-Release number of selected component (if applicable):

OCP 4.10.17

How reproducible:
N/A
Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Tried working around it with ALL_PROXY but it did not help. Note because the console operator reverts changes pretty quickly testing this was a bit of a PITA

Description of problem:

OCP Upgrade failing

Version-Release number of the following components:

oc version
Client Version: 4.8.0-202108312109.p0.git.0d10c3f.assembly.stream-0d10c3f
Server Version: 4.10.13
Kubernetes Version: v1.23.5+b463d71

How reproducible: Always

Steps to Reproduce:
1. Create the following SCC (that has `with readOnlyRootFilesystem: true`):
~~~
cat << EOF | oc create -f -
allowHostDirVolumePlugin: true
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: true
allowedCapabilities: []
apiVersion: security.openshift.io/v1
defaultAddCapabilities: []
fsGroup:
type: MustRunAs
groups: []
kind: SecurityContextConstraints
metadata:
annotations:
meta.helm.sh/release-name: azure-arc
meta.helm.sh/release-namespace: default
labels:
app.kubernetes.io/managed-by: Helm
name: kube-aad-proxy-scc
priority: null
readOnlyRootFilesystem: true
requiredDropCapabilities: []
runAsUser:
type: RunAsAny
seLinuxContext:
type: MustRunAs
supplementalGroups:
type: RunAsAny
users:

  • system:serviceaccount:azure-arc:azure-arc-kube-aad-proxy-sa
    volumes:
  • configMap
  • hostPath
  • secret
    EOF
    ~~~

2. oc adm upgrade --to=4.10.20

Actual results:

SCC kube-aad-proxy-scc, which has readOnlyRootFilesystem is injected inside the pod version-4.10.20-smvt9-6vqwc, causing it to fail.
~~~

  1. oc get po -n openshift-cluster-version
    NAME READY STATUS RESTARTS AGE
    cluster-version-operator-6b5c8ff5c8-4bmxx 1/1 Running 0 33m
    version-4.10.20-smvt9-6vqwc 0/1 Error 0 10s
  2. oc logs version-4.10.20-smvt9-6vqwc -n openshift-cluster-version
    oc logs version-4.10.20-smvt9-6vqwc
    mv: cannot remove '/manifests/0000_00_cluster-version-operator_00_namespace.yaml': Read-only file system
    mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_adminack_configmap.yaml': Read-only file system
    mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_admingate_configmap.yaml': Read-only file system
    mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_clusteroperator.crd.yaml': Read-only file system
    mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_clusterversion.crd.yaml': Read-only file system
    mv: cannot remove '/manifests/0000_00_cluster-version-operator_02_roles.yaml': Read-only file system
    mv: cannot remove '/manifests/0000_00_cluster-version-operator_03_deployment.yaml': Read-only file system
    mv: cannot remove '/manifests/0000_90_cluster-version-operator_00_prometheusrole.yaml': Read-only file system
    mv: cannot remove '/manifests/0000_90_cluster-version-operator_01_prometheusrolebinding.yaml': Read-only file system
    mv: cannot remove '/manifests/0000_90_cluster-version-operator_02_servicemonitor.yaml': Read-only file system
    mv: cannot remove '/manifests/0001_00_cluster-version-operator_03_service.yaml': Read-only file system
    ~~~

Expected results:

Pod version-4.10.20-smvt9-6vqwc should run fine

Additional info:

I don't know why, but SCC kube-aad-proxy-scc is injected inside pod version-4.10.20-smvt9-6vqwc:
~~~
apiVersion: v1
kind: Pod
metadata:
annotations:
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.129.0.70"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.129.0.70"
],
"default": true,
"dns": {}
}]
openshift.io/scc: kube-aad-proxy-scc ### HERE
creationTimestamp: "2022-07-25T16:47:39Z"
generateName: version-4.10.20-5xqtv-
labels:
controller-uid: ba707bbe-1825-4f80-89ce-f6bf2301a812
job-name: version-4.10.20-5xqtv
name: version-4.10.20-5xqtv-9gcwk
namespace: openshift-cluster-version
ownerReferences:

  • apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: version-4.10.20-5xqtv
    uid: ba707bbe-1825-4f80-89ce-f6bf2301a812
    resourceVersion: "40040"
    uid: 0d668d3d-7452-463f-a421-4dfee9c89c23
    spec:
    containers:
  • args:
  • -c
  • mkdir -p /etc/cvo/updatepayloads/KsrCX7X9QbtoXkW3TkPcww && mv /manifests /etc/cvo/updatepayloads/KsrCX7X9QbtoXkW3TkPcww/manifests
    && mkdir -p /etc/cvo/updatepayloads/KsrCX7X9QbtoXkW3TkPcww && mv /release-manifests
    /etc/cvo/updatepayloads/KsrCX7X9QbtoXkW3TkPcww/release-manifests
    command:
  • /bin/sh
    image: quay.io/openshift-release-dev/ocp-release@sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600
    imagePullPolicy: IfNotPresent
    name: payload
    resources:
    requests:
    cpu: 10m
    ephemeral-storage: 2Mi
    memory: 50Mi
    securityContext:
    privileged: true
    readOnlyRootFilesystem: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
  • mountPath: /etc/cvo/updatepayloads
    name: payloads
  • mountPath: /var/run/secrets/kubernetes.io/serviceaccount
    name: kube-api-access-fwblb
    readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    imagePullSecrets:
  • name: default-dockercfg-smmf4
    nodeName: ip-10-0-215-206.eu-central-1.compute.internal
    nodeSelector:
    node-role.kubernetes.io/master: ""
    preemptionPolicy: PreemptLowerPriority
    priority: 1000000000
    priorityClassName: openshift-user-critical
    restartPolicy: OnFailure
    schedulerName: default-scheduler
    securityContext:
    fsGroup: 1000030000
    seLinuxOptions:
    level: s0:c6,c0
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
  • key: node-role.kubernetes.io/master
  • effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  • effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  • effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
    volumes:
  • hostPath:
    path: /etc/cvo/updatepayloads
    type: ""
    name: payloads
  • name: kube-api-access-fwblb
    projected:
    defaultMode: 420
    sources:
  • serviceAccountToken:
    expirationSeconds: 3607
    path: token
  • configMap:
    items:
  • key: ca.crt
    path: ca.crt
    name: kube-root-ca.crt
  • downwardAPI:
    items:
  • fieldRef:
    apiVersion: v1
    fieldPath: metadata.namespace
    path: namespace
  • configMap:
    items:
  • key: service-ca.crt
    path: service-ca.crt
    name: openshift-service-ca.crt
    status:
    conditions:
  • lastProbeTime: null
    lastTransitionTime: "2022-07-25T16:47:39Z"
    status: "True"
    type: Initialized
  • lastProbeTime: null
    lastTransitionTime: "2022-07-25T16:47:39Z"
    message: 'containers with unready status: [payload]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  • lastProbeTime: null
    lastTransitionTime: "2022-07-25T16:47:39Z"
    message: 'containers with unready status: [payload]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  • lastProbeTime: null
    lastTransitionTime: "2022-07-25T16:47:39Z"
    status: "True"
    type: PodScheduled
    containerStatuses:
  • containerID: cri-o://ac6f6a5d8925620f1a2835a50fe26ea02d35e3a5c2d033015f38fde5206daf8c
    image: quay.io/openshift-release-dev/ocp-release@sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600
    imageID: quay.io/openshift-release-dev/ocp-release@sha256:b89ada9261a1b257012469e90d7d4839d0d2f99654f5ce76394fa3f06522b600
    lastState:
    terminated:
    containerID: cri-o://fdac85e975eb00a3abd08e18061ae3673a857769ddfc87ca94a3527a8c7b83f3
    exitCode: 1
    finishedAt: "2022-07-25T16:47:42Z"
    reason: Error
    startedAt: "2022-07-25T16:47:42Z"
    name: payload
    ready: false
    restartCount: 2
    started: false
    state:
    terminated:
    containerID: cri-o://ac6f6a5d8925620f1a2835a50fe26ea02d35e3a5c2d033015f38fde5206daf8c
    exitCode: 1
    finishedAt: "2022-07-25T16:47:56Z"
    reason: Error
    startedAt: "2022-07-25T16:47:56Z"
    hostIP: 10.0.215.206
    phase: Running
    podIP: 10.129.0.70
    podIPs:
  • ip: 10.129.0.70
    qosClass: Burstable
    startTime: "2022-07-25T16:47:39Z"
    ~~~

Tracker bug for bootimage bump in 4.10. This bug should block bugs which need a bootimage bump to fix.

This is a clone of issue OCPBUGS-515. The following is the description of the original issue:

When a thin provisioned COW format disk is created on OCP on RHV via CSI driver (a PVC -
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml

  • with for example requested storage 100 GB), the go-ovirt-client behaviour makes it so that the created disk has virtual size 100 GB and it's actual size is 110 GB.

But this is thin provisioned disk, so the initial size of the disk should be default of the engine and then grow as needed, it shouldn't be this big.

This causes all the disks created this way to be functionally preallocated (since it eats all that space), which is a real waste of space.

How reproducible: 100%

Steps to Reproduce:
1. Create a storage claim (PVC) in Openshift (
https://github.com/openshift/ovirt-csi-driver/blob/master/deploy/example/storage-claim.yaml
) using the default storage class (or any other storage class with thinProvisioning: "true") and with requested storage i.e. 100Gi

$ oc create -f storage-claim.yaml

2. In the RHV web console navigate to Storage -> Disks and check Virtual size and Actual size of the created disk (PVC)

Actual results:
Disk from our example with requested storage 100GB reports virtual size 100GB and actual size 110 GB.

Expected results:
Thin provisioned disks should start with small initial size and then grow as needed, so its actual size should be considerably smaller (the default initial size set by the engine should be 2.5 GB if I'm not mistaken).

Note: The extra 10GB in the actual size are caused by overhead for the qcow2 disk format, which is 10%, and this was tracked here as a separate issue:
https://bugzilla.redhat.com/show_bug.cgi?id=2097139

Description of problem:
This is a clone of https://issues.redhat.com/browse/OCPBUGS-658

 

Description of problem: Numerous erroreneous logs in OVN master

I0823 18:00:11.163491       1 obj_retry.go:1063] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163546       1 obj_retry.go:1096] Removing old object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163555       1 pods.go:124] Deleting pod: openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163631       1 obj_retry.go:1103] Retry delete failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k, will try again later: deleteLogicalPort failed for pod openshift-operator-lifecycle-manager_collect-profiles-27687900-hlp6k: unable to locate portUUID+nodeName for pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k: error getting logical port <nil>: object not found
W0823 18:00:41.163633       1 obj_retry.go:1031] Dropping retry entry for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k: exceeded number of failed attempts

Must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.2234927131259452300/

 

Version-Release number of selected component (if applicable): 4.12.0-0.nightly-2022-08-23-031342

How reproducible: Always

Steps to Reproduce:
1. Bring up OVN cluster on 4.12
2.
3.

Actual results: deleteLogicalPort failed for already gone object

Expected results: deleteLogicalPort should not keep retrying post object deletion

Additional info:

Before platformStatus, the operator used to get information about AWS and GCP from the install-config config map. This code can be removed.

Description of problem:

+++ This bug was initially created as a clone of 

Bug #2118514

Various CI steps use the upi-installer container for it's access to the
aws cli tools among other things. However, most of those steps also
curl yq directly from GitHub. We can save ourselves some headaches
when GitHub is down by just embedding the binary in the image already.

Whenever GitHub has issues or throttles us, YQ hash mismatch error out. The hash mismatch error is because github is probably returning an error page, although our scripts hide it.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Running the discovery cache every 10 minutes has a significant productivity impact on using kubectl on clusters with many CRDs as it is taking time to run these unnecessary requests.
The discovery cache doesn't really have to run every 10 minutes, as CRDs don't change that often. A lot of unnecessary load is created on clients and servers.

Version-Release number of selected component (if applicable):

4.10.z

How reproducible:

Cluster with a lot of CRDs

Steps to Reproduce:

https://github.com/kubernetes/kubernetes/issues/107130

Actual results:

Significant grow of kubectl request completion every 10 minutes

Expected results:

Significant grow of kubectl request completion every 24 hours

Additional info:

https://github.com/kubernetes/kubernetes/issues/107130

*USER STORY:*

 As a customer or OpenShift engineer, I want to see the user agent for anything calling from OpenShift -> vSphere to eliminate troubleshooting guesswork.

*DESCRIPTION:*

A question in #forum-vmware was raised where we identified that the user-agent may not be configured for all OpenShift components calling to vSphere API.

https://coreos.slack.com/archives/CH06KMDRV/p1627368902058800

*Required:*

Audit of OpenShift components calling to vSphere API to make sure user agent strings are set appropriately.

*Nice to have:*

How can this be prevented in the future? How can we minimize maintenance costs added by new PRs/bugs reported from this spike?

*ACCEPTANCE CRITERIA:*

New PRs or bug reports for each effected component.

Acceptance criteria:

  • All tests (including e2e) pass
  • No regressions are introduced
  • openshift/api points to a recent commit on the master branch

Goal

We have several use cases where dynamic plugins need to proxy to another service on the cluster. One example is the Helm plugin. We would like to move the backend code for Helm to a separate service on the cluster, and the Helm plugin could proxy to that service for its requests. This is required to make Helm a dynamic plugin. Similarly if we want to have ACM contribute any views through dynamic plugins, we will need a way for ACM to proxy to its services (e.g., for Search).

It's possible for plugins to make requests to services exposed through routes today, but that has several problems:

  1. It requires that the service be exposed outside the cluster, which is not always desired.
  2. It requires the service support CORS headers for the console.
  3. There is no way to specify a CA file for the route if it's not trusted by the browser.
  4. Plugins will not have access to the user's access token on the client, which means that there is no simple way to handle auth.

Plugins need a way to declare in-cluster services that they need to connect to. The console backend will need to set up proxies to those services on console load. This also requires that the console operator be updated to pass the configuration to the console backend.

 

This work will apply only to single clusters.

 

Open Questions

  • What happens when a multitenant isolated network policy is configured on the cluster?

https://docs.openshift.com/container-platform/4.7/networking/network_policy/multitenant-network-policy.html

  • How do we (and can we?) support this for multi-cluster where console is running on a different hub cluster?
  • Do we need to auth for all requests?

Acceptance Criteria

  • Plugins can declare a service to proxy to in the ConsolePlugin resource
  • Plugins can specify a CA cert for the service
  • Console falls back to the service signing CA if none is specified
  • Plugins have a way of specifying whether the user's authentication token is included in requests through the service proxy
  • Dynamic plugin enhancement is updated with the implementation details
  • Support for server-side events (SSE) for ACM
  • Add support, or a flag, if auth is needed for each request.  

cc Ali Mobrem [~christianmvogt]

This is a clone of issue OCPBUGS-1786. The following is the description of the original issue:

This is a clone of issue OCPBUGS-1677. The following is the description of the original issue:

Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)

This issue is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.

OCPBUGS-1678 is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh

Actual results:
Unit tests fail

Expected results:
Unit tests should pass again

Additional info:

Description of problem:

Jenkins install-plugins.sh script does not ignore update requests for locked versions of plugins, and does not verify that the locked version was actually included in the bundle-plugins.txt file.

Version-Release number of selected component (if applicable):


How reproducible:

Run make plugins-list

Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


+++ This bug was initially created as a clone of Bug #2118717 +++
https://bugzilla.redhat.com/show_bug.cgi?id=2118717

Description of problem:
This BZ is a spin-off of BZ-2114945 so we can track possible issues with new TCP connections from pods failing to be created on the nodes leading to pods being unable to start or crash.

Version-Release number of selected component (if applicable):
OCP 4.10.24 with OVN-Kubernetes

How reproducible:
Periodically on the customer only so far.

— Additional comment from Andre Costa on 2022-08-10 16:30:00 UTC —

There are 3 must-gathers here that were gathered during the issues and after the restart of OVNk-masters which makes all these issues go away and pods start connections immediately.

This must-gather was taken at 11 AM today when they received a report from one of the customers:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/3f820e22-3d1e-4c01-b56d-e479a6cc5578

Customer reported the issue again and this time we also got sosreport and inspect from the project.
In the pod they get errors like this (the same we saw on the call last week with them where it seems no TCP connections entries are created at all. First we though it was DNS but even with IPs directly there were issues like this):
-----------------
mx-toni-dev toni-dev-build 0/1 Error 0 18m 10.195.80.253 demchdc5vvx <none> <none>

[z0003rbj-z07@stuart ~]$ oc logs toni-dev-build
time="2022-08-10T10:59:02Z" level=info msg="Start building app with registry type openshift"
time="2022-08-10T10:59:02Z" level=info msg="Adding ssl certificate /etc/ssl/certs/ca-bundle.crt"
time="2022-08-10T10:59:02Z" level=info msg="Certificate /etc/ssl/certs/ca-bundle.crt has been added successfully"
time="2022-08-10T10:59:02Z" level=info msg="Updating docker config with registry credentials"
time="2022-08-10T10:59:02Z" level=info msg="Docker config has been updated with registry credentials"
time="2022-08-10T10:59:02Z" level=info msg="Downloading MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042"
time="2022-08-10T10:59:32Z" level=error msg="Failed to build mendix app, failed to create application layer failed to download MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042, Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042\": proxyconnect tcp: dial tcp: i/o timeout: Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042\": proxyconnect tcp: dial tcp: i/o timeout"
-----------------------

This keeps happening if they continue to run the builds which they did and created the MG and sosreport:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/d10d9646-5437-4806-b34f-3acab4d83266

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/1e9e0575-66b0-4184-a3d4-e84fa245c9a2

And like we have seen so far restarting the ovnk-master pods makes these connections work immediately again:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/e03ee9c6-2ce4-40d3-a484-84da61607de5

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/652ebe47-b71f-4d63-8499-cf989b3ebfc7

— Additional comment from Andre Costa on 2022-08-10 16:30:43 UTC —

— Additional comment from Tim Rozet on 2022-08-10 22:36:06 UTC —

Thanks for the must gathers. From Flavio and I examining them, there is definitely a bug here in ovn-kube. The toni-dev-build pod is deleted/recreated multiple times, and during this time it moves to different nodes. However due to a bug in OVNK, this port is updated with the new ip address and information as if it was moving to the new node, but stays on the previous logical switch. So for example, this is what happens:

1. The pod is originally assigned to node demchdc6zax. This node's cluster subnet is 10.195.79.0/24:

2022-08-09T09:22:57.078688727+00:00 stderr F I0809 09:22:57.078632 2239319 cni.go:248] [mx-toni-dev/toni-dev-build b1a4fb0be20ff717f85fd0fffab4fb303bbcb0f8b68aced4852fb7a2465d2df1] ADD finished CNI request [mx-toni-dev/toni-dev-build b1a4fb0be20ff717f85fd0fffab4fb303bbcb0f8b68aced4852fb7a2465d2df1], result "{\"interfaces\":[

{\"name\":\"b1a4fb0be20ff71\",\"mac\":\"a6:68:38:ad:66:c8\"}

,

{\"name\":\"eth0\",\"mac\":\"0a:58:0a:c3:4f:52\",\"sandbox\":\"/var/run/netns/e354d2d5-83cb-406f-a2d9-c5f3e786bae4\"}

],\"ips\":[

{\"version\":\"4\",\"interface\":1,\"address\":\"10.195.79.82/24\",\"gateway\":\"10.195.79.1\"}

],\"dns\":{}}", err <nil

2. Over time this pod is completed, deleted, recreated many times. Until eventually it lands on demchdc5vvx the next day:
2022-08-10T08:43:58.428759111Z I0810 08:43:58.428719 1837017 cni.go:248] [mx-toni-dev/toni-dev-build 5d4f195cbda5269e5451593987be9d69ea828ee549bc447a5bbe50db847c182a] ADD finished CNI request [mx-toni-dev/toni-dev-build 5d4f195cbda5269e5451593987be9d69ea828ee549bc447a5bbe50db847c182a], result "{\"interfaces\":[

{\"name\":\"5d4f195cbda5269\",\"mac\":\"a2:22:6c:1d:43:c1\"}

,

{\"name\":\"eth0\",\"mac\":\"0a:58:0a:c3:50:1e\",\"sandbox\":\"/var/run/netns/bae8f77a-b368-4b3a-86dc-df925330fa26\"}

],\"ips\":[

{\"version\":\"4\",\"interface\":1,\"address\":\"10.195.80.30/24\",\"gateway\":\"10.195.80.1\"}

],\"dns\":{}}", err <nil>

3. Although it lands on a new node, in OVNK we update the old port (somehow the old port is not being removed) that is attached to the old switch:
[root@fedora ~]# ovn-nbctl list logical_switch_port c345cc07-8a89-4e70-beff-d8d9f4dac46a
_uuid : c345cc07-8a89-4e70-beff-d8d9f4dac46a
addresses : ["0a:58:0a:c3:50:1e 10.195.80.30"]
dhcpv4_options : []
dhcpv6_options : []
dynamic_addresses : []
enabled : []
external_ids :

{namespace=mx-toni-dev, pod="true"}

ha_chassis_group : []
name : mx-toni-dev_toni-dev-build
options :

{iface-id-ver="655cc043-53d5-4550-9c0a-74095a0fbde4", requested-chassis=demchdc5vvx}

parent_name : []
port_security : ["0a:58:0a:c3:50:1e 10.195.80.30"]
tag : []
tag_request : []
type : ""
up : false

[root@fedora ~]# ovn-nbctl lsp-list demchdc6zax | grep c345cc07-8a89-4e70-beff-d8d9f4dac46a
c345cc07-8a89-4e70-beff-d8d9f4dac46a (mx-toni-dev_toni-dev-build)
[root@fedora ~]# ovn-nbctl lsp-list demchdc5vvx | grep c345cc07-8a89-4e70-beff-d8d9f4dac46a
[root@fedora ~]#

This will cause the pod not to be able to send any traffic as its IP is in the wrong subnet for this switch.

4. Additionally the default node SNAT for this pod is in the right place:
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc6zax | grep 10.195.80.30
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5vvx | grep 10.195.80.30
snat 139.25.144.25 10.195.80.30

5. But there is no egress IP reroute or SNAT entry for this pod:
Egress IP:
status:
items:

  • egressIP: 139.25.144.72
    node: demchdc5z6x
    name: egress-mx-toni-dev

[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5z6x | grep 10.195.80.30
[root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5z6x | grep 139.25.144.72
snat 139.25.144.72 10.195.77.156
snat 139.25.144.72 10.195.80.40
snat 139.25.144.72 10.195.76.65
snat 139.25.144.72 10.195.76.184
snat 139.25.144.72 10.195.80.42
snat 139.25.144.72 10.195.80.156

6. We see in the ovnkube-master logs that ovnk attempts to delete this pod, but it fails because we try to delete a logical switch port that is still bound to the wrong logical switch:
2022-08-10T09:07:32.033303027Z I0810 09:07:32.033270 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.80.30]}}] Timeout:<nil> Where:[where column _uuid ==

{4fd75aab-222c-4f17-9407-3764f9b6760b}

] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[

{GoUUID:c345cc07-8a89-4e70-beff-d8d9f4dac46a}]}}] Timeout:<nil> Where:[where column _uuid == {f5073eaa-3f72-4ec2-94c3-3744d412864a}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c345cc07-8a89-4e70-beff-d8d9f4dac46a}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"

2022-08-10T09:07:32.037857343Z I0810 09:07:32.037163 1 pods_retry.go:57] [655cc043-53d5-4550-9c0a-74095a0fbde4/mx-toni-dev/toni-dev-build] teardown retry failed; will try again later

From the engineering side we need to try to reproduce this. Flavio and I think this toni-dev-build must be a stateful set running to completion. We need to investigate further to understand how we ended up trying to add the new pod without first deleting the old. Could be a bug related to completed pods logic or retry logic that was added in 4.10.

— Additional comment from Anurag saxena on 2022-08-11 04:52:37 UTC —

Thanks @trozet@redhat.com for investigation. @huirwang@redhat.com @jechen@redhat.com Probably we can follow these steps to internally repro the issue.

— Additional comment from Andre Costa on 2022-08-11 15:09:25 UTC —

Hi all,

Many thanks for the great work here and customer is relieved that we found something

In the meantime a new example that looks to be another caused by this bug. We have seen this live last week and today the customer saw it again on another project:

- Similar build pod started by their operator:

1h21m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.185]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
1h21m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc5vux
1h21m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.81.116/24] from ovn-kubernetes
1h21m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
1h21m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 622.636779ms
1h21m Normal Created pod/ngat-dev-build Created container mendix-build
1h21m Normal Started pod/ngat-dev-build Started container mendix-build
1h20m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.81.116]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {95072525-7d91-4f1f-9ad8-38a4171b978e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
1h2m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc5vvx
1h2m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.80.33/24] from ovn-kubernetes
1h2m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
1h2m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 599.594895ms
1h2m Normal Created pod/ngat-dev-build Created container mendix-build
1h2m Normal Started pod/ngat-dev-build Started container mendix-build
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
28m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
27m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
27m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
27m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 575.907145ms
27m Normal Created pod/ngat-dev-build Created container mendix-build
27m Normal Started pod/ngat-dev-build Started container mendix-build
26m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
26m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
26m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
26m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
26m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 606.11715ms
26m Normal Created pod/ngat-dev-build Created container mendix-build
26m Normal Started pod/ngat-dev-build Started container mendix-build
26m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
22m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
22m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
22m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
22m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 534.430683ms
22m Normal Created pod/ngat-dev-build Created container mendix-build
22m Normal Started pod/ngat-dev-build Started container mendix-build
21m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:1 Error: Details: UUID:

{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

{GoUUID:}

Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)

  • Eventually the pod seems to start but then fails again with the same TCP timeout:

2022-08-11T08:26:24.264204783Z time="2022-08-11T08:26:24Z" level=error msg="Failed to build mendix app, failed to create application layer failed to download MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268, Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268\": proxyconnect tcp: dial tcp: i/o timeout: Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268\": proxyconnect tcp: dial tcp: i/o timeout"

  • Customer took a sosreport from the node where the pod terminated in case is worth it to check:

https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/bd3f86b2-6d2a-44f7-bb0f-eec85630586d?usePresignedUrl=true

— Additional comment from Tim Rozet on 2022-08-11 21:19:12 UTC —

We were able to reproduce the issue locally. There are two potential paths that can cause a stale logical switch port to be re-used on the wrong node:
Scenario 1: pod is created, deleted, and recreated on another node extremely quickly. This plays out like this:
Events:
1. pod toni is created on node A
2. pod toni is deleted on node A
3. pod toni is recreated on node B

What happens in ovnk:
1. pod toni is created on node B
2. pod toni is deleted on node A (fails)

This happens because we grab the latest version of the pod to add in event 1, which by the time we grab it is actually the value from event 3. Then after processing event 1, we move to event 2 and when we delete the pod, we use what was given to us in the event, which is now incorrect to remove the pod from node A, because it actually is on node B. The fix is to ignore the value in the pod spec on deletion, and use what we store internally in our internal port cache. If there is no entry in the cache, we search OVN NBDB to find the right switch (this is an expensive operation so we want to avoid it when possible).

Flavio is working on a fix for this.

Scenario 2: pod is created, runs to completion, is deleted very quickly, and then is recreated on another node:
Events:
1. pod toni is created on node A
2. pod toni runs to completion
3. pod toni immediately is deleted
4. pod toni is recreated on node B

What happens in ovnk:
1. pod toni is created on node A
2. completion causes an update event, ovnk does not delete the pod (bug)
3. deletion event is processed, but since it is a completed pod, we ignore it (since we should have already delted it in step 2)
4. pod toni is recreated on node B, ovnk sees there is already an switch port for toni on the wrong switch, and just updates that with the new information

This happens when a pod goes to completed, but is deleted very quickly. In step 2 during update event we try to grab the latest version of the pod, but it doesn't exist anymore since it was deleted. In this case we skip the update, instead of tearing down the pod. The delete code in step 3 assumes that if the pod is completed we must have already handled it in update, so the stale port stays around until an add later re-uses it. Patryk already has a fix for this:

https://github.com/ovn-org/ovn-kubernetes/pull/3071

There may still be more egress IP issues to investigate (tracked in other bugs) after fixing this, and we will look into those after fixing these fundamental pod issues.

— Additional comment from Tim Rozet on 2022-08-11 22:19:30 UTC —

Andre, re: comment 5. This looks like the same issue. If you have a must gather we can confirm. I think the sosreport from the worker node is not enough. The referential integrity violation occurs because we attempt to do these operations:
1. remove the logical switch port from the logical switch
2. delete the logical switch port

In this case:
1. we remove the logical switch port from the logical switch, but the switch port actually exists on a different switch (node) - this is a noop
2. we try to delete the logical switch port- NBDB complains this is an violation and refuses to delete it, because there is another switch that holds a reference to this object

— Additional comment from Anurag saxena on 2022-08-12 21:16:56 UTC —

@rbrattai@redhat.com Can you help verifying this and backports? feel free to re-assign to someone else if needed while I am on PTO. Thanks!

Description of problem:
During a fresh installation on a BareMetal platform, the monitoring cluster operator fails and becomes degraded. Further troubleshooting shows that the "alertmanagers" are not in a ready state (5/6).

Logs from the alertmanager:

level=info ts=2022-05-03T07:18:08.011Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=0993e91aab7afce476de5c45bead4ebb8d1295a7)"
level=info ts=2022-05-03T07:18:08.011Z caller=main.go:226 build_context="(go=go1.17.5, user=root@df86d88450ef, date=20220409-10:25:31)"

alertmanager-main pods are failing to start due to startupprobe timeout, it seems related to BZ 2037073
We tried to manually increase the timers in the startupprobe, but it was not possible.

Version-Release number of selected component (if applicable):
OCP 4.10.10

How reproducible:
OCP IPI Baremetal Install on HPE ProLiant BL460c Gen10, CU tried several time to redeploy always with the same outcome.

Actual results:
CMO is not being deployed

Expected results:
CMO deploys without errors

Additional info:

  • CU is deploying OCP 4.10 IPI on a baremetal disconnected cluster
  • cluster is 3 nodes with masters schedulable

 

+++ This bug was initially created as a clone of Bug #2117423 +++

Description of problem:

Backport https://github.com/openshift/kubernetes/pull/1295 to 4.10

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

+++ This bug was initially created as a clone of Bug #2081562 +++

Description of problem:

lifecycle.posStart does not have network connectivity on OpenShiftSDN CNI. (OVNKubernetes does not have the issue)

Version-Release number of selected component (if applicable):
4.10

How reproducible:
always

Steps to Reproduce:
1. create statefulset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ oc create -f statefulset.yaml
$ cat statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: httpd
spec:
serviceName: "httpd"
replicas: 1
selector:
matchLabels:
app: httpd
template:
metadata:
labels:
app: httpd
spec:
containers:

  • name: httpd
    image: registry.redhat.io/rhel8/httpd-24:1-191
    ports:
  • containerPort: 80
    name: web
    lifecycle:
    postStart:
    exec:
    command:
  • /bin/sh
  • -c
  • curl -k https://<IP:PORT> > /tmp/urltest.txt
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Actual results:

PostStartHook fails
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
36s Normal Killing pod/httpd-0 FailedPostStartHook
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expected results:

PostStartHook should not fail.

Additional info:

by adding a dummy initContainers, you can workaround the issue.
something like this:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
spec:
initContainers:

  • name: init-myservice
    image: busybox:1.28
    command: ['sh', '-c', 'sleep 2']
    containers:
  • name: httpd
    image: registry.redhat.io/rhel8/httpd-24:1-191
    ports:
  • containerPort: 80
    name: web
    lifecycle:
    postStart:
    exec:
    command:
  • /bin/sh
  • -c
  • curl -k <IP:PORT> > /tmp/urltest.txt
    ....
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

— Additional comment from rphillips@redhat.com on 2022-05-11 19:48:10 UTC —

crio's contract with networking is to have networking up when the container starts. Moving to the openshift-sdn team to help triage what is going on.

— Additional comment from hyoskim@redhat.com on 2022-06-09 00:40:33 UTC —

Hello,

Is there any update on this issue?

— Additional comment from npinaeva@redhat.com on 2022-06-09 07:53:38 UTC —

Hello, yeah we found the root cause and working on the fix now - PR should be ready by the end of the week

— Additional comment from aos-team-art-private@bot.bugzilla.redhat.com on 2022-07-24 15:21:48 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.

— Additional comment from errata-xmlrpc@redhat.com on 2022-07-27 00:18:40 UTC —

This bug has been added to advisory RHSA-2022:5069 by OpenShift Release Team Bot (ocp-build/buildvm.openshift.eng.bos.redhat.com@REDHAT.COM)

— Additional comment from swasthan@redhat.com on 2022-07-27 05:38:55 UTC —

Hello Team, thank you for the help so far!

May we know if this is going to backport in v4.10.z as well?

Regards,
Swadeep

— Additional comment from zzhao@redhat.com on 2022-07-27 06:40:30 UTC —

this fixed PR is merged to build 4.12.0-0.nightly-2022-07-24-180529
So I update the target version to 4.12 version.

— Additional comment from zzhao@redhat.com on 2022-07-27 06:48:33 UTC —

still failed on build 4.12.0-0.nightly-2022-07-26-131732

Creating above statefulset and pod still cannot be worked with same error

27s Warning FailedPostStartHook pod/httpd-0 Exec lifecycle hook ([/bin/sh -c curl -k https://<IP:PORT> > /tmp/urltest.txt]) for Container "httpd" in Pod "httpd-0_default(7e519841-7092-4513-928b-03c7783ddc7d)" failed - error: command '/bin/sh -c curl -k https://<IP:PORT> > /tmp/urltest.txt' exited with 1: /bin/sh: -c: line 0: syntax error near unexpected token `>'...
85s Normal Killing pod/httpd-0 FailedPostStartHook

— Additional comment from npinaeva@redhat.com on 2022-07-27 12:50:53 UTC —

Hey @zzhao@redhat.com can you share full statefulset yaml you're running?
Doesn't "line 0: syntax error near unexpected token `>'..." mean bash command is wrong?

— Additional comment from zzhao@redhat.com on 2022-07-27 13:36:43 UTC —

(In reply to Nadia Pinaeva from comment #9)
> Hey @zzhao@redhat.com can you share full statefulset yaml you're running?
> Doesn't "line 0: syntax error near unexpected token `>'..." mean bash
> command is wrong?

I'm using the statefulset from comment 0

$ cat statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: httpd
spec:
serviceName: "httpd"
replicas: 1
selector:
matchLabels:
app: httpd
template:
metadata:
labels:
app: httpd
spec:
containers:

  • name: httpd
    image: registry.redhat.io/rhel8/httpd-24:1-191
    ports:
  • containerPort: 80
    name: web
    lifecycle:
    postStart:
    exec:
    command:
  • /bin/sh
  • -c
  • curl -k https://<IP:PORT> > /tmp/urltest.txt

— Additional comment from npinaeva@redhat.com on 2022-07-27 14:02:54 UTC —

Did you replace <IP:PORT> here "curl -k https://<IP:PORT> > /tmp/urltest.txt"?

— Additional comment from zzhao@redhat.com on 2022-07-28 07:43:25 UTC —

(In reply to Nadia Pinaeva from comment #11)
> Did you replace <IP:PORT> here "curl -k https://<IP:PORT> >
> /tmp/urltest.txt"?

oh my bad

Tested again after replacing the ip and port with following


apiVersion: apps/v1
kind: StatefulSet
metadata:
name: httpd
spec:
serviceName: "httpd"
replicas: 1
selector:
matchLabels:
app: httpd
template:
metadata:
labels:
app: httpd
spec:
containers:

  • name: httpd
    image: registry.redhat.io/rhel8/httpd-24:1-191
    ports:
  • containerPort: 80
    name: web
    lifecycle:
    postStart:
    exec:
    command:
  • /bin/sh
  • -c
  • curl -k https://172.30.0.1:443 > /tmp/urltest.txt

on 4.12.0-0.nightly-2022-07-27-133042

$ oc get pod
NAME READY STATUS RESTARTS AGE
httpd-0 1/1 Running 0 2m28s

— Additional comment from npinaeva@redhat.com on 2022-07-29 13:02:53 UTC —

@swasthan@redhat.com yes, we are going to backport it to 4.10 (hopefully it will be faster than the fix itself )

This is a clone of issue OCPBUGS-613. The following is the description of the original issue:

Description of problem:

The path used by --rotated-pod-logs to gather the rotated pod logs from /var/log/pods node folder via /api/v1/nodes/${NODE}/proxy/logs/${LOG_PATH} is only valid for regular pods but not for static pods.

The main problem is that, while normal pods have their rotated logs at this /var/log/pods/${POD_NAME}_${POD_UID_IN_API}/${CONTAINER_NAME}, static pods have them at /var/log/pods/${POD_NAME}_${CONFIG_HASH}/${CONTAINER_NAME} because the UID cannot be known at the time that the static pod is born (because static pods are created by kubelet before registering them in the kube-apiserver, and UID is assigned by the kube-apiserver).

The visible results of that are:

  • Spurious errors of not found resources related to the pods.
  • Rotated pod logs are not gathered even if present.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

Always if there are static pods.

Steps to Reproduce:

1. oc adm inspect --rotated-pod-logs ns/openshift-etcd (or any other project with static pods).

Actual results:

  • Rotated pods not gathered.
  • Errors like these
    error: errors occurred while gathering data:
        one or more errors occurred while gathering pod-specific data for namespace: openshift-etcd
    
        [one or more errors occurred while gathering container data for pod etcd-master-0.example.net:
    
        the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-1.example.net:
    
        the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-2.example.net:
    
        the server could not find the requested resource]
    

Expected results:

No errors like the ones above and rotated pod logs to be gathered, if present.

Additional info:

Despite being marked as experimental, this --rotated-pod-logs is used in must-gather, so this issue can be easily reproduced by just running a default must-gather. I focused on bare oc adm inspect reproducers for simplicity.

Description of problem:
We are seeing customer's upgrade cannot kickoff due to the availableUpdates is null in clusterversion CR
Version-Release number of the following components:

How reproducible:
sometime
Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:

This is a clone of issue OCPBUGS-516. The following is the description of the original issue:

Description of problem:

Setting a telemeter proxy in the cluster-monitoring-config config map does not work as expected

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
the following KCS details steps to add a proxy.
The steps have been verified at 4.7 but do not work at 4.8, 4.9 or 4.10

https://access.redhat.com/solutions/6172402

When testing at 4.8, 4.9 and 4.10 the proxy setting where also nested under `telemeterClient`

which triggered a telemeter restart but the proxy setting do not get set in the deployment as they do in 4.7

Actual results:

4.8, 4.9 and 4.10 without the nested `telemeterClient`
does not trigger a restart of the telemeter pod

Expected results:

I think the proxy setting should be nested under telemeterClient
but should set the environment variables in the deployment

Additional info:

This is a backport of https://bugzilla.redhat.com/show_bug.cgi?id=2116382 from 4.12 to 4.11.z. Creating manually because as seen in https://github.com/openshift/cluster-monitoring-operator/pull/1743 `/cherry-pick` doesn't work for bugs originally created in bugzilla

Description of problem:
Previously we had a bug opened for "Reduce buildah log level for default build log level [NEEDINFO]":

https://bugzilla.redhat.com/show_bug.cgi?id=1996883

We suggested customer using secrets, however, customer confirmed that they are using secrets as per:
https://docs.openshift.com/container-platform/4.8/cicd/builds/creating-build-inputs.html#builds-input-secrets-configmaps_creating-build-inputs

But not able to use "--quiet" build argument since they are using openshift s2i config:

strategy:
type: Source
sourceStrategy:
from:
kind: ImageStreamTag
namespace: xxxxxxx
name: 's2i-xxxxxx-xxxxxxx:v1.0.0'

Actual results:
Under this setting, the secret as well as every other openshift secret are printed.

Expected results:
Sensitive information (ENV) should not appear in build logs

Additional info:
Maybe pass the --quiet option via the buildconfig fir s2i?

— Additional comment from taxu@redhat.com on 2022-07-01 03:42:46 UTC —

Hi Build team,

I see that the target release for this bug is 4.11.0

Please let us know where to add the LOG_LEVEL for s2i and/or passing "--quiet" options once the fix is deployed.

Kind regards,

Tao Xu

— Additional comment from aos-team-art-private@redhat.com on 2022-07-13 06:44:30 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.

— Additional comment from rdlugyhe@redhat.com on 2022-07-26 20:34:06 UTC —

Please approve the updated Doc Text.

— Additional comment from cdaley@redhat.com on 2022-07-26 20:49:05 UTC —

Approved

— Additional comment from jitsingh@redhat.com on 2022-07-27 04:48:18 UTC —

verified

— Additional comment from oarribas@redhat.com on 2022-08-15 12:11:23 UTC —

@Corey, I can see this BZ now verified for 4.12. Can this be backported to previous OCP versions?

— Additional comment from cdaley@redhat.com on 2022-08-15 13:57:02 UTC —

What version are you interested in getting it backported to?

— Additional comment from oarribas@redhat.com on 2022-08-15 15:46:10 UTC —

@Corey, I can see only 4.10 and 4.11 as "Full Support" in [1], so at least to those versions.

The static authorizer feature has landed in upstream kube-rbac-proxy. Lets use it by configuring a static authorizer for all requests that hit a /metrics endpoint.

DoD:

  • Downstream kube-rbac-proxy is synced.
  • All CMO operands are configured with static authorization.
  • Bugzillas created for all non-monitoring components using kube-rbac-proxy for metrics authn/authz.

Currently, Telemeter is not equipped with configurable request limit for receive endpoint (for full context see: https://github.com/openshift/cluster-monitoring-operator/pull/1416). It is using the default limit defined in the code base, however it seems this limit might not be suitable for our usage.

As a part of this ticket, it should be:

1) Understood what is the appropriate limit for request size for our use cases

2) Make the limit configurable in Telemeter via a flag

3) Deploy the changes, initially to the staging environment, to enable our team to test it.

Add a Makefile rule in CMO to execute all the different rule that are used for verification and validation. Currenctly, some of them might not be at the right place, for example `check-assets` which is part of `generate` despite not being responsible of any generation. https://github.com/openshift/cluster-monitoring-operator/pull/1151/files#r629371735

DoD:

  • Add a new rule in CMO to handle verification
  • Add a CI job for this rule

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

Description of problem:

Intended to backport the corresponding https://bugzilla.redhat.com/show_bug.cgi?id=2095852 which has been fixed already for this version.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 
  • CVE-2022-36882
  • CVE-2022-29047
  • CVE-2022-30945
  • CVE-2022-30946
  • CVE-2022-30948
  • CVE-2022-30952
  • CVE-2022-30953
  • CVE-2022-30954
  • CVE-2022-34174
  • CVE-2022-36883
  • CVE-2022-36884
  • CVE-2022-36885
  • CVE-2022-34177
  • CVE-2022-34176
  • CVE-2022-36881