Back to index

4.11.49

Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.10.67

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Problem:

Certain Insights Advisor features differentiate between RHEL and OCP advisor

Goal:

Address top priority UI misalignments between RHEL and OCP advisor. Address UI features dropped from Insights ADvisor for OCP GA.

 

Scope:

Specific tasks and priority of them tracked in https://issues.redhat.com/browse/CCXDEV-7432

 
 
 
 

 

This contains all the Insights Advisor widget deliverables for the OCP release 4.11.

Scope
It covers only minor bug fixes and improvements:

  • better error handling during internal outages in data processing
  • add "last refresh" timestamp in the Advisor widget

Show the error message (mocked in CCXDEV-5868) if the Prometheus metrics `cluster_operator_conditions{name="insights"}` contain two true conditions: UploadDegraded and Degraded at the same time. This state occurs if there was an IO archive upload error = problems with the pipeline.

Expected for 4.11 OCP release.

Scenario: Check if the Insights Advisor widget in the OCP WebConsole UI shows the time of the last data analysis
Given: OCP WebConsole UI and the cluster dashboard is accessible
And: CCX external data pipeline is in a working state
And: administrator A1 has access to his cluster's dashboard
And: Insights Operator for this cluster is sending archives
When: administrator A1 clicks on the Insights Advisor widget
Then: the results of the last analysis are showed in the Insights Advisor widget
And: the time of the last analysis is shown in the Insights Advisor widget 

Acceptance criteria:

  1. The time of the last analysis is shown in the Insights Advisor widget for the scenario above
  2. The way it is presented is defined within the scope of https://issues.redhat.com/browse/CCXDEV-5869 (mockup task)
  3. The source of this timestamp must be a result of running the Prometheus metric (last archive upload time):
    max_over_time(timestamp(changes(insightsclient_request_send_total\{status_code="202"}[1m]) > 0)[24h:1m])
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic Goal

  • Allow admin user to create new alerting rules, targeting metrics in any namespace
  • Allow cloning of existing rules to simplify rule creation
  • Allow creation of silences for existing alert rules

Why is this important?

  • Currently, any platform-related metrics (exposed in a openshift-, kube- and default namespace) cannot be used to form a new alerting rule. That makes it very difficult for administrators to enrich our out of the box experience for the OpenShift Container Platform with new rules that may be specific to their environments.
  • Additionally, we had requests from customer to allow modifications of our existing, out of the box alerting rules (for instance tweaking the alert expression or changing the severity label). Unfortunately, that is not easy since most rules come from several open source projects, or other OpenShift components, and any modifications would make a seamless upgrade not really seamless anymore. Imagine K8s changes metrics again (see 1.14) and we have to update our rules. We would not know what modifications have been done (even just the threshold might be difficult if upstream changes that as well) and we would not be able to upgrade these rules.

Scenarios

  • I'd like to modify the query expression of an existing rule (because the threshold value doesn't match with my environment).

Cloning the existing rule should end up with a new rule in the same namespace.
Modifications can now be done to the new rule.
(Optional) You can silence the existing rule.

  • I'd like to create a new rule based on a metric only available to an openshift-* namespace

Create a new PrometheusRule object inside the namespace that includes the metrics you need to form the alerting rule.

  • I'd like to update the label of an existing rule.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Ability to distinguish between rules deployed by us (CMO) and user created rules

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

  1. Distinguish between operator-created rules and user-created rules
    Currently no such mechanism exists. This will need to be added to prometheus-operator or cluster-monitoring-operator.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

CMO should reconcile the platform Prometheus configuration with the alert-relabel-config resources.

 

DoD

  • Alerts changed via alert-relabel-configs are evaluated by the Platform monitoring stack.
  • Product alerts which are overriden aren't sent to Alertmanager

CMO should reconcile the platform Prometheus configuration with the AlertingRule resources.

 

DoD

  • Alerts added via AlertingRule resources are evaluated by the Platform monitoring stack.

Managing PVs at scale for a fleet creates difficulties where "one size does not fit all". The ability for SRE to deploy prometheus with PVs and have retention based an on a desired size would enable easier management of these volumes across the fleet. 

 

The prometheus-operator exposes retentionSize.

Field Description
retentionSize Maximum amount of disk space used by blocks. Supported units: B, KB, MB, GB, TB, PB, EB. Ex: 512MB.

This is a feature request to enable this configuration option via CMO cluster-monitoring-config ConfigMap.

 

cc Simon Pasquier  

Epic Goal

  • Cluster admins want to configure the retention size for their metrics.

Why is this important?

  • While it is possible to define how long metrics should be retained on disk, it's not possible to tell the cluster monitoring operator how much data it should keep. For OSD/ROSA in particular, it would facilitate the management of the fleet if the retention size could be configured based on the persistent volume size because it would avoid issues with the storage getting full and monitoring being down when too many metrics are produced.

Scenarios

  • As a cluster admin, I want to define the maximum amount of data to be retained on the persistent volume.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The cluster-monitoring-config config and the user-workload-monitoring-config configmap allow to configure the retention size for
    • Prometheus (Platform and UWM)
    • Thanos Ruler (to be confirmed)
  • Proper validation is in place preventing bad user inputs from breaking the stack.

Dependencies (internal and external)

  1. Thanos ruler doesn't support retention size (only retention time).

Previous Work (Optional):

  1. None

Open questions::

  1. None

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Problem Alignment

The Problem

Today, all configuration for setting individual, for example, routing configuration is done via a single configuration file that only admins have access to. If an environment uses multiple tenants and each tenant, for example, has different systems that they are using to notify teams in case of an issue, then someone needs to file a request w/ an admin to add the required settings.

That can be bothersome for individual teams, since requests like that usually disappear in the backlog of an administrator. At the same time, administrators might get tons of requests that they have to look at and prioritize, which takes them away from more crucial work.

We would like to introduce a more self service approach whereas individual teams can create their own configuration for their needs w/o the administrators involvement.

Last but not least, since Monitoring is deployed as a Core service of OpenShift there are multiple restrictions that the SRE team has to apply to all OSD and ROSA clusters. One restriction is the ability for customers to use the central Alertmanager that is owned and managed by the SRE team. They can't give access to the central managed secret due to security concerns so that users can add their own routing information.

High-Level Approach

Provide a new API (based on the Operator CRD approach) as part of the Prometheus Operator that allows creating a subset of the Alertmanager configuration without touching the central Alertmanager configuration file.

Please note that we do not plan to support additional individual webhooks with this work. Customers will need to deploy their own version of the third party webhooks.

Goal & Success

  • Allow users to deploy individual configurations that allow setting up Alertmanager for their needs without an administrator.

Solution Alignment

Key Capabilities

  • As an OpenShift administrator, I want to control who can CRUD individual configuration so that I can make sure that any unknown third person can touch the central Alertmanager instance shipped within OpenShift Monitoring.
  • As a team owner, I want to deploy a routing configuration to push notifications for alerts to my system of choice.

Key Flows

Team A wants to send all their important notifications to a specific Slack channel.

  • Administrator gives permission to Team A to allow creating a new configuration CR in their individual namespace.
  • Team A creates a new configuration CR.
  • Team A configures what alerts should go into their Slack channel.
  • Open Questions & Key Decisions (optional)
  • Do we want to improve anything inside the developer console to allow configuration?

Epic Goal

  • Allow users to manage Alertmanager for user-defined alerts and have the feature being fully supported.

Why is this important?

  • Users want to configure alert notifications without admin intervention.
  • The feature is currently Tech Preview, it should be generally available to benefit a bigger audience.

Scenarios

  1. As a cluster admin, I can deploy an Alertmanager service dedicated for user-defined alerts (e.g. separated from the existing  Alertmanager already used for platform alerts).
  2. As an application developer, I can silence alerts from the OCP console.
  3. As an application developer, I'm not allowed to configure invalid AlertmanagerConfig objects.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • The AlertmanagerConfig CRD is v1beta1
  • The validating webhook service checking AlertmanagerConfig resources is highly-available.

Dependencies (internal and external)

  1. Prometheus operator upstream should migrate the AlertmanagerConfig CRD from v1alpha1 to v1beta1
  2. Console enhancements likely to be involved (see below).

Previous Work (Optional):

  1. Part of the feature is available as Tech Preview (MON-880).

Open questions:

  1. Coordination with the console team to support the Alertmanager service dedicated for user-defined alerts.
  2. Migration steps for users that are already using the v1alpha1 CRD.

Done Checklist

 * CI - CI is running, tests are automated and merged.
 * Release Enablement <link to Feature Enablement Presentation>
 * DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
 * DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
 * DEV - Downstream build attached to advisory: <link to errata>
 * QE - Test plans in Polarion: <link or reference to Polarion>
 * QE - Automated tests merged: <link or reference to automated tests>
 * DOC - Downstream documentation merged: <link to meaningful PR> 

 

Now that upstream supports AlertmanagerConfig v1beta1 (see MON-2290 and https://github.com/prometheus-operator/prometheus-operator/pull/4709), it should be deployed by CMO.

DoD:

  • Kubernetes API exposes and supports the v1beta1 version for AlertmanagerConfig CRD (in addition to v1alpha1).
  • Users can manage AlertmanagerConfig v1beta1 objects seamlessly.
  • AlertmanagerConfig v1beta1 objects are reconciled in the generated Alertmanager configuration.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic Goal

  • The goal is to support metrics federation for user-defined monitoring via the /federate Prometheus endpoint (both from within and outside of the cluster).

Why is this important?

  • It is already possible to configure remote write for user-defined monitoring to push metrics outside of the cluster but in some cases, the network flow can only go from the outside to the cluster and not the opposite. This makes it impossible to leverage remote write.
  • It is already possible to use the /federate endpoint for the platform Prometheus (via the internal service or via the OpenShift route) so not supporting for UWM doesn't provide a consistent experience.
  • If we don't expose the /federate endpoint for the UWM Prometheus, users would have no supported way to store and query application metrics from a central location.

Scenarios

  1. As a cluster admin, I want to federate user-defined metrics using the Prometheus /federate endpoint.
  2. As a cluster admin, I want that the /federate endpoint to UWM is accessible via an OpenShift route.
  3. As a cluster admin, I want that the access to the /federate endpoint to UWM requires authentication (with bearer token only) & authorization (the required permissions should match the permissions on the /federate endpoint of the Platform Prometheus).

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Documentation - information about the recommendations and limitations/caveats of the federation approach.
  • User can federate user-defined metrics from within the cluster
  • User can federate user-defined metrics from the outside via the OpenShift route.

Dependencies (internal and external)

  1. None

Previous Work (Optional):

  1. None

Open questions:

  1. None

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

DoD

  • User can federate UWM metrics from outside of the cluster via the OpenShift route.
  • E2E test added to the CMO test suite.

DoD

  • User can federate UWM metrics within the cluster from the prometheus-user-workload.openshift-user-workload-monitoring.svc:9092 service
  • The service requires authentication via bearer token and authorization (same permissions as for federating platform metrics)

Copy/paste from [_https://github.com/openshift-cs/managed-openshift/issues/60_]

Which service is this feature request for?
OpenShift Dedicated and Red Hat OpenShift Service on AWS

What are you trying to do?
Allow ROSA/OSD to integrate with AWS Managed Prometheus.

Describe the solution you'd like
Remote-write of metrics is supported in OpenShift but it does not work with AWS Managed Prometheus since AWS Managed Prometheus requires AWS SigV4 auth.

  • Note that Prometheus supports AWS SigV4 since v2.26 and OpenShift 4.9 uses v2.29.

Describe alternatives you've considered
There is the workaround to use the "AWS SigV4 Proxy" but I'd think this is not properly supported by RH.
https://mobb.ninja/docs/rosa/cluster-metrics-to-aws-prometheus/

Additional context
The customer wants to use an open and portable solution to centralize metrics storage and analysis. If they also deploy to other clouds, they don't want to have to re-configure. Since most clouds offer a Prometheus service (or it's easy to self-manage Prometheus), app migration should be simplified.

Epic Goal

The cluster monitoring operator should allow OpenShift customers to configure remote write with all authentication methods supported by upstream Prometheus.

We will extend CMO's configuration API to support the following authentications with remote write:

  • Sigv4
  • Authorization
  • OAuth2

Why is this important?

Customers want to send metrics to AWS Managed Prometheus that require sigv4 authentication (see https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-secure-metric-ingestion.html#AMP-secure-auth).

Scenarios

  1. As a cluster admin, I want to forward platform/user metrics to remote write systems requiring Sigv4 authentication.
  2. As a cluster admin, I want to forward platform/user metrics to remote write systems requiring OAuth2 authentication.
  3. As a cluster admin, I want to forward platform/user metrics to remote write systems requiring custom Authorization header for authentication (e.g. API key).

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • It is possible for a cluster admin to configure any authentication method that is supported by Prometheus upstream for remote write (both platform and user-defined metrics):
    • Sigv4
    • Authorization
    • OAuth2

Dependencies (internal and external)

  • In theory none because everything is already supported by the Prometheus operator upstream. We may discover bugs in the upstream implementation though that may require upstream involvement.

Previous Work

  • After CMO started exposing the RemoteWrite specification in MON-1069, additional authentication options where added to prometheus and prometheus-operator but CMO didn't catch up on these.

Open Questions

  • None

Prometheus and Prometheus operator already support custom Authorization for remote write. This should be possible to configure the same in the CMO configuration:

 

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      remoteWrite:
      - url: "https://remote-write.endpoint"
        Authorization:
          type: Bearer
          credentials:
            name: credentials
            key: token

DoD:

  • Ability to configure custom Authorization for remote write in the openshift-monitoring/cluster-monitoring-config configmap
  • Ability to configure custom Authorization for remote write in the openshift-user-workload-monitoring/user-workload-monitoring-config configmap

Prometheus and Prometheus operator already support sigv4 authentication for remote write. This should be possible to configure the same in the CMO configuration:

 

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      remoteWrite:
      - url: "https://remote-write.endpoint"
        sigv4:
          accessKey:
            name: aws-credentialss
            key: access
          secretKey:
            name: aws-credentials
            key: secret

          profile: "SomeProfile"

          roleArn: "SomeRoleArn"

DoD:

  • Ability to configure sigv4 authentication for remote write in the openshift-monitoring/cluster-monitoring-config configmap
  • Ability to configure sigv4 authentication for remote write in the openshift-user-workload-monitoring/user-workload-monitoring-config configmap
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description

As WMCO user, I want to make sure containerd logging information has been updated in documents and scripts.

Acceptance Criteria

  • update must-gather to collect containerd logs
  • Internal/Customer Documents and log collecting scripts must have containerd specific information (ex: location of logs). 

Summary (PM+lead)

Configure audit logging to capture login, logout and login failure details

Motivation (PM+lead)

TODO(PM): update this

Customer who needs login, logout and login failure details inside the openshift container platform.
I have checked for this on my test cluster but the audit logs do not contain any user name specifying login or logout details. For successful logins or logout, on CLI and openshift console as well we can see 'Login successful' or 'Invalid credentials'.

Expected results: Login, logout and login failures should be captured in audit logging.

Goals (lead)

  1. Login, logout and login failures should be captured in audit logs

Non-Goals (lead)

  1. Don't attempt to log login failures in the IdP login flow that goes beyond timeout, if it the information is not available in explicit oauth-server requests (e.g. github password login error).
  2. Logout does not involve oauth-server (but is a simple API object deletion in oauth-apiserver). Hence, the audit log discussed here won't include logout.

Deliverables

  1. Changes to oauth-server to log into /varLog/oauth-server/audit.log on the master node.
  2. Documentation

Proposal (lead)

The apiserver pods today have ´/var/log/<kube|oauth|openshift>-apiserver` mounted from the host and create audit files there using the upstream audit event format (JSON lines following https://github.com/kubernetes/apiserver/blob/92392ef22153d75b3645b0ae339f89c12767fb52/pkg/apis/audit/v1/types.go#L72). These events are apiserver specific, but as oauth authentication flow events are also requests, we can use the apiserver event format to log logins, login failures and logouts. Hence, we propose to make oauth-server to create /var/log/oauth-server/audit.log files on the master nodes using that format.

When the login flow does not finish within a certain time (e.g. 10min), we can artificially create an event to show a login failure in the audit logs.

User Stories (PM)

Dependencies (internal and external, lead)

Previous Work (lead)

Open questions (lead)

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

 

🏆 What

Let the Cluster Authentication Operator deliver the policy to OAuthServer.

💖 Why

In order to know if authn events should be logged, OAuthServer needs to be aware of it.

🗒 Notes

Create an observer to deliver the audit policy to the oauth server

Make the authentication-operator react to the new audit field in the oauth.config/cluster object. Write an observer watching this field, such an observer will translate the top-level configuration into oauth-server config and add it to the rest of the observed config.

* Stanislav Láznička

OCP/Telco Definition of Done
Feature Template descriptions and documentation.

Feature Overview.

Early customer feedback is that they see SNO as a great solution covering smaller footprint deployment, but are wondering what is the evolution story OpenShift is going to provide where more capacity or high availability are needed in the future.

While migration tooling (moving workload/config to new cluster) could be a mid-term solution, customer desire is not to include extra hardware to be involved in this process.

 For Telecommunications Providers, at the Far Edge they intend to start small and then grow. Many of these operators will start with a SNO-based DU deployment as an initial investment, but as DUs evolve, different segments of the radio spectrum are added, various radio hardware is provisioned and features delivered to the Far Edge, the Telecommunication Providers desire the ability for their Far Edge deployments to scale up from 1 node to 2 nodes to n nodes. On the opposite side of the spectrum from SNO is MMIMO where there is a robust cluster and workloads use HPA.

Goals

  • Provide the capability to expand a single replica control plane topology to host more workloads capacity - add worker
  • Provide the capability to expand a single replica control plane to be a highly available control plane
  • To satisfy MMIMO Telecommunications providers will want the ability to scale a SNO to a multi-node cluster that can support HPA.
  • Telecommunications providers do not want workload (DU specifically) downtime when migrating from SNO to a multi-node cluster.
  • Telecommunications providers wish to be able to scale from one to two or more nodes to support a variety of radio hardware.
  • Support CP scaling (CP HA) for 2 node cluster, 3 node cluster and n node cluster. As the number of nodes in the cluster increases so does the failure domain of the cluster. The cluster is now supporting more cell sectors and therefore has more of a need for HA and resiliency including the cluster CP.

Requirements

  • TBD
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic Goal

  • Documented and supported flow for adding 1, 2, 3 or more workers to a Single Node OpenShift (SNO) deployment without requiring cluster downtime and the understanding that this action will not make the cluster itself highly available.

Why is this important?

  • Telecommunications and Edge scenarios where HA is handled via failover to another site but single site capacity may vary or need to be expanded over time.
  • Similar scenarios exist for some ISV vendors where OpenShift is an implementation detail of how they deliver their solution on top of another platform (e.g. VMware).

Scenarios

  1. Adding a worker to a single node openshift cluster.
  2. Adding a second worker to a single node openshift cluster.
  3. Adding a third worker to a single node openshift cluster.
  4. Removing a worker node from a single node openshift cluster that has had 1 or more workers added.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Customer facing documentation of the add worker flow for SNO.

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

  1. Presumably there is a scale limit on how many workers could be added to an SNO control plane, and it is lower than the limit for a "normal" 3 node control plane. It is not anticipated that this limit will be established in this epic. Intent is to focus on small scale sites where adding 1-3 worker nodes would be beneficial.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Rebase OpenShift components to k8s v1.24

Why is this important?

  • Rebasing ensures components work with the upcoming release of Kubernetes
  • Address tech debt related to upstream deprecations and removals.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. k8s 1.24 release

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

  • As an infrastructure owner, I want a repeatable method to quickly deploy the initial OpenShift cluster.
  • As an infrastructure owner, I want to install the first (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters.

Goals

  • Enable customers and partners to successfully deploy a single “first” cluster in disconnected, on-premises settings

Requirements

4.11 MVP Requirements

  • Customers and partners needs to be able to download the installer
  • Enable customers and partners to deploy a single “first” cluster (cluster 0) using single node, compact, or highly available topologies in disconnected, on-premises settings
  • Installer must support advanced network settings such as static IP assignments, VLANs and NIC bonding for on-premises metal use cases, as well as DHCP and PXE provisioning environments.
  • Installer needs to support automation, including integration with third-party deployment tools, as well as user-driven deployments.
  • In the MVP automation has higher priority than interactive, user-driven deployments.
  • For bare metal deployments, we cannot assume that users will provide us the credentials to manage hosts via their BMCs.
  • Installer should prioritize support for platforms None, baremetal, and VMware.
  • The installer will focus on a single version of OpenShift, and a different build artifact will be produced for each different version.
  • The installer must not depend on a connected registry; however, the installer can optionally use a previously mirrored registry within the disconnected environment.

Use Cases

  • As a Telco partner engineer (Site Engineer, Specialist, Field Engineer), I want to deploy an OpenShift cluster in production with limited or no additional hardware and don’t intend to deploy more OpenShift clusters [Isolated edge experience].
  • As a Enterprise infrastructure owner, I want to manage the lifecycle of multiple clusters in 1 or more sites by first installing the first  (management, hub, “cluster 0”) cluster to manage other (standalone, hub, spoke, hub of hubs) clusters [Cluster before your cluster].
  • As a Partner, I want to package OpenShift for large scale and/or distributed topology with my own software and/or hardware solution.
  • As a large enterprise customer or Service Provider, I want to install a “HyperShift Tugboat” OpenShift cluster in order to offer a hosted OpenShift control plane at scale to my consumers (DevOps Engineers, tenants) that allows for fleet-level provisioning for low CAPEX and OPEX, much like AKS or GKE [Hypershift].
  • As a new, novice to intermediate user (Enterprise Admin/Consumer, Telco Partner integrator, RH Solution Architect), I want to quickly deploy a small OpenShift cluster for Poc/Demo/Research purposes.

Questions to answer…

  •  

Out of Scope

Out of scope use cases (that are part of the Kubeframe/factory project):

  • As a Partner (OEMs, ISVs), I want to install and pre-configure OpenShift with my hardware/software in my disconnected factory, while allowing further (minimal) reconfiguration of a subset of capabilities later at a different site by different set of users (end customer) [Embedded OpenShift].
  • As an Infrastructure Admin at an Enterprise customer with multiple remote sites, I want to pre-provision OpenShift centrally prior to shipping and activating the clusters in remote sites.

Background, and strategic fit

  • This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  1. The user has only access to the target nodes that will form the cluster and will boot them with the image presented locally via a USB stick. This scenario is common in sites with restricted access such as government infra where only users with security clearance can interact with the installation, where software is allowed to enter in the premises (in a USB, DVD, SD card, etc.) but never allowed to come back out. Users can't enter supporting devices such as laptops or phones.
  2. The user has access to the target nodes remotely to their BMCs (e.g. iDrac, iLo) and can map an image as virtual media from their computer. This scenario is common in data centers where the customer provides network access to the BMCs of the target nodes.
  3. We cannot assume that we will have access to a computer to run an installer or installer helper software.

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

References

 

 

Epic Goal

  • As an OpenShift infrastructure owner, I need to be able to integrate the installation of my first on-premises OpenShift cluster with my automation flows and tools.
  • As an OpenShift infrastructure owner, I must be able to provide the CLI tool with manifests that contain the definition of the cluster I want to deploy
  • As an OpenShift Infrastructure owner, I must be able to get the validation errors in a programmatic way
  • As an OpenShift Infrastructure owner, I must be able to get the events and progress of the installation in a programmatic way
  • As an OpenShift Infrastructure owner, I must be able to retrieve the kubeconfig and OpenShift Console URL in a programmatic way

Why is this important?

  • When deploying clusters with a large number of hosts and when deploying many clusters, it is common to require to automate the installations.
  • Customers and partners usually use third party tools of their own to orchestrate the installation.
  • For Telco RAN deployments, Telco partners need to repeatably deploy multiple OpenShift clusters in parallel to multiple sites at-scale, with no human intervention.

Scenarios

  1. Monitoring flow:
    1. I generate all the manifests for the cluster,
    2. call the CLI tool pointint to the manifests path,
    3. Obtain the installation image from the nodes
    4. Use my infrastructure capabilities to boot the image on the target nodes
    5. Use the tool to connect to assisted service to get validation status and events
    6. Use the tool to retrieve credentials and URL for the deployed cluster

Acceptance Criteria

  • Backward compatibility between OCP releases with automation manifests (they can be applied to a newer version of OCP).
  • Installation progress and events can be tracked programatically
  • Validation errors can be obtained programatically
  • Kubeconfig and console URL can be obtained programatically
  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

References

User Story:

As a deployer, I want to be able to:

  • Get the credentials for the cluster that is going to be deployed

so that I can achieve

  • Checking the installed cluster for installation completion
  • Connect and administer the cluster that gets installed

 

Currently the Assisted Service generates the credentials by running the ignition generation step of the oepnshift-installer. This is why the credentials are only retrievable from the REST API towards the end of the installation.

In the BILLI usage, which takes down assisted service before the installation is complete there is no obvious point at which to alert the user that they should retrieve the credentials. This means that we either need to:

  • Allow the user to pass the admin key that will then get signed by the generated CA and replace the key that is made by openshift-installer (would mean new functionality in AI)
  • Allow the key to be retrieved by SSH with the fleeting command from the node0 (after it has generated). The command should be able to wait until it is possible
  • Have the possibility to POST it somewhere

Acceptance Criteria:

  • The admin key is generated and usable to check for installation completeness

This requires/does not require a design proposal.
This requires/does not require a feature gate.

Feature Overview

The AWS-specific code added in OCPPLAN-6006 needs to become GA and with this we want to introduce a couple of Day2 improvements.
Currently the AWS tags are defined and applied at installation time only and saved in the infrastructure CRD's status field for further operator use, which in turn just add the tags during creation.

Saving in the status field means it's not included in Velero backups, which is a crucial feature for customers and Day2.
Thus the status.resourceTags field should be deprecated in favour of a newly created spec.resourceTags with the same content. The installer should only populate the spec, consumers of the infrastructure CRD must favour the spec over the status definition if both are supplied, otherwise the status should be honored and a warning shall be issued.

Being part of the spec, the behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the AWS resources should be updated accordingly.

On AWS this can be done without re-creating any resources (the behaviour is basically an upsert by tag key) and is possible without service interruption as it is a metadata operation.

Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.

Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").

Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.

After that, we can remove the experimental flag and make this a GA feature.

Goals

  • Inclusion in the cluster backups
  • Flexibility of changing tags during cluster lifetime, without recreating the whole cluster

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

List any affected packages or components.

  • Installer
  • Cluster Infrastructure
  • Storage
  • Node
  • NetworkEdge
  • Internal Registry
  • CCO

RFE-1101 described user defined tags for AWS resources provisioned by an OCP cluster. Currently user can define tags which are added to the resources during creation. These tags cannot be updated subsequently. The propagation of the tags is controlled using experimental flag. Before this feature goes GA we should define and implement a mechanism to exclude any experimental flags. Day2 operations and deletion of tags is not in the scope.

RFE-2012 aims to make the user-defined resource tags feature GA. This means that user defined tags should be updatable.

Currently the user-defined tags during install are passed directly as parameters of the Machine and Machineset resources for the master and worker. As a result these tags cannot be updated by consulting the Infrastructure resource of the cluster where the user defined tags are written.

The MCO should be changed such that during provisioning the MCO looks up the values of the tags in the Infrastructure resource and adds the tags during creation of the EC2 resources. The MCO should also watch the infrastructure resource for changes and when the resource tags are updated it should update the tags on the EC2 instances without restarts.

Acceptance Criteria:

  • e2e test where the ResourceTags are updated and then the test verifies that the tags on the ec2 instances are updated without restarts. now moved to CFE-179

Feature Overview  

Much like core OpenShift operators, a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage AWS STS authorization when using AWS APIs as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.

Goals:

Enable customers to easily leverage OpenShift's capabilities around AWS STS with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.

Requirements:

  • CCO gets a new mode in which it can reconcile STS credential request for OLM-managed operators
  • A standardized flow is leveraged to guide users in discovering and preparing their AWS IAM policies and roles with permissions that are required for OLM-managed operators 
  • A standardized flow is defined in which users can configure OLM-managed operators to leverage AWS STS
  • An example operator is used to demonstrate the end2end functionality
  • Clear instructions and documentation for operator development teams to implement the required interaction with the CloudCredentialOperator to support this flow

Use Cases:

See Operators & STS slide deck.

 

Out of Scope:

  • handling OLM-managed operator updates in which AWS IAM permission requirements might change from one version to another (which requires user awareness and intervention)

 

Background:

The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens. This capability should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with AWS APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on AWS.

 

Customer Considerations

This is particularly important for ROSA customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.

Documentation Considerations

  • Internal documentation needs to exists to guide Red Hat operator developer teams on the requirements and proposed implementation of integration with CCO and the proposed flow
  • External documentation needs to exist to guide users on:
    • how to become aware that the cluster is in STS mode
    • how to become aware of operators that support STS and the proposed CCO flow
    • how to become aware of the IAM permissions requirements of these operators
    • how to configure an operator in the proposed flow to interact with CCO

Interoperability Considerations

  • this needs to work with ROSA
  • this needs to work with self-managed OCP on AWS

Market Problem

This Section: High-Level description of the Market Problem ie: Executive Summary

  • As a customer of OpenShift layered products, I need to be able to fluidly, reliably and consistently install and use OpenShift layered product Kubernetes Operators into my ROSA STS clusters, while keeping a STS workflow throughout.
  •  
  • As a customer of OpenShift on the big cloud providers, overall I expect OpenShift as a platform to function equally well with tokenized cloud auth as it does with "mint-mode" IAM credentials. I expect the same from the Kubernetes Operators under the Red Hat brand (that need to reach cloud APIs) in that tokenized workflows are equally integrated and workable as with "mint-mode" IAM credentials.
  •  
  • As the managed services, including Hypershift teams, offering a downstream opinionated, supported and managed lifecycle of OpenShift (in the forms of ROSA, ARO, OSD on GCP, Hypershift, etc), the OpenShift platform should have as close as possible, native integration with core platform operators when clusters use tokenized cloud auth, driving the use of layered products.
  • .
  • As the Hypershift team, where the only credential mode for clusters/customers is STS (on AWS) , the Red Hat branded Operators that must reach the AWS API, should be enabled to work with STS credentials in a consistent, and automated fashion that allows customer to use those operators as easily as possible, driving the use of layered products.

Why it Matters

  • Adding consistent, automated layered product integrations to OpenShift would provide great added value to OpenShift as a platform, and its downstream offerings in Managed Cloud Services and related offerings.
  • Enabling Kuberenetes Operators (at first, Red Hat ones) on OpenShift for the "big3" cloud providers is a key differentiation and security requirement that our customers have been and continue to demand.
  • HyperShift is an STS-only architecture, which means that if our layered offerings via Operators cannot easily work with STS, then it would be blocking us from our broad product adoption goals.

Illustrative User Stories or Scenarios

  1. Main success scenario - high-level user story
    1. customer creates a ROSA STS or Hypershift cluster (AWS)
    2. customer wants basic (table-stakes) features such as AWS EFS or RHODS or Logging
    3. customer sees necessary tasks for preparing for the operator in OperatorHub from their cluster
    4. customer prepares AWS IAM/STS roles/policies in anticipation of the Operator they want, using what they get from OperatorHub
    5. customer's provides a very minimal set of parameters (AWS ARN of role(s) with policy) to the Operator's OperatorHub page
    6. The cluster can automatically setup the Operator, using the provided tokenized credentials and the Operator functions as expected
    7. Cluster and Operator upgrades are taken into account and automated
    8. The above steps 1-7 should apply similarly for Google Cloud and Microsoft Azure Cloud, with their respective token-based workload identity systems.
  2. Alternate flow/scenarios - high-level user stories
    1. The same as above, but the ROSA CLI would assist with AWS role/policy management
    2. The same as above, but the oc CLI would assist with cloud role/policy management (per respective cloud provider for the cluster)
  3. ...

Expected Outcomes

This Section: Articulates and defines the value proposition from a users point of view

  • See SDE-1868 as an example of what is needed, including design proposed, for current-day ROSA STS and by extension Hypershift.
  • Further research is required to accomodate the AWS STS equivalent systems of GCP and Azure
  • Order of priority at this time is
    • 1. AWS STS for ROSA and ROSA via HyperShift
    • 2. Microsoft Azure for ARO
    • 3. Google Cloud for OpenShift Dedicated on GCP

Effect

This Section: Effect is the expected outcome within the market. There are two dimensions of outcomes; growth or retention. This represents part of the “why” statement for a feature.

  • Growth is the acquisition of net new usage of the platform. This can be new workloads not previously able to be supported, new markets not previously considered, or new end users not previously served.
  • Retention is maintaining and expanding existing use of the platform. This can be more effective use of tools, competitive pressures, and ease of use improvements.
  • Both of growth and retention are the effect of this effort.
    • Customers have strict requirements around using only token-based cloud credential systems for workloads in their cloud accounts, which include OpenShift clusters in all forms.
      • We gain new customers from both those that have waited for token-based auth/auth from OpenShift and from those that are new to OpenShift, with strict requirements around cloud account access
      • We retain customers that are going thru both cloud-native and hybrid-cloud journeys that all inevitably see security requirements driving them towards token-based auth/auth.
      •  

References

As an engineer I want the capability to implement CI test cases that run at different intervals, be it daily, weekly so as to ensure downstream operators that are dependent on certain capabilities are not negatively impacted if changes in systems CCO interacts with change behavior.

Acceptance Criteria:

Create a stubbed out e2e test path in CCO and matching e2e calling code in release such that there exists a path to tests that verify working in an AWS STS workflow.

Feature Overview

Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.  

Goals

  1. Have an option to do upgrades in more discrete steps under admin control. Specifically, these steps are: 
    • Control plane upgrade
    • Worker nodes upgrade
    • Workload enabling upgrade (i..e. Router, other components) or infra nodes
  2. Better visibility into any errors during the upgrades and documentation of what they error means and how to recover. 
  3. An user experience around an end-2-end back-up and restore after a failed upgrade 
  4. OTA-810  - Better Documentation: 
    • Backup procedures before upgrades. 
    • More control over worker upgrades (with tagged pools between user Vs admin)
    • The kinds of pre-upgrade tests that are run, the errors that are flagged and what they mean and how to address them. 
    • Better explanation of each discrete step in upgrades, and what each CVO Operator is doing and potential errors, troubleshooting and mitigating actions.

References

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Provide a one click option to perform an upgrade which pauses all non master pools

Why is this important?

  • Customers are increasingly asking that the overall upgrade is broken up into more digestible pieces
  • This is the limit of what's possible today
    • R&D work will be done in the future to allow for further bucketing of upgrades into Control Plane, Worker Nodes, and Workload Enabling components (ie: router) That will however take much more consideration and rearchitecting

Scenarios

  1. An admin selecting their upgrade is offered two options "Upgrade Cluster" and "Upgrade Control Plane"
    1. If the admin selects Upgrade Cluster they get the pre 4.10 behavior
    2. If the admin selects Upgrade Control Plane all non master pools are paused and an upgrade is initiated
  1. A tooltip should clarify what the difference between the two are
  2. The pool progress bars should indicate pause/unpaused status, non master pools should allow for unpausing

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. While this epic doesn't specifically target upgrading from 4.N to 4.N+1 to 4.N+2 with non master pools paused it would fundamentally enable that and it would simplify the UX described in Paused Worker Pool Upgrades

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal
Add the ability to choose between a full cluster upgrade (which exists today) or control plane upgrade (which will pause all worker pools) in the console.

Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes. 

Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc.  It is all at a pool level, they will not be able to choose specific hosts.

Requirements

  1. Changes to the Update modal:
    1. Add the ability to choose between a cluster upgrade and a control plane upgrade (the design does not default to a selection but rather disables the update button to force the user to make a conscious decision)
    2. link out to documentation to learn more about update strategies
  2. Changes to the in progress check list:
    1. Add a status above the worker pool section to let users know that all worker pools are paused and an action to resume all updates
    2. Add a "resume update" button for each worker pool entry
  3. Changes to the update status:
    1. When all master pools are updated successfully, change the status from what we have today "Up to date" to something like "Control plane up to date - all worker pools paused"
  4. Add an inline alert that lets users know there is a 60 day window to update all worker pools. In the alert, include the sentiment that worker pools can remain paused as long as is normally safe, which means until certificate rotation becomes critical which is at about 60 days. The admin would be advised to unpause them in order to complete the full upgrade. If the MCPs are paused, the certification rotation does not happen, which causes the cluster to become degraded and causes failure in multiple 'oc' commands, including but not limited to 'oc debug', 'oc logs', 'oc exec' and 'oc attach'. (Are we missing anything else here?) Inline alert logic:
    1. From day 60 to day 10 use the default alert.
    2. From day 10 to day 3 use the warning alert.
    3. From day 3 to 0 use the critical alert and continue to persist until resolved.

Design deliverables: 

Goal
Improve the UX on the machine config pool page to reflect the new enhancements on the cluster settings that allows users to select the ability to update the control plane only.

Background
Currently in the console, users only have the ability to complete a full cluster upgrade. For many customers, upgrades take longer than what their maintenance window allows. Users need the ability to upgrade the control plane independently of the other worker nodes. 

Ex. Upgrades of huge clusters may take too long so admins may do the control plane this weekend, worker-pool-A next weekend, worker-pool-B the weekend after, etc.  It is all at a pool level, they will not be able to choose specific hosts.

Requirements

  1. Changes to the table:
    1. Remove "Updated, updating and paused" columns. We could also consider adding column management to this table and hide those columns by default.
    2. Add "Update status" as a column, and surface the same status on cluster settings. Not true or false values but instead updating, paused, and up to date.
    3. Surface the update action in the table row.
  2. Add an inline alert that lets users know there is a 60 day window to update all worker pools. In the alert, include the sentiment that worker pools can remain paused as long as is normally safe, which means until certificate rotation becomes critical which is at about 60 days. The admin would be advised to unpause them in order to complete the full upgrade. If the MCPs are paused, the certification rotation does not happen, which causes the cluster to become degraded and causes failure in multiple 'oc' commands, including but not limited to 'oc debug', 'oc logs', 'oc exec' and 'oc attach'. (Are we missing anything else here?) Add the same alert logic to this page as the cluster settings:
    1. From day 60 to day 10 use the default inline alert.
    2. From day 10 to day 3 use the warning inline alert.
    3. From day 3 to 0 use the critical alert and continue to persist until resolved.

Design deliverables: 

OCP/Telco Definition of Done
Feature Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Feature --->
<--- Remove the descriptive text as appropriate --->

Feature Overview

  • As RH OpenShift Product Owners, we want to enable new providers/platforms/service with varying levels of capabilities and integration with minimal reliance on OpenShift Engineering.
  • As a new provider/platform partner, I want to enable my solution (hardware and/or software) with OpenShift with minimal effort.

 

Problem

  • It is currently challenging for us to enable new platforms / providers without taking the heavy burden on doing the platform specific development ourselves.

Goals

  • We want to enable the long-tail new platforms/providers to expand our reach into new markets and/or support new use cases.
  • We want to remove strict dependencies we have on Engineering teams to review, support and test new providers.
  • We want to lower the effort required for onboarding new platforms/providers.
  • We want to enable new platform/providers to self-certify.
  • We want to define tiered model for provider/platform integration that delineates ownership and responsibilities throughout new provider/platform development lifecycle and support model.
  • We want to reduce time to onboard new provider/platform – ideally to a single release.
  • We want to maintain consistent customer experience across all providers/platforms.

Requirements

  • Step-by-step guide on how to add a new platform/provider for each tier
  • Certification tool for partner to self-certify
  • Certification tool results for (at least) each Y/minor release submitted by partner to Red Hat for acknowledgement
  • DCI program to enable partners to run CI with OpenShift on their platform
  • Well documented, accessible, and up-to-date test suites for providing the test coverage of the partner
  • CI includes upgrade testing of OpenShift with partner's components
  • Partner component upgrade failure should not block OpenShift upgrade
  • Partner code is available in repositories in the openshift org on github with an open source license compatible with OpenShift

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • ...

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • ...

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

 

References

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Running the OPCT with the latest version (v0.1.0) on OCP 4.11.0, the openshift-tests is reporting an incorrect counter for the "total" field.

In the example below, after the 1127th test, the total follows the same counter of executed. I also would assume that the total is incorrect before that point as the test continues the execution increases both counters.

 

openshift-tests output format: [failed/executed/total]

started: (0/1126/1127) "[sig-storage] PersistentVolumes-expansion  loopback local block volume should support online expansion on node [Suite:openshift/conformance/parallel] [Suite:k8s]"

passed: (38s) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options [Suite:openshift/conformance/parallel] [Suite:k8s]"

started: (0/1127/1127) "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]"

passed: (6.6s) 2022-08-09T17:12:21 "[sig-storage] Downward API volume should provide container's memory request [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]"

started: (0/1128/1128) "[sig-storage] In-tree Volumes [Driver: cinder] [Testpattern: Dynamic PV (immediate binding)] topology should fail to schedule a pod which has topologies that conflict with AllowedTopologies [Suite:openshift/conformance/parallel] [Suite:k8s]"

skip [k8s.io/kubernetes@v1.24.0/test/e2e/storage/framework/testsuite.go:116]: Driver local doesn't support GenericEphemeralVolume -- skipping
Ginkgo exit error 3: exit with code 3

skipped: (400ms) 2022-08-09T17:12:21 "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition [Suite:openshift/conformance/parallel] [Suite:k8s]"

started: (0/1129/1129) "[sig-storage] In-tree Volumes [Driver: emptydir] [Testpattern: Dynamic PV (default fs)] capacity provides storage capacity information [Suite:openshift/conformance/parallel] [Suite:k8s]" 

 

OPCT output format [executed/total (failed failures)]

Tue, 09 Aug 2022 14:12:13 -03> Global Status: running
JOB_NAME                         | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
openshift-conformance-validated  | running    |            | 1112/1127 (0 failures)    | status=running                                    
openshift-kube-conformance       | complete   |            | 352/352 (0 failures)      | waiting for post-processor...                     
Tue, 09 Aug 2022 14:12:23 -03> Global Status: running
JOB_NAME                         | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
openshift-conformance-validated  | running    |            | 1120/1127 (0 failures)    | status=running                                    
openshift-kube-conformance       | complete   |            | 352/352 (0 failures)      | waiting for post-processor...                     
Tue, 09 Aug 2022 14:12:33 -03> Global Status: running
JOB_NAME                         | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
openshift-conformance-validated  | running    |            | 1139/1139 (0 failures)    | status=running                                    
openshift-kube-conformance       | complete   |            | 352/352 (0 failures)      | waiting for post-processor...                     
Tue, 09 Aug 2022 14:12:43 -03> Global Status: running
JOB_NAME                         | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
openshift-conformance-validated  | running    |            | 1185/1185 (0 failures)    | status=running                                    
openshift-kube-conformance       | complete   |            | 352/352 (0 failures)      | waiting for post-processor...                     
Tue, 09 Aug 2022 14:12:53 -03> Global Status: running
JOB_NAME                         | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
openshift-conformance-validated  | running    |            | 1188/1188 (0 failures)    | status=running                                    
openshift-kube-conformance       | complete   |            | 352/352 (0 failures)      | waiting for post-processor...      

 

 

 

 

Goal

Increase integration of Shipwright, Tekton, Argo CD in OpenShift GitOps with OpenShift platform and related products such as ACM.

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

Feature Overview

We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.

Goals

  • Feature enhancements (performance, scale, configuration, UX, ...)
  • Modernization (incorporation and productization of new technologies)

Requirements

  • Core Networking Stability
  • Core Networking Performance and Scale
  • Core Neworking Extensibility (Multus CNIs)
  • Core Networking UX (Observability)
  • Core Networking Security and Compliance

In Scope

  • Network Edge (ingress, DNS, LB)
  • SDN (CNI plugins, openshift-sdn, OVN, network policy, egressIP, egress Router, ...)
  • Networking Observability

Out of Scope

There are definitely grey areas, but in general:

  • CNV
  • Service Mesh
  • CNF

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

User Story: As a customer in a highly regulated environment, I need the ability to secure DNS traffic when forwarding requests to upstream resolvers so that I can ensure additional DNS traffic and data privacy.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Create a PR in openshift/cluster-ingress-operator to implement configurable router probe timeouts.

The PR should include the following:

  • Changes to the ingress operator's ingress controller to allow the user to configure the readiness and liveness probe's timeoutSeconds values.
  • Changes to existing unit tests to verify that the new functionality works properly.
  • Write E2E test to verify that the new functionality works properly.

tldr: three basic claims, the rest is explanation and one example

  1. We cannot improve long term maintainability solely by fixing bugs.
  2. Teams should be asked to produce designs for improving maintainability/debugability.
  3. Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.


Relevant links:

In OCP 4.8 the router was changed to use the "random" balancing algorithm for non-passthrough routes by default. It was previously "leastconn".

Bug https://bugzilla.redhat.com/show_bug.cgi?id=2007581 shows that using "random" by default incurs significant memory overhead for each backend that uses it.

PR https://github.com/openshift/cluster-ingress-operator/pull/663
reverted the change and made "leastconn" the default again (OCP 4.8 onwards).

The analysis in https://bugzilla.redhat.com/show_bug.cgi?id=2007581#c40 shows that the default haproxy behaviour is to multiply the weight (specified in the route CR) by 16 as it builds its data structures for each backend. If no weight is specified then openshift-router sets the weight to 256. If you have many, many thousands of routes then this balloons quickly and leads to a significant increase in memory usage, as highlighted by customer cases attached to BZ#2007581.

The purpose of this issue is to both explore changing the openshift-router default weight (i.e., 256) to something smaller, or indeed unset (assuming no explicit weight has been requested), and to measure the memory usage within the context of the existing perf&scale tests that we use for vetting new haproxy releases.

It may be that the low-hanging change is to not default to weight=256 for backends that only have one pod replica (i.e., if no value specified, and there is only 1 pod replica, then don't default to 256 for that single server entry).

Outcome: does changing the [default] weight value make it feasible to switch back to "random" as the default balancing algorithm for a future OCP release.

Revert router to using "random" once again in 4.11 once analysis is done on impact of weight and static memory allocation.

Per the 4.6.30 Monitoring DNS Post Mortem, we should add E2E tests to openshift/cluster-dns-operator to reduce the risk that changes to our CoreDNS configuration break DNS resolution for clients.  

To begin with, we add E2E DNS testing for 2 or 3 client libraries to establish a framework for testing DNS resolvers; the work of adding additional client libraries to this framework can be left for follow-up stories.  Two common libraries are Go's resolver and glibc's resolver.  A somewhat common library that is known to have quirks is musl libc's resolver, which uses a shorter timeout value than glibc's resolver and reportedly has issues with the EDNS0 protocol extension.  It would also make sense to test Java or other popular languages or runtimes that have their own resolvers. 

Additionally, as talked about in our DNS Issue Retro & Testing Coverage meeting on Feb 28th 2024, we also decided to add a test for testing a non-EDNS0 query for a larger than 512 byte record, as once was an issue in bug OCPBUGS-27397.   

The ultimate goal is that the test will inform us when a change to OpenShift's DNS or networking has an effect that may impact end-user applications. 

Feature Overview

  • This Section:* High-Level description of the feature ie: Executive Summary
  • Note: A Feature is a capability or a well defined set of functionality that delivers business value. Features can include additions or changes to existing functionality. Features can easily span multiple teams, and multiple releases.

 

Goals

  • This Section:* Provide high-level goal statement, providing user context and expected user outcome(s) for this feature

 

Requirements

  • This Section:* A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

 

(Optional) Use Cases

This Section: 

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

 

Questions to answer…

  • ...

 

Out of Scope

 

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

 

Assumptions

  • ...

 

Customer Considerations

  • ...

 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?  
  • New Content, Updates to existing content,  Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

During master nodes upgrade when nodes are getting drained there's currently no protection from two or more operands going down. If your component is required to be available during upgrade or other voluntary disruptions, please consider deploying PDB to protect your operands.

The effort is tracked in https://issues.redhat.com/browse/WRKLDS-293.

Example:

 

Acceptance Criteria:
1. Create PDB controller in console-operator for both console and downloads pods
2. Add e2e tests for PDB in single node and multi node cluster

 

Note: We should consider to backport this to 4.10

Goal
Add support for PDB (Pod Disruption Budget) to the console.

Requirements:

  • Add a list, detail, and yaml view (with samples) for PDBs. In addition, update the workloads page to support PDBs as well.
  • For the PBD list page include a table with name, namespace, selector, availability, allowed disruptions and created. In addition, to the table provide the main call to action to create a PDB.
  • For the PDB details page provide a Details, YAML and Pods tab. The Pods tab will include a list pods associated with the PBD - make sure to surface the owner column.
  • When users create a PDB from the list page, take them to the YAML and provide samples to enhance the creation experience. Sample 1: Set max unavailable to 0, Sample 2: Set min unavailable to 25% (confirming samples with stakeholders). In the case that a PDB has already been applied, warn users that it is not recommended to add another. Cover use cases as well that keep users from creating poor policies - for example, setting the minimum available to zero.
  • Add the ability to add/edit/view PBDs on a workload. If we edit a PDB applied to multiple workloads, warn users that this change will affect all workloads and not only the one they are currently editing. When a PDB has been applied, add a new filed to the details page with a link to the PDB and policy.

Designs:

Samuel Padgett Colleen Hart

When viewing the Installed Operators list set to 'All projects' and then selecting an operator that is available in 'All namespaces' (globally installed,) upon clicking the operator to view its details the user is taken into the details of that operator in installed namespace (project selector will switch to the install namespace.)

This can be disorienting then to look at the lists of custom resource instances and see them all blank, since the lists are showing instances only in the currently selected project (the install namespace) and not across all namespaces the operator is available in.

It is likely that making use of the new Operator resource will improve this experience (CONSOLE-2240,) though that may still be some releases away. it should be considered if it's worth a "short term" fix in the meantime.

Note: The informational alert was not implemented. It was decided that since "All namespaces" is displayed in the radio button, the alert was not needed.

Feature Overview

Enable sharing ConfigMap and Secret across namespaces

Requirements

Requirement Notes isMvp?
Secrets and ConfigMaps can get shared across namespaces   YES

Questions to answer…

NA

Out of Scope

NA

Background, and strategic fit

Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them. 

Documentation Considerations

Questions to be addressed:
 * What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
 * Does this feature have doc impact?
 * New Content, Updates to existing content, Release Note, or No Doc Impact
 * If unsure and no Technical Writer is available, please contact Content Strategy.
 * What concepts do customers need to understand to be successful in [action]?
 * How do we expect customers will use the feature? For what purpose(s)?
 * What reference material might a customer want/need to complete [action]?
 * Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
 * What is the doc impact (New Content, Updates to existing content, or Release Note)?

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Require volumes that use the Shared Resources CSI driver to specify readOnly: true in order to create the pod
  • Reserve the "openshift-" prefix for SharedSecrets and SharedConfigMaps, such that these resources can only be created by OpenShift operators. We must do this while the driver is tech preview.

Why is this important?

  • readOnly: true must be specified in order for the driver to mount the volume correctly. If this is not set, the volume mount is rejected and the pod will be stuck in a Pending/Initializing state.
  • A validating admission webhook will ensure that the pods won't be created in such a state, improving user experience.
  • Openshift operators may want/need to create SharedSecrets and SharedConfigMaps so they can be used as system level resources. For example, Insights Operator can automatically create a SharedSecret for the Simple Content Access cert.

Scenarios

  1. As a developer, I want to consume shared Secrets and ConfigMaps in my workloads so that I can have access to shared credentials and configuration.
  2. As a cluster admin, I want the Insights operator to automatically create a SharedSecret for my cluster's simple content access certificate.
  3. As a cluster admin/SRE, I want OpenShift to use SharedConfigMaps to distribute cluster certificate authorities so that data is not duplicated in ConfigMaps across my cluster.

Acceptance Criteria

  • Pods must have readOnly: true set to use the shared resource CSI Driver - admission should be rejected if this is not set.
  • Documentation updated to reflect this requirement.
  • Users (admins?) are not allowed to create SharedSecrets or SharedConfigMaps with the "openshift-" prefix.

Dependencies (internal and external)

  1. ART - to create payload image for the webhook
  2. Arch review for the enhancement proposal (Apiserver/control plane team)

Previous Work (Optional):

  1. BUILD-293 - Shared Resources tech preview

Open questions::

  1. From email exchange with David Eads:  "Thinking ahead to how we'd like to use this in builds once we're GA, are we likely to choose openshift-etc-pki-entitlement as one of our well-known names?  If we do, what sort of validation (if any) would we like to provide on the backing secret and does that require any new infrastructure?"

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a developer using SharedSecrets and ConfigMaps
I want to ensure all pods set readOnly; true on admission
So that I don't have pods stuck in the "Pending" state because of a bad volume mount

Acceptance Criteria

  • Pods which reference the Shared Resource CSI driver must set readOnly: true on admission.
  • If readOnly: true is not set, or is set to false, the pod should not be created.
  • Appropriate testing in place to verify behavior

QE Impact

QE will need to verify the new Pod Admission behavior

Docs Impact

Docs will need to ensure that readOnly: true is required and must be set to true.

PX Impact

None.

QE testing/verification of the feature - require readOnly to be true

Actions:

1. Create smoke test and submit to GitHub
2. Run script to integrate smoke test with Polarion

User Story

As an OpenShift engineer
I want the shared resource CSI Driver webhook to be installed with the cluster storage operator
So that the webhook is deployed when the CSI driver is deployed

Acceptance Criteria

  • Shared Resource CSI Driver operator deploys the webhook alongside the CSI driver
  • Cluster storage operator is updated if needed to deploy the shared resource CSI driver webhook.

Docs Impact

None - no new functional capabilities will be added

QE Impact

None - we can verify in CI that we are deploying the webhook correctly.

PX Impact

None - no new functional capabilities will be added

Notes

The scope of this story is to just deploy the "hello world" webhook with the Cluster Storage Operator.
Adding the live ValidatingWebhook configuration and service will be done in a separate story.

User Story

As an OpenShift engineer,
I want to initialize a validating admission webhook for the shared resource CSI driver
So that I can eventually require readOnly: true to be set on all pods that use the Shared Resource CSI Driver

Acceptance Criteria

  • Container image created in CI which builds a "hello world" binary for the future validating webhook.
  • ART sets up downstream build process for the image.

QE Impact

None.

Docs Impact

None.

PX Impact

None.

Notes

This is a prerequisite for implementing the validating admission webhook.
We need to have ART build the container image downstream so that we can add the correct image references for the CVO.
If we reference images in the CVO manifests which do not have downstream counterparts, we break the downstream build for the payload.

CI is capable of producing multiple images for a GitHub repository. For example, github.com/openshift/oc produces 4-5 images with various capabilities.

We did similar work in BUILD-234 - some of these steps are not required.

See also:

Complete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled

Summary (PM+lead)

https://issues.redhat.com/browse/AUTH-2 revealed that, in prinicipal, Pod Security Admission is possible to integrate into OpenShift while retaining SCC functionality.

 

This epic is about the concrete steps to enable Pod Security Admission by default in OpenShift

Motivation (PM+lead)

Goals (lead)

  • Enable Pod Security Admission in "restricted" policy level by default
  • Migrate existing core workloads to comply to the "restricted" pod security policy level

Non-Goals (lead)

  • Other OpenShift workloads must be migrated by the individual responsible teams.

Deliverables

Proposal (lead)

Enhancement - https://github.com/openshift/enhancements/pull/1010

User Stories (PM)

Dependencies (internal and external, lead)

Previous Work (lead)

Open questions (lead)

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

ingress-operator must comply to pod security. The current audit warning is:

 

{   "objectRef": "openshift-ingress-operator/deployments/ingress-operator",   "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.run AsNonRoot=true), seccompProfile (pod or containers \"ingress-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }

dns-operator must comply to restricted pod security level. The current audit warning is:

{   "objectRef": "openshift-dns-operator/deployments/dns-operator",   "pod-security.kubernetes.io/audit-violations": "would violate PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.allowPrivilegeEscalation=false), unre stricted capabilities (containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.runAsNonRoot=tr ue), seccompProfile (pod or containers \"dns-operator\", \"kube-rbac-proxy\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")" }

Epic Goal

HyperShift provisions OpenShift clusters with externally managed control-planes. It follows a slightly different process for provisioning clusters. For example, HyperShift uses cluster API as a backend and moves all the machine management bits to the management cluster.  

Why is this important?

showing machine management/cluster auto-scaling tabs in the console is likely to confuse users and cause unnecessary side effects. 

Definition of Done

  • MachineConfig and MachineConfigPool should not be present, they should be either removed or hidden when the cluster is spawned using HyperShift. 
  • Cluster Settings show say the control plane is externally managed and be read-only.
  • Cluster Settings -> Configuration resources should be read-only, maybe hide the tab
  • Some resources should go in an allowlist. Most will be hidden
  • Review getting started steps

See Design Doc: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

 

Setup / Testing

It's based on the SERVER_FLAG controlPlaneTopology being set to External is really the driving factor here; this can be done in one of two ways:

  • Locally via a Bridge Variable, export BRIDGE_CONTROL_PLANE_TOPOLOGY_MODE="External"
  • Locally / OnCluster via modifying the window.SERVER_FLAGS.controlPlaneTopology to External in the dev tools

To test work related to cluster upgrade process, use a 4.10.3 cluster set on the candidate-4.10 upgrade channel using 4.11 frontend code.

If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend kubeadmin notifier, from the global notifications, since it contain link for updating the cluster OAuth configuration (see attachment).

 

 

If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to remove the ability to “Add identity providers” under “Set up your Cluster”. In addition to the getting started card, we should remove the ability to update a cluster on the details card when applicable (anything that changes a cluster version should be read only).

Summary of changes to the overview page:

  • Remove the ability to “Add identify providers” under “Set up your Cluster”
  • Remove cluster update CTA from the details card
  • Remove update alerts from the status card

Check section 03 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml to the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need to suspend these notifications:

  • cluster upgrade notifications
  • new channel available notifications

For these we will need to check `ControlPlaneTopology`, if it's set to 'External' and also check if the user can edit cluster version(either by creating a hook or an RBAC call, eg. `canEditClusterVersion`)

 

Check section 05 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

Based on Cesar's comment we should be removing the `Control Plane` section, if the infrastructure.status.controlplanetopology being "External".

If the Infrastructure.Status.ControlPlaneTopology is set to 'External', the console-operator will pass this information via the console-config.yaml co the console. Console pod will get re-deployed and will store the topology mode information as a SERVER_FLAG. Based on that value we need surface a message that the control plane is externally managed and add following changes:

  • Remove update button
  • Make channel read only
  • Link out to read only CV details page
  • Remove the ability to edit upstream configuration
  • Remove the cluster autoscaler field
  • Add an alert to the page so that users know the control plane is externally managed

In general, anything that changes a cluster version should be read only.

Check section 02 for more info: https://docs.google.com/document/d/1k76JtRRHBdCCEjHPqKcYvbNVsuaGmRhWDLESWIm0mbo/edit#

 

Epic Goal

Why is this important?

  • So the UX satisfies the current trands, where dark mode is becoming a standard for modern services.  

Acceptance Criteria

  • OCP admin console must be rendered in a preferred mode based on `prefers-color-scheme` media query
  • OCP admin console must be rendered in a preferred mode selected in the User Setting page
  • Create an followup epic/story for and listing and tracking changes needed in OCP console's dynamic plugins

Dependencies (internal and external)

  1. PatternFly - Dark mode PF variables

Previous Work (Optional):

  1. Mike Coker has worked on a POC from the PF point of view on both the admin and dev console, and the screenshot results are listed below along with the repo branch. Also listed is a document covering some of the common issues found when putting together the admin console POC. https://github.com/mcoker/console/tree/dark-theme
    Background POC work completed for reference:

PatternFly Dark Theme Handbookhttps://docs.google.com/document/d/1mRYEfUoOjTsSt7hiqjbeplqhfo3_rVDO0QqMj2p67pw/edit

Admin Console -> Workloads & Pods

Dev Console -> Gotcha pages: Observe Dashboard and Metrics, Add, Pipelines: builder, list, log, and run

Open questions::

  1. Who should be responsible for updating DynamicPlugins to be able to render in dark mode?

As a developer, I want to be able to scope the changes needed to enable dark mode for the admin console. As such, I need to investigate how much of the console will display dark mode using PF variables and also define a list of gotcha pages/components which will need special casing above and beyond PF variable settings.

 

Acceptance criteria:

As a developer, I want to be able to fix remaining issues from the spreadsheet of issues generated after the initial pass and spike of adding dark theme to the console.. As such, I need to make sure to either complete all remaining issues for the spreadsheet, or, create a bug or future story for any remaining issues in these two documents.

 

Acceptance criteria:

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

The Cluster Dashboard Details Card Protractor integration test was failing at high rate, and despite multiple attempts to fix, was never fully resolved, so it was disabled as a way to fix https://bugzilla.redhat.com/show_bug.cgi?id=2068594. Migrating this entire file to Cypress should give us better debugging capability, which is what was done to fix a similarly problematic project dashboard Protractor test.

This epic contains all the Dynamic Plugins related stories for OCP release-4.11 

Epic Goal

  • Track all the stories under a single epic

Acceptance Criteria

  •  

Currently, you need to navigate to

Cluster Settings ->
Global configuration ->
Console (operator) config ->
Console plugins

to see and managed plugins. This takes a lot of clicks and is not discoverable. We should look at surfacing plugin details where they're easier to find – perhaps on the Cluster Settings page – or at least provide a more convenient link somewhere in the UI.

AC: Add the Dynamic Plugins section to the Status Card in the overview that will contain:

  • count of active and non-active plugins
  • link to the ConsolePlugins instances page
  • status of the loaded plugins and breakout error

cc Ali Mobrem Robb Hamilton

We have a Timestamp component for consistent display of dates and times that we should expose through the SDK. We might also consider a hook that formats dates and times for places were you don't want or cant use the component, eg. times on a chart. 

This will become important when we add a user preference for dates so that plugins show consistent dates and times as console. If I set my user preference to UTC dates, console should show UTC dates everywhere.

 

AC:

  • Expose the Timestamp component inside the SDK. 
  • Replace the connect with useSelector hook
  • Keep the original component and proxy it to the new one in the SDK

 

 

 

cc Jakub Hadvig Sho Weimer 

Currently, enabled plugins can fail to load for a variety of reasons. For instance, plugins don't load if the plugin name in the manifest doesn't match the ConsolePlugin name or the plugin has an invalid codeRef. There is no indication in the UI that something has gone wrong. We should explore ways to report this problem in the UI to cluster admins. Depending on the nature of the issue, an admin might be able to resolve the issue or at least report a bug against the plugin.

The message about failing could appear in the notification drawer and/or console plugins tab on the operator config. We could also explore creating an alert if a plugin is failing.

 

AC:

  • Add notification into the Notification Drawer in case a Dynamic Plugin will error out during load.
  • Render these errors in the status card, notification section, as well.
  • For each failed plugin we should create a separate notification.

We need to provide a base for running integration tests using the dynamic plugins. The tests should initially

  • Create a deployment and service to run the dynamic demo plugin
  • Update the console operator config to enable the plugin
  • Wait for the plugin to be available
  • Test at least one extension point used by the plugin (such as adding items to the nav)
  • Disable the plugin when done

Once the basic framework is in place, we can update the demo plugin and add new integration tests when we add new extension points.

https://github.com/openshift/console/tree/master/frontend/dynamic-demo-plugin

 

https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md

 

https://github.com/openshift/console/tree/master/frontend/packages/console-plugin-sdk

In the 4.11 release, a console.openshift.io/default-i18next-namespace annotation is being introduced. The annotation indicates whether the ConsolePlugin contains localization resources. If the annotation is set to "true", the localization resources from the i18n namespace named after the dynamic plugin (e.g. plugin__kubevirt), are loaded. If the annotation is set to any other value or is missing on the ConsolePlugin resource, localization resources are not loaded. 

 

In case these resources are not present in the dynamic plugin, the initial console load will be slowed down. For more info check BZ#2015654

 

AC:

  • console-operator should be checking for the new console.openshift.io/use-i18n annotation, update the console-config.yaml accordingly and redeploy the console server
  • console server should pick up the changes in the console-config.yaml and only load the i18n namespace that are available

 

Follow up of https://issues.redhat.com/browse/CONSOLE-3159

 

 

Goal

  • Add the ability for users to select supported but not recommended updates.
  • Refine workflow when both "upgradeable=false" and "supported-but-not-recommended" updates occur

Background
RFE: for 4.10, Cincinnati and the cluster-version operator are adding conditional updates (a.k.a. targeted edge blocking): https://issues.redhat.com/browse/OTA-267

High-level plans in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#update-client-support-for-the-enhanced-schema

Example of what the oc adm upgrade UX will be in https://github.com/openshift/enhancements/blob/master/enhancements/update/targeted-update-edge-blocking.md#cluster-administrator.

The oc implementation landed via https://github.com/openshift/oc/pull/961.

Design

  • Use case 01: "supported but not recommended" occurs to the latest version:
    • Add an info icon next to the version on update path with a pop-over to explain about why updating to this version is supported, but not recommended and a link to known risks
    • Identify the difference in "recommended" versions, "supported but not recommended" versions, and "blocked" versions (upgradeable=false) in the + more modal.
    • The latest version is pre-selected in the dropdown in the update modal with an inline alert to inform users about supported-but-not-recommended version with link to known risks. Users can choose to update to another recommended versions, update to a supported-but-not-recommended one, or wait.
    • The "recommended" and "supported but not recommended" updates are separated in the dropdown.
    • If a user selects a "recommended" update, the inline alert disappears.
  • Use case 02: When both "upgradeable=false" and "supported but not recommended" occur:
    • Add an alert banner to explain why users shouldn’t update to the latest version and link to how to resolve on the cluster settings details page. Users have the options to resolve the issue, update to a patch version, or wait.
    • If users open the update modal without resolving the "upgradeable=false" issue, the next recommended version is pre-selected. An expandable link "View blocked versions (#)" is included under the dropdown to show "upgradeable=false" versions with resolve link.
    • If users resolve the "upgradeable=false" issue, the cluster settings page will change to use case 01
    • Question: Priority on changing the upgradeable=false alert banner in update modal and blocked versions in dropdown

See design doc: https://docs.google.com/document/d/1Nja4whdsI5dKmQNS_rXyN8IGtRXDJ8gXuU_eSxBLMIY/edit#

See marvel: https://marvelapp.com/prototype/h3ehaa4/screen/86077932

Update the cluster settings page to inform the user when the latest available update is supported but not recommended. Add an informational popover to the latest version in  update path visualization.

The "Update Version" modal on the cluster settings page should be updated to give users information about recommended, not recommended, and blocked update versions.

  • When the modal is opened, the latest recommended update version should be pre-selected in the version dropdown.
  • Blocked versions should no longer be displayed in the version dropdown, and should instead be displayed in a collapsible field below the dropdown.
  • When blocked versions are present, a link should be provided to the cluster operator tab. The version dropdown itself should have two labeled sections: "Recommended" and "Supported but not recommended".
  • When the user selects a "Supported but not recommended" item from the version dropdown, an inline info alert should appear below the version selection field and should provide a link to known risks associated with the selected version. This is an external link provided through the ClusterVersion API.

Epic Goal

  • Add telemetry so that we know how image stream features are used.

Why is this important?

  • We have a long standing epic to create image streams v2. We need to better understand how image streams are used today.

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Epic Goal

  • Make the image registry distributed across availability zones.

Why is this important?

  • The registry should be highly available and zone failsafe.

Scenarios

  1. As an administrator I want to rely on a default configuration that spreads image registry pods across topology zones so that I don't suffer from a long recovery time (>6 mins) in case of a complete zone failure if all pods are impacted.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. Pod's topologySpreadConstraints

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: https://github.com/openshift/cluster-image-registry-operator/pull/730
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Story: As an administrator I want to rely on a default configuration that spreads image registry pods across topology zones so that I don't suffer from a long recovery time (>6 mins) in case of a complete zone failure if all pods are impacted.

Background: The image registry currently uses affinity/anti-affinity rules to spread registry pods across different hosts. However this might cause situations in which all pods end up on hosts of a single zone, leading to a long recovery time of the registry if that zone is lost entirely. However due to problems in the past with the preferred setting of anti-affinity rule adherence the configuration was forced instead with required and the rules became constraints. With zones as constraints the internal registry would not have deployed anymore in environments with a single zone, e.g. internal CI environment. Pod topology constraints is a new API that is supported in OCP which can also relax constraints in case they cannot be satisfied. Details here: https://docs.openshift.com/container-platform/4.7/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html

Acceptance criteria:

  • by default the internal registry is deployed with at least two replica
  • by default the topology constraints should be on a zone-basis, so that by defaults one registry pod is scheduled in each zone
  • when constraints can't be satisfied the registry should deploy anyway
  • we should not do this in SNO environments
  • the registry should still work on SNO environments

Open Questions:

  • what happens in environments where the storage is zone dependent?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

As an OpenShift administrator
I want to provide the registry operator with a custom certificate authority for S3 storage
so that I can use a third-party S3 storage provider.

Acceptance criteria

  1. Users can specify a configmap name (from openshift-config) in config.imageregistry/cluster's spec.storage.s3.
  2. The operator uses CA from this configmap to check S3 bucket.
  3. The image registry pod uses CA from this configmap to access the S3 bucket.
  4. When a custom CA is defined, the operator/image-registry should still trust certificate authorities that are used by Amazon S3 and other well-known CAs.
  5. An end-to-end test that runs minio and checks the image registry becomes healthy with it.

Goal

Remove Jenkins from the OCP Payload.

Problem

  • Jenkins images are "non-trival in size, impact experience around OCP payloads
  • Security advisories cannot be handled once, but against all actively supported OCP releases, adding to response time for handling said advisories
  • Some customers may now want to upgrade Jenkins as OCP upgrades (making this configurable is more ideal)

Why is this important

  • This is an engineering motivated item to reduce costs so we have more cycles for strategic work
  • Aside from the team itself, top level OCP architects want this to reduce the image size, improve general OCP upgrade experience
  • Sends a mix message with respect to what is startegic CI/CI when Jenkins is baked into OCP, but Tekton/Pipelines is an add-on, day 2 install sort of thing

Dependencies (internal and external)

See epic linking - need alternative non payload image available to provide relatively seamless migration

 

Also, the EP for this is approved and merged at https://github.com/openshift/enhancements/blob/master/enhancements/builds/remove-jenkins-payload.md

Estimate (xs, s, m, l, xl, xxl):

Questions:

       PARTIAL ANSWER ^^:  confirmed with Ben Parees in https://coreos.slack.com/archives/C014MHHKUSF/p1646683621293839 that EP merging is currently sufficient OCP "technical leadership" approval.

 

Previous work

 

Customers

assuming none

User Stories

 

As maintainers of the OpenShift jenkins component, we need run Jenkins CI for PR testing against openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin, openshift/jenkins-openshift-login-plugin, using images built in the CI pipeline but not injected into CI test clusters via sample operator overriding the jenkins sample imagestream with the jenkins payload image.

 

As maintainers of the OpenShift Jenkins component, we need Jenkins periodics for the client and sync plugins to run against the latest non payload, CPaas image, promoted to CI's image locations on quay.io, for the current release in development.

 

As maintainers of the OpenShift Jenkins component, we need Jenkins related tests outside of very basic Jenkins Pipieline Strategy Build Config verification, removed from openshift-tests in OpenShift Origin, using a non-payload, CPaas image pertinent to the branch in question.

Acceptance criteria

  • all PR CI Tests do not utilize samples operator manipulation of the jenkins imagestream with the in payload image, but rather images including the PRs changes
  • all periodic CI Tests do not utilize samples operator manipulation of the jenkins imagestream with the in payload image, but rather CI promoted images for the current release pushed to quay.io

High Level, we ideally want to vet the new CPaas image via CI and periodics BEFORE we start changing the samples operator so that it does not manipulate the jenkins imagestream (our tests will override the samples operator override)

QE Impact

NONE ... QE should wait until JNKS-254

Docs Impact

NONE

PX Impact

 

NONE

Launch Checklist

Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated

Notes

  • Our CSI shared resource experience will help us here
  • but the old IMAGE_FORMAT stuff is deprecated, and does not work well with step registry stuff
  • instead, we need to use https://docs.ci.openshift.org/docs/architecture/ci-operator/#dependency-overrides
  • Makefile level logic will use `oc tag` to update the jenkins imagestream created as part of samples to override the use of the in payload image with the image build by the PR, or for periodics, with what has been promoted to quay.io
  • Ultimately, CI step registry for capturing the `oc tag` update the imagestream logic is the probably end goal
  • JNKS-268 might change how we do periodics, but the current thought is to get existing periodics working with the CPaas image first

Possible staging

1) before CPaas is available, we can validate images generated by PRs to openshift/jenkins, openshift/jenkins-sync-plugin, openshift/jenkins-client-plugin by taking the image built by the image (where the info needed to get the right image from the CI registry is in the IMAGE_FORMAT env var) and then doing an `oc tag --source=docker <PR image ref> openshift/jenkins:2` to replace the use of the payload image in the jenkins imagestream in the openshift namespace with the PRs image

2) insert 1) in https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/sync-plugin/e2e/jenkins-sync-plugin-e2e-commands.sh and https://github.com/openshift/release/blob/master/ci-operator/step-registry/jenkins/client-plugin/tests/jenkins-client-plugin-tests-commands.sh where you test for IMAGE_FORMAT being set

3) or instead of 2) you update the Makefiles for the plugins to call a script that does the same sort of thing, see what is in IMAGE_FORMAT, and if it has something, do the `oc tag`

 

https://github.com/openshift/release/pull/26979 is a prototype of how to stick the image built from a PR and conceivably the periodics to get the image built from it and tag it into the jenkins imagestream in the openshift namespace in the test cluster

 

Epic Goal

  • Remove this UI from our stack that we cannot support.

Why is this important?

  • Reduce support burden.
  • Remove Bugzilla burden of addressing continuous CVEs found in this project.

Acceptance Criteria

  • All Prometheus upstream UI links are removed
  • Related documentation is updated
  • Ports/routes etc configured to expose access to this UI are removed such that no configuration we provide enables access to this UI or its codepaths.
  • There is no reason any CVEs found in this UI would ever require intervention by the Monitoring Team.

Dependencies (internal and external)

  1. Make the Prometheus Targets information available in Console UI (https://issues.redhat.com/browse/MON-1079)

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

After installing or upgrading to the latest OCP version, the existing OpenShift route to the prometheus-k8s service is updated to be a path-based route to '/api/v1'.

DoD:

  • It is not possible to access the Prometheus UI via the OpenShift route
  • Using a bearer token with sufficient permissions, it is possible to access the /api/v1/* endpoints via the OpenShift route.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Following up on https://issues.redhat.com/browse/MON-1320, we added three new CLI flags to Prometheus to apply different limits on the samples' labels. These new flags are available starting from Prometheus v2.27.0, which will most likely be shipped in OpenShift 4.9.

The limits that we want to look into for OCP are the following ones:

# Per-scrape limit on number of labels that will be accepted for a sample. If
# more than this number of labels are present post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_limit: <int> | default = 0 ]

# Per-scrape limit on length of labels name that will be accepted for a sample.
# If a label name is longer than this number post metric-relabeling, the entire
# scrape will be treated as failed. 0 means no limit.
[ label_name_length_limit: <int> | default = 0 ]

# Per-scrape limit on length of labels value that will be accepted for a sample.
# If a label value is longer than this number post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_value_length_limit: <int> | default = 0 ]

We could benefit from them by setting relatively high values that could only induce unbound cardinality and thus reject the targets completely if they happened to breach our constrainst.

DoD:

  • Being able to configure label scrape limits for UWM

Epic Goal

When users configure CMO to interact with systems outside of an OpenShift cluster, we want to provide an easy way to add the cluster ID to the data send.

Why is this important?

Technically this can be achieved today, by adding an identifying label to the remote_write configuration for a given cluster. The operator adding the remote_write integration needs to take care that the label is unique over the managed fleet of clusters. This however adds management complexity. Any given cluster already has a pseudo-unique datum, that can be used for this purpose.

  • Starting in 4.9 we support the Prometheus remote_write feature to send metric data to a storage integration outside of the cluster similar to our own Telemetry service.
  • In Telemetry we already use the cluster ID to distinguish the various clusters.
  • For users of remote_write this could add an easy way to add such distinguishing information.

Scenarios

  1. An organisation with multiple OpenShift clusters want to store their metric data centralized in a dedicated system and use remote_write in all their clusters to send this data. When querying their centralized storage, metadata (here a label) is needed to separate the data of the various clusters.
  2. Service providers who manage multiple clusters for multiple customers via a centralized storage system need distinguishing metadata too. See https://issues.redhat.com/browse/OSD-6573 for example

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • Document how to use this feature

Dependencies (internal and external)

  1. none

Previous Work (Optional):

  1. none

Open questions::

  1.  

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Implementation proposal:

 

Expose a flag in the CMO configuration, that is false by default (keeps backward compatibility) and when set to true will add the _id label to a remote_write configuration. More specifically it will be added to the top of a remote_write relabel_config list via the replace action. This will add the label as expect, but additionally a user could alter this label in a later relabel config to suit any specific requirements (say rename the label or add additional information to the value).
The location of this flag is the remote_write Spec, so this can be set for individual remote_write configurations.

We currently use a sample app to e2e test remote write in CMO.
In order to test the addition of the cluster_id relabel config, we need to confirm that the metrics send actually have the expected label.
For this test we should use Prometheus as the remote_write target. This allows us to query the metrics send via remote write and confirm they have the expected label.

Add an optional boolean flag to CMOs definition of RemoteWriteSpec that if true adds an entry in the specs WriteRelabelConfigs list.

I went with adding the relabel config to all user-supplied remote_write configurations. This path has no risk for backwards compatibility (unless users use the {}tmp_openshift_cluster_id{} label, seems unlikely) and reduces overall complexity, as well as documentation complexity.

The entry should look like what is already added to the telemetry remote write config and it should be added as the first entry in the list, before any user supplied relabel configs.

Epic Goal

  • Offer the option to double the scrape intervals for CMO controlled ServiceMonitors in single node deployments
  • Alternatively automatically double the same scrape intervals if CMO detects an SNO setup

The potential target ServiceMonitors are:

  • kubelet
  • kube-state-metrics
  • node-exporter
  • etcd
  • openshift-state-metrics

Why is this important?

  • Reduce CPU usage in SNO setups
  • Specifically doubling the scrape interval is important because:
  1. we are confident that this will have the least chance to interfere with existing rules. We typically have rate queries over the last 2 minutes (no shorter time window). With 30 second scrape intervals (the current default) this gives us 4 samples in any 2 minute window. rate needs at least 2 samples to work, we want another 2 for failure tolerance. Doubling the scrape interval will still give us 2 samples in most 2 minute windows. If a scrape fails, a few rule evaluations might fail intermittently.
  2. We expect a measureable reduction of CPU resources (see previous work)

Scenarios

  1. RAN deployments (Telco Edge) are SNO deployments. In these setups a full CMO deployment is often not needed and the default setup consumes too many resources. OpenShift as a whole has only very limited CPU cycles available and too many cycles are spend on Monitoring

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Previous Work (Optional):

  1. https://issues.redhat.com/browse/MON-1569

Open questions:

  1. Whether doubling some scrape intervals reduces CPU usage to fit into the assigned budget

Non goals

  • Allow arbitrarily long scrape intervals. This will interfere with alert and recoring rules
  • Implement a global override to scrape intervals.
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description

As a user, I want the topology view to be less cluttered as I doom out showing only information that I can discern and still be able to get a feel for the status of my project.

Acceptance Criteria

  1. When zoomed to 50% scale, all labels & decorators will be hidden. Label are shown when hovering over the node
  2. When zoomed to 30% scale, all labels, decorators, pod rings & icons will be hidden. Node shape remains the same, and background is either white, yellow or red. Background color is determined based on aggregate status of pods, alerts, builds and pipelines. Tooltip is available showing node name as well as the "things" which are attributing to the warning/error status.

Additional Details:

Description

As a user, I want to understand which service bindings connected a service to a component successfully or not. Currently it's really difficult to understand and needs inspection into each ServiceBinding resource (yaml).

Acceptance Criteria

  1. Show a status badge on the SB details page
  2. Show a Status field in the right column of the SB details page
  3. Show the Status field in the right column of the Topology side panel when a SB is selected
  4. Show an indicator in the Topology view which will help to differentiate when the service binding is in error state
  5. Define the available statuses & associated icons 🥴
    1. Connected
    2. Error
  6. Error states defined by the SB conditions … if any of these 3 are not True, the status will be displayed as Error

Additional Details:

See also https://docs.google.com/document/d/1OzE74z2RGO5LPjtDoJeUgYBQXBSVmD5tCC7xfJotE00/edit

Problem:

This epic is mainly focused on the 4.10 Release QE activities

Goal:

1. Identify the scenarios for automation
2. Segregate the test Scenarios into smoke, Regression and other user stories
a. Update the https://docs.jboss.org/display/ODC/Automation+Status+Report
3. Align with layered operator teams for updating scripts
3. Work closely with dev team for epic automation
4. Create the automation scripts using cypress
5. Implement CI for nightly builds
6. Execute scripts on sprint basis

Why is it important?

To the track the QE progress at one place in 4.10 Release Confluence page

Use cases:

  1. <case>

Acceptance criteria:

  1. <criteria>

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Acceptance criteria:

  1. Execute the automation scripts on ODC nightly builds in OpenShift CI (prow) periodically
  2. provide a separate job for each "plugin" (like pipelines, knative, etc.)

Goal:

This epic covers a number of customer requests(RFEs) as well as increases usability.

Why is it important?

Customer satisfaction as well as improved usability.

Acceptance Criteria

  1. Allow user to re-arrange the resources which have ben added to nav by the user
  2. Improved user experience (form based experience)
    1. Form based editing of Routes
    2. Form based creation and editing of Config Maps
    3. Form base creation of Deployments
  3. Improved discovery
    1. Include Share my project on the Add page to increase discoverability
    2. NS Helm Chart Repo
      1. Add tile to Add page for discoverability
      2. Provide a form driven creation experience
      3. User should be able to switch back and forth from Form/YAML
      4. change the intro text to the below & have the link in the intro text bring up the full page form
        1. Browse for charts that help manage complex installations and upgrades. Cluster administrators can customize the content made available in the catalog. Alternatively, developers can try to configure their own custom Helm Chart repository.

Dependencies (External/Internal):

None

Exploration:

Miro board from Epic Exploration

Description

As a user, I should be able to switch between the form and yaml editor while creating the ProjectHelmChartRepository CR.

Acceptance Criteria

  1. Convert the create form into a form-yaml switcher
  2. Display this form-yaml view in Search -> ProjectHelmChartRepositories in both perspectives

Additional Details:

Form component https://github.com/openshift/console/pull/11227

Description

As a user, I want to use a form to create Deployments

Acceptance Criteria

  1. Use existing edit Deployment form component for creating Deployments
  2. Display the form when clicked on `Create Deployment` in the Deployments Search page in the Dev perspective
  3. The `Create Deployment` button in the Deployments list page & the search page in the Admin perspective should have a similar experience.

Additional Details:

Edit deployment form ODC-5007

Problem:

Currently we are only able to get limited telemetry from the Dev Sandbox, but not from any of our managed clusters or on prem clusters.

Goals:

  1. Enable gathering segment telemetry whenever cluster telemetry is enabled on OSD clusters
  2. Have our OSD clusters opt into telemetry by default
  3. Work with PM & UX to identify additional metrics to capture in addition to what we have enabled currently on Sandbox.
  4. Ability to get a single report from woopra across all of our Sandbox and OSD clusters.
  5. Be able to generate a report including metrics of a single cluster or all clusters of a certain type ( sandbox, or OSD)

Why is it important?

In order to improve properly analyze usage and the user experience, we need to be able to gather as much data as possible.

Acceptance Criteria

  1. Extend console backend (bridge) to provide configuration as SERVER_FLAGS
    // JS type
    telemetry?: Record<string, string>
    
    1. Read the annotation of the cluster ConfigMap for telemetry data and pass them into the internal serverconfig.
    2. Pass through this internal serverconfig and export it as SERVER_FLAGS.
    3. Add a new --telemetry CLI option so that the telemetry options could be tested in a dev environment:
      ./bin/bridge --telemetry SEGMENT_API_KEY=a-key-123-xzy
      ./bin/bridge --telemetry CONSOLE_LOG=debug
      
  2. TBD: In best case the new annotation could be read from the cluster ConfigMap...
    1. Otherwise update the console-operator to pass the annotation from the console cluster configuration to the console ConfigMap.

Additional Details:

  1. More information about the integration with the backend could be found in the Telemetry on OSD clusters Google Doc

Goal:
Enhance oc adm release new (and related verbs info, extract, mirror) with heterogeneous architecture support

tl;dr

oc adm release new (and related verbs info, extract, mirror) would be enhanced to optionally allow the creation of manifest list release payloads. The manifest list flow would be triggered whenever the CVO image in an imagestream was a manifest list. If the CVO image is a standard manifest, the generated release payload will also be a manifest. If the CVO image is a manifest list, the generated release payload would be a manifest list (containing a manifest for each arch possessed by the CVO manifest list).

In either case, oc adm release new would permit non-CVO component images to be manifest or manifest lists and pass them through directly to the resultant release manifest(s).

If a manifest list release payload is generated, each architecture specific release payload manifest will reference the same pullspecs provided in the input imagestream.

 

More details in Option 1 of https://docs.google.com/document/d/1BOlPrmPhuGboZbLZWApXszxuJ1eish92NlOeb03XEdE/edit#heading=h.eldc1ppinjjh

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Epic Goal

  • Update image registry dependencies (Kubernetes and OpenShift) to the latest versions.

Why is this important?

  • New versions usually bring improvements that are needed by the registry and help with getting updates for z-stream.

Scenarios

  1. As an OpenShift engineer, I want my components to use the versions of dependencies, so that they get fixes for known issues and can be easily updated in z-stream.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated

Dependencies (internal and external)

  1. Kubernetes 1.24

Previous Work (Optional):

  1. IR-210

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>

As a OpenShift engineer
I want image-registry to use the latest k8s libraries
so that image-registry can benefit from new upstream features.

Acceptance criteria

  • image-registry uses k8s.io/api v1.24.z
  • image-registry uses latest openshift/api, openshift/library-go, openshift/client-go

Epic Goal

  • Provide a dedicated dashboard for NVIDIA GPU usage visualization in the OpenShift Console.

Why is this important?

  • Customers that use GPUs in their clusters usually have the GPU workloads as the main purpose of their cluster. As such, it makes much more sense to have the details about the usage they are doing of GPGPU resources AND CPU/RAM rather than just CPU/RAM

Scenarios

  1. As an admin of a cluster dedicated to data science, I want to quickly find out how much of my very costly resources are currently in use and if things are getting queued due to lack of resources

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Dependencies (internal and external)

  1. The NVIDIA GPU Operator must export to prometheus the relevant data

Open questions::

  1. Will NVIDIA agree to these extra data exports in their GPU Operator?

I asked Zvonko Kaiser and he seemed open to it. I need to confirm with Shiva Merla

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • Run OpenShift builds that do not execute as the "root" user on the host node.

Why is this important?

  • OpenShift builds require an elevated set of capabilities to build a container image
  • Builds currently run as root to maintain adequate performance
  • Container workloads should run as non-root from the host's perspective. Containers running as root are a known security risk.
  • Builds currently run as root and require a privileged container. See BUILD-225 for removing the privileged container requirement.

Scenarios

  1. Run BuildConfigs in a multi-tenant environment
  2. Run BuildConfigs in a heightened security environment/deployment

Acceptance Criteria

  • Developers can opt into running builds in a cri-o user namespace by providing an environment variable with a specific value.
  • When the correct environment variable is provided, builds run in a cri-o user namespace, and the build pod does not require the "privileged: true" security context.
  • User namespace builds can pass basic test scenarios for the Docker and Source strategy build.
  • Steps to run unprivileged builds are documented.

Dependencies (internal and external)

  1. Buildah supports running inside a non-privileged container
  2. CRI-O allows workloads to opt into running containers in user namespaces.

Previous Work (Optional):

  1. BUILD-225 - remove privileged requirement for builds.

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

User Story

As a developer building container images on OpenShift
I want to specify that my build should run without elevated privileges
So that builds do not run as root from the host's perspective with elevated privileges

Acceptance Criteria

  • Developers can provide an environment variable to indicate the build should not use privileged containers
  • When the correct env var + value is specified, builds run in a user namespace (non-root on the host)

QE Impact

No QE required for Dev Preview. OpenShift regression testing will verify that existing behavior is not impacted.

Docs Impact

We will need to document how to enable this feature, with sufficient warnings regarding Dev Preview.

PX Impact

This likely warrants an OpenShift blog post, potentially?

Notes

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Description of problem:

The storageclass "thin-csi" is created by vsphere-CSI-Driver-Operator, after deleting it manually, it should be re-created immediately. 

Version-Release number of selected component (if applicable):

4.11.4

How reproducible:

Always

Steps to Reproduce:

1. Check storageclass in running cluster, thin-csi is present:
$ oc get sc 
NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
thin (default)   kubernetes.io/vsphere-volume   Delete          Immediate              false                  41m
thin-csi         csi.vsphere.vmware.com         Delete          WaitForFirstConsumer   true                   38m
2. Delete thin-csi storageclass:
$ oc delete sc thin-csi
storageclass.storage.k8s.io "thin-csi" deleted
3. Check storageclass again, thin-csi is not present:
$ oc get sc
NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
thin (default)   kubernetes.io/vsphere-volume   Delete          Immediate           false                  50m
4. Check vmware-vsphere-csi-driver-operator log:
......
I0909 03:47:42.172866       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1662695014\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1662695014\" (2022-09-09 02:43:34 +0000 UTC to 2023-09-09 02:43:34 +0000 UTC (now=2022-09-09 03:47:42.172853123 +0000 UTC))"I0909 03:49:38.294962       
1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOFI0909 03:49:38.295468       
1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOFI0909 03:49:38.295765       
1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF

5. Only first time creating in vmware-vsphere-csi-driver-operator log:
$ oc -n openshift-cluster-csi-drivers logs vmware-vsphere-csi-driver-operator-7cc6d44b5c-c8czw | grep -i "storageclass"I0909 03:46:31.865926   1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"vmware-vsphere-csi-driver-operator", UID:"9e0c3e2d-d403-40a1-bf69-191d7aec202b", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'StorageClassCreated' Created StorageClass.storage.k8s.io/thin-csi because it was missing 

Actual results:

The storageclass "thin-csi" could not be re-created after deleting

Expected results:

The storageclass "thin-csi" should be re-created after deleting

Additional info:

 

Description of problem:

With every pod update we are executing a mutate operation to add the pod port to the port group or add the pod IP to an address set. This functionally doesn't hurt, since mutate will not add duplicate values to the same set. However, this is bad for performance. For example, with a 730 network policies affecting a pod, and issuing 7 pod updates would result in over 5k transactions.

This is a clone of issue OCPBUGS-13183. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10990. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10526. The following is the description of the original issue:

Description of problem:


Version-Release number of selected component (if applicable):

 4.13.0-0.nightly-2023-03-17-161027 

How reproducible:

Always

Steps to Reproduce:

1.  Create a GCP XPN cluster with flexy job template ipi-on-gcp/versioned-installer-xpn-ci, then 'oc descirbe node'

2. Check logs for cloud-network-config-controller pods

Actual results:


 % oc get nodes
NAME                                                          STATUS   ROLES                  AGE    VERSION
huirwang-0309d-r85mj-master-0.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-master-1.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-master-2.c.openshift-qe.internal         Ready    control-plane,master   173m   v1.26.2+06e8c46
huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal   Ready    worker                 162m   v1.26.2+06e8c46
huirwang-0309d-r85mj-worker-b-5txgq.c.openshift-qe.internal   Ready    worker                 162m   v1.26.2+06e8c46
 `oc describe node`, there is no related egressIP annotations 
% oc describe node huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal 
Name:               huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=n2-standard-4
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-central1
                    failure-domain.beta.kubernetes.io/zone=us-central1-a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal
                    kubernetes.io/os=linux
                    machine.openshift.io/interruptible-instance=
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=n2-standard-4
                    node.openshift.io/os_id=rhcos
                    topology.gke.io/zone=us-central1-a
                    topology.kubernetes.io/region=us-central1
                    topology.kubernetes.io/zone=us-central1-a
Annotations:        csi.volume.kubernetes.io/nodeid:
                      {"pd.csi.storage.gke.io":"projects/openshift-qe/zones/us-central1-a/instances/huirwang-0309d-r85mj-worker-a-wsrls"}
                    k8s.ovn.org/host-addresses: ["10.0.32.117"]
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_huirwang-0309d-r85mj-worker-a-wsrls.c.openshift-qe.internal","mac-address":"42:01:0a:00:...
                    k8s.ovn.org/node-chassis-id: 7fb1870c-4315-4dcb-910c-0f45c71ad6d3
                    k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.5/16"}
                    k8s.ovn.org/node-mgmt-port-mac-address: 16:52:e3:8c:13:e2
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.32.117/32"}
                    k8s.ovn.org/node-subnets: {"default":["10.131.0.0/23"]}
                    machine.openshift.io/machine: openshift-machine-api/huirwang-0309d-r85mj-worker-a-wsrls
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-bec5065070ded51e002c566a9c5bd16a
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true


 % oc logs cloud-network-config-controller-5cd96d477d-2kmc9  -n openshift-cloud-network-config-controller  
W0320 03:00:08.981493       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0320 03:00:08.982280       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock...
E0320 03:00:38.982868       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp: lookup api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com: i/o timeout
E0320 03:01:23.863454       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp: lookup api-int.huirwang-0309d.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: read udp 10.129.0.14:52109->172.30.0.10:53: read: connection refused
I0320 03:02:19.249359       1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock
I0320 03:02:19.250662       1 controller.go:88] Starting node controller
I0320 03:02:19.250681       1 controller.go:91] Waiting for informer caches to sync for node workqueue
I0320 03:02:19.250693       1 controller.go:88] Starting secret controller
I0320 03:02:19.250703       1 controller.go:91] Waiting for informer caches to sync for secret workqueue
I0320 03:02:19.250709       1 controller.go:88] Starting cloud-private-ip-config controller
I0320 03:02:19.250715       1 controller.go:91] Waiting for informer caches to sync for cloud-private-ip-config workqueue
I0320 03:02:19.258642       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal to node workqueue
I0320 03:02:19.258671       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal to node workqueue
I0320 03:02:19.258682       1 controller.go:182] Assigning key: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal to node workqueue
I0320 03:02:19.351258       1 controller.go:96] Starting node workers
I0320 03:02:19.351303       1 controller.go:102] Started node workers
I0320 03:02:19.351298       1 controller.go:96] Starting secret workers
I0320 03:02:19.351331       1 controller.go:102] Started secret workers
I0320 03:02:19.351265       1 controller.go:96] Starting cloud-private-ip-config workers
I0320 03:02:19.351508       1 controller.go:102] Started cloud-private-ip-config workers
E0320 03:02:19.589704       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-1.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.615551       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-0.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.644628       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-2.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.774047       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-0.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-0.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.783309       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-1.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-1.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue
E0320 03:02:19.816430       1 controller.go:165] error syncing 'huirwang-0309d-r85mj-master-2.c.openshift-qe.internal': error retrieving the private IP configuration for node: huirwang-0309d-r85mj-master-2.c.openshift-qe.internal, err: error retrieving the network interface subnets, err: googleapi: Error 404: The resource 'projects/openshift-qe/regions/us-central1/subnetworks/installer-shared-vpc-subnet-1' was not found, notFound, requeuing in node workqueue

Expected results:

EgressIP should work

Additional info:

It can be reproduced in  4.12 as well, not regression issue.

This bug is a backport clone of [Bugzilla Bug 2094362](https://bugzilla.redhat.com/show_bug.cgi?id=2094362). The following is the description of the original bug:

Description of problem:
A change [1] was introduced to split the kube-apiserver SLO rules into 2 groups to reduce the load on Prometheus (see bug 2004585).

[1] https://github.com/openshift/cluster-kube-apiserver-operator/commit/4a1751ee86cda37f0d9ea520cac09f91ebc3abe9

Version-Release number of selected component (if applicable):
4.9 (because the change was backported to 4.9.z)

How reproducible:
Always

Steps to Reproduce:
1. Install OCP 4.9
2. Retrieve kube-apiserver-slos*
oc get -n openshift-kube-apiserver prometheusrules kube-apiserver-slos -o yaml
oc get -n openshift-kube-apiserver prometheusrules kube-apiserver-slos-basic -o yaml

Actual results:

The KubeAPIErrorBudgetBurn alert with labels

{long="1h",namespace="openshift-kube-apiserver",severity="critical",short="5m"}

exists both in kube-apiserver-slos and kube-apiserver-slos-basic.

The alerting rules is evaluated twice. The same is true for recording rules like "apiserver_request:burnrate1h" and in this case, it can trigger warning logs in the Prometheus pods:

> level=warn component="rule manager" group=kube-apiserver.rules msg="Error on ingesting out-of-order result from rule evaluation" numDropped=283

Expected results:

I presume that kube-apiserver-slos shouldn't exist since it's been replaced by kube-apiserver-slos-basic and kube-apiserver-slos-extended.

Additional info:

Discovered while investigating bug 2091902

Description of problem:

Upgrade OCP 4.11 --> 4.12 fails with one 'NotReady,SchedulingDisabled' node and MachineConfigDaemonFailed.

Version-Release number of selected component (if applicable):

Upgrade from OCP 4.11.0-0.nightly-2022-09-19-214532 on top of OSP RHOS-16.2-RHEL-8-20220804.n.1 to 4.12.0-0.nightly-2022-09-20-040107.

Network Type: OVNKubernetes

How reproducible:

Twice out of two attempts.

Steps to Reproduce:

1. Install OCP 4.11.0-0.nightly-2022-09-19-214532 (IPI) on top of OSP RHOS-16.2-RHEL-8-20220804.n.1.
   The cluster is up and running with three workers:
   $ oc get clusterversion
   NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
   version   4.11.0-0.nightly-2022-09-19-214532   True        False         51m     Cluster version is 4.11.0-0.nightly-2022-09-19-214532

2. Run the OC command to upgrade to 4.12.0-0.nightly-2022-09-20-040107:
$ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 --allow-explicit-upgrade --force=true
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requesting update to release image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 

3. The upgrade is not succeeds: [0]
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-09-19-214532   True        True          17h     Unable to apply 4.12.0-0.nightly-2022-09-20-040107: wait has exceeded 40 minutes for these operators: network

One node degrided to 'NotReady,SchedulingDisabled' status:
$ oc get nodes
NAME                          STATUS                        ROLES    AGE   VERSION
ostest-9vllk-master-0         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-master-1         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-master-2         Ready                         master   19h   v1.24.0+07c9eb7
ostest-9vllk-worker-0-4x4pt   NotReady,SchedulingDisabled   worker   18h   v1.24.0+3882f8f
ostest-9vllk-worker-0-h6kcs   Ready                         worker   18h   v1.24.0+3882f8f
ostest-9vllk-worker-0-xhz9b   Ready                         worker   18h   v1.24.0+3882f8f

$ oc get pods -A | grep -v -e Completed -e Running
NAMESPACE                                          NAME                                                         READY   STATUS      RESTARTS       AGE
openshift-openstack-infra                          coredns-ostest-9vllk-worker-0-4x4pt                          0/2     Init:0/1    0              18h
 
$ oc get events
LAST SEEN   TYPE      REASON                                        OBJECT            MESSAGE
7m15s       Warning   OperatorDegraded: MachineConfigDaemonFailed   /machine-config   Unable to apply 4.12.0-0.nightly-2022-09-20-040107: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
7m15s       Warning   MachineConfigDaemonFailed                     /machine-config   Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]

$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
baremetal                                  4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cloud-controller-manager                   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cloud-credential                           4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
cluster-autoscaler                         4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
config-operator                            4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
console                                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
control-plane-machine-set                  4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
csi-snapshot-controller                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
dns                                        4.12.0-0.nightly-2022-09-20-040107   True        True          False      19h     DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6."
etcd                                       4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
image-registry                             4.12.0-0.nightly-2022-09-20-040107   True        True          False      18h     Progressing: The registry is ready...
ingress                                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
insights                                   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-apiserver                             4.12.0-0.nightly-2022-09-20-040107   True        True          False      18h     NodeInstallerProgressing: 1 nodes are at revision 11; 2 nodes are at revision 13
kube-controller-manager                    4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-scheduler                             4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
kube-storage-version-migrator              4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-api                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-approver                           4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
machine-config                             4.11.0-0.nightly-2022-09-19-214532   False       True          True       16h     Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)]
marketplace                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
monitoring                                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
network                                    4.12.0-0.nightly-2022-09-20-040107   True        True          True       19h     DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-09-20T14:16:13Z...
node-tuning                                4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
openshift-apiserver                        4.12.0-0.nightly-2022-09-20-040107   True        False         False      18h    
openshift-controller-manager               4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
openshift-samples                          4.12.0-0.nightly-2022-09-20-040107   True        False         False      17h    
operator-lifecycle-manager                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
service-ca                                 4.12.0-0.nightly-2022-09-20-040107   True        False         False      19h    
storage                                    4.12.0-0.nightly-2022-09-20-040107   True        True          False      19h     ManilaCSIDriverOperatorCRProgressing: ManilaDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...

[0] http://pastebin.test.redhat.com/1074531

Actual results:

OCP 4.11 --> 4.12 upgrade fails.

Expected results:

OCP 4.11 --> 4.12 upgrade success.

Additional info:

Attached logs of the NotReady node - [^journalctl_ostest-9vllk-worker-0-4x4pt.log.tar.gz]

This is a clone of issue OCPBUGS-8044. The following is the description of the original issue:

Description of problem:

rhbz#2089199, backported to 4.11.5, shifted the etcd Grafana dashboard from the monitoring operator to the etcd operator. During the shift, the ConfigMap was renamed from grafana-dashboard-etcd to etcd-dashboard. However, we did not include logic for garbage-collecting the obsolete dasboard, so clusters that update from 4.11.1 and similar into 4.11.>=5 or 4.12+ currently end up with both the obsolete and new ConfigMaps. We should grow code to remove the obsolete ConfigMap.

Version-Release number of selected component (if applicable):

4.11.>=5 and 4.12+ are currently exposed.

How reproducible:

100%

Steps to Reproduce:

1. Install 4.11.1.
2. Update to a release that defines the etcd-dashboard ConfigMap.
3. Check for etcd dashboards with oc -n openshift-config-managed get configmaps | grep etcd.

Actual results:

Both etcd-dashboard and grafana-dashboard-etcd exist:

$ oc -n openshift-config-managed get configmaps | grep etcd
etcd-dashboard                                        1      196d
grafana-dashboard-etcd                                1      2y282d

Another example is 4.11.1 to 4.11.5 CI:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1570415394001260544/artifacts/e2e-aws-upgrade/configmaps.json | jq -r '.items[].metadata | select(.namespace == "openshift-config-managed" and (.name | contains("etcd"))) | .name'
etcd-dashboard
grafana-dashboard-etcd

Expected results:

Only etcd-dashboard still exists.

Additional info:

A new manifest for the outgoing ConfigMap that sets the release.openshift.io/delete: "true" annotation would ask the cluster-version operator to reap the obsolete ConfigMap.

An RW mutex was introduced to the project auth cache with https://github.com/openshift/openshift-apiserver/pull/267, taking exclusive access during cache syncs. On clusters with extremely high object counts for namespaces and RBAC, syncs appear to be extremely slow (on the order of several minutes). The project LIST handler acquires the same mutex in shared mode as part of its critical path.

Description of problem:

The metal3 Pod of openshift-machine-api is in CrashLoopBackOff status.

Version-Release number of selected component (if applicable):

4.10.31

How reproducible:

Always reproduce in IPv6

Steps to Reproduce:

1. Preparing to configure ipi on the provisioning node
   - RHEL 8 ( haproxy, named, mirror registry, rhcos_cache_server ..)

2. configuring the install-config.yaml (attached)
   - provisioningNetwork: disabled
   - machine network: only IPv6
   - disconnected installation
   
3. deploy the cluster 

Actual results:

It is not possible to add worker nodes because metal3 does not start normally.

Expected results:

metal3 starts normally in IPv6 environment

Additional info:

1. attached must-gather
https://drive.google.com/file/d/1GKxj3syROIMnURx_PYzOYhJdEuXBNXVW/view?usp=sharing

2. pod status
[kni@prov ~]$ oc get pod
NAME                                           READY   STATUS             RESTARTS          AGE
cluster-autoscaler-operator-6656bfd7b9-bt4j8   2/2     Running            0                 35h
cluster-baremetal-operator-6bbdd6758-rmxgq     2/2     Running            0                 35h
machine-api-controllers-55fb545b56-kl5sj       7/7     Running            0                 34h
machine-api-operator-845b6cf855-q7gdd          2/2     Running            0                 35h
metal3-574876cfdb-98fmz                        6/7     CrashLoopBackOff   179 (3m17s ago)   14h
metal3-image-cache-5mq2w                       1/1     Running            0                 14h
metal3-image-cache-nftpj                       1/1     Running            0                 14h
metal3-image-cache-t7whh                       1/1     Running            0                 14h
metal3-image-customization-68d4d6d99b-dbqgn    1/1     Running            0                 15h

[kni@prov ~]$ oc logs metal3-574876cfdb-98fmz -c metal3-httpd
AH00526: Syntax error on line 8 of /etc/httpd/conf.d/vmedia.conf:
The port number "2001:feed:101::102" is outside the appropriate range (i.e., 1..65535).

This is a clone of issue OCPBUGS-6850. The following is the description of the original issue:

This is a clone of issue OCPBUGS-6503. The following is the description of the original issue:

Description of problem:

While looking into OCPBUGS-5505 I discovered that some 4.10->4.11 upgrade job runs perform an Admin Ack check, while some do not. 4.11 has a ack-4.11-kube-1.25-api-removals-in-4.12 gate, so these upgrade jobs sometimes test that Upgradeable goes false after the ugprade, and sometimes they do not. This is only determined by the polling race condition: the check is executed once per 10 minutes, and we cancel the polling after upgrade is completed. This means that in some cases we are lucky and manage to run one check before the cancel, and sometimes we are not and only check while still on the base version.

Example job that checked admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'
Jan  6 21:16:40.153: INFO: Waiting for Upgradeable to be AdminAckRequired ...

Example job that did not check admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'

Version-Release number of selected component (if applicable):

4.11+ openshift-tests

How reproducible:

nondeterministic, wild guess is ~30% of upgrade jobs

Steps to Reproduce:

1. Inspect the E2E test log of an upgrade jobs and compare the time of the update ("Completed upgrade") with the time of the last check ( "Skipping admin ack", "Gate .* not applicable to current version", "Admin Ack verified') done by the admin ack test

Actual results:

Jan 23 00:47:43.842: INFO: Admin Ack verified
Jan 23 00:57:43.836: INFO: Admin Ack verified
Jan 23 01:07:43.839: INFO: Admin Ack verified
Jan 23 01:17:33.474: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z09ll8fw/release@sha256:322cf67dc00dd6fa4fdd25c3530e4e75800f6306bd86c4ad1418c92770d58ab8

No check done after the upgrade

Expected results:

Jan 23 00:57:37.894: INFO: Admin Ack verified
Jan 23 01:07:37.894: INFO: Admin Ack verified
Jan 23 01:16:43.618: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z8h5x1c5/release@sha256:9c4c732a0b4c2ae887c73b35685e52146518e5d2b06726465d99e6a83ccfee8d
Jan 23 01:17:57.937: INFO: Admin Ack verified

One or more checks done after upgrade

Description of problem:

The last egressIP in IP capacity cannot be applied correctly

Version-Release number of selected component (if applicable):

4.11.0-0.nightly-2022-11-27-164248 

How reproducible:

Found this failure in automation case which used to pass

Steps to Reproduce:

In AWS, one worker node is with IP capacity "ipv4:14"
 $oc describe node ip-10-0-129-21.us-west-2.compute.internal | egrep -C 3 egress
Annotations:        cloud.network.openshift.io/egress-ipconfig:
                      [{"interface":"eni-03820ba0eca427fbf","ifaddr":{"ipv4":"10.0.128.0/18"},"capacity":{"ipv4":14,"ipv6":15}}]
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0cb3ae15bf3cffd9f"}
                    k8s.ovn.org/host-addresses: ["10.0.129.21"]

1. Label above node as egress node
2. Created 15 egressIP objects, each egressIP object with one egressIP.

Actual results:

 13 egressIPs applied correctly , last one is in wrong status.
% oc get egressip
NAME                EGRESSIPS      ASSIGNED NODE                               ASSIGNED EGRESSIPS
egressip-47208-0    10.0.135.200   ip-10-0-129-21.us-west-2.compute.internal   10.0.135.200
egressip-47208-1    10.0.162.178   ip-10-0-129-21.us-west-2.compute.internal   10.0.162.178
egressip-47208-10   10.0.144.46    ip-10-0-129-21.us-west-2.compute.internal   10.0.144.46
egressip-47208-11   10.0.191.91    ip-10-0-129-21.us-west-2.compute.internal   10.0.191.91
egressip-47208-12   10.0.133.215   ip-10-0-129-21.us-west-2.compute.internal   10.0.133.215
egressip-47208-13   10.0.174.207   ip-10-0-129-21.us-west-2.compute.internal   10.0.174.207
egressip-47208-14   10.0.176.224   ip-10-0-129-21.us-west-2.compute.internal   10.0.176.224
egressip-47208-2    10.0.184.114   ip-10-0-129-21.us-west-2.compute.internal   10.0.184.114
egressip-47208-3    10.0.167.224   ip-10-0-129-21.us-west-2.compute.internal   10.0.167.224
egressip-47208-4    10.0.187.148   ip-10-0-129-21.us-west-2.compute.internal   10.0.187.148
egressip-47208-5    10.0.184.109   ip-10-0-129-21.us-west-2.compute.internal   10.0.184.109
egressip-47208-6    10.0.155.208                                               
egressip-47208-7    10.0.134.13    ip-10-0-129-21.us-west-2.compute.internal   10.0.134.13
egressip-47208-8    10.0.142.255   ip-10-0-129-21.us-west-2.compute.internal   10.0.142.255
egressip-47208-9    10.0.170.197

% oc get cloudprivateipconfig
NAME           AGE
10.0.133.215   113s
10.0.134.13    113s
10.0.135.200   113s
10.0.142.255   113s
10.0.144.46    113s
10.0.155.208   113s
10.0.162.178   113s
10.0.167.224   113s
10.0.174.207   113s
10.0.176.224   113s
10.0.184.109   113s
10.0.184.114   113s
10.0.187.148   113s
10.0.191.91    113s

 oc get cloudprivateipconfig 10.0.155.208  -o yaml
apiVersion: cloud.network.openshift.io/v1
kind: CloudPrivateIPConfig
metadata:
  annotations:
    k8s.ovn.org/egressip-owner-ref: egressip-47208-6
  creationTimestamp: "2022-11-28T09:13:47Z"
  finalizers:
  - cloudprivateipconfig.cloud.network.openshift.io/finalizer
  generation: 1
  name: 10.0.155.208
  resourceVersion: "72869"
  uid: 0143a07b-0a30-4de8-bfd7-589ed0c3d7dc
spec:
  node: ip-10-0-129-21.us-west-2.compute.internal
status:
  conditions:
  - lastTransitionTime: "2022-11-28T09:16:51Z"
    message: 'Error processing cloud assignment request, err: <nil>'
    observedGeneration: 1
    reason: CloudResponseError
    status: "False"
    type: Assigned
  node: ip-10-0-129-21.us-west-2.compute.internal

Expected results:

As IP capacity for this node is 14, so here should have 14 egress IP applied successfully. No cloudprivateipconfig IP in error status.


Additional info:


This is a clone of issue OCPBUGS-1629. The following is the description of the original issue:

Description of problem:

It is a disconnected cluster on AWS. There is an issue configuring Egress IP where the cluster uses STS. While looking into cloud-network-config-controller pod it is trying to connect to the global sts service "https://sts.amazonaws.com/" rather it should connect to the regional one "https://ec2.ap-southeast-1.amazonaws.com".

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Create a disconected OCP cluster on AWS.
$ oc get netnamespace | grep egress
egress-ip-test                                     2689387    ["172.16.1.24"]
$ oc get hostsubnet
NAME                                              HOST                                              HOST IP        SUBNET          EGRESS CIDRS   EGRESS IPS
ip-172-16-1-151.ap-southeast-1.compute.internal   ip-172-16-1-151.ap-southeast-1.compute.internal   172.16.1.151   10.130.0.0/23                  
ip-172-16-1-53.ap-southeast-1.compute.internal    ip-172-16-1-53.ap-southeast-1.compute.internal    172.16.1.53    10.131.0.0/23                  ["172.16.1.24"]
ip-172-16-2-15.ap-southeast-1.compute.internal    ip-172-16-2-15.ap-southeast-1.compute.internal    172.16.2.15    10.128.0.0/23                  
ip-172-16-2-77.ap-southeast-1.compute.internal    ip-172-16-2-77.ap-southeast-1.compute.internal    172.16.2.77    10.128.2.0/23                  
ip-172-16-3-111.ap-southeast-1.compute.internal   ip-172-16-3-111.ap-southeast-1.compute.internal   172.16.3.111   10.129.0.0/23                  
ip-172-16-3-79.ap-southeast-1.compute.internal    ip-172-16-3-79.ap-southeast-1.compute.internal    172.16.3.79    10.129.2.0/23                  
$ oc logs sdn-controller-6m5kb -n openshift-sdn I0922 04:09:53.348615       1 vnids.go:105] Allocated netid 2689387 for namespace "egress-ip-test"
E0922 04:24:00.682018       1 egressip.go:254] Ignoring invalid HostSubnet ip-172-16-1-53.ap-southeast-1.compute.internal (host: "ip-172-16-1-53.ap-southeast-1.compute.internal", ip: "172.16.1.53", subnet: "10.131.0.0/23"): related node object "ip-172-16-1-53.ap-southeast-1.compute.internal" has an incomplete annotation "cloud.network.openshift.io/egress-ipconfig", CloudEgressIPConfig: <nil>
 $ oc logs cloud-network-config-controller-5c7556db9f-x78bs -n openshift-cloud-network-config-controller

E0922 04:26:59.468726       1 controller.go:165] error syncing 'ip-172-16-2-77.ap-southeast-1.compute.internal': error retrieving the private IP configuration for node: ip-172-16-2-77.ap-southeast-1.compute.internal, err: error: cannot list ec2 instance for node: ip-172-16-2-77.ap-southeast-1.compute.internal, err: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": dial tcp 54.239.29.25:443: i/o timeout, requeuing in node workqueue
$ oc get Infrastructure -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: Infrastructure
  metadata:
    creationTimestamp: "2022-09-22T03:28:15Z"
    generation: 1
    name: cluster
    resourceVersion: "598"
    uid: 994da301-2a96-43b7-b43b-4b7c18d4b716
  spec:
    cloudConfig:
      name: ""
    platformSpec:
      aws:
        serviceEndpoints:
        - name: sts
          url: https://sts.ap-southeast-1.amazonaws.com
        - name: ec2
          url: https://ec2.ap-southeast-1.amazonaws.com
        - name: elasticloadbalancing
          url: https://elasticloadbalancing.ap-southeast-1.amazonaws.com
      type: AWS
  status:
    apiServerInternalURI: https://api-int.openshiftyy.ocpaws.sadiqueonline.com:6443
    apiServerURL: https://api.openshiftyy.ocpaws.sadiqueonline.com:6443
    controlPlaneTopology: HighlyAvailable
    etcdDiscoveryDomain: ""
    infrastructureName: openshiftyy-wfrpf
    infrastructureTopology: HighlyAvailable
    platform: AWS
    platformStatus:
      aws:
        region: ap-southeast-1
        serviceEndpoints:
        - name: ec2
          url: https://ec2.ap-southeast-1.amazonaws.com
        - name: elasticloadbalancing
          url: https://elasticloadbalancing.ap-southeast-1.amazonaws.com
        - name: sts
          url: https://sts.ap-southeast-1.amazonaws.com
      type: AWS
kind: List
metadata:
  resourceVersion: ""
$ oc get secret aws-cloud-credentials -n openshift-machine-api -o json |jq -r .data.credentials |base64 -d 
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-machine-api-aws-cloud-credentials
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 
[ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credential-operator-iam-ro-creds -n openshift-cloud-credential-operator -o json |jq -r .data.credentials |base64 -d 
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cloud-credential-operator-cloud-creden
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 
[ec2-user@ip-172-17-1-229 ~]$ oc get secret installer-cloud-credentials -n openshift-image-registry -o json |jq -r .data.credentials |base64 -d 
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-image-registry-installer-cloud-credent
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 
[ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credentials -n openshift-ingress-operator -o json |jq -r .data.credentials |base64 -d 
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-ingress-operator-cloud-credentials
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 
[ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credentials -n openshift-cloud-network-config-controller -o json |jq -r .data.credentials |base64 -d 
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cloud-network-config-controller-cloud-
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 
[ec2-user@ip-172-17-1-229 ~]$ oc get secret ebs-cloud-credentials -n openshift-cluster-csi-drivers -o json |jq -r .data.credentials |base64 -d
[default]
sts_regional_endpoints = regional
role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cluster-csi-drivers-ebs-cloud-credenti
web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
 

 

Actual results:

Egress IP not configured properly and cloud-network-config-controller trying to connect to global STS service.

Expected results:

Egress IP should get configured and cloud-network-config-controller should connect to regional STS service instead of global.

Additional info:

 

Description of problem:

In a 4.11 cluster with only openshift-samples enabled, the 4.12 introduced optional COs console and insights are installed. While upgrading to 4.12, CVO considers them to be disabled explicitly and skips reconciling them. So these COs are not upgraded to 4.12.

Installed COs cannot be disabled, so CVO is supposed to implicitly enable them.


$ oc get clusterversion -oyaml
{
  "apiVersion": "config.openshift.io/v1",
     "kind": "ClusterVersion",
     "metadata": {
         "creationTimestamp": "2022-09-30T05:02:31Z",
         "generation": 3,
         "name": "version",
         "resourceVersion": "134808",
         "uid": "bd95473f-ffda-402d-8fe3-74f852a9d6eb"
     },
     "spec": {
         "capabilities": {
             "additionalEnabledCapabilities": [
                 "openshift-samples"
             ],
             "baselineCapabilitySet": "None"
         },
         "channel": "stable-4.11",
         "clusterID": "8eda5167-a730-4b39-be1d-214a80506d34",
         "desiredUpdate": {
             "force": true,
             "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc",
             "version": ""
         }
     },
     "status": {
         "availableUpdates": null,
         "capabilities": {
             "enabledCapabilities": [
                 "openshift-samples"
             ],
             "knownCapabilities": [
                 "Console",
                 "Insights",
                 "Storage",
                 "baremetal",
                 "marketplace",
                 "openshift-samples"
             ]
         },
         "conditions": [
             {
                 "lastTransitionTime": "2022-09-30T05:02:33Z",
                 "message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-09-28-204419 not found in the \"stable-4.11\" channel",
                 "reason": "VersionNotFound",
                 "status": "False",
                 "type": "RetrievedUpdates"
             },
             {
                 "lastTransitionTime": "2022-09-30T05:02:33Z",
                 "message": "Capabilities match configured spec",
                 "reason": "AsExpected",
                 "status": "False",
                 "type": "ImplicitlyEnabledCapabilities"
             },
             {
                 "lastTransitionTime": "2022-09-30T05:02:33Z",
                 "message": "Payload loaded version=\"4.12.0-0.nightly-2022-09-28-204419\" image=\"registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc\" architecture=\"amd64\"",
                 "reason": "PayloadLoaded",
                 "status": "True",
                 "type": "ReleaseAccepted"
             },
             {
                 "lastTransitionTime": "2022-09-30T05:23:18Z",
                 "message": "Done applying 4.12.0-0.nightly-2022-09-28-204419",
                 "status": "True",
                 "type": "Available"
             },
             {
                 "lastTransitionTime": "2022-09-30T07:05:42Z",
                 "status": "False",
                 "type": "Failing"
             },
             {
                 "lastTransitionTime": "2022-09-30T07:41:53Z",
                 "message": "Cluster version is 4.12.0-0.nightly-2022-09-28-204419",
                 "status": "False",
                 "type": "Progressing"
             }
         ],
         "desired": {
             "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc",
             "version": "4.12.0-0.nightly-2022-09-28-204419"
         },
         "history": [
             {
                 "completionTime": "2022-09-30T07:41:53Z",
                 "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc",
                 "startedTime": "2022-09-30T06:42:01Z",
                 "state": "Completed",
                 "verified": false,
                 "version": "4.12.0-0.nightly-2022-09-28-204419"
             },
             {
                 "completionTime": "2022-09-30T05:23:18Z",
                 "image": "registry.ci.openshift.org/ocp/release@sha256:5a6f6d1bf5c752c75d7554aa927c06b5ea0880b51909e83387ee4d3bca424631",
                 "startedTime": "2022-09-30T05:02:33Z",
                 "state": "Completed",
                 "verified": false,
                 "version": "4.11.0-0.nightly-2022-09-29-191451"
             }
         ],
         "observedGeneration": 3,
         "versionHash": "CSCJ2fxM_2o="
     }
 }

$ oc get co
 NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      93m     
cloud-controller-manager                   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h56m   
cloud-credential                           4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h59m   
cluster-autoscaler                         4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h53m   
config-operator                            4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
console                                    4.11.0-0.nightly-2022-09-29-191451   True        False         False      3h45m   
control-plane-machine-set                  4.12.0-0.nightly-2022-09-28-204419   True        False         False      117m    
csi-snapshot-controller                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
dns                                        4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h53m   
etcd                                       4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h52m   
image-registry                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h46m   
ingress                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      151m    
insights                                   4.11.0-0.nightly-2022-09-29-191451   True        False         False      3h48m   
kube-apiserver                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h50m   
kube-controller-manager                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h51m   
kube-scheduler                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h51m   
kube-storage-version-migrator              4.12.0-0.nightly-2022-09-28-204419   True        False         False      91m     
machine-api                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h50m   
machine-approver                           4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
machine-config                             4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h52m   
monitoring                                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h44m   
network                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h55m   
node-tuning                                4.12.0-0.nightly-2022-09-28-204419   True        False         False      113m    
openshift-apiserver                        4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h48m   
openshift-controller-manager               4.12.0-0.nightly-2022-09-28-204419   True        False         False      113m    
openshift-samples                          4.12.0-0.nightly-2022-09-28-204419   True        False         False      116m    
operator-lifecycle-manager                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h48m   
service-ca                                 4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m   
storage                                    4.12.0-0.nightly-2022-09-28-204419   True        False         False      3h54m 

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-28-204419

How reproducible:

Always

Steps to Reproduce:

1. Install a 4.11 cluster with only openshift-samples enabled
2. Upgrade to 4.12
3.

Actual results:

The 4.12 introduced optional CO console and insights are not upgraded to 4.12

Expected results:

All the installed COs get upgraded

Additional info:

 

This is a clone of issue OCPBUGS-11404. The following is the description of the original issue:

This is a clone of issue OCPBUGS-11333. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10690. The following is the description of the original issue:

Description of problem:

according to PR: https://github.com/openshift/cluster-monitoring-operator/pull/1824, startupProbe for UWM prometheus/platform prometheus should be 1 hour, but startupProbe for UWM prometheus is still 15m after enabled UWM, platform promethues does not have issue, startupProbe is increased to 1 hour

$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 60
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3
...

$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep startupProbe -A20
    startupProbe:
      exec:
        command:
        - sh
        - -c
        - if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready;
          elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready;
          else exit 1; fi
      failureThreshold: 240
      periodSeconds: 15
      successThreshold: 1
      timeoutSeconds: 3

 

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-19-052243

How reproducible:

always

Steps to Reproduce:

1. enable UWM, check startupProbe for UWM prometheus/platform prometheus
2.
3.

Actual results:

startupProbe for UWM prometheus is still 15m

Expected results:

startupProbe for UWM prometheus should be 1 hour

Additional info:

since startupProbe for platform prometheus is increased to 1 hour, and no similar bug for UWM prometheus, won't fix the issue is OK.

This is a clone of issue OCPBUGS-1428. The following is the description of the original issue:

Description of problem:

When using an OperatorGroup attached to a service account, AND if there is a secret present in the namespace, the operator installation will fail with the message:
the service account does not have any API secret sa=testx-ns/testx-sa
This issue seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=2094303 - which was resolved in 4.11.0 - however, the new element now, is that the presence of a secret in the namespace  is causing the issue.
The name of the secret seems significant - suggesting something somewhere is depending on the order that secrets are listed in. For example, If the secret in the namespace is called "asecret", the problem does not occur. If it is called "zsecret", the problem always occurs.
"zsecret" is not a "kubernetes.io/service-account-token". The issue I have raised here relates to Opaque secrets - zsecret is an Opaque secret. The issue may apply to other types of secrets, but specifically my issue is that when there is an opaque secret present in the namespace, the operator install fails as described. I aught to be allowed to have an opaque secret present in the namespace where I am installing the operator.

Version-Release number of selected component (if applicable):

4.11.0 & 4.11.1

How reproducible:

100% reproducible

Steps to Reproduce:

1.Create namespace: oc new-project testx-ns
2. oc apply -f api-secret-issue.yaml

Actual results:

 

Expected results:

 

Additional info:

API YAML:

cat api-secret-issue.yaml 
apiVersion: v1
kind: Secret
metadata:
  name: zsecret
  namespace: testx-ns
  annotations:
   kubernetes.io/service-account.name: testx-sa
type: Opaque
stringData:
  mykey: mypass

apiVersion: v1
kind: ServiceAccount
metadata:
  name: testx-sa
  namespace: testx-ns

kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: testx-og
  namespace: testx-ns
spec:
  serviceAccountName: "testx-sa"
  targetNamespaces:
  - testx-ns

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: testx-role
  namespace: testx-ns
rules:

  • apiGroups: ["*"]
      resources: ["*"]
      verbs: ["*"] 
      

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: testx-rolebinding
  namespace: testx-ns
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: testx-role
subjects:

  • kind: ServiceAccount
      name: testx-sa
      namespace: testx-ns

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd-operator
  namespace: testx-ns
spec:
  channel: singlenamespace-alpha
  installPlanApproval: Automatic
  name: etcd
  source: community-operators
  sourceNamespace: openshift-marketplace

This is a clone of issue OCPBUGS-10496. The following is the description of the original issue:

Description of problem:

Customer is running machine learning (ML) tasks on OpenShift Container Platform, for which large models need to be embedded in the container image. When building a new container image with large container image layers (>=10GB) and pushing it to the internal image registry, this fails with the following error message:

error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid

In the image registry Pod we can see the following error message:

time="2023-01-30T14:12:22.315726147Z" level=error msg="upload resumed at wrong offest: 10485760000 != 10738341637" [..]
time="2023-01-30T14:12:22.338264863Z" level=error msg="response completed with error" err.code="blob upload invalid" err.message="blob upload invalid" [..]

Backend storage is AWS S3. We suspect that this could be the following upstream bug: https://github.com/distribution/distribution/issues/1698

Version-Release number of selected component (if applicable):

Customer encountered the issue on OCP 4.11.20. We reproduced the issue on OCP 4.11.21:

$  oc version
Client Version: 4.12.0
Kustomize Version: v4.5.7
Server Version: 4.11.21
Kubernetes Version: v1.24.6+5658434

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform cluster 4.11.21 on AWS
2. Confirm registry storage is on AWS S3
3. Create a new build including a 10GB file using the following command: `printf "FROM registry.fedoraproject.org/fedora:37\nRUN dd if=/dev/urandom of=/bigfile bs=1M count=10240" | oc new-build -D -`
4. Wait for some time for the build to run

Actual results:

Pushing the new build fails with the following error message:

error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid

Expected results:

Push of large container image layers succeeds

Additional info:

Description of problem:

Manual backport of 
* https://github.com/openshift/cluster-dns-operator/pull/336
* https://github.com/openshift/cluster-dns-operator/pull/339

Version-Release number of selected component (if applicable):

4.11

Description of problem:

Version-Release number of selected component (if applicable):
4.11

How reproducible:
Always

Steps to Reproduce:
1. Enable UWM + dedicated UWM Alertmanager
2. Deploy an application + service monitor + alerting rule which fires always
3. Go to the OCP dev console and silence the alert.

Actual results:
Nothing happens

Expected results:
The alert notification is muted.

Additional info:
Copied from https://bugzilla.redhat.com/show_bug.cgi?id=2100860

ENV:
OCP 4.11.29
VSphere IPI

ISSUE:
If scaling up 3-5 machines the machines will transition to running, join the cluster.

If scaling up a large number of machines, 10-20+ some will transition to running and join the cluster.
The remaining will stay in provisioned/provisioning state and never power on
If those nodes are manually powered on they join the cluster and are healthy.

OBSERVATION:
There are ~16k tags shared across multiple VMWare Cloud Foundation data centers.
In the logs it is observed the reconciliation of tags occur and then some of the nodes will power on and transition to running.
Do not see in the logs where the reconciliation of the tags on the remaining nodes runs again or completes.

Clone of https://bugzilla.redhat.com/show_bug.cgi?id=2106803 to backport the e2e fix to 4.11 and 4.10.

Description of problem: E2E: intermittent failure is seen on tests for devfile due to network call to devfile registry

Deploy git workload with devfile from topology page: A-04-TC01

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_console/11726/pull-ci-openshift-console-master-e2e-gcp-console/1547046629775773696/artifacts/e2e-gcp-console/test/artifacts/gui_test_screenshots/cypress/screenshots/e2e/add-flow-ci.feature/Deploy%20git%20workload%20with%20devfile%20from%20topology%20page%20A-04-TC01%20--%20after%20each%20hook%20(failed).png

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/11768/pull-ci-openshift-console-master-e2e-gcp-console/1547061730478133248

Version-Release number of selected component (if applicable):

How reproducible: Intermittent

Steps to Reproduce:
1. Run test for add-flow-ci.feature to test Deploy git workload with devfile from topology page: A-04-TC01

Actual results:

Expected results: Show always pass

Additional info:

This is a clone of issue OCPBUGS-1329. The following is the description of the original issue:

Description of problem:

etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO

Version-Release number of selected component (if applicable):

4.10.32

How reproducible:

Not always, after ~10 attempts

Steps to Reproduce:

1. Deploy SNO with Telco DU profile applied
2. Create multiple pods with local storage volumes attached(attaching yaml manifest)
3. Force delete and re-create pods 10 times

Actual results:

etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time

Expected results:

etcd and kube-apiserver do not get restarted

Additional info:

Attaching must-gather.

Please let me know if any additional info is required. Thank you!

Description of problem:

Customer is facing issue similar to https://github.com/devfile/api/issues/897

Version-Release number of selected component (if applicable):

OCP 4.10.17

How reproducible:
N/A
Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Tried working around it with ALL_PROXY but it did not help. Note because the console operator reverts changes pretty quickly testing this was a bit of a PITA

This is a clone of issue OCPBUGS-2438. The following is the description of the original issue:

Description of problem:

On the alert details page and alerting rule details page, clicking on a field that has a popover help throws an uncaught JavaScript error.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Go to Observe > Alerting pages
2. Click on an alert (or go to the rules tab then click on a rule)
3. Click on one of the underlined fields (those that have a popover help)

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-10943. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10661. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10591. The following is the description of the original issue:

Description of problem:

Starting with 4.12.0-0.nightly-2023-03-13-172313, the machine API operator began receiving an invalid version tag either due to a missing or invalid VERSION_OVERRIDE(https://github.com/openshift/machine-api-operator/blob/release-4.12/hack/go-build.sh#L17-L20) value being passed tot he build.

This is resulting in all jobs invoked by the 4.12 nightlies failing to install.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2023-03-13-172313 and later

How reproducible:

consistently in 4.12 nightlies only(ci builds do not seem to be impacted).

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

Example of failure https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-csi/1635331349046890496/artifacts/e2e-aws-csi/gather-extra/artifacts/pods/openshift-machine-api_machine-api-operator-866d7647bd-6lhl4_machine-api-operator.log

This is a clone of issue OCPBUGS-9955. The following is the description of the original issue:

Description of problem:

OCP cluster installation (SNO) using assisted installer running on ACM hub cluster. 
Hub cluster is OCP 4.10.33
ACM is 2.5.4

When a cluster fails to install we remove the installation CRs and cluster namespace from the hub cluster (to eventually redeploy). The termination of the namespace hangs indefinitely (14+ hours) with finalizers remaining. 

To resolve the hang we can remove the finalizers by editing both the secret pointed to by BareMetalHost .spec.bmc.credentialsName and BareMetalHost CR. When these finalizers are removed the namespace termination completes within a few seconds.

Version-Release number of selected component (if applicable):

OCP 4.10.33
ACM 2.5.4

How reproducible:

Always

Steps to Reproduce:

1. Generate installation CRs (AgentClusterInstall, BMH, ClusterDeployment, InfraEnv, NMStateConfig, ...) with an invalid configuration parameter. Two scenarios validated to hit this issue:
  a. Invalid rootDeviceHint in BareMetalHost CR
  b. Invalid credentials in the secret referenced by BareMetalHost.spec.bmc.credentialsName
2. Apply installation CRs to hub cluster
3. Wait for cluster installation to fail
4. Remove cluster installation CRs and namespace

Actual results:

Cluster namespace remains in terminating state indefinitely:
$ oc get ns cnfocto1
NAME       STATUS        AGE    
cnfocto1   Terminating   17h

Expected results:

Cluster namespace (and all installation CRs in it) are successfully removed.

Additional info:

The installation CRs are applied to and removed from the hub cluster using argocd. The CRs have the following waves applied to them which affects the creation order (lowest to highest) and removal order (highest to lowest):
Namespace: 0
AgentClusterInstall: 1
ClusterDeployment: 1
NMStateConfig: 1
InfraEnv: 1
BareMetalHost: 1
HostFirmwareSettings: 1
ConfigMap: 1 (extra manifests)
ManagedCluster: 2
KlusterletAddonConfig: 2

 

Description of problem:
Follow-up of: https://issues.redhat.com/browse/SDN-2988

This failure is perma-failing in the e2e-metal-ipi-ovn-dualstack-local-gateway jobs.

Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway/1597574181430497280
Search CI: https://search.ci.openshift.org/?search=when+using+openshift+ovn-kubernetes+should+ensure+egressfirewall+is+created&maxAge=336h&context=1&type=junit&name=e2e-metal-ipi-ovn-dualstack-local-gateway&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Sippy: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.13/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway%22%7D%5D%7D

Version-Release number of selected component (if applicable):

4.12,4.13

How reproducible:

Every time

Steps to Reproduce:

1. Setup dualstack KinD cluster
2. Create egress fw policy with spec
Spec:
  Egress:
    To:
      Cidr Selector:  0.0.0.0/0
    Type:             Deny
3. create a pod and ping to 1.1.1.1

Actual results:

Egress policy does not block flows to external IP

Expected results:

Egress policy blocks flows to external IP

Additional info:

It seems mixing ip4 and ip6 operands in ACL matchs doesnt work

In order to delete the correct GCP cloud resources, the "--credentials-requests-dir" parameter must be passed to "ccoctl gcp delete". This was fixed for 4.12 as part of https://github.com/openshift/cloud-credential-operator/pull/489 but must be backported for previous releases. See https://github.com/openshift/cloud-credential-operator/pull/489#issuecomment-1248733205 for discussion regarding this bug.

To reproduce, create GCP infrastructure with a name parameter that is a subset of another set of GCP infrastructure's name parameter. I will "ccoctl gcp create all" with "name=abutcher-gcp" and "name=abutcher-gcp1".

$ ./ccoctl gcp create-all \
--name=abutcher-gcp \
--region=us-central1 \
--project=openshift-hive-dev \
--credentials-requests-dir=./credrequests

$ ./ccoctl gcp create-all \
--name=abutcher-gcp1 \
--region=us-central1 \
--project=openshift-hive-dev \
--credentials-requests-dir=./credrequests

Running "ccoctl gcp delete --name=abutcher-gcp" will result in GCP infrastructure for both "abutcher-gcp" and "abutcher-gcp1" being deleted. 

$ ./ccoctl gcp delete --name abutcher-gcp --project openshift-hive-dev
2022/10/24 11:30:06 Credentials loaded from file "/home/abutcher/.gcp/osServiceAccount.json"
2022/10/24 11:30:06 Deleted object .well-known/openid-configuration from bucket abutcher-gcp-oidc
2022/10/24 11:30:07 Deleted object keys.json from bucket abutcher-gcp-oidc
2022/10/24 11:30:07 OIDC bucket abutcher-gcp-oidc deleted
2022/10/24 11:30:09 IAM Service account abutcher-gcp-openshift-image-registry-gcs deleted
2022/10/24 11:30:10 IAM Service account abutcher-gcp-openshift-gcp-ccm deleted
2022/10/24 11:30:11 IAM Service account abutcher-gcp1-openshift-cloud-network-config-controller-gcp deleted
2022/10/24 11:30:12 IAM Service account abutcher-gcp-openshift-machine-api-gcp deleted
2022/10/24 11:30:13 IAM Service account abutcher-gcp-openshift-ingress-gcp deleted
2022/10/24 11:30:15 IAM Service account abutcher-gcp-openshift-gcp-pd-csi-driver-operator deleted
2022/10/24 11:30:16 IAM Service account abutcher-gcp1-openshift-ingress-gcp deleted
2022/10/24 11:30:17 IAM Service account abutcher-gcp1-openshift-image-registry-gcs deleted
2022/10/24 11:30:19 IAM Service account abutcher-gcp-cloud-credential-operator-gcp-ro-creds deleted
2022/10/24 11:30:20 IAM Service account abutcher-gcp1-openshift-gcp-pd-csi-driver-operator deleted
2022/10/24 11:30:21 IAM Service account abutcher-gcp1-openshift-gcp-ccm deleted
2022/10/24 11:30:22 IAM Service account abutcher-gcp1-cloud-credential-operator-gcp-ro-creds deleted
2022/10/24 11:30:24 IAM Service account abutcher-gcp1-openshift-machine-api-gcp deleted
2022/10/24 11:30:25 IAM Service account abutcher-gcp-openshift-cloud-network-config-controller-gcp deleted
2022/10/24 11:30:25 Workload identity pool abutcher-gcp deleted

 

This bug is a backport clone of [Bugzilla Bug 2034883](https://bugzilla.redhat.com/show_bug.cgi?id=2034883). The following is the description of the original bug:

Description of problem:

Situation (starting point):

  • There is an ongoing change to the machine-config-daemon daemonset being applied by the machine-config-operator pod. It is waiting for the daemonset to roll out.
  • There are some nodes not ready, so daemonset rollout never ends and waiting on that ends in timeout error.

Problem:

  • Machine-config-operator pods stops trying to reconcile stuff whenever it finds timeout error in waiting for the machine-config-daemon rollout
  • This implies that the `spec.kubeAPIServerServingCAData` field of controllerconfig/machine-config-controller object is not updated when the kube-apiserver-operator updates kube-apiserver-to-kubelet-client-ca configmap.
  • Without that field updated, a kube-apiserver-to-kubelet-client-ca change is never rolled out to the nodes.
  • That ultimately leads to cluster-wide unavailability of "oc logs", "oc rsh" etc. commands when the kube-apiserver-operator starts using a client cert signed by the new kube-apiserver-to-kubelet-client-ca to access kubelet ports.

Version-Release number of MCO (Machine Config Operator) (if applicable):

4.7.21

Platform (AWS, VSphere, Metal, etc.): (not relevant)

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y

How reproducible:

Always if the said conditions are met.

Steps to Reproduce:
1. Have some nodes not ready
2. Force a change that requires machine-config-daemon daemonset rollout (I think that changing proxy settings would work for this)
3. Wait until a new kube-apiserver-to-kubelet-client-ca is rolled out by kube-apiserver-operator

Actual results:

New kube-apiserver-to-kubelet-client-ca not forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca not deployed on nodes

Expected results:

kube-apiserver-to-kubelet-client-ca forwarded to controllerconfig, kube-apiserver-to-kubelet-client-ca deployed to nodes.

Additional info:

In comments

This is a clone of issue OCPBUGS-6517. The following is the description of the original issue:

Description of problem:

When the cluster is configured with Proxy the swift client in the image registry operator is not using the proxy to authenticate with OpenStack, so it's unable to reach the OpenStack API. This issue became evident since recently the support was added to not fallback to cinder in case swift is available[1].

[1]https://github.com/openshift/cluster-image-registry-operator/pull/819

 

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. Deploy a cluster with proxy and restricted installation
2. 
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-5100. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5068. The following is the description of the original issue:

Description of problem:

virtual media provisioning fails when iLO Ironic driver is used

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Always

Steps to Reproduce:

1. attempt virtual media provisioning on a node configured with ilo-virtualmedia:// drivers
2.
3.

Actual results:

Provisioning fails with "An auth plugin is required to determine endpoint URL" error

Expected results:

Provisioning succeeds

Additional info:

Relevant log snippet:

3742 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector [None req-e58ac1f2-fac6-4d28-be9e-983fa900a19b - - - - - -] Unable to start managed inspection for node e4445d43-3458-4cee-9cbe-6da1de75      78cd: An auth plugin is required to determine endpoint URL: keystoneauth1.exceptions.auth_plugins.MissingAuthPlugin: An auth plugin is required to determine endpoint URL
 3743 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector Traceback (most recent call last):
 3744 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/inspector.py", line 210, in _start_managed_inspection
 3745 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     task.driver.boot.prepare_ramdisk(task, ramdisk_params=params)
 3746 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic_lib/metrics.py", line 59, in wrapped
 3747 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     result = f(*args, **kwargs)
 3748 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/ilo/boot.py", line 408, in prepare_ramdisk
 3749 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     iso = image_utils.prepare_deploy_iso(task, ramdisk_params,
 3750 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 624, in prepare_deploy_iso
 3751 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     return prepare_iso_image(inject_files=inject_files)
 3752 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 537, in _prepare_iso_image
 3753 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     image_url = img_handler.publish_image(
 3754 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 193, in publish_image
 3755 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     swift_api = swift.SwiftAPI()
 3756 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector   File "/usr/lib/python3.9/site-packages/ironic/common/swift.py", line 66, in __init__
 3757 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector     endpoint = keystone.get_endpoint('swift', session=session)

Description of problem:

Intended to backport the corresponding https://bugzilla.redhat.com/show_bug.cgi?id=2095852 which has been fixed already for this version.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

1. Proposed title of this feature request
--> Alert generation when the etcd container memory consumption goes beyond 90%

2. What is the nature and description of the request?
--> When the etcd database starts growing rapidly due to some high number of objects like secrets, events, or configmap generation by application/workload, the memory and CPU consumption of APIserver and etcd container (control plane component) spikes up and eventually the control plane nodes goes to hung/unresponsive or crash due to out of memory errors as some of the critical processes/services running on master nodes get killed. Hence we request an alert/alarm when the ETCD container's memory consumption goes beyond 90% so that the cluster administrator can take some action before the cluster/nodes go unresponsive.

I see we already have a etcdExcessiveDatabaseGrowth Prometheus rule which helps when the surge in etcd writes leading to a 50% increase in database size over the past four hours on etcd instance however it does not consider the memory consumption:

$ oc get prometheusrules etcd-prometheus-rules -o yaml|grep -i etcdExcessiveDatabaseGrowth -A 9

  • alert: etcdExcessiveDatabaseGrowth
    annotations:
    description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes
    leading to 50% increase in database size over the past four hours on etcd
    instance {{ $labels.instance }}, please check as it might be disruptive.'
    expr: |
    increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50
    for: 10m
    labels:
    severity: warning

3. Why does the customer need this? (List the business requirements here)
--> Once the etcd memory consumption goes beyond 90-95% of total ram as it's system critical container, the OCP cluster goes unresponsive causing revenue loss to business and impacting the productivity of users of the openshift cluster. 

 

4. List any affected packages or components.
--> etcd

This is a clone of issue OCPBUGS-5019. The following is the description of the original issue:

This is a clone of issue OCPBUGS-4941. The following is the description of the original issue:

Description of problem: This is a follow-up to OCPBUGS-3933.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses, and a container is empty.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

Upgrade to 4.10 is stuck looping in syncEgressFirewall

We see transacting operations with context deadline exceeded.

It looks to be trying to process 2.8 million records is one go.

2023-02-21T19:55:06.514097513Z I0221 19:55:06.435220 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:6a3ad543-a77d-4700-83b8-5ccae6b2d067} ID:1c5297ff-8588-467a-93f4-22f22d609563} {GoUUID:f6288ed3-3928-45a8-ae57-40ed94cfa249} {GoUUID:04bf90c2-fde1-4a10-baaa-6a3f1d8e2931} {GoUUID:c6609536-857c-48ae-9125-9505753180 a8} {GoUUID:c79b4398-d7cc-4dcf-8c1d-11484f318324} {GoUUID:4323ac2c-033e-43c3-885b-e951cd7a4159} {GoUUID:7b316a80-076f-4266-b7d2-bd69b1d4b874} {GoUUID:57dfecb2-2f94-4cd8-a277-8 b28205e1048} {GoUUID:2c039f15-ff11-4ceb-aa82-bcbe82fc86d1} {GoUUID:063c4121-73c3-4d53-a89d-1063e775146b} {GoUUID:25c788e3-6146-4571-98bf-61010100a22a} {GoUUID:3d3c150f-1296-4d 91-b334-506f28bff4bd}]}}] Timeout:<nil> Where:[where column _uuid == {ba9652de-5aae-4a74-a512-29f775e38c19}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded 

2023-02-21T19:55:18.739739417Z E0221 19:55:18.643127       1 master.go:1369] Failed (will retry) in syncing syncEgressFirewall: failed to remove reject acl from node logical switches: error while removing ACLS: [6a3ad543-a77d-4700-83b8-5ccae6b2d067 8e004991-0382-455f-9901-33ef724acbc2 

Everything is built into one operation via:
https://github.com/openshift/ovn-kubernetes/blob/release-4.10/go-controller/pkg/libovsdbops/switch.go#L243

TrandactAndCheck is being called with a 10s timeout and this operation never completes. 
 

Version-Release number of selected component (if applicable):

4.10.50

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

Upgrade completes

Additional info:

 

Description of problem:

Currently when installing Openshift on the Openstack cluster name length limit is allowed to  14 characters.
Customer wants to know if is it possible to override this validation when installing Openshift on Openstack and create a cluster name that is greater than 14 characters.

Version : OCP 4.8.5 UPI Disconnected 
Environment : Openstack 16 

Issue:
User reports that they are getting error for OCP cluster in Openstack UPI, where the name of the cluster is > 14 characters.

Error events :
~~~
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/openshift-install", "create", "manifests", "--dir=/home/gitlab-runner/builds/WK8mkokN/0/CPE/SKS/pipelines/non-prod/ocp4-openstack-build/ocpinstaller/install-upi"], "delta": "0:00:00.311397", "end": "2022-09-03 21:38:41.974608", "msg": "non-zero return code", "rc": 1, "start": "2022-09-03 21:38:41.663211", "stderr": "level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters", "stderr_lines": ["level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters"], "stdout": "", "stdout_lines": []}
~~~

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

 

Actual results:

Users are getting error "cluster name is too long" when clustername contains more than 14 characters for OCP on Openstack

Expected results:

The 14 characters limit should be change for the OCP clustername on Openstack

Additional info:

 

Description of problem:

Similar to OCPBUGS-11636 ccoctl needs to be updated to account for the s3 bucket changes described in https://aws.amazon.com/blogs/aws/heads-up-amazon-s3-security-changes-are-coming-in-april-of-2023/

these changes have rolled out to us-east-2 and China regions as of today and will roll out to additional regions in the near future

See OCPBUGS-11636 for additional information

Version-Release number of selected component (if applicable):

 

How reproducible:

Reproducible in affected regions.

Steps to Reproduce:

1. Use "ccoctl aws create-all" flow to create STS infrastructure in an affected region like us-east-2. Notice that document upload fails because the s3 bucket is created in a state that does not allow usage of ACLs with the s3 bucket.

Actual results:

./ccoctl aws create-all --name abutchertestue2 --region us-east-2 --credentials-requests-dir ./credrequests --output-dir _output
2023/04/11 13:01:06 Using existing RSA keypair found at _output/serviceaccount-signer.private
2023/04/11 13:01:06 Copying signing key for use by installer
2023/04/11 13:01:07 Bucket abutchertestue2-oidc created
2023/04/11 13:01:07 Failed to create Identity provider: failed to upload discovery document in the S3 bucket abutchertestue2-oidc: AccessControlListNotSupported: The bucket does not allow ACLs
        status code: 400, request id: 2TJKZC6C909WVRK7, host id: zQckCPmozx+1yEhAj+lnJwvDY9rG14FwGXDnzKIs8nQd4fO4xLWJW3p9ejhFpDw3c0FE2Ggy1Yc=

Expected results:

"ccoctl aws create-all" successfully creates IAM and S3 infrastructure. OIDC discovery and JWKS documents are successfully uploaded to the S3 bucket and are publicly accessible.

Additional info:

 

This is a clone of issue OCPBUGS-11998. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10678. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10655. The following is the description of the original issue:

Description of problem:
The dev console shows a list of samples. The user can create a sample based on a git repository. But some of these samples doesn't include a git repository reference and could not be created.

Version-Release number of selected component (if applicable):
Tested different frontend versions against a 4.11 cluster and all (oldest tested frontend was 4.8) show the sample without git repository.

But the result also depends on the installed samples operator and installed ImageStreams.

How reproducible:
Always

Steps to Reproduce:

  1. Switch to the Developer perspective
  2. Navigate to Add > All Samples
  3. Search for Jboss
  4. Click on "JBoss EAP XP 4.0 with OpenJDK 11" (for example)

Actual results:
The git repository is not filled and the create button is disabled.

Expected results:
Samples without git repositories should not be displayed in the list.

Additional info:
The Git repository is saved as "sampleRepo" in the ImageStream tag section.

Description of problem:

go test -mod=vendor -test.v -race  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops
# github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops [github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops.test]
pkg/libovsdbops/acl_test.go:98:15: undefined: FindACLs
pkg/libovsdbops/acl_test.go:105:15: undefined: UpdateACLsOps
FAIL    github.com/ovn-org/ovn-kubernetes/go-controller/pkg/libovsdbops [build failed]
FAIL

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This bug was initially created as a copy of
Bug #2096605
I am copying this bug because: the parent bug solved the validation aspect of diskType but now the description of diskType in
https://github.com/openshift/installer/blob/master/data/data/install.openshift.io_installconfigs.yaml#L2914-L2923
needs to be updated.

Version: 4.11.0-0.nightly-2022-06-06-201913

Platform: vSphere IPI

What happened?
1. If user inputs an invalid value for platform.vsphere.diskType in install-config.yaml file, there is no validation checking for diskType and doesn't exit with error, but continues the installation, which is not the same behavior as in 4.10.

After all vms are provisioned, I checked that the disk provision type is thick.

2. If user doesn't set platform.vsphere.diskType in install-config.yaml file, the default disk provision type is thick, but not the vSphere default storage policy. On VMC, the default policy is thin, so maybe the description of diskType should also need to be updated.

$ ./openshift-install explain installconfig.platform.vsphere.diskType
KIND: InstallConfig
VERSION: v1

RESOURCE: <string>
Valid Values: "","thin","thick","eagerZeroedThick"
DiskType is the name of the disk provisioning type, valid values are thin, thick, and eagerZeroedThick. When not specified, it will be set according to the default storage policy of vsphere.

What did you expect to happen?
validation for diskType

How to reproduce it (as minimally and precisely as possible)?
set diskType to invalid value in install-config.yaml and install the cluster

This is a clone of issue OCPBUGS-5879. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5505. The following is the description of the original issue:

Description of problem:

The upgradeability check in CVO is throttled (essentially cached) for a nondeterministic period of time, same as the minimal sync period computed at runtime. The period can be up to 4 minutes, determined at CVO start time as 2minutes * (0..1 + 1). We agreed with Trevor that such throttling is unnecessarily aggressive (the check is not that expensive). It also causes CI flakes, because the matching test only has 3 minutes timeout. Additionally, the non-determinism and longer throttling results makes UX worse by actions done in the cluster may have their observable effect delayed.

Version-Release number of selected component (if applicable):

discovered in 4.10 -> 4.11 upgrade jobs

How reproducible:

The test seems to flake ~10% of 4.10->4.11 Azure jobs (sippy). There does not seem to be that much impact on non-Azure jobs though which is a bit weird.

Steps to Reproduce:

Inspect the CVO log and E2E logs from failing jobs with the provided [^check-cvo.py] helper:

$ ./check-cvo.py cvo.log && echo PASS || echo FAIL

Preferably, inspect CVO logs of clusters that just underwent an upgrade (upgrades makes the original problematic behavior more likely to surface)

Actual results:

$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-5b6966c474-g4kwk_cluster-version-operator.log && echo PASS || echo FAIL
FAIL: Cache hit at 11:59:55.332339 0:03:13.665006 after check at 11:56:41.667333
FAIL: Cache hit at 12:06:22.663215 0:03:13.664964 after check at 12:03:08.998251
FAIL: Cache hit at 12:12:49.997119 0:03:13.665598 after check at 12:09:36.331521
FAIL: Cache hit at 12:19:17.328510 0:03:13.664906 after check at 12:16:03.663604
FAIL: Cache hit at 12:25:44.662290 0:03:13.666759 after check at 12:22:30.995531
Upgradeability checks:           5
Upgradeability check cache hits: 12
FAIL

Note that the bug is probabilistic, so not all unfixed clusters will exhibit the behavior. My guess of the incidence rate is about 30-40%.

Expected result

$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-7b8f85d455-mk9fs_cluster-version-operator.log && echo PASS || echo FAIL
Upgradeability checks:           12
Upgradeability check cache hits: 11
PASS

The actual numbers are not relevant (unless the upgradeabilily check count is zero, which means the test is not conclusive, the script warns about that), lack of failure is.

Additional info:

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b7d4b5bbd-zjqdt_cluster-version-operator.log | grep upgradeable.go
...
I1227 06:50:59.023190       1 upgradeable.go:122] Cluster current version=4.10.46
I1227 06:50:59.042735       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:51:14.024345       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:53:23.080768       1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later.
I1227 06:56:59.366010       1 upgradeable.go:122] Cluster current version=4.11.0-0.ci-2022-12-26-193640

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Kubernetes 1.25 and therefore OpenShift 4.12'
Dec 27 06:51:15.319: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ...
Dec 27 06:54:15.413: FAIL: Error while waiting for Upgradeable to complain about AdminAckRequired with message "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.": timed out waiting for the condition
The test passes. Also, the "Upgradeable conditions were recently checked, will try later." messages in CVO logs should never occur after a deterministic, short amount of time (I propose 1 minute) after upgradeability was checked.

I tested the throttling period in https://github.com/openshift/cluster-version-operator/pull/880. With the period of 15m, the test passrate was 4 of 9. Wiht the period of 1m, the test did not fail at all.

Some context in Slack thread

Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/72

The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

This is a clone of issue OCPBUGS-7650. The following is the description of the original issue:

This is a clone of issue OCPBUGS-672. The following is the description of the original issue:

Description of problem:

Redhat-operator part of the marketplace is failing regularly due to startup probe timing out connecting to registry-server container part of the same pod within 1 sec which in turn increases CPU/Mem usage on Master nodes:

62m         Normal    Scheduled                pod/redhat-operators-zb4j7                         Successfully assigned openshift-marketplace/redhat-operators-zb4j7 to ip-10-0-163-212.us-west-2.compute.internal by ip-10-0-149-93
62m         Normal    AddedInterface           pod/redhat-operators-zb4j7                         Add eth0 [10.129.1.112/23] from ovn-kubernetes
62m         Normal    Pulling                  pod/redhat-operators-zb4j7                         Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.11"
62m         Normal    Pulled                   pod/redhat-operators-zb4j7                         Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.11" in 498.834447ms
62m         Normal    Created                  pod/redhat-operators-zb4j7                         Created container registry-server
62m         Normal    Started                  pod/redhat-operators-zb4j7                         Started container registry-server
62m         Warning   Unhealthy                pod/redhat-operators-zb4j7                         Startup probe failed: timeout: failed to connect service ":50051" within 1s
62m         Normal    Killing                  pod/redhat-operators-zb4j7                         Stopping container registry-server


Increasing the threshold of the probe might fix the problem:
  livenessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: registry-server
    ports:
    - containerPort: 50051
      name: grpc
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5 

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

1. Install OSD cluster using 4.11.0-0.nightly-2022-08-26-162248 payload
2. Inspect redhat-operator pod in openshift-marketplace namespace
3. Observe the resource usage ( CPU and Memory ) of the pod 

Actual results:

Redhat-operator failing leading to increase to CPU and Mem usage on master nodes regularly during the startup

Expected results:

Redhat-operator startup probe succeeding and no spikes in resource on master nodes

Additional info:

Attached cpu, memory and event traces.

 

This is a clone of issue OCPBUGS-7732. The following is the description of the original issue:

Description of problem:

When services are deleted, the services controller cache should also remove the service from its top level cache to avoid growing forever.

While this is not an issue in 4.13 once the lb_cache rework merges [1], the 4.12 and older branches have this problem because that rework is meant for 4.13 only.

[1]: https://github.com/ovn-org/ovn-kubernetes/pull/3387

This is the location where alreadyApplied is not deleting the removal: 
https://github.com/openshift/ovn-kubernetes/blob/cf9fb51510e1870961bf3a0f064b73536757a4f8/go-controller/pkg/ovn/controller/services/services_controller.go#L269

It should do the similar changes depicted here (currently merged upstream):
https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/services_controller.go#L322-L324

 

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1. create service -- use unique name
2. remove service
3. notice how alreadyApplied grows and never gets smaller
4. repeat

Actual results:

^^

Expected results:

alreadyApplied should not grow forever

Additional info:

 

This is a clone of issue OCPBUGS-10314. The following is the description of the original issue:

This is a clone of issue OCPBUGS-8741. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5889. The following is the description of the original issue:

Description of problem:

Customer running a cluster with following config:
4.10.23
AWS/IPI
OVNKubernetes

Observed that in namespace with networkpolicy rules enabled, and a policy for allow-from-same namespace, pods will have different behaviors when calling service IP's hosted in that same namespace.

Example:
Deployment1 with two pods (A/B) exists in namespace <EXAMPLE>
Deployment2 with 1 pod hosting a service and route exists in same namespace
Pod A will unexpectedly stop being able to call service IP of deployment2; Pod B will never lose access to calling service IP of deployment2.

Pod A remains able to call out through br-ex interface, tag the ROUTE address, and reach deployment2 pod via haproxy (this never breaks)

Pod A remains able to reach the local gateway on the node

Host node for Pod A is able to reach the service IP of deployment2 and remains able to do so, even while pod A is impacted.

Issue can be mitigated by applying a label or annotation to pod A, which immediately allows it to reach internal service IPs again within the namespace.

I suspect that the issue is to do with the networkpolicy rules failing to stay updated on the pod object, and the pod needs to be 'refreshed' --> label appendation/other update, to force the pod to 'remember' that it is allowed to call peers within the namespace.

Additional relevant data:
- pods affects throughout cluster; no specific project/service/deployment/application
- pods ride on different nodes all the time (no one node affected)
- pods with fail condition are on same node with other pods without issue
- multiple namespaces see this problem
- all namespaces are using similar networkpolicy isolation and allow-from-same-namespace ruleset (which matches our documentation on syntax).



Version-Release number of selected component (if applicable):

4.10.23

How reproducible:

every time --> unclear what the trigger is that causes this; pods will be functional and several hours/days later, will stop being able to talk to peer services.

Steps to Reproduce:

1. deploy pod with at least two replicas in a namespace with allow-from same network policy
2. deploy a different service and route example httpd instance in same namespace
3. observe that one of the two pods may fail to reach service IP after some time
4. apply annotation to pod and it is immediately able to reach services again.

Actual results:

pods intermittently fail to reach internal service addresses, but are able to be interacted with otherwise, and can reach upstream/external addresses including routes on cluster. 

Expected results:

pods should not lose access to service network peers. 

Additional info:

see next comments for relevant uploads/sosreports and inspects.

This is a clone of BZ https://bugzilla.redhat.com/show_bug.cgi?id=2117374 which fixed 4.12

 
Description of problem:
OCP 4.11 introduced the `restricted-v2` SecurityContextConstraint as the default binding for all authenticated users. When a Pod that would have normally be admitted successfully using the `restricted` SCC, but cannot be admitted using a `restricted-v2` SCC, it's not clear to the customer that the problem is related to the default SCC changing from restricted to restricted-v2.

Suggestion is to generate an additional message for this specific use case that makes it clear to customers that they have a workload that is not compatible with restricted-v2.

Example message:
Error creating: pods "nginx-ingress-controller-75bffcfdf8-" is forbidden: the pod fails to validate against the default `restricted-v2` security context constraint, but would validate successfully against the `restricted` security context constraint.

The goal of such a message is to give the customer a breadcrumb that the `restricted-v2` SCC is activated and set as the default, which they may have not been aware of. This would give them a way to find alternatives, such as changing the global default, fixing the application or assigning the `restricted` SCC to the Namespace or ServiceAccount.

Version-Release number of selected component (if applicable):
4.11

How reproducible:
Always

Steps to Reproduce:
1. Install OCP 4.11.0
2. Create a deployment with a podtemplate with .spec.containers[*].securityContext.allowPrivigeEscalation = true in a new namespace

Actual results:
The pod will fail to be admitted with an error such as:

Warning FailedCreate 27m (x25 over 76m) replicaset-controller Error creating: pods "nginx-ingress-controller-75bffcfdf8-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, spec.containers[0].securityContext.allowPrivilegeEscalation: Invalid value: true: Allowing privilege escalation for containers is not allowed, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "noobaa": Forbidden: not usable by user or serviceaccount, provider "noobaa-endpoint": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "rook-ceph": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "rook-ceph-csi": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]

Expected results:
Add an additional or replacement error message when there is no SCC assignment other than `restricted-v2` AND `restricted` would have worked:

Error creating: pods "nginx-ingress-controller-75bffcfdf8-" is forbidden: the pod fails to validate against the default `restricted-v2` security context constraint, but would validate successfully against the `restricted` security context constraint.

Additional info:

This is a clone of issue OCPBUGS-15099. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14943. The following is the description of the original issue:

This is a clone of issue OCPBUGS-14668. The following is the description of the original issue:

Description of problem:

visiting global configurations page will return error after 'Red Hat OpenShift Serverless' is installed, the error persist even operator is uninstalled

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-06-212044

How reproducible:

Always

Steps to Reproduce:

1. Subscribe 'Red Hat OpenShift Serverless' from OperatorHub, wait for the operator to be successfully installed
2. Visit Administration -> Cluster Settings -> Configurations tab

Actual results:

react_devtools_backend_compact.js:2367 unhandled promise rejection: TypeError: Cannot read properties of undefined (reading 'apiGroup') 
    at r (main-chunk-e70ea3b3d562514df486.min.js:1:1)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
    at Array.map (<anonymous>)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
overrideMethod @ react_devtools_backend_compact.js:2367
window.onunhandledrejection @ main-chunk-e70ea3b3d562514df486.min.js:1

main-chunk-e70ea3b3d562514df486.min.js:1 Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'apiGroup')
    at r (main-chunk-e70ea3b3d562514df486.min.js:1:1)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1
    at Array.map (<anonymous>)
    at main-chunk-e70ea3b3d562514df486.min.js:1:1

 

Expected results:

no errors

Additional info:

 

Description of problem:

Looks like a regression was introduced and the Windows networking tests started to fail accross all 4.11 CI jobs for both Linux-to-Windows and Windows-to-Windows pods communication. 

https://github.com/openshift/ovn-kubernetes/pull/1415 merged without passing Windows networking tests.

https://github.com/openshift/windows-machine-config-operator/pull/1359  shows networking tests failing accross all CI jobs.





Version-Release number of selected component (if applicable):

4.11

How reproducible:

Always 

Steps to Reproduce:

1. Trigger a job retest in any of the above mentioned PRs
 

Actual results:

Windows networking test fail

Expected results:

Windows networking test pass

Additional info:

Direct link to a release-4.11 aws-e2e-operator CI job showing the failed networking test

Description of problem:

See https://bugzilla.redhat.com/show_bug.cgi?id=2104275

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-2495. The following is the description of the original issue:

Failures like:

$ oc login --token=...

Logged into "https://api..." as "..." using the token provided.

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get projects.project.openshift.io)

break login, which tries to gather information before saving the configuration, including a giant project list.

Ideally login would be able to save the successful login credentials, even when the informative gathering had difficulties. And possibly the informative gathering could be made conditional (--quiet or similar?) so expensive gathering could be skipped in use-cases where the context was not needed.

This is a clone of issue OCPBUGS-5016. The following is the description of the original issue:

Description of problem:

When editing any pipeline in the openshift console, the correct content cannot be obtained (the obtained information is the initial information).

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> YAML view -> Cancel ->  Actions -> Edit Pipeline -> YAML view 

Actual results:

displayed content is incorrect.

Expected results:

Get the content of the current pipeline, not the "pipeline create" content.

Additional info:

If cancel or save in the "Pipeline Builder" interface after "Edit Pipeline", can get the expected content.
~
Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> Pipeline builder -> Cancel ->  Actions -> Edit Pipeline -> YAML view :Display resource content normally
~

This is a clone of issue OCPBUGS-675. The following is the description of the original issue:

Description of problem:

A cluster hit a panic in etcd operator in bootstrap:
I0829 14:46:02.736582 1 controller_manager.go:54] StaticPodStateController controller terminated
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e940ab]

goroutine 2701 [running]:
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.checkSingleMemberHealth({0x29374c0, 0xc00217d920}, 0xc0021fb110)
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:135 +0x34b
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0x7f
created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2ac
Version-Release number of selected component (if applicable):

 

How reproducible:

Pulled up a 4.12 cluster and hit panic during bootstrap

Steps to Reproduce:

1.
2.
3.

Actual results:

panic as above

Expected results:

no panic

Additional info:

 

Description of problem:

When queried dns hostname from certain pod on the certain node, responded from random coredns pod, not prefer local one. Is it expected result ?

# In OCP v4.8.13 case
// Ran dig command on the certain node which is running the following test-7cc4488d48-tqc4m pod.
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
:
07:16:33 :172.217.175.238
07:16:34 :172.217.175.238 <--- Refreshed the upstream result
07:16:36 :142.250.207.46
07:16:37 :142.250.207.46

// The dig results is matched with the running node one as you can see the above one.
$ oc rsh  test-7cc4488d48-tqc4m bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
:
07:16:35 :172.217.175.238 
07:16:36 :172.217.175.238 <--- At the same time, the pod dig result is also refreshed.
07:16:37 :142.250.207.46
07:16:38 :142.250.207.46


But in v4.10 case, in contrast, the dns query result is various and responded randomly regardless local dns results on the node as follows.

# In OCP v4.10.23 case, pod's response from DNS services are not consistent.
$ oc rsh test-848fcf8ddb-zrcbx  bash -c 'while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done'
07:23:00 :142.250.199.110
07:23:01 :142.250.207.46
07:23:02 :142.250.207.46
07:23:03 :142.250.199.110
07:23:04 :142.250.199.110
07:23:05 :172.217.161.78

# Even though the node which is running the pod keep responding the same IP...
sh-4.4# while : ; do echo -n "$(date '+%H:%M:%S') :"; dig google.com +short; sleep 1; done
07:23:00 :172.217.161.78
07:23:01 :172.217.161.78
07:23:02 :172.217.161.78
07:23:03 :172.217.161.78
07:23:04 :172.217.161.78
07:23:05 :172.217.161.78

Version-Release number of selected component (if applicable):

v4.10.23 (ROSA)
SDN: OpenShiftSDN

How reproducible:

You can always reproduce this issue using "dig google.com" from both any pod and the node the pod running according to the above "Description" details.

Steps to Reproduce:

1. Run any usual pod, and check which node the pod is running on.
2. Run dig google.com on the pod and the node.
3. Check the IP is consistent with the running node each other. 

Actual results:

The response IPs are not consistent and random IP is responded.

Expected results:

The response IP is kind of consistent, and aware of prefer local dns.

Additional info:

This issue affects EgressNetworkPolicy dnsName feature.

Description of problem:

For OVNK to become CNCF complaint, we need to support session affinity timeout feature and enable the e2e's on OpenShift side. This bug tracks the efforts to get this into 4.12 OCP.

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-855. The following is the description of the original issue:

Description of problem:

When setting the allowedregistries like the example below, the openshift-samples operator is degraded:

oc get image.config.openshift.io/cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Image
metadata:
  annotations:
    release.openshift.io/create-only: "true"
  creationTimestamp: "2020-12-16T15:48:20Z"
  generation: 2
  name: cluster
  resourceVersion: "422284920"
  uid: d406d5a0-c452-4a84-b6b3-763abb51d7a5
spec:
  additionalTrustedCA:
    name: registry-ca
  allowedRegistriesForImport:
  - domainName: quay.io
    insecure: false
  - domainName: registry.redhat.io
    insecure: false
  - domainName: registry.access.redhat.com
    insecure: false
  - domainName: registry.redhat.io/redhat/redhat-operator-index
    insecure: true
  - domainName: registry.redhat.io/redhat/redhat-marketplace-index
    insecure: true
  - domainName: registry.redhat.io/redhat/certified-operator-index
    insecure: true
  - domainName: registry.redhat.io/redhat/community-operator-index
    insecure: true
  registrySources:
    allowedRegistries:
    - quay.io
    - registry.redhat.io
    - registry.rijksapps.nl
    - registry.access.redhat.com
    - registry.redhat.io/redhat/redhat-operator-index
    - registry.redhat.io/redhat/redhat-marketplace-index
    - registry.redhat.io/redhat/certified-operator-index
    - registry.redhat.io/redhat/community-operator-index


oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.21   True        False         False      5d13h   
baremetal                                  4.10.21   True        False         False      450d    
cloud-controller-manager                   4.10.21   True        False         False      94d     
cloud-credential                           4.10.21   True        False         False      624d    
cluster-autoscaler                         4.10.21   True        False         False      624d    
config-operator                            4.10.21   True        False         False      624d    
console                                    4.10.21   True        False         False      42d     
csi-snapshot-controller                    4.10.21   True        False         False      31d     
dns                                        4.10.21   True        False         False      217d    
etcd                                       4.10.21   True        False         False      624d    
image-registry                             4.10.21   True        False         False      94d     
ingress                                    4.10.21   True        False         False      94d     
insights                                   4.10.21   True        False         False      104s    
kube-apiserver                             4.10.21   True        False         False      624d    
kube-controller-manager                    4.10.21   True        False         False      624d    
kube-scheduler                             4.10.21   True        False         False      624d    
kube-storage-version-migrator              4.10.21   True        False         False      31d     
machine-api                                4.10.21   True        False         False      624d    
machine-approver                           4.10.21   True        False         False      624d    
machine-config                             4.10.21   True        False         False      17d     
marketplace                                4.10.21   True        False         False      258d    
monitoring                                 4.10.21   True        False         False      161d    
network                                    4.10.21   True        False         False      624d    
node-tuning                                4.10.21   True        False         False      31d     
openshift-apiserver                        4.10.21   True        False         False      42d     
openshift-controller-manager               4.10.21   True        False         False      22d     
openshift-samples                          4.10.21   True        True          True       31d     Samples installation in error at 4.10.21: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "}
operator-lifecycle-manager                 4.10.21   True        False         False      624d    
operator-lifecycle-manager-catalog         4.10.21   True        False         False      624d    
operator-lifecycle-manager-packageserver   4.10.21   True        False         False      31d     
service-ca                                 4.10.21   True        False         False      624d    
storage                                    4.10.21   True        False         False      113d  


After applying the fix as described here(  https://access.redhat.com/solutions/6547281 ) it is resolved:
oc patch configs.samples.operator.openshift.io cluster --type merge --patch '{"spec": {"samplesRegistry": "registry.redhat.io"}}'

But according the the BZ this should be fixed in 4.10.3 https://bugzilla.redhat.com/show_bug.cgi?id=2027745 but the issue is still occur in our 4.10.21 cluster:

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.21   True        False         31d     Error while reconciling 4.10.21: the cluster operator openshift-samples is degraded

Version-Release number of selected component (if applicable):

 

How reproducible:

100%

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem: This is a follow-up to OCPBUGS-2795 and OCPBUGS-2941.

The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:

level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []

Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.

I'm setting the priority to critical as it's causing all our jobs to fail in master.

Version-Release number of selected component (if applicable):

4.8.z

How reproducible:

Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

Description of problem:

When a pod runs to a completed state, we typically rely on the update event that will indicate to us that this pod is completed. At that point the pod IP is released and the port configuration is removed in OVN. The subsequent delete event for this pod will be ignored because it should have been cleaned up in the previous update.

However, there can be cases where the update event is missed with pod completed. In this case we will only receive a delete with pod completed event, and ignore tearing down the pod. The end result is the pod is not cleaned up in OVN and the IP address remains allocated, reducing the amount of address range available to launch another pod. This can lead to exhausting all IP addresses available for pod allocation on a node.

Version-Release number of selected component (if applicable):

4.10.24

How reproducible:

Not sure how to reproduce this. I'm guessing some lag in kapi updates can cause the completed update event and the final delete event to be combined into a single event.

Steps to Reproduce:

1.
2.
3.

Actual results:

Port still exists in OVN, IP remains allocated for a deleted pod.

Expected results:

IP should be freed, port should be removed from OVN.

Additional info:

 

This is a clone of issue OCPBUGS-4250. The following is the description of the original issue:

Description of problem:

Added a script to collect PodNetworkConnectivityChecks to able to view the overall status of the pod network connectivity.

Current must-gather collects the contents of `openshift-network-diagnostics` but does not collect the PodNetworkConnectivityCheck.

Version-Release number of selected component (if applicable):

4.12, 4.11, 4.10

Description of problem:

[OVN][OSP] After reboot egress node, egress IP cannot be applied anymore.

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-11-07-181244

How reproducible:

Frequently happened in automation. But didn't reproduce it in manual.

Steps to Reproduce:

1. Label one node as egress node

2.
Config one egressIP object
STEP: Check  one EgressIP assigned in the object.

Nov  8 15:28:23.591: INFO: egressIPStatus: [{"egressIP":"192.168.54.72","node":"huirwang-1108c-pg2mt-worker-0-2fn6q"}]

3.
Reboot the node, wait for the node ready.


Actual results:

EgressIP cannot be applied anymore. Waited more than 1 hour.
 oc get egressip
NAME             EGRESSIPS       ASSIGNED NODE   ASSIGNED EGRESSIPS
egressip-47031   192.168.54.72    

Expected results:

The egressIP should be applied correctly.

Additional info:


Some logs
E1108 07:29:41.849149       1 egressip.go:1635] No assignable nodes found for EgressIP: egressip-47031 and requested IPs: [192.168.54.72]
I1108 07:29:41.849288       1 event.go:285] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egressip-47031", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'NoMatchingNodeFound' no assignable nodes for EgressIP: egressip-47031, please tag at least one node with label: k8s.ovn.org/egress-assignable


W1108 07:33:37.401149       1 egressip_healthcheck.go:162] Could not connect to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107): context deadline exceeded
I1108 07:33:37.401348       1 master.go:1364] Adding or Updating Node "huirwang-1108c-pg2mt-worker-0-2fn6q"
I1108 07:33:37.437465       1 egressip_healthcheck.go:168] Connected to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107)

After this log, seems like no logs related to "192.168.54.72" happened.

Description of problem:

Remove the self-provisioner role for the system authenticated users as per  https://access.redhat.com/solutions/4040541 to stop users from having the ability to create new projects, but the customer has found this is only partially working. It appears that when you use cluster Web UI Administrator view, the "Create Project" button is not available but switching to the default Developer view default user can create a project

Version-Release number of selected component (if applicable):

 

How reproducible:

Follow https://access.redhat.com/solutions/1529893

Steps to Reproduce:

1. oc adm policy remove-cluster-role-from-group self-provisioner system:authenticated:oauth
2. log back in as user and switch between admin/Dev view
3. User still has link showing in Dev console

Actual results:

Create new project link still exists

Expected results:

Create new project link should be removed, similar to Admin Console

 

Additional info:

Although the loink still exists, the user get's a correct permission denied message.

This is a clone of issue OCPBUGS-7409. The following is the description of the original issue:

This is a clone of issue OCPBUGS-7374. The following is the description of the original issue:

Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000

The controllers sometimes get stuck on listing members in failure scenarios, this is known and can be mitigated by simply restarting the CEO. 

similar BZ 2093819 with stuck controllers was fixed slightly different in https://github.com/openshift/cluster-etcd-operator/commit/4816fab709e11e0681b760003be3f1de12c9c103

 

This fix was contributed by lance5890, thanks a lot!

 

Description of problem:

When scaling down the machineSet for worker nodes, a PV(vmdk) file got deleted.

Version-Release number of selected component (if applicable):

4.10

How reproducible:

N/A

Steps to Reproduce:

1. Scale down worker nodes
2. Check VMware logs and VM gets deleted with vmdk still attached

Actual results:

After scaling down nodes, volumes still attached to the VM get deleted alongside the VM

Expected results:

Worker nodes scaled down without any accidental deletion

Additional info:

 

This is a clone of issue OCPBUGS-1522. The following is the description of the original issue:

Description of problem:

Normal user cannot open the debug container from the pods(crashLoopbackoff) they created, And would be got error message:
pods "<pod name>" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-09-20-040107, 4.11.z, 4.10.z

How reproducible:

Always

Steps to Reproduce:

1. Login OCP as a normal user
   eg: flexy-htpasswd-provider
2. Create a project, go to Developer prespective -> +Add page
3. Click "Import from Git", and provide below data to get a Pods with CrashLoopBackOff state
   Git Repo URL: https://github.com/sclorg/nodejs-ex.git
   Name: nodejs-ex-git
   Run command: star a wktw
4. Navigate to /k8s/ns/<project name>/pods page, find the pod with CrashLoopBackOff status, and go to it details page -> Logs Tab
5. Click the link of "Debug container"
6. Check if the Debug container can be opened

Actual results:

6. Error message would be shown on page, user cannot open debug container via UI
   pods "nodejs-ex-git-6dd986d8bd-9h2wj-debug-tkqk2" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>

Expected results:

6. Normal user could use debug container without any error message

Additional info:

The debug container could be created for the normal user successfully via CommandLine
 $ oc debug <crashloopbackoff pod name> -n <project name>

Description of problem:

Created two egressIP object, egressIPs in one egressIP object cannot be applied successfully 

Version-Release number of selected component (if applicable):

4.11.0-0.nightly-2022-11-27-164248

How reproducible:

Frequently happen in auto case

Steps to Reproduce:

1.  Label two nodes as egress nodes
  oc get nodes -o wide
NAME                                   STATUS   ROLES    AGE    VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
huirwang-1128a-s6j6t-master-0          Ready    master   154m   v1.24.6+5658434   10.0.0.8      <none>        Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)   4.18.0-372.32.1.el8_6.x86_64   cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8
huirwang-1128a-s6j6t-master-1          Ready    master   154m   v1.24.6+5658434   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)   4.18.0-372.32.1.el8_6.x86_64   cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8
huirwang-1128a-s6j6t-master-2          Ready    master   153m   v1.24.6+5658434   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)   4.18.0-372.32.1.el8_6.x86_64   cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8
huirwang-1128a-s6j6t-worker-westus-1   Ready    worker   135m   v1.24.6+5658434   10.0.1.5      <none>        Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)   4.18.0-372.32.1.el8_6.x86_64   cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8
huirwang-1128a-s6j6t-worker-westus-2   Ready    worker   136m   v1.24.6+5658434   10.0.1.4      <none>        Red Hat Enterprise Linux CoreOS 411.86.202211232221-0 (Ootpa)   4.18.0-372.32.1.el8_6.x86_64   cri-o://1.24.3-6.rhaos4.11.gitc4567c0.el8

 % oc get node huirwang-1128a-s6j6t-worker-westus-1 --show-labels
NAME                                   STATUS   ROLES    AGE    VERSION           LABELS
huirwang-1128a-s6j6t-worker-westus-1   Ready    worker   136m   v1.24.6+5658434   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westus,failure-domain.beta.kubernetes.io/zone=0,k8s.ovn.org/egress-assignable=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=huirwang-1128a-s6j6t-worker-westus-1,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.disk.csi.azure.com/zone=,topology.kubernetes.io/region=westus,topology.kubernetes.io/zone=0

 % oc get node huirwang-1128a-s6j6t-worker-westus-2 --show-labels
NAME                                   STATUS   ROLES    AGE    VERSION           LABELS
huirwang-1128a-s6j6t-worker-westus-2   Ready    worker   136m   v1.24.6+5658434   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_D4s_v3,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=westus,failure-domain.beta.kubernetes.io/zone=0,k8s.ovn.org/egress-assignable=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=huirwang-1128a-s6j6t-worker-westus-2,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=Standard_D4s_v3,node.openshift.io/os_id=rhcos,topology.disk.csi.azure.com/zone=,topology.kubernetes.io/region=westus,topology.kubernetes.io/zone=0
2. Created two egressIP objects
3.

Actual results:

egressip-47032 was not applied to any egress node
% oc get egressip
NAME             EGRESSIPS    ASSIGNED NODE                          ASSIGNED EGRESSIPS
egressip-47032   10.0.1.166                                          
egressip-47034   10.0.1.181   huirwang-1128a-s6j6t-worker-westus-1   10.0.1.181

%  oc get cloudprivateipconfig                      
NAME         AGE
10.0.1.130   6m25s
10.0.1.138   6m34s
10.0.1.166   6m34s
10.0.1.181   6m25s
%  oc get cloudprivateipconfig  10.0.1.166  -o yaml
apiVersion: cloud.network.openshift.io/v1
kind: CloudPrivateIPConfig
metadata:
  annotations:
    k8s.ovn.org/egressip-owner-ref: egressip-47032
  creationTimestamp: "2022-11-28T10:27:37Z"
  finalizers:
  - cloudprivateipconfig.cloud.network.openshift.io/finalizer
  generation: 1
  name: 10.0.1.166
  resourceVersion: "87528"
  uid: 5221075a-35d0-4670-a6a7-ddfc6cbc700b
spec:
  node: huirwang-1128a-s6j6t-worker-westus-1
status:
  conditions:
  - lastTransitionTime: "2022-11-28T10:33:29Z"
    message: 'Error processing cloud assignment request, err: <nil>'
    observedGeneration: 1
    reason: CloudResponseError
    status: "False"
    type: Assigned
  node: huirwang-1128a-s6j6t-worker-westus-1
%  oc get cloudprivateipconfig  10.0.1.138  -o yaml 
apiVersion: cloud.network.openshift.io/v1
kind: CloudPrivateIPConfig
metadata:
  annotations:
    k8s.ovn.org/egressip-owner-ref: egressip-47032
  creationTimestamp: "2022-11-28T10:27:37Z"
  finalizers:
  - cloudprivateipconfig.cloud.network.openshift.io/finalizer
  generation: 1
  name: 10.0.1.138
  resourceVersion: "87523"
  uid: e4604e76-64d8-4735-87a2-eb50d28854cc
spec:
  node: huirwang-1128a-s6j6t-worker-westus-2
status:
  conditions:
  - lastTransitionTime: "2022-11-28T10:33:29Z"
    message: 'Error processing cloud assignment request, err: <nil>'
    observedGeneration: 1
    reason: CloudResponseError
    status: "False"
    type: Assigned
  node: huirwang-1128a-s6j6t-worker-westus-2

oc logs cloud-network-config-controller-6f7b994ddc-vhtbp  -n openshift-cloud-network-config-controller

.......
E1128 10:30:43.590807       1 controller.go:165] error syncing '10.0.1.138': error assigning CloudPrivateIPConfig: "10.0.1.138" to node: "huirwang-1128a-s6j6t-worker-westus-2", err: network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="InvalidRequestFormat" Message="Cannot parse the request." Details=[{"code":"DuplicateResourceName","message":"Resource /subscriptions//resourceGroups//providers/Microsoft.Network/networkInterfaces/ has two child resources with the same name (huirwang-1128a-s6j6t-worker-westus-2_10.0.1.138)."}], requeuing in cloud-private-ip-config workqueue
I1128 10:30:44.051422       1 cloudprivateipconfig_controller.go:271] CloudPrivateIPConfig: "10.0.1.166" will be added to node: "huirwang-1128a-s6j6t-worker-westus-1"
E1128 10:30:44.301259       1 controller.go:165] error syncing '10.0.1.166': error assigning CloudPrivateIPConfig: "10.0.1.166" to node: "huirwang-1128a-s6j6t-worker-westus-1", err: network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="InvalidRequestFormat" Message="Cannot parse the request." Details=[{"code":"DuplicateResourceName","message":"Resource /subscriptions//resourceGroups//providers/Microsoft.Network/networkInterfaces/ has two child resources with the same name (huirwang-1128a-s6j6t-worker-westus-1_10.0.1.166)."}], requeuing in cloud-private-ip-config workqueue
..........

Expected results:

EgressIP can be applied successfully.

Additional info:


Tracker bug for bootimage bump in 4.11. This bug should block bugs which need a bootimage bump to fix.

This is a clone of issue OCPBUGS-7373. The following is the description of the original issue:

Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000

Under some circumstances the static pod machinery fails to populate the node status in time to generate the correct env variables for ETCD_URL_HOST, ETCD_NAME etc. The pods that come up will fail to accept those variables.

This is particularly pronounced in SNO topologies, leading to installation failures. 

The fix is to fail fast in the targetconfig/envvar controller to ensure the CEO goes degraded instead of silently failing on the rollout of an invalid static pod.

This is a clone of issue OCPBUGS-1678. The following is the description of the original issue:

Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)

OCPBUGS-1677 is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.

This is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.

Version-Release number of selected component (if applicable):
4.12

How reproducible:
Always

Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh

Actual results:
Unit tests fail

Expected results:
Unit tests should pass again

Additional info:

This is a clone of issue OCPBUGS-212. The following is the description of the original issue:

Description of problem:

oc --context build02 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-ec.1   True        False         45h     Error while reconciling 4.12.0-ec.1: the cluster operator kube-controller-manager is degraded

oc --context build02 get co kube-controller-manager
NAME                      VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-controller-manager   4.12.0-ec.1   True        False         True       2y87d   GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.153.28:9091: connect: cannot assign requested address

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

build02 is a build farm cluster in CI production.
I can provide credentials to access the cluster if needed.

This is a clone of issue OCPBUGS-676. The following is the description of the original issue:

the machine approver isn't recognizing hostnames that use capital letters as valid even though DNS is case-insensitive

an example of this is in OHSS-14709:

I0822 19:04:51.587266       1 controller.go:114] Reconciling CSR: csr-vdtpv
I0822 19:04:51.600941       1 csr_check.go:156] csr-vdtpv: CSR does not appear to be client csr
I0822 19:04:51.603648       1 csr_check.go:542] retrieving serving cert from ip-100-66-119-117.ec2.internal (100.66.119.117:10250)
I0822 19:04:51.604003       1 csr_check.go:181] Failed to retrieve current serving cert: dial tcp 100.66.119.117:10250: connect: connection refused
I0822 19:04:51.604017       1 csr_check.go:201] Falling back to machine-api authorization for ip-100-66-119-117.ec2.internal
E0822 19:04:51.604024       1 csr_check.go:392] csr-vdtpv: DNS name 'ip-100-66-119-117.tech-ace-maint-prd.aws.delta.com' not in machine names: ip-100-66-119-117.ec2.internal ip-100-66-119-117.ec2.internal ip-100-66-119-117.tech-ACE-maint-prd.aws.delta.com
I0822 19:04:51.604033       1 csr_check.go:204] Could not use Machine for serving cert authorization: DNS name 'ip-100-66-119-117.tech-ace-maint-prd.aws.delta.com' not in machine names: ip-100-66-119-117.ec2.internal ip-100-66-119-117.ec2.internal ip-100-66-119-117.tech-ACE-maint-prd.aws.delta.com
I0822 19:04:51.606777       1 controller.go:199] csr-vdtpv: CSR not authorized

This can be worked around by manually approving the CSR

The relevant line in the machine approver appears to be here: https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L378

This is a clone of issue OCPBUGS-416. The following is the description of the original issue:

Description of problem:
The pod fails to mount the PVC using IBM Cloud VPC block storage.

Version-Release number of selected component (if applicable):

How reproducible:
The steps can be followed here: from this link
https://cloud.ibm.com/docs/openshift?topic=openshift-vpc-block
.
The error occurs when the application pod tried to mount the VPC.

Steps to Reproduce:
Describe above.

Actual results:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 26m default-scheduler Successfully assigned default/test to a100-huge-m25p7-worker-3-with-secondary-xdwvl
Normal SuccessfulAttachVolume 26m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba"
Warning FailedMount 26m (x2 over 26m) kubelet MountVolume.MountDevice failed for volume "pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba" : rpc error: code = Internal desc = {RequestID: ffbb97b4-e4d0-4016-87a9-dc46f80c5478 , Code: FormatAndMountFailed, Description: Failed to format '/dev/disk/by-id/virtio-0777-6872e22d-5c00-4' and mount it at '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount', BackendError: format of disk "/dev/disk/by-id/virtio-0777-6872e22d-5c00-4" failed: type"ext4") target"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount") options"defaults") errcode:(exit status 1) output:(mke2fs 1.45.6 (20-Mar-2020)
The file /dev/disk/by-id/virtio-0777-6872e22d-5c00-4 does not exist and no size was specified.
) , Action: Please check if there is any error in POD describe related with volume attach}
Warning FailedMount 22m kubelet Unable to attach or mount volumes: unmounted volumes=[bs-pvc], unattached volumes=[kube-api-access-6bgvj bs-pvc]: timed out waiting for the condition
Warning FailedMount 4m11s (x9 over 24m) kubelet Unable to attach or mount volumes: unmounted volumes=[bs-pvc], unattached volumes=[bs-pvc kube-api-access-6bgvj]: timed out waiting for the condition
Warning FailedMount 3m51s (x17 over 26m) kubelet MountVolume.MountDevice failed for volume "pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba" : rpc error: code = Internal desc = {RequestID: 1a12a7c5-3bd0-41cf-b8a9-90dd3224c2fb , Code: FormatAndMountFailed, Description: Failed to format '/dev/disk/by-id/virtio-0777-6872e22d-5c00-4' and mount it at '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount', BackendError: format of disk "/dev/disk/by-id/virtio-0777-6872e22d-5c00-4" failed: type"ext4") target"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount") options"defaults") errcode:(exit status 1) output:(mke2fs 1.45.6 (20-Mar-2020)

Expected results:
The pod should successfully mount the PVC

Additional info:
Had a debugging session with Sameer Shaikh and Arashad Ahamad from the IBM VPC block storage team. The conclusion is that the udevadm utility is missing in the IPI image used by the IBM Cloud VPC block storage CSI.

  1. oc exec -it ibm-vpc-block-csi-controller-5fbb46bdc6-k7kpf -n openshift-cluster-csi-drivers -c iks-vpc-block-driver bash
    kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD][COMMAND] instead.
    bash-4.4$ which udevadm
    which: no udevadm in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)

Description of problem:

This is just a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2105570 for purposes of cherry-picking.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

 

Steps to Reproduce:

1.
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

This is a clone of issue OCPBUGS-3235. The following is the description of the original issue:

Description of problem:

Frequently we see the loading state of the topology view, even when there aren't many resources in the project.

Including an example

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

  1. load topology
  2. if it loads successfully, keep trying  until it fails to load

Actual results:

topology will sometimes hang with the loading indicator showing indefinitely

Expected results:

topology should load consistently without fail

Reproducibility (Always/Intermittent/Only Once):

intermittent

Build Details:

4.9

Additional info:

Description of problem:

NodePort port not accessible

Version-Release number of selected component (if applicable):

OCP 4.8.20

How reproducible:

$oc -n ui-nprd get services -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
docker-registry ClusterIP 10.201.219.240 <none> 5000/TCP 24d app=registry
docker-registry-lb LoadBalancer 10.201.252.253 internal-xxxxxx.xx-xxxx-1.elb.amazonaws.com 5000:30779/TCP 3d22h app=registry
docker-registry-np NodePort 10.201.216.26 <none> 5000:32428/TCP 3d16h app=registry

$oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxx.ca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.96
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -vz 10.81.23.96 32428
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.

In a new-created namespaces the same deployment works:

[RHEL7:> oc project
Using project "test-c1" on server "https://api.xx.xx.xxxx.xx.xx:6443".
[RHEL7:- ~/tmp]> oc port-forward service/docker-registry-np 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000

[1]+ Stopped oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> bg %1
[1]+ oc4 port-forward service/docker-registry-np 5000:5000 &
[RHEL7: ~/tmp]> nc -v localhost 5000
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 127.0.0.1:5000.
Handling connection for 5000

[RHEL7: ~/tmp]> kill %1
[RHEL7: ~/tmp]>
[1]+ Terminated oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> oc get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
docker-registry-np NodePort 10.201.224.174 <none> 5000:31793/TCP 68s

[RHEL7: ~/tmp]> oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
registry-75b7c7fd94-rx29j 1/1 Running 0 7m5s 10.201.1.29 ip-xxx.ca-central-1.compute.internal <none> <none>
[RHEL7: ~/tmp]> oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxxca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.87
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -v 10.81.23.87 31793
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.81.23.87:31793.

Actual results:

  • Working on new created namespace
  • Not working on already created namespace

Expected results:

  • Suppose to work on all namespaces.

Additional info:

  • This cluster get upgrade from 4.7.x to 4.8 and then they manually enable OVN.
  • The issue was happening on all namespaces but after restarting the ovnkube-master-xxxx pods only the newly created namespaces work.

Description of problem:

Cluster running 4.10.52 had three aws-ebs-csi-driver-node pods begin to consume multiple GB of memory, causing heavy node memory pressure as the pods have no memory limit. 

All other aws-ebs-csi-driver-node pods were still in the 50-70MB range:

NAME                                            CPU(cores)   MEMORY(bytes)   
aws-ebs-csi-driver-controller-59867579b-d6s2q   0m           397Mi           
aws-ebs-csi-driver-controller-59867579b-t4wgq   0m           276Mi           
aws-ebs-csi-driver-node-4rmvk                   0m           53Mi            
aws-ebs-csi-driver-node-5799f                   0m           50Mi            
aws-ebs-csi-driver-node-6dpvg                   0m           59Mi            
aws-ebs-csi-driver-node-6ldzk                   0m           65Mi            
aws-ebs-csi-driver-node-6mbk5                   0m           54Mi            
aws-ebs-csi-driver-node-bkvsr                   0m           50Mi            
aws-ebs-csi-driver-node-c2fb2                   0m           62Mi            
aws-ebs-csi-driver-node-f422m                   0m           61Mi            
aws-ebs-csi-driver-node-lwzbb                   6m           1940Mi          
aws-ebs-csi-driver-node-mjznt                   0m           53Mi            
aws-ebs-csi-driver-node-pczsj                   0m           62Mi            
aws-ebs-csi-driver-node-pmskn                   0m           3493Mi          
aws-ebs-csi-driver-node-qft8w                   0m           68Mi            
aws-ebs-csi-driver-node-v5bpx                   11m          2076Mi          
aws-ebs-csi-driver-node-vn8km                   0m           84Mi            
aws-ebs-csi-driver-node-ws6hx                   0m           73Mi            
aws-ebs-csi-driver-node-xsk7k                   0m           59Mi            
aws-ebs-csi-driver-node-xzwlh                   0m           55Mi            
aws-ebs-csi-driver-operator-8c5ffb6d4-fk6zk     5m           88Mi            

Deleting the pods caused them to recreate, with normal memory consumption levels.

Version-Release number of selected component (if applicable):

4.10.52

How reproducible:

Unknown

This is a clone of issue OCPBUGS-753. The following is the description of the original issue:

Description of problem:

The default dns-default pod is missing the "target.workload.openshift.io/management:" annotation. 

As a result when the workload partitioning feature is enabled on SNO, this pod resources will not get mutated and pinned to the reserved cpuset.

This is a regresion from 4.10. Pod spec from 4.10.17

Annotations:
...
   resources.workload.openshift.io/dns: {"cpushares": 51}
   resources.workload.openshift.io/kube-rbac-proxy: {"cpushares": 10}
   target.workload.openshift.io/management {"effect":"PreferredDuringScheduling"}

Version-Release number of selected component (if applicable):

4.11.0

How reproducible:

100%

Steps to Reproduce:

1. Install a SNO and check the annotation
2.
3.

Actual results:

 

Expected results:

 

Additional info:

 

 

This is a clone of issue OCPBUGS-2508. The following is the description of the original issue:

Description of problem:

Installer fails due to Neutron policy error when creating Openstack servers for OCP master nodes.

$ oc get machines -A
NAMESPACE               NAME                          PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ostest-kwtf8-master-0         Running                               23h
openshift-machine-api   ostest-kwtf8-master-1         Running                               23h
openshift-machine-api   ostest-kwtf8-master-2         Running                               23h
openshift-machine-api   ostest-kwtf8-worker-0-g7nrw   Provisioning                          23h
openshift-machine-api   ostest-kwtf8-worker-0-lrkvb   Provisioning                          23h
openshift-machine-api   ostest-kwtf8-worker-0-vwrsk   Provisioning                          23h

$ oc -n openshift-machine-api logs machine-api-controllers-7454f5d65b-8fqx2 -c machine-controller
[...]
E1018 10:51:49.355143       1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Failed to create port err: Request forbidden: [POST https://overcloud.redhat.local:13696/v2.0/ports], error message: {\"NeutronError\": {\"type\": \"PolicyNotAuthorized\", \"message\": \"(rule:create_port and (rule:create_port:allowed_address_pairs and (rule:create_port:allowed_address_pairs:ip_address and rule:create_port:allowed_address_pairs:ip_address))) is disallowed by policy\", \"detail\": \"\"}}" "name"="ostest-kwtf8-worker-0-lrkvb" "namespace"="openshift-machine-api"

Version-Release number of selected component (if applicable):

4.10.0-0.nightly-2022-10-14-023020

How reproducible:

Always

Steps to Reproduce:

1. Install 4.10 within provider networks (in primary or secondary interface)

Actual results:

Installation failure:
4.10.0-0.nightly-2022-10-14-023020: some cluster operators have not yet rolled out

Expected results:

Successful installation

Additional info:

Please find must-gather for installation on primary interface link here and for installation on secondary interface link here.

 

This is a clone of issue OCPBUGS-4897. The following is the description of the original issue:

This is a clone of issue OCPBUGS-2500. The following is the description of the original issue:

Description of problem:

When the Ux switches to the Dev console the topology is always blank in a Project that has a large number of components.

Version-Release number of selected component (if applicable):

 

How reproducible:

Always occurs

Steps to Reproduce:

1.Create a project with at least 12 components (Apps, Operators, knative Brokers)
2. Go to the Administrator Viewpoint
3. Switch to Developer Viewpoint/Topology
4. No components displayed
5. Click on 'fit to screen'
6. All components appear

Actual results:

Topology renders with all controls but no components visible (see screenshot 1)

Expected results:

All components should be visible

Additional info:

 

Description of problem:
This is a clone of https://issues.redhat.com/browse/OCPBUGS-469

 

Description of problem: Numerous erroreneous logs in OVN master

I0823 18:00:11.163491       1 obj_retry.go:1063] Retry object setup: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163546       1 obj_retry.go:1096] Removing old object: *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163555       1 pods.go:124] Deleting pod: openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k
I0823 18:00:11.163631       1 obj_retry.go:1103] Retry delete failed for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k, will try again later: deleteLogicalPort failed for pod openshift-operator-lifecycle-manager_collect-profiles-27687900-hlp6k: unable to locate portUUID+nodeName for pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k: error getting logical port <nil>: object not found
W0823 18:00:41.163633       1 obj_retry.go:1031] Dropping retry entry for *v1.Pod openshift-operator-lifecycle-manager/collect-profiles-27687900-hlp6k: exceeded number of failed attempts

Must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.2234927131259452300/

 

Version-Release number of selected component (if applicable): 4.12.0-0.nightly-2022-08-23-031342

How reproducible: Always

Steps to Reproduce:
1. Bring up OVN cluster on 4.12
2.
3.

Actual results: deleteLogicalPort failed for already gone object

Expected results: deleteLogicalPort should not keep retrying post object deletion

Additional info:

Description of problem:

Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"    

Version-Release number of selected component (if applicable): 4.10.22

How reproducible:  Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"

Steps to Reproduce:

Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"

Actual results: master nodes do come up.

Expected results: master nodes should come up despite that the text "master" is not in their hostname.

Additional info:

Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"    

The code for the cluster-baremetal-operator at the following link: 

https://github.com/openshift/cluster-baremetal-operator/blob/49d7b249c5dcef8228f206eff4530a25f03b201f/controllers/provisioning_controller.go#L441

The following condition is concerning:

if strings.Contains(bmh.Name, "master") && len(bmh.Spec.BootMACAddress) > 0

The packages reveal that bmh.Name references the name inside the metadata of the BMH object. 

Should a customer have masters with names that do not include the text "master", the above condition can never become true, and so, the following slice is never created :

macs = append(macs, bmh.Spec.BootMACAddress)

 

 

This is a clone of issue OCPBUGS-10977. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10890. The following is the description of the original issue:

This is a clone of issue OCPBUGS-10649. The following is the description of the original issue:

Description of problem:

After a replace upgrade from OCP 4.14 image to another 4.14 image first node is in NotReady.

jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig 
NAME                     STATUS   ROLES  AGE   VERSION
ip-10-0-128-175.us-east-2.compute.internal  Ready   worker  72m   v1.26.2+06e8c46
ip-10-0-134-164.us-east-2.compute.internal  Ready   worker  68m   v1.26.2+06e8c46
ip-10-0-137-194.us-east-2.compute.internal  Ready   worker  77m   v1.26.2+06e8c46
ip-10-0-141-231.us-east-2.compute.internal  NotReady  worker  9m54s  v1.26.2+06e8c46

- lastHeartbeatTime: "2023-03-21T19:48:46Z"
  lastTransitionTime: "2023-03-21T19:42:37Z"
  message: 'container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
   message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/.
   Has your network provider started?'
  reason: KubeletNotReady
  status: "False"
  type: Ready

Events:
 Type   Reason          Age         From          Message
 ----   ------          ----        ----          -------
 Normal  Starting         11m         kubelet        Starting kubelet.
 Normal  NodeHasSufficientMemory 11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientMemory
 Normal  NodeHasNoDiskPressure  11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
 Normal  NodeHasSufficientPID   11m (x2 over 11m)  kubelet        Node ip-10-0-141-231.us-east-2.compute.internal status is now: NodeHasSufficientPID
 Normal  NodeAllocatableEnforced 11m         kubelet        Updated Node Allocatable limit across pods
 Normal  Synced          11m         cloud-node-controller Node synced successfully
 Normal  RegisteredNode      11m         node-controller    Node ip-10-0-141-231.us-east-2.compute.internal event: Registered Node ip-10-0-141-231.us-east-2.compute.internal in Controller
 Warning ErrorReconcilingNode   17s (x30 over 11m) controlplane      nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation

ovnkube-master log:

I0321 20:55:16.270197       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:16.270209       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:16.270273       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:17.851497       1 master.go:719] Adding or Updating Node "ip-10-0-137-194.us-east-2.compute.internal"
I0321 20:55:25.965132       1 master.go:719] Adding or Updating Node "ip-10-0-128-175.us-east-2.compute.internal"
I0321 20:55:45.928694       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432145 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
I0321 20:55:46.270129       1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:55:46.270154       1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:55:46.270164       1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal"
I0321 20:55:46.270201       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:46.270209       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:46.270284       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:55:52.916512       1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 5 items received
I0321 20:56:06.910669       1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 12 items received
I0321 20:56:15.928505       1 client.go:783]  "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:update Table:NB_Global Row:map[options:{GoMap:map[e2e_timestamp:1679432175 mac_prefix:2e:f9:d8 max_tunid:16711680 northd_internal_version:23.03.1-20.27.0-70.6 northd_probe_interval:5000 svc_monitor_mac:fe:cb:72:cf:f8:5f use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c8b24290-296e-44a2-a4d0-02db7e312614}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"
I0321 20:56:16.269611       1 obj_retry.go:265] Retry object setup: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:56:16.269637       1 obj_retry.go:319] Adding new object: *v1.Node ip-10-0-141-231.us-east-2.compute.internal
I0321 20:56:16.269646       1 master.go:719] Adding or Updating Node "ip-10-0-141-231.us-east-2.compute.internal"
I0321 20:56:16.269688       1 default_network_controller.go:667] Node add failed for ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:56:16.269697       1 obj_retry.go:326] Retry add failed for *v1.Node ip-10-0-141-231.us-east-2.compute.internal, will try again later: nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation
I0321 20:56:16.269724       1 event.go:285] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-141-231.us-east-2.compute.internal", UID:"621e6289-ca5a-4e17-afff-5b49961cfb38", APIVersion:"v1", ResourceVersion:"52970", FieldPath:""}): type: 'Warning' reason: 'ErrorReconcilingNode' nodeAdd: error adding node "ip-10-0-141-231.us-east-2.compute.internal": could not find "k8s.ovn.org/node-subnets" annotation

cluster-network-operator log:

I0321 21:03:38.487602       1 log.go:198] Set operator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:38.488312       1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged
I0321 21:03:38.499825       1 log.go:198] Set ClusterOperator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:38.571013       1 log.go:198] Set HostedControlPlane conditions:
- lastTransitionTime: "2023-03-21T17:38:24Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidAWSIdentityProvider
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Configuration passes validation
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidHostedControlPlaneConfiguration
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: QuorumAvailable
  status: "True"
  type: EtcdAvailable
- lastTransitionTime: "2023-03-21T17:38:23Z"
  message: Kube APIServer deployment is available
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: KubeAPIServerAvailable
- lastTransitionTime: "2023-03-21T20:26:29Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "False"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:37:11Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: External DNS is not configured
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ExternalDNSReachable
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: Available
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Reconciliation active on resource
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ReconciliationActive
- lastTransitionTime: "2023-03-21T17:38:25Z"
  message: All is well
  reason: AsExpected
  status: "True"
  type: AWSDefaultSecurityGroupCreated
- lastTransitionTime: "2023-03-21T19:30:54Z"
  message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster
    operator network is degraded'
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "False"
  type: ClusterVersionProgressing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Condition not found in the CVO.
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ClusterVersionUpgradeable
- lastTransitionTime: "2023-03-21T17:44:05Z"
  message: Done applying 4.14.0-0.nightly-2023-03-20-201450
  observedGeneration: 3
  reason: FromClusterVersion
  status: "True"
  type: ClusterVersionAvailable
- lastTransitionTime: "2023-03-21T19:55:15Z"
  message: Cluster operator network is degraded
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "True"
  type: ClusterVersionFailing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450"
    architecture="amd64"
  observedGeneration: 3
  reason: PayloadLoaded
  status: "True"
  type: ClusterVersionReleaseAccepted
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "False"
  type: network.operator.openshift.io/ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: network.operator.openshift.io/Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: network.operator.openshift.io/Progressing
- lastTransitionTime: "2023-03-21T17:39:27Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Available
I0321 21:03:39.450912       1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status
I0321 21:03:39.450953       1 pod_watcher.go:125] Operand /, Kind= openshift-multus/multus updated, re-generating status
I0321 21:03:39.493206       1 log.go:198] Set operator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:39.494050       1 log.go:198] Skipping reconcile of Network.operator.openshift.io: spec unchanged
I0321 21:03:39.508538       1 log.go:198] Set ClusterOperator conditions:
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "False"
  type: ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: Progressing
- lastTransitionTime: "2023-03-21T17:39:26Z"
  status: "True"
  type: Available
I0321 21:03:39.684429       1 log.go:198] Set HostedControlPlane conditions:
- lastTransitionTime: "2023-03-21T17:38:24Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidAWSIdentityProvider
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Configuration passes validation
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ValidHostedControlPlaneConfiguration
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: QuorumAvailable
  status: "True"
  type: EtcdAvailable
- lastTransitionTime: "2023-03-21T17:38:23Z"
  message: Kube APIServer deployment is available
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: KubeAPIServerAvailable
- lastTransitionTime: "2023-03-21T20:26:29Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "False"
  type: Degraded
- lastTransitionTime: "2023-03-21T17:37:11Z"
  message: All is well
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: InfrastructureReady
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: External DNS is not configured
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ExternalDNSReachable
- lastTransitionTime: "2023-03-21T19:24:24Z"
  message: ""
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: Available
- lastTransitionTime: "2023-03-21T17:37:06Z"
  message: Reconciliation active on resource
  observedGeneration: 3
  reason: AsExpected
  status: "True"
  type: ReconciliationActive
- lastTransitionTime: "2023-03-21T17:38:25Z"
  message: All is well
  reason: AsExpected
  status: "True"
  type: AWSDefaultSecurityGroupCreated
- lastTransitionTime: "2023-03-21T19:30:54Z"
  message: 'Error while reconciling 4.14.0-0.nightly-2023-03-20-201450: the cluster
    operator network is degraded'
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "False"
  type: ClusterVersionProgressing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Condition not found in the CVO.
  observedGeneration: 3
  reason: StatusUnknown
  status: Unknown
  type: ClusterVersionUpgradeable
- lastTransitionTime: "2023-03-21T17:44:05Z"
  message: Done applying 4.14.0-0.nightly-2023-03-20-201450
  observedGeneration: 3
  reason: FromClusterVersion
  status: "True"
  type: ClusterVersionAvailable
- lastTransitionTime: "2023-03-21T19:55:15Z"
  message: Cluster operator network is degraded
  observedGeneration: 3
  reason: ClusterOperatorDegraded
  status: "True"
  type: ClusterVersionFailing
- lastTransitionTime: "2023-03-21T17:39:11Z"
  message: Payload loaded version="4.14.0-0.nightly-2023-03-20-201450" image="registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-03-20-201450"
    architecture="amd64"
  observedGeneration: 3
  reason: PayloadLoaded
  status: "True"
  type: ClusterVersionReleaseAccepted
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "False"
  type: network.operator.openshift.io/ManagementStateDegraded
- lastTransitionTime: "2023-03-21T19:53:10Z"
  message: DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making
    progress - last change 2023-03-21T19:42:39Z
  reason: RolloutHung
  status: "True"
  type: network.operator.openshift.io/Degraded
- lastTransitionTime: "2023-03-21T17:39:21Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Upgradeable
- lastTransitionTime: "2023-03-21T19:42:39Z"
  message: |-
    DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
  reason: Deploying
  status: "True"
  type: network.operator.openshift.io/Progressing
- lastTransitionTime: "2023-03-21T17:39:27Z"
  message: ""
  reason: AsExpected
  status: "True"
  type: network.operator.openshift.io/Available

Version-Release number of selected component (if applicable):

 

How reproducible:

 

Steps to Reproduce:

1. management cluster 4.13
2. bring up the hostedcluster and nodepool in 4.14.0-0.nightly-2023-03-19-234132
3. upgrade the hostedcluster to 4.14.0-0.nightly-2023-03-20-201450 
4. replace upgrade the nodepool to 4.14.0-0.nightly-2023-03-20-201450 

Actual results

First node is in NotReady

Expected results:

All nodes should be Ready

Additional info:

No issue with replace upgrade from 4.13 to 4.14

 

 

 

 

 

 

This is a clone of issue OCPBUGS-5191. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5164. The following is the description of the original issue:

Description of problem:

It looks like the ODC doesn't register KNATIVE_SERVING and KNATIVE_EVENTING flags. Those are based on KnativeServing and KnativeEventing CRs, but they are looking for v1alpha1 version of those: https://github.com/openshift/console/blob/f72519fdf2267ad91cc0aa51467113cc36423a49/frontend/packages/knative-plugin/console-extensions.json#L6-L8
This PR https://github.com/openshift-knative/serverless-operator/pull/1695 moved the CRs to v1beta1, and that breaks that ODC discovery.

Version-Release number of selected component (if applicable):

Openshift 4.8, Serverless Operator 1.27

Additional info:

https://coreos.slack.com/archives/CHGU4P8UU/p1671634903447019

 

Description of problem:

During an upgrade from 4.10.0-0.nightly-2022-09-06-081345 -> 4.11.0-0.nightly-2022-09-06-074353

DNS operator stuck in progressing state with following error in dns-defalt pod

2022-09-07T00:47:35.959931289Z [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout

must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather-136172-050448657.tar.gz

Version-Release number of selected component (if applicable):

4.11.0-0.nightly-2022-09-06-074353

How reproducible:

Intermittent

Steps to Reproduce:

1. Perform upgrade from 4.10->4.11
2.
3.

Actual results:

Upgrade unsuccessful due to API svc unreachability

Expected results:

Upgrade should be successful

Additional info:

$ omg get co | grep -v "True.*False.*False"
NAME                                      VERSION                             AVAILABLE  PROGRESSING  DEGRADED  SINCE
dns                                       4.10.0-0.nightly-2022-09-06-081345  True       True         False     3h4m
network                                   4.11.0-0.nightly-2022-09-06-074353  True       True         False     3h5m

$ omg logs dns-default-5mqz4 -c dns -n openshift-dns
/home/anusaxen/Downloads/must-gather-136208-318613238/must-gather.local.4033693973186012455/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d995e1608ba83a1f71b5a1c49705aa3cef38618e0674ee256332d1b0e2cb0e23/namespaces/openshift-dns/pods/dns-default-5mqz4/dns/dns/logs/current.log
2022-09-07T00:45:36.355161941Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:36.854403897Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:37.354367197Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:37.861427861Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:38.355095996Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:38.855063626Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:39.355170472Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:39.855296624Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:40.355481413Z [INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
2022-09-07T00:45:40.855455283Z [WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
2022-09-07T00:45:40.856166731Z .:5353
2022-09-07T00:45:40.856166731Z [INFO] plugin/reload: Running configuration SHA512 = 7c3db3182389e270fe3af971fef9e7157a34bde01eb750d58c2d4895a965ecb2e2d00df34594b6a819653088b6efa2590e748ce23f9e16cbb0f44d74a4fc3587
2022-09-07T00:45:40.856166731Z CoreDNS-1.9.2
2022-09-07T00:45:40.856166731Z linux/amd64, go1.18.4, 
2022-09-07T00:46:05.957503195Z [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout
2022-09-07T00:47:25.444876534Z [INFO] SIGTERM: Shutting down servers then terminating
2022-09-07T00:47:25.444984367Z [INFO] plugin/health: Going into lameduck mode for 20s
2022-09-07T00:47:35.959931289Z [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: i/o timeout

Description of problem:

If you set a services cluster IP to an IP with a leading zero (e.g. 192.168.0.011), ovn-k should normalise this and remove the leading zero before sending it to ovn.

This was seen by me on a CI run executing the k8 test here: test/e2e/network/funny_ips.go +75

you can reproduce using that above test.

Have a read of the text there:

 43 // What are funny IPs:  
 44 // The adjective is because of the curl blog that explains the history and the problem of liberal  
 45 // parsing of IP addresses and the consequences and security risks caused the lack of normalization,
 46 // mainly due to the use of different notations to abuse parsers misalignment to bypass filters.
 47 // xref: https://daniel.haxx.se/blog/2021/04/19/curl-those-funny-ipv4-addresses/   
 48 //     
 49 // Since golang 1.17, IPv4 addresses with leading zeros are rejected by the standard library.
 50 // xref: https://github.com/golang/go/issues/30999
 51 //     
 52 // Because this change on the parsers can cause that previous valid data become invalid, Kubernetes
 53 // forked the old parsers allowing leading zeros on IPv4 address to not break the compatibility.
 54 //     
 55 // Kubernetes interprets leading zeros on IPv4 addresses as decimal, users must not rely on parser
 56 // alignment to not being impacted by the associated security advisory: CVE-2021-29923 golang
 57 // standard library "net" - Improper Input Validation of octal literals in golang 1.16.2 and below
 58 // standard library "net" results in indeterminate SSRF & RFI vulnerabilities. xref:
 59 // https://nvd.nist.gov/vuln/detail/CVE-2021-29923                                                                                                     

northd is logging an error about this also:

|socket_util|ERR|172.30.0.011:7180: bad IP address "172.30.0.011" 
...
2022-08-23T14:14:21.968Z|01839|ovn_util|WARN|bad ip address or port for load balancer key 172.30.0.011:7180

 

Also, I see the error:

E0823 14:14:34.135115    3284 gateway_shared_intf.go:600] Failed to delete conntrack entry for service e2e-funny-ips-8626/funny-ip: failed to delete conntrack entry for service e2e-funny-ips-8626/funny-ip with svcVIP 172.30.0.011, svcPort 7180, protocol TCP: value "<nil>" passed to DeleteConntrack is not an IP address 

We should normalise the IPs before sending to OVN-k. I see also theres conntrack error when trying to set this bad IP.

 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. See above k8 test

Actual results:

Leading zero IP sent to OVN

Expected results:

No leading zero IP sent to OVN

Additional info:

This is a clone of issue OCPBUGS-11993. The following is the description of the original issue:

Description of problem:

There's argument number mismatch on release_vif() call while reverting
port association.

Version-Release number of selected component (if applicable):

 

How reproducible:

It's clear in the code, no need to reproduce this.

Steps to Reproduce:

1.
2.
3.

Actual results:

TypeError

Expected results:

KuryrPort released

Additional info:

 

This is a clone of issue OCPBUGS-8339. The following is the description of the original issue:

This is a clone of issue OCPBUGS-5287. The following is the description of the original issue:

Description of problem:

See https://issues.redhat.com/browse/THREESCALE-9015.  A problem with the Red Hat Integration - 3scale - Managed Application Services operator prevents it from installing correctly, which results in the failure of operator-install-single-namespace.spec.ts integration test.

Description of problem:

The 4.11 version of openshift-installer does not support the mon01 zone

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info: